Fullscreen
Loading...
 

MDTS11 pre-processing

Item: 6/8 Last Item
1 4 5 6 7 8

View Item

Rating 12345(0)Help
Title: Lexical Extraction
Short description Lexical extraction has been applied to titles and abstracts from the dataset (many thanks to Telmo Menezes for his sqlite parser). This list of terms can be used to provide a lexical description of each paper, to make analysis based on cooccurrences data for example.
The final list features 2,000 n-terms such as cell cycle arrest, fetal fibroblasts, ventricle, amphibian embryos, tyrosine phosphorylation, etc. The extraction is fully automatic.


The term extraction process has been applied according to the following method:
  • 1. the corpus is tagged with a POS tagger (NLTK), central nervous system development -> JJ_central JJ_nervous NN_system NN_development
  • 2. Chunking of the n-terms, a syntactic filter is applied to retrieve relevant noun groups,
  • 3. n-terms are grouped together if they are composed of the same unordered stemmed terms. central develop nervou system -> {central nervous system development|development central nervous systems|development central nervous system}
  • 4. A simple stop words is applied to avoid irrelevant terms such as: Elsevier, Academic Press, etc.
  • 6. C-Value (treatment of nested terms (Frantzi et al. 1998) of every n-terms are computed. The 4,000 n-terms with the highest C-values are selected.
  • 7. chi-2 concurrence (Matsuo et al. 2004) are computed on these n-terms. The final list is made of the 2,000 words


The final list is available as an xls file (https://public.iscpif.fr/~cointet/MDTS11-data_challenge/nterms.xls). The file features the following columns:
unique key of n-terms, main form of the n-term, exhaustive list of n-terms forms (separated by '|'), n-terms total occurrences, and chi-2 pertinence score.

The complete indexation of the dataset in sqlite format is also available (https://public.iscpif.fr/~cointet/MDTS11-data_challenge/stat-terms.db). The database is made of two tables. The first one (terms) lists the n-terms along with their indexes, the second one (article2terms) enumerates the occurrences of each n-terms in notices extracted from the dataset. The columns are the following: wos_id (wos identifier (no "00"!)), terms_id (as identified in the table terms), title_or_abstract (0 if the term has been identified in the title, 1 otherwise), sentence_id (i means that the n-term has been detected in the i-th sentence).

Thanks to Andreï Mogoutov, Elias Showk and the Cortext Team for their help in designing the lexical extraction methodology.


Type of your pre-processing natural language processing
URL for your pre-processing https://public.iscpif.fr/~cointet/MDTS11-data_challenge/
Contributor Jean-Philippe Cointet
Created Saturday 19 March, 2011 11:54:39


Show php error messages