Collaborative Data Challenge winners 2011

  • CorText Team (IFRIS) for Knowledge tube reconstruction - 600€

  • SONIC Lab (Yun Huang, Alina Lungeanu, Zhang Chuang),
    NICO (Mike Stringer, Jonathan Haynes), and AMARAL Lab (Dan McClary,
    Xiaohan Zeng) at Northwestern University -  Structured and Relational Information Extraction- 200€

  • Telmo Menezes - Sci-pie - 200€

View the awards page


List of shared preprocessing

New preprocessing submission | Shared preprocessing | Your submissions

Illustration:Title:Short descriptionType of your pre-processingContributorCreatedatts
 ISI 2 ISI converterThe attached script (python) simply convert every files in a given directory (dir=) into a new file which should be more similar to the usual ISI format that one can download on the website.

The final structure of the file is not perfectly mimicking every features nor every fields of the regular isi exports. But the conversion may be sufficient for most uses. Otherwise, the convsersion script can certainly be enhanced to allow a better fit with the original isi files.

Two examples files are also provided. The one terminating with '.isi' is the output of the script applied to the other file which directly comes from the data challenge raw data repository.
data format conversionJean-Philippe CointetThu 31 Mar., 2011 18:54List Attachments4
 Structured and Relational Information ExtractionCollaboration in SONIC Lab (Yun Huang, Alina Lungeanu, Zhang Chuang), NICO (Mike Stringer, Jonathan Haynes), and AMARAL Lab (Dan McClary, Xiaohan Zeng) at Northwestern University

In the data preprocessing, we developed AWK and Python scripts to extract more than 30 attributes related to articles, issues, and authors and construct 16 relational tables in MYSQL.

Using SQL stored procedures provided,users can easily extract author-publication, author-citation, co-authorship, and citation similarity relations as well as related author keywords, keyword plus, addresses, publication years, and subject categories for a subset or all authors.

data format conversionyunhuangTue 22 Mar., 2011 05:09List Attachments4
 Analysis and visualization of WoS data with PajekPajek is a program for analysis and visualization of large networks. We used program WoS2Pajek for transforming WoS data into Pajek's. The citation network of works, two-mode networks: works x authors, works x journals, works x keywords, works x institutions, works x countries, partition of works by publication year, partition of works by completness of their description and vector of number of pages were obtained from the data.

There is always enough ideas for different analysis of networks, especially when we have networks, that can be analyzed in combination with each other, available. Above mentioned networks were analyzed in order to identify important authors, works, main topics and to explore collaboration and citations among authors.

Analysis report, sorted by used networks, is available at DocuWiki.
Analysis of large networkmonkWed 23 Mar., 2011 23:52List Attachments0
 WoS2Pajek/MDTSPajek is a program for analysis and visualization of large networks.

The Python program WoS2Pajek for transforming WoS data into Pajek's was adapted for the MDTS 11 Challenge data. Several Pajek's partitions and networks were obtained. The contribution is described in details at


It consists of:
  • adapted WoS2Pajek/MDTS program
  • collection of Pajek's networks and additional data
network extractionbatageljWed 23 Mar., 2011 18:30List Attachments0
 sci-pieCollection of scripts to process, analyze and generating visualizations of the dataset, based on a relational database derived from the data.

For now:
* createdb.py: creates a sqlite3 relational database to contained the data in a more organized way. The main tables are: publications, issues, articles, authors, keywords and organizations.
* import.py: imports a set of scripts in Thomson/WoS format to the database.
* trimyears.py: discards all articles and issues than do not fall within an interval defined in years

See also: the associated blog post
hopefully more to come soon...telmoThu 10 Mar., 2011 17:06List Attachments0
Lexical extraction Treetagger + YateaFrom Thierry Poibeau:

This lexical extraction is based on the combination of Treetagger and Yatea (http://search.cpan.org/~thhamon/Lingua-YaTeA-0.5/).

Due to computational ressources, the analysis was applied to 3 % of the abstracts des abstracts.

There has been additionnal manual filtering on treetagger output to avoid bugs.

Yatea propose also an XML more detailed but I am not sure that is could be more useful than just the list of candidates for a whitelist.

This list should be manually checked keep only meaningful terms

It should nevertheless be possible to have additional simple filtering to improve results (remove some kind of words based on the tagging).

Since it is base on a part of the corpora, I am not sure it is really useful as such but it gives an idea of what can be done with these tools.

File can be found in attachement
natural language processingDavid ChavalariasTue 22 Mar., 2011 02:31List Attachments1
 Lexical ExtractionLexical extraction has been applied to titles and abstracts from the dataset (many thanks to Telmo Menezes for his sqlite parser). This list of terms can be used to provide a lexical description of each paper, to make analysis based on cooccurrences data for example.
The final list features 2,000 n-terms such as cell cycle arrest, fetal fibroblasts, ventricle, amphibian embryos, tyrosine phosphorylation, etc. The extraction is fully automatic.

The term extraction process has been applied according to the following method:
  • 1. the corpus is tagged with a POS tagger (NLTK), central nervous system development -> JJ_central JJ_nervous NN_system NN_development
  • 2. Chunking of the n-terms, a syntactic filter is applied to retrieve relevant noun groups,
  • 3. n-terms are grouped together if they are composed of the same unordered stemmed terms. central develop nervou system -> {central nervous system development|development central nervous systems|development central nervous system}
  • 4. A simple stop words is applied to avoid irrelevant terms such as: Elsevier, Academic Press, etc.
  • 6. C-Value (treatment of nested terms (Frantzi et al. 1998) of every n-terms are computed. The 4,000 n-terms with the highest C-values are selected.
  • 7. chi-2 concurrence (Matsuo et al. 2004) are computed on these n-terms. The final list is made of the 2,000 words

The final list is available as an xls file (https://public.iscpif.fr/~cointet/MDTS11-data_challenge/nterms.xls). The file features the following columns:
unique key of n-terms, main form of the n-term, exhaustive list of n-terms forms (separated by '|'), n-terms total occurrences, and chi-2 pertinence score.

The complete indexation of the dataset in sqlite format is also available (https://public.iscpif.fr/~cointet/MDTS11-data_challenge/stat-terms.db). The database is made of two tables. The first one (terms) lists the n-terms along with their indexes, the second one (article2terms) enumerates the occurrences of each n-terms in notices extracted from the dataset. The columns are the following: wos_id (wos identifier (no "00"!)), terms_id (as identified in the table terms), title_or_abstract (0 if the term has been identified in the title, 1 otherwise), sentence_id (i means that the n-term has been detected in the i-th sentence).

Thanks to Andreï Mogoutov, Elias Showk and the Cortext Team for their help in designing the lexical extraction methodology.
natural language processingJean-Philippe CointetSat 19 Mar., 2011 11:54List Attachments2
easiparseExtracts and Transform from and to WoS ISI raw files and/or MongoDB.
Will soon propose a cooccurrence network module.
data format conversionelishowkWed 23 Feb., 2011 15:29List Attachments0


New preprocessing submission | Shared preprocessing | Your submissions

The MDTS11 workshop is sponsoring a data challenge. The challenge is based on a dataset provided by Thomson ISI Web of Science with a focus on embryology and embryonic science from 1956 to 2010 (papers abstract with citation, authors, patents, etc.). We invite you to download the dataset, explore it, learn something interesting about it, and submit a contribution about it on the MDRS11 webpage. The workshop itself will feature presentations by authors on March 25, as well as a broader discussion of data issues and science mapping issues. The deadline to participate in the data challenge itself is on March 23.

Applicants are also invited to share pre-processing of the data on the workshop website from now on until the submission deadline. Pre-processing and contribution submissions are independent. Applicants need not submit to the challenge to be eligible for the best preprocessing.

There will therefore be an earlier collaborative working session, on March 22, to share experience on pre-processings and collaboratively draft the pre-processing section of a collective paper.

Awards :

  1. 400€ for the best pre-processing (NLP, SNA, dataviz, etc.),

  2. 600€ for the best contribution and first author position for the collective paper of the workshop.

The best contributions will be gathered in a collective paper for which first author will be the winner of the data challenge. The workshop will hire a reviewer to insure the final rewriting of the paper.

Important Dates 

  • February 11 2011 : Registration is open, Embryology data set is online, registration is open

  • February 11 2011 : Preprocessing propositions are open Share a preprocessing | view the list of propositions

  • March 22 2001 : Collaborative working session to first draft the collective paper

  • March 23 2011 : Deadline for challenge submissions

  • March 25 2011 : Data challenge presentation and award

Good topics for submitting to the challenge include...

  • visualization of scientific data and science maps

  • citation and co-word analysis,

  • social network extraction,

  • tracing the evolution of science

  • bibliometric and scientometric analysis,

  • analysis of influence among scientists

  • ….

Good pre-processing include...

  • network extraction of any kind

  • topic extraction and white list extraction,

  • natural language processing of any kind,

  • ….

Data confidentiality agreement

To get this data, extracted from Thomson ISI Web Of Science, the following confidentiality agreement have to be sent by mail at mdts /at/ sciencemapping /dot/ com :


by this mail, I agree to not disclose, publish or otherwise reveal any of the Thomson ISI data provided by the workshop MDTS11 to any other party whatsoever and use it only for the purpose of the data challenge of the MDTS11 workshop.

I agree to use results from the analysis of this dataset only for the purpose of the publications issued by the

You can share pre-processing or view the list of shared pre-processing on this site.


We have a forum for discussing the datasets.  Please join the discussions about whatever you're doing with the data. In particular, if you are looking for groups to collaborate with, here's a forum for you. We also have a project at GitHub, https://github.com/moma, where we can host tools and resources that you create to go along with the datasets.


Awards will be given by a panel of scientists from the broad domains of science mapping, infoviz, science and technology studies and embryology.

Criteria for preprocessing award

The judging panel will be asked to consider the following criteria :

  • Usefulness for other participants, number of challenge proposals that rely on it.

  • Code for the preprocessing is available, documented and open,

Criteria for data challenge award

The judging panel will be asked to consider the following criteria :

  • The proposal is useful for people from the target field (embryology), scholars working on epistemology, philosophy, sciences and technologies studies and history of science or for the biomedical field in general

  • The proposal is not specific to the target field but the approach can be applied to other fields of science,

  • The proposal relies on open sources softwares, methods and treatments are public and reproducible.

  • The proposal is using pre-processing from other participants

Rules and restrictions

  • Any one individual or group can participate to the pre-processing challenge and the data challenge. Registrations should be made online. Young researchers are encouraged to participate. Attendance to the workshop is not required to be elligible to any of the awards. However, a short  presentation (5 to 10 min) on march 25 at the collaborative challenge session is required to be elligible to the data challenge award. The workshop organizers might consider on request a remote participation in the form of a video. 

  • Any one individual or group can receive at most one award. If you submit both a pre-processing and a challenge proposal, you will receive only the challenge award; the pre-processing award will cascade to the next best proposition.

  • Pre-processing and challenge proposals must make use of the dataset provided by MDTS11. Information from other datasets and sources can be used if it is in addition to this one, and if the other dataset is open.

  • The judging panel has final say on all submissions, without restriction. Members of the judging panel will be allowed to submit pre-processing and challenge propositions but will not be allowed to claim award.
  • In case of ex aequo for one of the awards, the prize will be shared by the winners and splitted according to the number of teams.
  • The order of authors for the workshop paper will be drawn random from the winners of the challenge award, then from the remaining selected participants.


Description of the data

From the Web of Knowledge

In partnership with Thomson-Reuters, we extracted the data set for this challenge from the ~36 millions references within the ISI Web of Science (WoS).

We provide in two different formats an extract of 196 239 articles having at least a title and matching one the following words in the title or the abstract : embryolog* or embryo(s) or embryonic*.

We also provide each participant with a full documentation of the data. After you'll sign the confidentiality agreement, you'll get a login and password giving you access to the data we prepared, both in a raw ISI WoS format and in a semi-structured document database, coming along with their specifications.

Available formats

We propose two different data formats :

  • raw/original files
  • semi-structured MongoDB/JSON dump.


Here's a example of an item from the raw data set file we prepared. All available fields within WoS items are present in this file. We'll provide any participant with a complete specification, and also with an extraction software.

Considering that articles (items) are grouped into journals issues, this flat format organize elements the following way :

UI open a journal issue
data associated with the issue
UT open an article
data associated with the article, eventually nested blocks
UT another article

Let's consider an example of an article from the raw files. All tags documentaton will be given along with the data.

UT A190002
T9 00351
AU Heavy, RW
AU Peters, TF
TI Solution of ...
BP 11
EP 26
PG 16
DT @ Article
LA EN English
DE mass conservation
AB We extend the ...
NR 24
R9 0003495
/Y 1995
/V 32
/P 404
RA Healy, RW
NN MS 13, BOX 25
NZ 8023       AP

Semi-structured database

Here's an example in JSON of the database contents. We choosed MongoDB as our database server. We both provide a dump of the database so that you import the dataset in your own instance of MongoDB. During the workshop, we'll also provide a running database server at iscpif.fr (accessible locally at ISCPIF). If you want to use the dump file, please install MongoDB, and restore the database by using mongorestore

This data is partial compared to the raw version, but in the other hand it targets interesting field for analysis, and proposes a structured and queryable version of the data set. Please consult the documentation to learn about all the querying possibilities of the database.

As some fields are optionals, all documents are not structured the same way. To construct this database, we searched for the following fields :

  • article identifiers (WoS and DOI/PII/UNSP if available)
  • associated publication (structured) with its metadata (date, type, keywords, etc)
  • authors (list)
  • article keywords (list)
  • title and abstract (text)
  • language (text)
  • article citations (structured)
  • patent citations (structured)
  • institution, country & city (structured)

        # database ID
	"_id": "0001700010",
        # author's keywords
	"DE": [
		"minimal residual disease",
		"disseminated tumor cells",
        # Identifier for the citation network
	"T9": "6412",
        # language
	"LA": "EN English",
        # Abstract
	"AB": "There are different reasons ... tumor cells (2, 11).",
        # authors
	"AU": [
		"Amdros, PE",
		"Meles, G",
        # title
	"TI": "Detection, quantification ...",
        # cited references
	"CR": [
                        # Identifier for the citation network
			"R9": "007965",
                        # Cited author
			"/A": "AMDROS, PF",
                        # Cited article's year
			"/Y": "2001",

			"/W": "LEUKEMIA"
        # document type
	"DT": "@ Article",
        # institutions and locations
			"NC": "St Anna Childrens Hosp",
			"NF": "St Anna Childrens Hosp, Childrens Canc Res Inst, A-1090 Vienna, Austria",
			"NY": "Vienna",
			"NU": "Austria"
        # associated issue
        "issue": {
		"IO": "20100002821097",
		"JI": "Acad. Pediatr.",
		"LD": "20101004-21:17:48",
		"PT": "J",
		"TV": "N",
		"PY": "2010",
		"PA": "360 PARK AVE SOUTH, NEW YORK, NY 10010-1710 USA",
		"PD": "SEP-OCT",
		"PI": "NEW YORK",
		"_id": "0002821097"

 For any information on this data challenge, or for inscription, please contact mdts /at/ sciencemapping /d0t/ com</p>. You can also follow us on twitter @mdts11

Contributors to this page: Jean-Philippe Cointet , elishowk and David Chavalarias .
Page last modified on Thursday 07 April, 2011 12:28:54 by Jean-Philippe Cointet.