Evaluation of GROBID against PubMed Central
Individual models can be evaluated as explained in Training the different models of Grobid.
For an end-to-end evaluation, covering the whole extraction process from the parsing of PDF to the end result of the cascading of several CRF models, GROBID includes an evaluation progress against a set of PubMed Central articles. For its publications, PubMed Central provides both PDF and fulltext XML files in the NLM format. Keeping in mind some limits described bellow, it is possible to estimate the ability of Grobid to extract and normalize the content of the PDF documents for matching the quality of the NLM file.
Getting PubMedCentral gold-standard data
We are currently evaluating GROBID using the
PMC_sample_1943 dataset compiled by Alexandru Constantin. The dataset is available at this url (around 1.5GB in size). The sample dataset contains 1943 articles from 1943 different journals corresponding to the latest publications from a 2011 snapshot.
Any similar PubMed Central set of articles could normally be used, as long they follow the same directory structure: one directory per article containing at least the corresponding PDF file and the reference NLM file.
We suppose in the following that the archive is decompressed under
Running and evaluating
grobid-trainer/, the following command line is used to run and evaluate Grobid on the dataset:
> mvn compile exec:exec -PPubMedCentralEval -Dpmc=*PATH_TO_PMC/PMC_sample_1943* -Drun=1
run indicates if GROBID has to be executed on all the PDF of the data set. The resulting TEI file will be added in each article subdirectory. If you only want to run the evaluation without re-executing Grobid on the PDF, set the parameter to 0:
> mvn compile exec:exec -PPubMedCentralEval -Dpmc=*PATH_TO_PMC/PMC_sample_1943* -Drun=0
It is also possible to set a ratio of evaluation data to be used expressed as a number between 0 and 1 introduced by the parameter
fileRatio. For instance, if you want to evaluate Grobid against only 10% of the PubMedCentral files, use:
> mvn compile exec:exec -PPubMedCentralEval -Dpmc=*PATH_TO_PMC/PMC_sample_1943* -Drun=0 -DfileRatio=0.1
The evaluation provides precision, recall and f-score for the different fields in the header and bibliographical references. In addition, the scores are also computed at instance level, which means at the level of a complete header or complete citation.
An experimental evaluation for the structures of the full text body is also proposed. This is not reliable in the current state, because most of the annotations of the full texts in PudMed Central are not uniform. For instance, the numbering of the section header is sometime included in the section header annotation, sometime not. The PubMed Central annotations will need to be standardized as a pre-process for a meaningful evaluation, which is a task planned in the next releases.
The evaluation covers four different string matching techniques for textual fields, based on the existing evaluation approaches observed in the litterature:
strict, i.e. exact match,
soft corresponding to matching ignoring punctuations, character case and space character mismatches,
relative Levenshtein distance relative to the max length of two strings
These macthing variants only apply to textual fields, not numerical and dates fields (such as volume, issue, dates, pages).
A relatively important number of citations in the NLM files are encoded only as raw string, for example in the first file of the set
... <ref id="CR9"> <label>9.</label> <mixed-citation publication-type="other">Piraña and PCluster: a modeling environment and cluster infrastructure for NONMEM. Keizer RJ, van Benten M, Beijnen JH, Schellens JH, Huitema AD. Comput Methods Programs Biomed. 2011;101(1):72–9.</mixed-citation> </ref> <ref id="CR10"> <label>10.</label> <mixed-citation publication-type="other">Holford, N. VPC, the visual predictive check—superiority to standard diagnostic (Rorschach) plots. In: PAGE 2005. 2005.</mixed-citation> </ref> ...
(this file contains for instance 3 non-encoded citations out of 18)
As a consequence, the fields extracted by GROBID will not match any reference 'expected' values and will all be considered as false positive. The scores for the citation structures are thus lower than the actual performance of the system.