General principles

The annotation guidelines describe the rules to follow for creating additional training data to be used with Grobid.

This maybe of interest if the current state of the models does not correctly recognize structures in your own PDF files. Grobid allows to improve the quality of its models by training them to recognize more structures. The following section describes that process.

Generating pre-annotated training data

The addition of training in Grobid is not done from scratch, but from pre-annotated training data generated by the existing models in Grobid. This ensures that the syntax of the new training data will be (normally) correct and that the stream of text will be easy to align with the text extracted from the PDF. This permits also to take advantage of the existing models which will annotate correctly a certain amount of text, and to focus on the corrections, thus improving the productivity of the annotator.

For generating pre-annotated training files for Grobid based on the existing models, see the instructions for running the software in batch here and here.

After running the batch createTraining on a set of PDF files using methods for creating training data, each article comes with:

  • the PDF used to generate the training data, for instance toto.pdf,

  • a set of pre-annotated XML files with an extension *.training.*.tei.xml for the different GROBID models: e.g. toto.training.header.tei.xml for the pre-annotated header section, toto.training.fulltext.tei.xml for the pre-annotated body section (so-called fulltext model), etc. these files need to be edited and corrected in order to get "gold-standard" training data to be used by GROBID,

  • a set of files without XML extension, containing the list of tokens with associated features to used for training: e.g. toto.training.header for header model, toto.training.fulltext for the fulltext model - these files can be ignored and should not be edited, but are necessary for training Grobid.

The exact list of generated files depends on the structures occurring in the article. It is not unusual to not find all of the training file types listed below. The following is a complete list of training files that can be produced:

name of file model for the pre-annotated file
*.training.segmentation.tei.xml segmentation for the initial model used to segment a complete article into the principal zones
*.training.header.tei.xml header a pre-annotated file for the header model
*.training.header.affiliation.tei.xml affiliation-address for the detailed affiliation and address recognition
*.training.header.authors.tei.xml header for the detailed authors recognition in the header
*.training.header.date.xml date for the detailed structure of dates appearing in the header
*.training.header-references.xml header for the detailed bibliographical reference segment structure if one appears in the header
*.training.fulltext.tei.xml fulltext for the structured body
*.training.figure.tei.xml and *.training.table.tei.xml figure, table for the different figures and tables
*.training.references.referenceSegmenter.tei.xml reference-segmenter for the reference-segmenter model (segment a bibliographical section into individual reference entries)
*.training.references.tei.xml fulltext for all the bibliographical references of the article
*.training.references.authors.tei.xml citation for all the authors appearing in the bibliographical references of the article

These files must be reviewed and corrected manually before being added to the training data, taking into account that exploiting any additional training data requires GROBID to re-create its models - by retraining them.

Correcting pre-annotated files

The most important principle when correcting the pre-annotated training data is to keep the stream of text untouched. Only the tags can be moved, the text itself shall not be modified or corrected. The stream of text present in the training file after extraction of the content of the PDF, is similar to the stream of text Grobid will have to process once the models are (re)created. It is thus important to have Grobid trained on this real-world input, even if they contain OCR errors, noise, unknown unicode characters, etc.

There are two exceptions to this main rule :

  • actual end-of-lines from the PDF files are indicated by element <lb/>. These tags <lb/> should be considered as part of the stream of text are should not be moved or removed with respect to the overall text stream.

  • in the TEI/XML files, end-of-line is equivalent to a space character - it is thus possible to add or remove end-of-line characters as long the spacing is preserved. For instance for GROBID:

    <title level="a">In XML training files, end-of-line and space are the same</title> <lb/> <author>Kermitt Jr</author> <lb/> <date>2017</date> 

is equivalent to

<title level="a">In XML training files, end-of-line and space are 
    the same</title> <lb/> 
    <author>Kermitt Jr</author> 
    <lb/> 

    <date>2017</date> 

In the standard Grobid installation, examples of existing annotations can be found under grobid-trainer/resources/ for each model.

In the current correction process, an XML file shall not be added to the corpus:

  • if it is incorrectly produced, e.g. *.training.header-references.xml is produced while a chunk of text was incorrectly identified as reference in the header file. In this case the XML file must be removed;

  • if it is missing, e.g. a chunk of text was a reference but was pre-annotated automatically by Grobid as a note, then the additional XML file *.training.header-references.xml shall not be created.

XML files are therefore either modified or deleted, but never created.

In the next sections, the annotation guidelines for each model are presented with various examples.