Project reference

If you want to cite this work, please simply refer to the github project:

GROBID (2008-2022) <https://github.com/kermitt2/grobid>

Please do not include a particular person name to emphasize the project and the tool !

We also ask you not to cite any research papers, but the current project itself (this might be rejected by reviewers, the editorial style or editors, but at least you tried !).

Here's a BibTeX entry using the Software Heritage project-level permanent identifier:

@misc{GROBID,
    title = {GROBID},
    howpublished = {\url{https://github.com/kermitt2/grobid}},
    publisher = {GitHub},
    year = {2008--2023},
    archivePrefix = {swh},
    eprint = {1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c}
}

Evaluation and usages

The following articles are provided for information - it does not mean that we agree with all their statements about Grobid (please refer to the present documentation for the actual features and capacities of the tool) or with all the various methodologies used for evaluation, but they all explore interesting aspects related to Grobid.

Articles on CRF for bibliographical reference parsing

For archeological purposes, the following first paper has been the main motivation and influence for starting GROBID, many thanks to Fuchun Peng and Andrew McCallum.

Datasets

For end-to-end evaluation, we are making available corpus of PDF/XML pairs at https://zenodo.org/record/7708580, including the original PMC_sample_1943 dataset, a updated version of bioRxiv 10k with additional annotations relevant for Grobid, and two additional evaluation sets from PLOS (1000 articles) and eLife (984 articles), see End-to-end evaluation for more details.

For layout/zoning identification:

Transformer/Layout joint approaches (open source)

Other

Created in the context of PdfPig, the following page is a great collection of resources on Document Layout Analysis: https://github.com/BobLd/DocumentLayoutAnalysis