Project reference
If you want to cite this work, please simply refer to the github project:
GROBID (2008-2022) <https://github.com/kermitt2/grobid>
Please do not include a particular person name to emphasize the project and the tool !
We also ask you not to cite any old research papers, but the current project itself, because, yes, we can try to cite a software project in the bibliographical references and not just mention it in a foot note ;) Well, it might be (likely) rejected by reviewers, the editorial style or editors, but at least you tried !
Here's a BibTeX entry using the Software Heritage project-level permanent identifier:
@misc{GROBID,
title = {GROBID},
howpublished = {\url{https://github.com/kermitt2/grobid}},
publisher = {GitHub},
year = {2008--2022},
archivePrefix = {swh},
eprint = {1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c}
}
Evaluation and usages
The following articles are provided for information - it does not mean that we agree with all their statements about Grobid (please refer to the present documentation for the actual features and capacities of the tool) or with all the various methodologies used for evaluation, but they all explore interesting aspects related to Grobid.
-
M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents, in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Indianapolis, IN, USA, 2013.
-
Joseph Boyd. Automatic Metadata Extraction The High Energy Physics Use Case. Master Thesis, EPFL, Switzerland, 2015.
-
Phil Gooch and Kris Jack, How well does Mendeley’s Metadata Extraction Work?, 2015.
-
Meta-eval, 2015.
-
D. Tkaczyk, A. Collins, P. Sheridan, & J. Beel. Evaluation and Comparison of Open Source Bibliographic Reference Parsers: A Business Use Case. arXiv:1802.01168, 2018.
-
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney and Dan S. Weld. S2ORC: The Semantic Scholar Open Research Corpus. arXiv:1911.02782, github, 2019
-
CORD-19: The COVID-19 Open Research Dataset. 2020. https://pages.semanticscholar.org/coronavirus-research, arXiv:2004.10706. (See also here)
-
Mark Grennan and Joeran Beel. Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and Cora. arXiv:2004.10410, 2020.
-
J.M. Nicholson, M. Mordaunt, P. Lopez, A. Uppala, D. Rosati, N.P. Rodrigues, P. Grabitz, S.C. Rife. scite: a smart citation index that displays the context of citations and classifies their intent using deep learning; bioRxiv preprint 2021. doi: https://doi.org/10.1101/2021.03.15.435418
Articles on CRF for bibliographical reference parsing
For archeological purposes, the first paper has been the main motivation and influence for starting GROBID.
-
Fuchun Peng and Andrew McCallum. Accurate Information Extraction from Research Papers using Conditional Random Fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004.
-
Isaac G. Councill, C. Lee Giles, Min-Yen Kan. ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Language Resources and Evaluation Conference (LREC), Marrakesh, Morrocco, 2008.
Datasets
For end-to-end evaluation:
For layout/zoning identification:
Similar open source tools
Transformer/Layout joint approaches (open source)
Other
Created in the context of PdfPig, the following page is a great collection of resources on Document Layout Analysis: https://github.com/BobLd/DocumentLayoutAnalysis