Project reference
If you want to cite this work, please simply refer to the github project:
GROBID (2008-2021) <https://github.com/kermitt2/grobid>
Please do not include a particular person name to emphasize the project and the tool!
We also ask you not to cite any old research papers, but the current project itself, because, yes, we can cite a software project in the bibliographical references and not just mention it in a foot note ;)
Here's a BibTeX entry using the Software Heritage project-level permanent identifier:
@misc{GROBID,
title = {GROBID},
howpublished = {\url{https://github.com/kermitt2/grobid}},
publisher = {GitHub},
year = {2008--2021},
archivePrefix = {swh},
eprint = {1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3c}
}
Presentations on Grobid
GROBID in 30 slides (2015).
GROBID in 20 slides (2012).
P. Lopez. Automatic Extraction and Resolution of Bibliographical References in Patent Documents. First Information Retrieval Facility Conference (IRFC), Vienna, May 2010. LNCS 6107, pp. 120-135. Springer, Heidelberg, 2010.
Evaluation and usages
The following articles are provided for information - it does not mean that we agree with all their statements about Grobid (please refer to the present documentation for the actual features and capacities of the tool) or with all the various methodologies used for evaluation, but they all explore interesting aspects with Grobid.
-
M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents, in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Indianapolis, IN, USA, 2013.
-
Joseph Boyd. Automatic Metadata Extraction The High Energy Physics Use Case. Master Thesis, EPFL, Switzerland, 2015.
-
Phil Gooch and Kris Jack, How well does Mendeley’s Metadata Extraction Work?, 2015.
-
Meta-eval, 2015.
-
D. Tkaczyk, A. Collins, P. Sheridan, & J. Beel. Evaluation and Comparison of Open Source Bibliographic Reference Parsers: A Business Use Case. arXiv:1802.01168, 2018.
-
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney and Dan S. Weld. S2ORC: The Semantic Scholar Open Research Corpus. arXiv:1911.02782, github, 2019
-
CORD-19: The COVID-19 Open Research Dataset. 2020. https://pages.semanticscholar.org/coronavirus-research, arXiv:2004.10706. (See also here)
-
Mark Grennan and Joeran Beel. Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and Cora. arXiv:2004.10410, 2020.
Articles on CRF for bibliographical reference parsing
-
Fuchun Peng and Andrew McCallum. Accurate Information Extraction from Research Papers using Conditional Random Fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004.
-
Isaac G. Councill, C. Lee Giles, Min-Yen Kan. ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Language Resources and Evaluation Conference (LREC), Marrakesh, Morrocco, 2008.
Other similar Open Source tools
CiteSeerX page on Scholarly Information Extraction which lists tools and related information (ok now outdated).