Project reference

You can simply refer to the github project:

GROBID (2008-2017)

(please do not include a particular person name to emphasize the project and tool!)

Presentations on Grobid

GROBID in 30 slides (2015).

GROBID in 20 slides (2012).

Papers on Grobid

GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. P. Lopez. Proceedings of the 13th European Conference on Digital Library (ECDL), Corfu, Greece, 2009.

Automatic Extraction and Resolution of Bibliographical References in Patent Documents. P. Lopez. First Information Retrieval Facility Conference (IRFC), Vienna, May 2010. LNCS 6107, pp. 120-135. Springer, Heidelberg (2010).

Automatic Metadata Extraction The High Energy Physics Use Case. Joseph Boyd. Master Thesis, EPFL, Switzerland, 2015.


M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp, Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents, in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Indianapolis, IN, USA, 2013.

Phil Gooch and Kris Jack, How well does Mendeley’s Metadata Extraction Work?


Articles on CRF for bibliographical extraction

Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun Peng and Andrew McCallum. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004.

Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Language Resources and Evaluation Conference (LREC), Marrakesh, Morrocco.

Other similar Open Source tools

CiteSeerX page on Scholarly Information Extraction which list many tools and related information.