Introduction

Status

Purpose

GROBID (or Grobid, but not GroBid nor GroBiD) means GeneRation Of BIbliographic Data.

GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. First developments started in 2008 as a hobby. In 2011 the tool has been made available in open source. Work on GROBID has been steady as side project since the beginning and is expected to continue as such.

The following functionalities are available:

Header extraction and parsing from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
References extraction and parsing from articles in PDF format, around .87 f-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references. All the usual publication metadata are covered (including DOI).
Citation contexts recognition and linking to the full bibliographical references of the article. The accuracy of citation context resolution is above 0.77 f-score (which corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).
Parsing of references in isolation (above 0.90 f-score).
Parsing of names (e.g. person title, forenames, middlename, etc.), in particular author names in header, and author names in references (two distinct models).
Parsing of affiliation and address blocks.
Parsing of dates, ISO normalized day, month, year.
Full text extraction and structuring from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference callout, figure, table, etc.).
Consolidation/resolution of the extracted bibliographical references using the biblio-glutton service or the CrossRef REST API. In both cases, DOI resolution performance is higher than 0.95 f-score from PDF extraction.
Extraction and parsing of patent and non-patent references in patent publications.
PDF coordinates for extracted information, allowing to create "augmented" interactive PDF.

GROBID includes batch processing, a comprehensive web service API, a JAVA API, a docker container, a relatively generic evaluation framework (precision, recall, n-fold cross-valiation, etc.) and the semi-automatic generation of training data.

In a complete PDF processing, GROBID manages 55 final labels used to build relatively fine-grained structures, from traditional publication metadata (title, author first/last/middlenames, affiliation types, detailed address, journal, volume, issue, pages, doi, pmid, etc.) to full text structures (section title, paragraph, reference markers, head/foot notes, figure headers, etc.).

GROBID can be considered as production ready. Deployments in production includes ResearchGate, HAL Research Archive, the European Patent Office, INIST-CNRS, Mendeley, CERN (Invenio), Internet Archive, and many others.

The key aspects of GROBID are the following ones:

Written in Java, with JNI call to native CRF libraries and/or Deep Learning libraries via Python JNI bridge.
Speed - on low profile Linux machine (8 threads): header extraction from 4000 PDF in 2 minutes (36 PDF per second with the RESTful API), parsing of 3500 references in 4 seconds, full processing of 4000 PDF (full body, header and reference, structured) in 26 minutes (around 2.5 PDF per second).
Scalability and robustness: We have been able recently to run the complete fulltext processing at around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) during one week on one 16 CPU machine (16 threads, 32GB RAM, no SDD, articles from mainstream publishers), see here (11.3M PDF were processed in 6 days by 2 servers without crash).
Lazy loading of models and resources. Depending on the selected process, only the required data are loaded in memory. For instance, extracting only metadata header from a PDF requires less than 2 GB memory in a multithreading usage, extracting citations uses around 3GB and extracting all the PDF structures around 4GB.
Robust and fast PDF processing with pdfalto, based on xpdf, and dedicated post-processing.
Modular and reusable machine learning models for sequence labelling. The default extractions are based on Linear Chain Conditional Random Fields, with the possibility to use various Deep Learning architectures for sequence labelling (including ELMo and BERT-CRF). The specialized sequence labelling models are cascaded to build a complete (hierarchical) document structure.
Full encoding in TEI, both for the training corpus and the parsed results.
Optional consolidation of extracted bibliographical data via online call to biblio-glutton or the CrossRef REST API, export to OpenURL, BibTeX, etc. for easier integration into Digital Library environments. For scalability, reliability and accuracy, we recommend to use biblio-glutton when possible.
Rich bibliographical processing: fine grained parsing of author names, dates, affiliations, addresses, etc. but also for instance quite reliable automatic attachment of affiliations and emails to authors.
Automatic generation of pre-annotated training data from new PDF documents based on current models, for supporting semi-automatic training data generation.
Support for CJK and Arabic languages based on customized Lucene analyzers provided by WIPO.
PDF coordinates for extracted information, allowing to create "augmented" interactive PDF.

The GROBID extraction and parsing algorithms use by default a fork of Wapiti CRF library. On Linux (64 bits) and MacOS, as alternative, it is possible to perform the sequence labelling with DeLFT deep learning models (typically BidLSTM-CRF with or without ELMo, or BERT-CRF, with additional feature channels) instead of Wapiti CRF models, using a native integration via JEP. The native libraries, in particular TensorFlow, are transparently integrated as JNI with dynamic call based on the current OS. However, the best Deep Learning algorithms for sequence labelling provide so far relatively limited improvements as compared to CRF, while considerably slower (10 to 1,000 times slower with GPU) and memory-intensive. They should be used when accuracy is the main priority, at the price of reduced scalability. See the related benchmarking.

GROBID should run properly "out of the box" on Linux (32 and 64 bits) and macOS.

Credits

The main author is Patrice Lopez (patrice.lopez@science-miner.com).

Core committers and maintenance: Patrice Lopez (science-miner) and Luca Foppiano (NIMS).

Many thanks to:

Vyacheslav Zholudev (Sum&Substance, formerly at ResearchGate)
Achraf Azhar (CCSD)
Daniel Ecer (eLife)
Laurent Romary (Inria)
Vitalii Bezsheiko (PKP)
Bryan Newbold (Internet Archive)
Christopher Boumenot (Microsoft) in particular for the Windows support
CERN contributors Andreas la Roi and Micha Moskovic
Florian Zipser (Humboldt University) who developed the first historical version of the REST API in 2011
the other contributors from ResearchGate: Michael Häusler, Kyryl Bilokurov, Artem Oboturov
Damien Ridereau (Infotel)
Oliver Kopp (JabRef research)
Bruno Pouliquen (WIPO) for the custom analyzers for Eastern languages
Thomas Lavergne, Olivier Cappé and François Yvon for Wapiti
The JEP team for their great JVM CPython embedding solution
Taku Kudo for CRF++ (not used anymore, but all the same, thanks!)
Hervé Déjean and his colleagues from Xerox Research Centre Europe, for xml2pdf
and the other contributors: @elonzh, Jakob Fix, Tanti Kristanti, Bryan Newbold, Dmitry Katsubo, Phil Gooch, Romain Loth, Maud Medves, Chris Mattmann, Sujen Shah, Joseph Boyd, Guillaume Muller, ...