GROBID batch mode

For the best performance, benchmarking and for exploiting multithreading, we recommand to use the service mode, see Use GROBID as a service, and not the batch mode.

Using the batch

Go under the grobid/grobid-core/ directory where the core library has been built:

> cd grobid/grobid-core/

The following command display some help for the batch commands:

> java -jar target/grobid-core-`<current version>`.one-jar.jar -h

Be sure to replace <current version> with the current version of GROBID that you have installed and built.

The available batch commands are listed bellow. For those commands, at least -Xmx1G is used to set the JVM memory to avoid OutOfMemoryException given the current size of the Grobid models and the crazyness of some PDF. For complete fulltext processing, which involve all the GROBID models, -Xmx4G is recommended (although allocating less memory is usually fine).

processHeader

'processHeader' batch command will extract, structure and normalise in TEI the header of pdf files. The output is a TEI file corresponding to the structured article header. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input PDF files

  • -dOut: path to the output directory (if omitted the current directory)

  • -r: recursive processing of files in the sub-directories (by default not recursive)

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH ../grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processHeader 

WARNING: the expected extension of the PDF files to be processed is .pdf

processFullText

processFullText batch command will extract, structure and normalize in TEI the full text of pdf files. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input PDF files

  • -dOut: path to the output directory (if omitted the current directory)

  • -r: recursive processing of files in the sub-directories (by default not recursive)

  • -ignoreAssets: do not extract and save the PDF assets (bitmaps, vector graphics), by default the assets are extracted and saved

  • -teiCoordinates: output a subset of the identified structures with coordinates in the original PDF, by default no coordinates are present

Example:

> java -Xmx4G -jar target/grobid-core-0.4.2.one-jar.jar -gH ../grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processFullText 

WARNING: the expected extension of the PDF files to be processed is .pdf

processDate

processDate batch command will parse and format in XML/TEI the date given as string input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -s: the input date to format as raw string

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -exe processDate -s "some date to extract and format"

processAuthorsHeader

processAuthorsHeader batch command will parse and format in TEI the authors given in input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -s: the input header author sequence as raw string

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -exe processAuthorsHeader -s "some authors"

processAuthorsCitation

processAuthorsCitation batch command will parse and format in TEI the authors given in input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -s: the input citation author sequence as raw string

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -exe processAuthorsCitation -s "some authors"

processAffiliation

processAffiliation batch command will parse and format in TEI the affiliation/address given in input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -s: the input affiliation/address as raw string

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -exe processAffiliation -s "some affiliation"

processRawReference

processRawReference batch command will parse and format in TEI the raw bibliographical reference given in input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -s: the input bibliographical reference in raw text

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -exe processRawReference -s "a reference string"

processReferences

processRawReference batch command will process, extract and format in TEI all the references in the PDF files present in the directory given in input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input PDF files

  • -dOut: path to the output directory (if omitted the current directory)

  • -r: recursive processing of files in the sub-directories (by default not recursive)

Example:

> java -Xmx2G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processReferences

WARNING: the expected extension of the PDF files to be processed is .pdf

processCitationPatentTEI

processCitationPatentTEI batch command will process, extract and format the citations in the patents encoded in TEI given in input (we assume here the TEI PDM format for patent "fulltexts"). The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input tei files

  • -dOut: path to save the tei annotated data

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTEI

WARNING: extension of the TEI files to be processed must be .tei or .tei.xml

processCitationPatentST36

processCitationPatentST36 batch command will process, extract and format the citations in the patents encoded in ST.36 given in input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input xml files

  • -dOut: path to save the tei results

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentST36

WARNING: extension of the ST.36 files to be processed must be .xml

processCitationPatentTXT

processCitationPatentTXT batch command will process, extract and format the citations in the patents encoded in UTF-8 text given in input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input text files

  • -dOut: path to save the tei results

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentTXT

WARNING: extension of the text files to be processed must be .txt, and expected encoding is UTF-8

processCitationPatentPDF

processCitationPatentPDF batch command will process, extract and format the citations in the patents available in pdf given in input. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input pdf files

  • -dOut: path to save the tei results

Example:

> java -Xmx1G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe processCitationPatentPDF

WARNING: extension of the text files to be processed must be .pdf

createTraining

createTraining batch command will generate the GROBID training data file for all the models from PDF files. The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input pdf files

  • -dOut: path to save the trained data

Example:

> java -Xmx4G -jar target/grobid-core-0.4.2.one-jar.jar -gH /path/to/Grobid/grobid/grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -exe createTraining

WARNING: the expected extension of the PDF files to be processed is .pdf

processPDFAnnotation

The batch command processPDFAnnotation will annotations add to the PDF. These annotations correspond to the citation information, more precisely PDF "goto" annotations for reference callout in the article text and URL link annotations for the bibliographical section (by default to the DOI registry when the DOI is recognized, to the arXiv articles when the arXiv id is recognised or to the indicated URL if present in the reference).

The needed parameters for that command are:

  • -gH: path to grobid-home directory

  • -dIn: path to the directory of input PDF files

  • -dOut: path to save the PDF result files

Example:

>  java -Xmx2G -jar target/grobid-core-0.4.2.one-jar.jar -gH ../grobid-home -dIn /path/to/input/directory -dOut /path/to/output/directory -r -exe processPDFAnnotation

WARNING: extension of the text files to be processed must be .pdf