Configuration of GROBID
The configuration of GROBID is done vie the file grobid/grobid-home/config/grobid.yaml
. You will need to restart the GROBID service after modifying the file to have the new configuration parameters active.
Configuration of a Docker image
See here.
Description of the configuration parameters
GROBID home
The GROBID home is the path where all the runtime GROBID resources are located (models, lexicon, etc). The default is grobid-home/
and it should normally never been changed:
# where all the Grobid resources are stored (models, lexicon, native libraries, etc.), normally no need to change
grobidHome: "grobid-home"
By default temporary files are currently written under grobid-home/tmp
:
# path relative to the grobid-home path (e.g. tmp for grobid-home/tmp) or absolute path (/tmp)
temp: "tmp"
By default native libraries are currently written under grobid-home/lib
:
# normally nothing to change here, path relative to the grobid-home path (e.g. grobid-home/lib)
nativelibrary: "lib"
PDF processing
PDF processing is done by pdfalto
which is a command line executable. The choice tp keep it as a command line and not as natively integrated library is motivated by the requirement of robustness (at scale, PDF processing can lead to various issues and it helps to keep it isolated in its own external process).
pdfalto
parameters are related to security limits (aka circuit breakers). memoryLimitMb
indicate the maximum amount of memory that a PDF parsing can use. The motivation is to avoid OOM errors and swapping with problematic PDF. timeoutSec
is the maximum runtime allowed for pdfalto
to process one PDF.
pdf:
pdfalto:
# path relative to the grobid-home path (e.g. grobid-home/pdfalto), you don't want to change this normally
path: "pdfalto"
# security for PDF parsing
memoryLimitMb: 6096
timeoutSec: 60
The following parameters are related to the ALTO file generated by pdfalto
. They are used to prevent processing too large documents.
# security relative to the PDF parsing result
blocksMax: 100000
tokensMax: 1000000
Consolidation
Consolidation and its configuration are described here.
Proxy
Optionally a proxy can be defined, which will be used when calling CrossRef REST service of biblio-glutton service, if selected.
proxy:
# proxy to be used when doing external call to the consolidation service
host:
port:
CORS
CORS for the GROBID web API service can be configurated by the following yaml part:
# CORS configuration for the GROBID web API service
corsAllowedOrigins: "*"
corsAllowedMethods: "OPTIONS,GET,PUT,POST,DELETE,HEAD"
corsAllowedHeaders: "X-Requested-With,Content-Type,Accept,Origin"
Language processing implementation
GROBID uses external implementation for recognizing the language used in a publication and for performing sentence disambiguation.
There is currently only one possible language recognition implementation possible (Cybozu Language Detector) and two possible sentence segmenters (OpenNLP (default) and the Pragmatic Segmenter).
# the actual implementation for language recognition to be used
languageDetectorFactory: "org.grobid.core.lang.impl.CybozuLanguageDetectorFactory"
# the actual implementation for optional sentence segmentation to be used (PragmaticSegmenter or OpenNLP)
#sentenceDetectorFactory: "org.grobid.core.lang.impl.PragmaticSentenceDetectorFactory"
sentenceDetectorFactory: "org.grobid.core.lang.impl.OpenNLPSentenceDetectorFactory"
NOTE: While OpenNLP is 60 time faster than the Pragmatic Segmenter, it performs "slightly" worst. The pragmatic segmenter runs with the JRuby Interpreter.
Service configuration
The maximum number of threads to be used by the GROBID service can be set with the concurrency
parameter. GROBID manages a pool of threads of the indicated size to ensure service stability and availability over time. If all the threads are used, some request will be put in a queue and wait poolMaxWait
seconds before trying to be assigned to a thread again. If the queue reaches a certain limit depending on concurrency
, then 503
response will be send back to the client indicating to wait a bit before sending again the request. If you need more explanation on this mechanism, see here and the GROBID clients implementations.
# maximum concurrency allowed to GROBID server for processing parallel requests - change it according to your CPU/GPU capacities
# for a production server running only GROBID, set the value slightly above the available number of threads of the server
# to get best performance and security
concurrency: 10
# when the pool is full, for queries waiting for the availability of a Grobid engine, this is the maximum time wait to try
# to get an engine (in seconds) - normally never change it
poolMaxWait: 1
When executing the service, models can be loaded in a lazy manner (if you plan to use only some services) to save memory or when the service starts to avoid slowing down the first query:
# for **service only**: how to load the models,
# false -> models are loaded when needed, avoiding putting in memory useless models (only in case of CRF) but slow down
# significantly the service at first call
# true -> all the models are loaded into memory at the server startup (default), slow the start of the services
# and models not used will take some more memory (only in case of CRF), but server is immediatly warm and ready
modelPreload: true
Finally, the following part specifies the port to be used by the GROBID web service:
server:
type: custom
applicationConnectors:
- type: http
port: 8070
adminConnectors:
- type: http
port: 8071
registerDefaultExceptionMappers: false
Wapiti global parameters
Under wapiti
, we find the generic parameters of the Wapiti engine, currently only one is present. The following parameter applies only when training with CRF Wapiti, it indicates how many threads to be used when training a Wapiti model.
wapiti:
# Wapiti global parameters
# number of threads for training the wapiti models (0 to use all available processors)
nbThreads: 0
DeLFT global parameters
Under delft
, we find the generic parameters of the DeLFT engine. For using Deep Learning models, you will need an installation of the python library DeLFT or to use the Docker image. For a local build, use the following parameters to indicate the location of this installation, and optionally the path to the virtual environment folder of this installation:
delft:
# delft installation path if Deep Learning architectures are used to implement one of the sequence labeling model,
# embeddings are usually compiled as lmdb under delft/data (this paramter is ignored if only featured-engineered CRF are used)
install: "../delft"
pythonVirtualEnv: ../delft/env"
Configuring the models
Each model has its own configuration indicating:
-
which "engine" to be used, with values
wapiti
for feature-based CRF ordelft
for Deep Learning models. -
for Deep Learning models, which neural architecture to be used, with choices normally among
BidLSTM_CRF
,BidLSTM_CRF_FEATURES
,BERT
,BERT-CRF
,BERT_CRF_FEATURES
. The corresponding model/architecture combination need to be available undergrobid-home/models/
. If it is not the case, you will need to train the model with this particular architecture.
Wapiti CRF training uses three parameters: window
, epsilon
and nbMaxIterations
related to stopping criteria.
DeLFT models use traditional Deep Learning parameters, with the possibility to override default parameters with two specific parameter subgroups training
(parameters to be used during training) and runtime
(parameters to be used when doing inference). With this distinction it is possible for instance to train with parameters adapted to a given GPU and doing inference with a smaller GPU.
For instance, the citation model is configured below to use a BidLSTM_CRF_FEATURES
architecture with the indicated architecture-specific parameters:
models:
# we configure here how each sequence labeling model should be implemented
# for feature-engineered CRF, use "wapiti" and possible training parameters are window, epsilon and nbMaxIterations
# for Deep Learning, use "delft" and select the target DL architecture (see DeLFT library), the training
# parameters then depends on this selected DL architecture
- name: "citation"
engine: "delft"
delft:
# deep learning parameters
architecture: "BidLSTM_CRF_FEATURES"
useELMo: false
runtime:
# parameters used at runtime/prediction
max_sequence_length: 3000
batch_size: 20
training:
# parameters used for training
max_sequence_length: 3000
batch_size: 30
A Deep Learning model comes with predefined word embeddings
(e.g. glove-840B
or word2vec
) or transformer
pre-trained model name (e.g. bert-base-cased
, allenai/scibert_scivocab_cased
- reusing Hugging Face transformers Hub model names). They are part of the trained DL model and will be loaded online the first time if now available in the local DeLFT install.
Logging
Logging can be set with the following parameter group:
logging:
level: INFO
loggers:
org.apache.pdfbox.pdmodel.font.PDSimpleFont: "OFF"
org.glassfish.jersey.internal: "OFF"
com.squarespace.jersey2.guice.JerseyGuiceUtils: "OFF"
appenders:
- type: console
threshold: WARN
timeZone: UTC
# uncomment to have the logs in json format
#layout:
# type: json
- type: file
currentLogFilename: logs/grobid-service.log
threshold: INFO
archive: true
archivedLogFilenamePattern: logs/grobid-service-%d.log
archivedFileCount: 5
timeZone: UTC
# uncomment to have the logs in json format
#layout:
# type: json