Configuration of GROBID
The configuration of GROBID is done vie the file
grobid/grobid-home/config/grobid.yaml. You will need to restart the GROBID service after modifying the file to have the new configuration parameters active.
Configuration of a Docker image
Description of the configuration parameters
The GROBID home is the path where all the runtime GROBID resources are located (models, lexicon, etc). The default is
grobid-home/ and it should normally never been changed:
# where all the Grobid resources are stored (models, lexicon, native libraries, etc.), normally no need to change grobidHome: "grobid-home"
By default temporary files are currently written under
# path relative to the grobid-home path (e.g. grobid-home/tmp) temp: "tmp"
By default native libraries are currently written under
# normally nothing to change here, path relative to the grobid-home path (e.g. grobid-home/lib) nativelibrary: "lib"
PDF processing is done by
pdfalto which is a command line executable. The choice tp keep it as a command line and not as natively integrated library is motivated by the requirement of robustness (at scale, PDF processing can lead to various issues and it helps to keep it isolated in its own external process).
pdfalto parameters are related to security limits (aka circuit breakers).
memoryLimitMb indicate the maximum amount of memory that a PDF parsing can use. The motivation is to avoid OOM errors and swapping with problematic PDF.
timeoutSec is the maximum runtime allowed for
pdfalto to process one PDF.
pdf: pdfalto: # path relative to the grobid-home path (e.g. grobid-home/pdfalto), you don't want to change this normally path: "pdfalto" # security for PDF parsing memoryLimitMb: 6096 timeoutSec: 60
The following parameters are related to the ALTO file generated by
pdfalto. They are used to prevent processing too large documents.
# security relative to the PDF parsing result blocksMax: 100000 tokensMax: 1000000
Consolidation and its configuration are described here.
Optionally a proxy can be defined, which will be used when calling CrossRef REST service of biblio-glutton service, if selected.
proxy: # proxy to be used when doing external call to the consolidation service host: port:
CORS for the GROBID web API service can be configurated by the following yaml part:
# CORS configuration for the GROBID web API service corsAllowedOrigins: "*" corsAllowedMethods: "OPTIONS,GET,PUT,POST,DELETE,HEAD" corsAllowedHeaders: "X-Requested-With,Content-Type,Accept,Origin"
Language processing implementation
GROBID uses external implementation for recognizing the language used in a publication and for performing sentence disambiguation.
There is currently only one possible language recognition implementation possible (Cybozu Language Detector) and two possible sentence segmenters (OpenNLP, default and the Pragmatic Segmenter).
# the actual implementation for language recognition to be used languageDetectorFactory: "org.grobid.core.lang.impl.CybozuLanguageDetectorFactory" # the actual implementation for optional sentence segmentation to be used (PragmaticSegmenter or OpenNLP) #sentenceDetectorFactory: "org.grobid.core.lang.impl.PragmaticSentenceDetectorFactory" sentenceDetectorFactory: "org.grobid.core.lang.impl.OpenNLPSentenceDetectorFactory"
The maximum number of threads to be used by the GROBID service can be set with the
concurrency parameter. GROBID manages a pool of threads of the indicated size to ensure service stability and availability over time. If all the threads are used, some request will be put in a queue and wait
poolMaxWait seconds before trying to be assigned to a thread again. If the queue reaches a certain limit depending on
503 response will be send back to the client indicating to wait a bit before sending again the request. If you need more explanation on this mechanism, see here and the GROBID clients implementations.
# maximum concurrency allowed to GROBID server for processing parallel requests - change it according to your CPU/GPU capacities # for a production server running only GROBID, set the value slightly above the available number of threads of the server # to get best performance and security concurrency: 10 # when the pool is full, for queries waiting for the availability of a Grobid engine, this is the maximum time wait to try # to get an engine (in seconds) - normally never change it poolMaxWait: 1
When executing the service, models can be loaded in a lazy manner (if you plan to use only some services) to save memory or when the service starts to avoid slowing down the first query:
# for **service only**: how to load the models, # false -> models are loaded when needed (default), avoiding putting in memory useless models but slow down significantly # the service at first call # true -> all the models are loaded into memory at the server startup, slow the start of the services and models not # used will take some memory, but server is immediatly warm and ready modelPreload: false
Finally the following part specifies the port to be used by the GROBID web service:
server: type: custom applicationConnectors: - type: http port: 8070 adminConnectors: - type: http port: 8071 registerDefaultExceptionMappers: false
Wapiti global parameters
wapiti, we find the generic parameters of the Wapiti engine, currently only one is present. The following parameter applies only when training with CRF Wapiti, it indicates how many threads to be used when training a Wapiti model.
wapiti: # Wapiti global parameters # number of threads for training the wapiti models (0 to use all available processors) nbThreads: 0
DeLFT global parameters
delft, we find the generic parameters of the DeLFT engine. For using Deep Learning models, you will need an installation of the python library DeLFT. Use the following parameters to indicate the location of this installation, and optionally the path to the virtual environment folder of this installation:
delft: # delft installation path if Deep Learning architectures are used to implement one of the sequence labeling model, # embeddings are usually compiled as lmdb under delft/data (this paramter is ignored if only featured-engineered CRF are used) install: "../delft" pythonVirtualEnv:
Configuring the models
Each model has its own configuration indicating:
which "engine" to be used, with values
wapitifor featured-based CRF or
delftfor Deep Learning models.
for Deep Learning models, which neural architecture to be used, with choices normally among
scibert. The corresponding model/architecture combination need to be available under
grobid-home/models/. If it is not the case, you will need to train the model with this particular architecture.
Wapiti CRF training uses three parameters:
nbMaxIterations related to stopping criteria.
DeLFT models use traditional Deep Learning parameters, with the possibility to override default parameters with two specific parameter subgroups
training (parameters to be used during training) and
runtime (parameters to be used when doing inference). With this distinction it is possible for instance to train with parameters adapted to a given GPU and doing inference with a smaller GPU.
For instance, the citation model is configured below to use a
BidLSTM_CRF_FEATURES architecture with
glove-840B embeddings and with the indicated architecture-specific parameters:
models: # we configure here how each sequence labeling model should be implemented # for feature-engineered CRF, use "wapiti" and possible training parameters are window, epsilon and nbMaxIterations # for Deep Learning, use "delft" and select the target DL architecture (see DeLFT library), the training # parameters then depends on this selected DL architecture - name: "citation" engine: "delft" delft: # deep learning parameters architecture: "BidLSTM_CRF_FEATURES" useELMo: false embeddings_name: "glove-840B" runtime: # parameters used at runtime/prediction max_sequence_length: 3000 batch_size: 20 training: # parameters used for training max_sequence_length: 3000 batch_size: 30
Logging can be set with the following parameter group:
logging: level: INFO loggers: org.apache.pdfbox.pdmodel.font.PDSimpleFont: "OFF" appenders: - type: console threshold: ALL timeZone: UTC - type: file currentLogFilename: logs/grobid-service.log threshold: ALL archive: true archivedLogFilenamePattern: logs/grobid-service-%d.log archivedFileCount: 5 timeZone: UTC