Configuration of GROBID

The configuration of GROBID is done vie the file grobid/grobid-home/config/grobid.yaml. You will need to restart the GROBID service after modifying the file to have the new configuration parameters active.

Configuration of a Docker image

See here.

Description of the configuration parameters

GROBID home

The GROBID home is the path where all the runtime GROBID resources are located (models, lexicon, etc). The default is grobid-home/ and it should normally never been changed:

  # where all the Grobid resources are stored (models, lexicon, native libraries, etc.), normally no need to change
  grobidHome: "grobid-home"

By default temporary files are currently written under grobid-home/tmp:

  # path relative to the grobid-home path (e.g. tmp for grobid-home/tmp) or absolute path (/tmp)
  temp: "tmp"

By default native libraries are currently written under grobid-home/lib:

  # normally nothing to change here, path relative to the grobid-home path (e.g. grobid-home/lib)
  nativelibrary: "lib"

PDF processing

PDF processing is done by pdfalto which is a command line executable. The choice tp keep it as a command line and not as natively integrated library is motivated by the requirement of robustness (at scale, PDF processing can lead to various issues and it helps to keep it isolated in its own external process).

pdfalto parameters are related to security limits (aka circuit breakers). memoryLimitMb indicate the maximum amount of memory that a PDF parsing can use. The motivation is to avoid OOM errors and swapping with problematic PDF. timeoutSec is the maximum runtime allowed for pdfalto to process one PDF.

  pdf:
    pdfalto:
      # path relative to the grobid-home path (e.g. grobid-home/pdfalto), you don't want to change this normally
      path: "pdfalto"
      # security for PDF parsing
      memoryLimitMb: 6096
      timeoutSec: 60

The following parameters are related to the ALTO file generated by pdfalto. They are used to prevent processing too large documents.

    # security relative to the PDF parsing result
    blocksMax: 100000
    tokensMax: 1000000

Consolidation

Consolidation and its configuration are described here.

Proxy

Optionally a proxy can be defined, which will be used when calling CrossRef REST service of biblio-glutton service, if selected.

  proxy:
    # proxy to be used when doing external call to the consolidation service
    host: 
    port:

CORS

CORS for the GROBID web API service can be configurated by the following yaml part:

  # CORS configuration for the GROBID web API service
  corsAllowedOrigins: "*"
  corsAllowedMethods: "OPTIONS,GET,PUT,POST,DELETE,HEAD"
  corsAllowedHeaders: "X-Requested-With,Content-Type,Accept,Origin"

Language processing implementation

GROBID uses external implementation for recognizing the language used in a publication and for performing sentence disambiguation.

There is currently only one possible language recognition implementation possible (Cybozu Language Detector) and two possible sentence segmenters (OpenNLP, default and the Pragmatic Segmenter).

  # the actual implementation for language recognition to be used
  languageDetectorFactory: "org.grobid.core.lang.impl.CybozuLanguageDetectorFactory"

  # the actual implementation for optional sentence segmentation to be used (PragmaticSegmenter or OpenNLP)
  #sentenceDetectorFactory: "org.grobid.core.lang.impl.PragmaticSentenceDetectorFactory"
  sentenceDetectorFactory: "org.grobid.core.lang.impl.OpenNLPSentenceDetectorFactory"

Service configuration

The maximum number of threads to be used by the GROBID service can be set with the concurrency parameter. GROBID manages a pool of threads of the indicated size to ensure service stability and availability over time. If all the threads are used, some request will be put in a queue and wait poolMaxWait seconds before trying to be assigned to a thread again. If the queue reaches a certain limit depending on concurrency, then 503 response will be send back to the client indicating to wait a bit before sending again the request. If you need more explanation on this mechanism, see here and the GROBID clients implementations.

  # maximum concurrency allowed to GROBID server for processing parallel requests - change it according to your CPU/GPU capacities
  # for a production server running only GROBID, set the value slightly above the available number of threads of the server
  # to get best performance and security
  concurrency: 10  
  # when the pool is full, for queries waiting for the availability of a Grobid engine, this is the maximum time wait to try 
  # to get an engine (in seconds) - normally never change it
  poolMaxWait: 1

When executing the service, models can be loaded in a lazy manner (if you plan to use only some services) to save memory or when the service starts to avoid slowing down the first query:

  # for **service only**: how to load the models, 
  # false -> models are loaded when needed, avoiding putting in memory useless models (only in case of CRF) but slow down 
  #          significantly the service at first call
  # true -> all the models are loaded into memory at the server startup (default), slow the start of the services 
  #         and models not used will take some more memory (only in case of CRF), but server is immediatly warm and ready
  modelPreload: true

Finally the following part specifies the port to be used by the GROBID web service:

server:
    type: custom
    applicationConnectors:
    - type: http
      port: 8070
    adminConnectors:
    - type: http
      port: 8071
    registerDefaultExceptionMappers: false

Wapiti global parameters

Under wapiti, we find the generic parameters of the Wapiti engine, currently only one is present. The following parameter applies only when training with CRF Wapiti, it indicates how many threads to be used when training a Wapiti model.

  wapiti:
    # Wapiti global parameters
    # number of threads for training the wapiti models (0 to use all available processors)
    nbThreads: 0

DeLFT global parameters

Under delft, we find the generic parameters of the DeLFT engine. For using Deep Learning models, you will need an installation of the python library DeLFT or to use the Docker image. For a local build, use the following parameters to indicate the location of this installation, and optionally the path to the virtual environment folder of this installation:

  delft:
    # delft installation path if Deep Learning architectures are used to implement one of the sequence labeling model, 
    # embeddings are usually compiled as lmdb under delft/data (this paramter is ignored if only featured-engineered CRF are used)
    install: "../delft"
    pythonVirtualEnv: ../delft/env"

Configuring the models

Each model has its own configuration indicating:

which "engine" to be used, with values wapiti for feature-based CRF or delft for Deep Learning models.
for Deep Learning models, which neural architecture to be used, with choices normally among BidLSTM_CRF, BidLSTM_CRF_FEATURES, BERT, BERT-CRF, BERT_CRF_FEATURES. The corresponding model/architecture combination need to be available under grobid-home/models/. If it is not the case, you will need to train the model with this particular architecture.

Wapiti CRF training uses three parameters: window, epsilon and nbMaxIterations related to stopping criteria.

DeLFT models use traditional Deep Learning parameters, with the possibility to override default parameters with two specific parameter subgroups training (parameters to be used during training) and runtime (parameters to be used when doing inference). With this distinction it is possible for instance to train with parameters adapted to a given GPU and doing inference with a smaller GPU.

For instance, the citation model is configured below to use a BidLSTM_CRF_FEATURES architecture with the indicated architecture-specific parameters:

  models:
    # we configure here how each sequence labeling model should be implemented
    # for feature-engineered CRF, use "wapiti" and possible training parameters are window, epsilon and nbMaxIterations
    # for Deep Learning, use "delft" and select the target DL architecture (see DeLFT library), the training 
    # parameters then depends on this selected DL architecture 

    - name: "citation"
      engine: "delft"
      delft:
        # deep learning parameters
        architecture: "BidLSTM_CRF_FEATURES"
        useELMo: false
        runtime:
          # parameters used at runtime/prediction
          max_sequence_length: 3000
          batch_size: 20
        training:
          # parameters used for training
          max_sequence_length: 3000  
          batch_size: 30

A Deep Learning model comes with predefined word embeddings (e.g. glove-840B or word2vec) or transformer pre-trained model name (e.g. bert-base-cased, allenai/scibert_scivocab_cased - reusing Hugging Face transformers Hub model names). They are part of the trained DL model and will be loaded online the first time if now available in the local DeLFT install.

Logging

Logging can be set with the following parameter group:

logging:
  level: INFO
  loggers:
    org.apache.pdfbox.pdmodel.font.PDSimpleFont: "OFF"
    org.glassfish.jersey.internal: "OFF"
    com.squarespace.jersey2.guice.JerseyGuiceUtils: "OFF"
  appenders:
    - type: console
      threshold: WARN
      timeZone: UTC
      # uncomment to have the logs in json format
      #layout:
      #  type: json
    - type: file
      currentLogFilename: logs/grobid-service.log
      threshold: INFO
      archive: true
      archivedLogFilenamePattern: logs/grobid-service-%d.log
      archivedFileCount: 5
      timeZone: UTC
      # uncomment to have the logs in json format
      #layout:
      #  type: json