Error in the logs
The logs of Grobid are located in
logs/grobid-service.log, the console log from gradle are usually not very useful to understand the problem.
When opening an issue, remember to attach the logs from
grobid-service.log in the report.
The exact server configuration depends on the service you want to call.
Following are the configuration used to process with
processFulltextDocument around 10.6 PDF per second (around 915,000 PDF per day, around 20M pages per day) with the node.js client during one week on a 16 CPU machine (16 threads, 32GB RAM, no SDD).
The process ran without any crash during ~7 days at this rate (11.3M PDF processed) using two 16-CPU servers.
- if the server has 8-10 threads available, you can use the default settings of the docker image, otherwise you would rather need to build and start the service yourself to tune the parameters
- keep the concurrency at the client (number of simultaneous calls) slightly higher than the available number of threads at the server side, for instance if the server has 16 threads, use a concurrency between 20 and 24 (it's the option
nin the above mentioned clients)
grobid/grobid-home/grobid.properties, set the property
org.grobid.max.connectionsto your number of available thread at server side or slightly higher (e.g. 16 to 20 for a 16 threads-machine)
grobid/grobid-service/config/config.yaml, it will avoid some strange behavior at startup
- in the query, consolidateHeader can be
2if you are using the CrossRef consolidation. It significantly improves the accuracy and add useful metadata.
- If you want to consolidate all the bibliographical references and use
2, CrossRef query rate limit will avoid scale to more than 1 document per second... For scaling the bibliographical reference resolution, you will need to use a local consolidation service. The overall capacity will depend on the biblio-glutton service then, and the number of elasticsearch nodes you can exploit. From experience, it is difficult to go beyond 300K PDF per day when using consolidation for every extracted bibliographical references.
See full thread.
Memory constraints and Out of Memory errors
In case of running on limited memory hardware, there are various ways to deal with memory constraints in Grobid:
- reduce the number of parallel threads on server side in grobid/grobid-home/config/grobid.properties change the property org.grobid.max.connections (default is
10, this is the max size of the thread pool that is reported in the logs)
reduce the number of parallel processing at client side: this is the parameter
nin the grobid python client command line
reduce the maximum amount of memory dedicated to the PDF parsing by pdfalto: in
grobid/grobid-home/config/grobid.propertieschange the property
grobid.3rdparty.pdf2xml.memory.limit.mbtoa lower amount (note: not too low, PDF parsing remains memory-hungry).
Windows related issues
Grobid is developed and tested on Linux. macOS is also supported, although some components might behave slighly different due to the natural incompatibility of Apple with the rest of the world and the availability on some proprietary fonts on this platform.
Windows, unfortunately, is currently not anymore supported, due to lack of experience and time constraints. We recommend Windows users to use the Grobid Docker image (documented here) and call the system via API using one of the various grobid clients.
Before opening a new issue which might be related to Windows, please check that it is not redundant here
NOTE: If you are a developer using Windows and you like to help, please let us know.
Invalid model format
Using macOS 10.12 Sierra with a non-default locales (e.g. French), you might get the following error:
error: invalid model format
This is because Wapiti is not reading the binary files correctly (most likely due to some incompatibles introduced from OSX to macOS).
As workaround, before launching GROBID, type in the console:
See here the open issue.
Missing libxml2 library
libxml2: GROBID is currenly shipped with all the needed libraries (Mac and Linux 32/64 bit).
libxml2 is required by pdfalto, and is normally shipped by default on all standard installation (Ubuntu, Mac OSX, etc).
For minimal or cloud based / container system like Linode, AWS, Docker, etc. libxml2 might not be installed by default and should thus be installed as prerequisite.
See here the open issue.