Recompiling and integrating CRF libraries
Grobid can be used with two different CRF libraries: Wapiti (the default and most performant one) and CRF++ (the historical one). However, CRF++ is deprecated and we are not considering it anymore. Wapiti appears two time faster for decoding than CRF++ for the Grobid models. The sizes of the Wapiti models are five to ten times smaller in memory than CRF++ ones. Training time is also significantly reduced based on the Wapiti l-bfgs
training algorithm with a similar accuracy.
Wapiti
Wapiti is the default CRF library used by Grobid. It is integrated transparently to Grobid via JNI and there is normally nothing to be additionally done. This section explains how to rebuild and install the native library for integrating into Grobid a new versions of Wapiti.
For Grobid, we are using a specific fork of the original Wapiti distribution: Wapiti fork for Grobid.
This version includes in particular SWIG mapping providing a JNI interface and some bug fixes.
1) Build
The instruction for building the library are given on the above GitHub repo. The libraries libwapiti.so
or libwapiti.dylib
are then available under the subdirection build/
.
2) Portability
On Mac OS X architecture, libwapiti.dylib
is already portable.
On Linux, for having the library portable and not dynamically linking to the installed system libraries, process as follow:
- in the subdirection
build/
, execute the provided script to collect the dependencies
> ../collect-dependencies.sh libwapiti.so .
The script will copy the required libaries together with libwapiti.so
.
- The local linking will be prioritize to ensure portability of the JNI, which can be checked on Linux by:
> ldd libwapiti.so
which should display something like that, with linking only to the local dependencies:
linux-vdso.so.1 => (0x00007fff8e591000)
libstdc++.so.6 => /home/at-sac/plopez/Wapiti/build/./libstdc++.so.6 (0x00007f6a0e46b000)
libm.so.6 => /home/at-sac/plopez/Wapiti/build/./libm.so.6 (0x00007f6a0e1e6000)
libgcc_s.so.1 => /home/at-sac/plopez/Wapiti/build/./libgcc_s.so.1 (0x00007f6a0dfd0000)
libc.so.6 => /home/at-sac/plopez/Wapiti/build/./libc.so.6 (0x00007f6a0dc3c000)
/lib64/ld-linux-x86-64.so.2 (0x00000032a0c00000)
3) Install native libraries
The next step is to install the updated libraries in the Grobid distribution. On Linux:
> cp ld-linux-x86-64.so.2 libc.so.6 libgcc_s.so.1 libm.so.6 libstdc++.so.6 libwapiti.so GROBID-ROOT-DIRECTORY/grobid-home/lib/lin-<nb bits of the OS>
On Max OS X:
> cp libwapiti.dylib GROBID-ROOT-DIRECTORY/grobid-home/lib/mac-64/
4) Install the JNI jar
Finally, the JNI jar file has to be deployed in the local repository in grobid-core (we suppose here that the version of Wapiti is 1.5.0, to be adapted if necessary):
> mvn install:install-file -Dfile=wapiti-1.5.0.jar -DgroupId="fr.limsi.wapiti" -DartifactId="wapiti" -Dversion="1.5.0" -Dpackaging="jar" -DlocalRepositoryPath="GROBID-ROOT-DIRECTORY/grobid-core/lib"
If the Wapiti library version changes, the dependency version in grobid-core/pom.xml has to be updated.
CRF++
The usage of CRF++ is not supported anymore since GROBID version 0.4.
Integration to GROBID source
The generated library has to be added in the open source project. Copy libcrfpp.so (See previous paragraph) and all linked libraries to GROBID-ROOT-DIRECTORY/grobid-home/lib/lin-<nb bits of the OS>
, for instance:
> cp ld-linux-x86-64.so.2 libwapiti.so libc.so.6 libgcc_s.so.1 libm.so.6 libpthread.so.0 libstdc++.so.6 GROBID-ROOT-DIRECTORY/grobid-home/lib/lin-<nb bits of the OS>
The Java dependency file has to be deployed in the local repository for grobid-core. Finally the dependency version of wapiti in build.gradle
has to be updated.