Recompiling and integrating CRF libraries
Grobid can be used with two different CRF libraries: Wapiti (the default and most performant one) and CRF++ (the historical one). However, CRF++ is deprecated and we are not considering it anymore. Wapiti appears two time faster for decoding than CRF++ for the Grobid models. The sizes of the Wapiti models are five to ten times smaller in memory than CRF++ ones. Training time is also significantly reduced based on the Wapiti l-bfgs training algorithm with a similar accuracy.
Wapiti
Wapiti is the default CRF library used by Grobid. It is integrated transparently to Grobid via JNI and there is normally nothing to be additionally done. This section explains how to rebuild and install the native library for integrating into Grobid a new versions of Wapiti.
For Grobid, we are using a specific fork of the original Wapiti distribution: Wapiti fork for Grobid.
This version includes in particular SWIG mapping providing a JNI interface and some bug fixes.
1) Build
The instruction for building the library are given on the above GitHub repo. The libraries libwapiti.so or libwapiti.dylib are then available under the subdirection build/.
2) Portability
On Mac OS X architecture, libwapiti.dylib is already portable.
On Linux, for having the library portable and not dynamically linking to the installed system libraries, process as follow:
-
in the subdirection
build/, execute the provided script to collect the dependenciesThe script will copy the required libraries together with> ../collect-dependencies.sh libwapiti.so .libwapiti.so. -
The local linking will be prioritize to ensure portability of the JNI, which can be checked on Linux by:
which should display something like that, with linking only to the local dependencies:> ldd libwapiti.solinux-vdso.so.1 => (0x00007fff8e591000) libstdc++.so.6 => /home/at-sac/plopez/Wapiti/build/./libstdc++.so.6 (0x00007f6a0e46b000) libm.so.6 => /home/at-sac/plopez/Wapiti/build/./libm.so.6 (0x00007f6a0e1e6000) libgcc_s.so.1 => /home/at-sac/plopez/Wapiti/build/./libgcc_s.so.1 (0x00007f6a0dfd0000) libc.so.6 => /home/at-sac/plopez/Wapiti/build/./libc.so.6 (0x00007f6a0dc3c000) /lib64/ld-linux-x86-64.so.2 (0x00000032a0c00000)
3) Install native libraries
The next step is to install the updated libraries in the Grobid distribution. On Linux:
> cp ld-linux-x86-64.so.2 libc.so.6 libgcc_s.so.1 libm.so.6 libstdc++.so.6 libwapiti.so GROBID-ROOT-DIRECTORY/grobid-home/lib/lin-<nb bits of the OS>
> cp libwapiti.dylib GROBID-ROOT-DIRECTORY/grobid-home/lib/mac-64/
Finally, the JNI jar file has to be deployed in the local repository in grobid-core (we suppose here that the version of Wapiti is 1.5.0, to be adapted if necessary):
> mvn install:install-file -Dfile=wapiti-1.5.0.jar -DgroupId="fr.limsi.wapiti" -DartifactId="wapiti" -Dversion="1.5.0" -Dpackaging="jar" -DlocalRepositoryPath="GROBID-ROOT-DIRECTORY/grobid-core/lib"
CRF++
The usage of CRF++ is not supported anymore since GROBID version 0.4.
Integration to GROBID source
The generated library has to be added in the open source project. Copy libcrfpp.so (See previous paragraph) and all linked libraries to GROBID-ROOT-DIRECTORY/grobid-home/lib/lin-<nb bits of the OS>, for instance:
> cp ld-linux-x86-64.so.2 libwapiti.so libc.so.6 libgcc_s.so.1 libm.so.6 libpthread.so.0 libstdc++.so.6 GROBID-ROOT-DIRECTORY/grobid-home/lib/lin-<nb bits of the OS>
build.gradle has to be updated.