GROBID and containers (Docker)

NOTE: the support to Docker is still experimental.

Docker is an open-source project that automates the deployment of applications inside software containers. The documentation on how to install it and start using it can be found here.

GROBID can be instantiated and run using Docker. The image information can be found here.

The process for fetching and running the image is (assuming docker is installed and working):

  • Pull the image from docker HUB
> docker pull lfoppiano/grobid:0.5.3
  • Run the container (note the new version running on 8070, however it will be mapped on the 8080 of your host):
> docker run -t --rm --init -p 8080:8070 -p 8081:8071 lfoppiano/grobid:0.5.3

(alternatively you can also get the image ID)

> docker images | grep lfoppiano/grobid | grep 0.5.3
> docker run -t --rm --init -p 8080:8070 -p 8081:8071 $image_id_from_previous_command
  • Access the service:
  • open the browser at the address http://localhost:8080
  • the health check will be accessible at the address http://localhost:8081

Troubleshooting

Out of memory while processing

This might be due to insufficient memory on the docker machine. Make sure your machine has enough:

> docker-machine inspect

You should see something like:

{
    "ConfigVersion": 3,
    "Driver": {
        "IPAddress": "192.168.99.100",
        "MachineName": "default",
        "SSHUser": "docker",
        "SSHPort": 55933,
        "SSHKeyPath": "/Users/lfoppiano/.docker/machine/machines/default/id_rsa",
        "StorePath": "/Users/lfoppiano/.docker/machine",
        "SwarmMaster": false,
        "SwarmHost": "tcp://0.0.0.0:3376",
        "SwarmDiscovery": "",
        "VBoxManager": {},
        "HostInterfaces": {},
        "CPU": 1,
        "Memory": 2048,     #<---- Memory: 2Gb                   
        "DiskSize": 204800,
        "NatNicType": "82540EM",
        "Boot2DockerURL": "",
        "Boot2DockerImportVM": "",
        "HostDNSResolver": false,
        "HostOnlyCIDR": "192.168.99.1/24",
        "HostOnlyNicType": "82540EM",
        "HostOnlyPromiscMode": "deny",
        "NoShare": false,
        "DNSProxy": true,
        "NoVTXCheck": false
    },
    "DriverName": "virtualbox",
    "HostOptions": {
      [...]
        },
        "SwarmOptions": {
         [...]
        },
        "AuthOptions": {
           [...]
        }
    },
    "Name": "default"
}

For more information see the GROBID main page.

pdf2xml zombie processes

~~When running docker without an init process, the pdf2xml processes will be hang as zombie eventually filling up the machine. The docker solution is to use --init as parameter when running the image, however we are discussing some more long-term solution compatible with Kubernetes for example.~~ The solution shipped with the current Dockerfile, using tini (https://github.com/krallin/tini) should provide the correct init process to cleanup killed processes.

Build caveat

NOTE: The following part is only for development purposes. We recommend you to use the official docker images from the docker HUB.

The docker build from 0.5.3 will clone the repository using git, so no need to custom builds. Only important information is the version which will be checked out from the tags.

> docker build -t lfoppiano/grobid:0.5.3 --build-arg GROBID_VERSION=0.5.3 .

In order to run the container of the newly created image:

> docker run -t --rm -p 8080:8070 -p 8081:8071 lfoppiano/grobid:0.5.3

For testing or debugging purposes, you can connect to the container with a bash shell:

> docker exec -i -t {container_name} /bin/bash