Annotating training data with the PDF-TEI Editor
pdf-tei-editor is a web-based, open-source tool for editing and correcting GROBID TEI training data side-by-side with the source PDF. It provides a graphical alternative to editing the *.training.*.tei.xml files by hand in a text editor, which makes the correction of pre-annotated training data considerably faster and less error-prone.
The tool is developed as part of the Legal Theory Knowledge Graph project at the Max Planck Institute of Legal History and Legal Theory.
Note
The PDF-TEI Editor is a third-party project and is not maintained by the GROBID team. Refer to its repository and documentation for support, issues, and the most up-to-date instructions.
Why use it
When you generate pre-annotated training data with GROBID's createTraining (via the service API or in batch), the output is a set of TEI XML files that must be reviewed and corrected before they can be added to the gold-standard corpus. Doing this in a raw text editor is tedious: you constantly switch between the XML and the original PDF to check whether a label is on the right span of text, and it is easy to accidentally alter the text stream — which must be kept untouched.
The PDF-TEI Editor addresses this by:
- Synchronized dual-pane interface — the rendered PDF and the editable TEI XML are shown next to each other, so you can verify annotations against the source layout at a glance.
- Schema validation — the TEI is validated for compliance as you edit, catching malformed markup before it reaches the training corpus.
- Version control — branching, merging, comparison (diff) between versions, and detailed revision tracking, which is useful for collaborative, multi-annotator gold-standard creation.
- Role-based access control and collection management — for organizing documents and contributors across a shared dataset.
- Multiple extraction engines — GROBID is supported as one of the AI extraction backends, so documents can be pre-annotated and then corrected within the same interface.
Typical workflow with GROBID
The editor fits into the standard GROBID training-data preparation loop described in the annotation guidelines:
- Run GROBID's
createTraining(see Generation of training data) to pre-annotate your PDFs, producing the*.training.*.tei.xmlfiles — or use the editor's built-in GROBID extraction. - Open the PDF together with its generated TEI XML in the PDF-TEI Editor.
- Visually correct the annotations against the PDF, moving tags without altering the text stream (the
<lb/>line-break markers and the order of the text must be preserved — see the correction principles). - Validate the TEI and save a clean, gold-standard version.
- Move the corrected file into the corresponding model's corpus directory (
grobid-trainer/resources/dataset/<MODEL>/corpus/) and retrain the model.
Tip
Remember that GROBID training data is curated by editing or deleting the pre-annotated files — you should not create new *.training.*.tei.xml files from scratch. The editor is there to make the correction of GROBID's output efficient, not to author TEI independently of GROBID's extraction.
Getting started
Connecting to a GROBID server
The editor does not bundle GROBID — it talks to a running GROBID server over its REST API to pre-annotate documents. You tell the editor which server to use with the GROBID_SERVER_URL environment variable.
Pre-annotation quality directly determines how much manual correction you have to do, so always point the editor at a full GROBID server (one running the Deep Learning models): it gives noticeably better reference and citation extraction than the CRF-only light instances. You have two options:
-
Use the public full instance on Hugging Face —
https://grobidOrg-grobid-full.hf.space(mirrorhttps://grobidOrg-grobid-full2.hf.space). It runs the Deep Learning models and requires no installation, so it is the quickest way to start annotating. This is a good fit for occasional work or a first pass. (The light instanceshttps://grobidOrg-grobid.hf.spaceand its mirrorhttps://grobidOrg-grobid2.hf.space, documented on the Quick start page, are CRF-only and not recommended for building a gold-standard corpus.) -
Run your own full GROBID instance — recommended once you settle into iterative sessions of correction and retraining. Hosting it yourself removes the rate/availability limits of the public space and, more importantly, lets you point the editor at your own freshly retrained models so each correction cycle pre-annotates with the improvements from the previous one. Use the full image (
grobid/grobid:{version}-full, or the mirrorlfoppiano/grobid:{version}-full); see Run with Docker for the full instructions. GROBID's REST API listens on port8070, so the URL is typicallyhttp://localhost:8070:
docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.9.0-full
Running the editor (Docker)
The fastest way to run the editor itself is its Docker image, passing the GROBID endpoint via GROBID_SERVER_URL:
docker run -p 8000:8000 \
-e APP_ADMIN_PASSWORD=secure_password \
-e GROBID_SERVER_URL=https://grobidOrg-grobid-full.hf.space \
cboulanger/pdf-tei-editor:latest
The application is then available at http://localhost:8000 (user admin, with the password set above). The equivalent docker-compose.yml:
services:
pdf-tei-editor:
image: cboulanger/pdf-tei-editor:latest
ports:
- "8000:8000"
environment:
- APP_ADMIN_PASSWORD=secure_password
- GROBID_SERVER_URL=https://grobidOrg-grobid-full.hf.space
Note
The example above uses the public Hugging Face instance so it works without any further setup. If you instead run your own GROBID (see above), remember that GROBID_SERVER_URL must be reachable from inside the editor's container: when GROBID runs on the same host, use http://host.docker.internal:8070 (Docker Desktop) or put both containers on the same Docker network and use the GROBID container name. A bare http://localhost:8070 refers to the editor container itself and will not reach GROBID.
Trying the bundled demo
The repository also ships a one-command demo deployment:
git clone https://github.com/mpilhlt/pdf-tei-editor.git
cd pdf-tei-editor
npm run deploy .env.deploy.demo.localhost
It becomes available at http://localhost:8080 with demo credentials admin/admin or demo/demo — change these for any non-local use.
For the authoritative and most current installation, configuration, and usage instructions, see the pdf-tei-editor repository and its documentation.