Annotation guidelines for bibliographical references


This section describes how to annotate training data for the citation model. This model parses a bibliographical reference in isolation (as typically present in the bibliographical section at the end of an article).

In the model cascading sequence, a bibliographical reference in isolation is obtained from the model reference segmenter which aims at segmenting a bibliographical section into a list of individual bibliographical references.

Note that this mark-up for training data for bibliographical references follows overall the TEI.


The complete bibliographical reference is enclosed in a <bibl> element. One <bibl> structure must match with exactly one bibliographical reference. Then the following tags are used for structuring the bibliographical references:

  • <author> for the complete sequence of authors

  • <orgname type="collaboration"> a collaboration is a project-based grouping of authors from different affiliations limited in time. Some examples: In high energy particules the ATLAS, CMS and DELPHI collaborations (the current world record holder for number of authorship arXiv:1503.07589 combines the two CERN collaborations ATLAS and CMS), or the LUNA collaboration in astrophysics. The particularity of collaborations is to be used both as authorship and affiliation component.

  • <title level="a"> for article title and chapter title. Here "a" stands for analytics (a part of a monograph)

  • <title level="j"> for journal title

  • <title level="s"> for series title (e.g. "Lecture Notes in Computer Science")

  • <title level="m"> for non journal bibliographical item holding the cited article, e.g. conference proceedings title. Note if a book is cited, the title of the book is annotated with <title level="m">. If a thesis is cited, the title of the thesis is annotated with <title level="m">, and the type of thesis as <note type="report">. Here m stands for monograph.

  • <date> the date sequence (excluding parenthesis, etc.)

  • <biblScope unit="page"> the full range of pages of the article

  • <biblScope unit="volume"> the value of the volume (e.g. vol. <biblScope unit="volume">7</biblScope>,)

  • <biblScope unit="issue"> the value of the issue, also known as number, (e.g. no. <biblScope type="issue">3</biblScope>,)

  • <orgName> the institution for theses or technical reports

  • <publisher> the name of the publisher

  • <pubPlace> publication place, or location of the "publishing" institution

  • <editor> for all the sequence of editors

  • <ptr type="web"> for web URL

  • <idno> for the document-specific identifier, in particular DOI and arXiv identifiers, optionally the type of identifier can be given by the attribute @type, for instance, <idno type="DOI">...</idno>, for report identifiers <idno type="report">...</idno> is used

  • <note type="report"> in the case of technical report at large, encode the indication of the kind of report, this includes "technical report" from an institution, but also the kind of thesis ("Ph.D. thesis", "M.Sc. thesis", etc.)

  • <note> for any indications related to the reference and not covered by one of the previous tags (for instance "personal communication")

Additional text/characters that do not belong to one of these elements (punctuations, syntactic sugar, etc.) has to be be left untagged under the <bibl> elements. This is the case for instance for the tag <date>, the caracters such as parenthesis have to be put outside this element (see the example bellow).

Example: Biostatistics (2008), 9(2), pp. 234–248

<?xml version="1.0" encoding="UTF-8"?>
<tei xmlns="" 

            <title level="j">Biostatistics</title> (<date>2008</date>), 
            <biblScope type="vol">9</biblScope>(<biblScope type="issue">2</biblScope>), 
            pp. <biblScope type="pp">234–248</biblScope>


Special case with years

In case a letter is added to year, for instance following the Harvard bibliographical reference style:

Gavazzi G., Piertini D., Boselli A., Tuffs R., 1996c, A&AS, 120, 489(Paper I)

with (Gavazzi et al, 1996c) used to reference the citation in the full text.

The year with letter are tagged together:

        <author>Gavazzi G., Piertini D., Boselli A., Tuffs R.</author>, <date>1996c</date>, 
        <title level="j">A&amp;AS</title>, <biblScope unit="volume">120</biblScope>, 
        <biblScope unit="page">489</biblScope>(<note>Paper I</note>) 

No segmentation between fields

In examples like the following one:

D. Foo, P. Bar, Phys. Rev. D95, 34(2017)

there is no spacing between the title of the journal Phys. Rev. D and the volume 95. For annotating this case, we tag these two fields without separation (not introducing a space or end-of-line between D and 95):

    <author>D. Foo, P. Bar</author>, 
    <title level="j">Phys. Rev. D</title><biblScope unit="volume">95</biblScope>, 
    <biblScope unit="page">34</biblScope>(<date>2017</date>)


In the case of identifiers, we expect the following identifier types to be indicated: DOI, PMID, PMC (for PMC ID), ISSN, PII, ISBN and arxiv (in lowercase). Identifier for technical reports are identicated as type report (e.g. <idno type="report">).

As a principle for identifiers, the identifier type string/prefix must be kept in the tagged field (except when it's impossible, for instance due to a formatting error in the PDF). This will help to disambiguate the recognized identifiers when the trained models are labelling new content. For example, see the different <idno> in the following reference:

    <author>Dillman, Allissa A et al.</author> (<date>2013</date>). 
    "<title level="a">mRNA expression, splicing and editing in the embryonic and adultmouse cerebral cortex</title>". eng. 
    In :<title level="j">Nature neuroscience</title> <biblScope unit="volume">16</biblScope>. 
    <biblScope unit="issue">4</biblScope>. <idno type="PMID">PMID :23416452</idno> 
    <idno type="PMC">PMCID: PMC3609882</idno>, pp. <biblScope unit="page">499-506</biblScope>. 
    <idno type="ISSN">issn: 1546- 1726</idno>. 
    <idno type="DOI">doi: 10. 1038/nn.3332</idno>.