Affiliation parsing in CERMINE

13 November 2014 by Dominika Tkaczyk

CERMINE is our Java library for extracting metadata from scientific literature. Among other information, CERMINE extracts the authors of the input document, their affiliations, and also associates authors with affiliations. Recently new functionality has beed added: affiliation parsing.

The goal of affiliation parsing is to recognize affiliation string fragments related to institution, address and country. Additionally, country names are decorated with their ISO codes. Here follows an example of a parsed affiliation string (it conforms to the JATS document description format):

<aff id="id">
  <institution>Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw</institution>,
  <addr-line>ul. Prosta 69, 00-838 Warsaw</addr-line>,
  <country country="PL">Poland</country>

Affiliations are parsed with the use of Conditional Random Fields classifier. First the affiliation string is tokenized, then each token is classified as institution, address, country or other, and finally neighbouring tokens with the same label are concatenated. The main feature used in CRFs is the classified word itself. Additional features are all binary: whether the token is a number, whether it is all uppercase/lowercase word, whether it is a lowercase word that starts with an uppercase letter, whether the token is contained by dictionaries of countries or words commonly appearing in institutions or addresses. Additionally, the token's feature vector contains not only features of the token itself, but also features of two preceding and two following tokens.

Affiliation parser was evaluated by a 5-fold cross validation with the use of 8,000 affiliations from PubMed Central Open Access Subset. Labelled affiliation fragment (institution, address or country) was considered correct only if the entire string was identical to the ground truth. The following results were obtained:

Affiliation parser can be used via REST service. It can be accessed using cURL tool:

$ curl -X POST --data "affiliation=the text of the affiliation"

For more information about the usage, visit CERMINE's GitHub page.

comments powered by Disqus