Try it!

How does it work?

The goal of affiliation parsing is to recognize affiliation string fragments related to institution, address and country. Additionally, country names are decorated with their ISO codes.

Affiliations are parsed with the use of Conditional Random Fields classifier. First the affiliation string is tokenized, then each token is classified as institution, address, country or other, and finally neighbouring tokens with the same label are concatenated. The main feature used in CRFs is the classified word itself. Additional features are all binary: whether the token is a number, whether it is all uppercase/lowercase word, whether it is a lowercase word that starts with an uppercase letter, whether the token is contained by dictionaries of countries or words commonly appearing in institutions or addresses. Additionally, the token's feature vector contains not only features of the token itself, but also features of two preceding and two following tokens.

RESTful API

Example


$ curl -X POST --data "affiliation=the text of the affiliation" \
  http://cermine.ceon.pl/parse.do

References