Try it!

How does it work?

Reference strings contain important metadata. The information we extract from a reference string includes: author, title, journal name, volume, issue, pages, publisher, location and year.

First the reference strings are tokenized. The tokens are transformed into vectors of features and labels are assigned to them. Finally, the adjacent tokens with the same label are concatenated and the resulting reference metadata record is formed. The heart of the implementation is the token classifier, which employs Conditional Random Fields and is built on top of GRMM and MALLET packages.

We use 42 features to describe the tokens. Some of them are based on the presence of a particular character class, e.g. digits or lowercase/uppercase letters. Others check whether the token is a particular character (e.g. a dot, a square bracket, a comma or a dash), or a particular word. Finally, we use features checking if the token is contained by the dictionary built from the dataset, e.g. a dictionary of cities or words commonly appearing in the journal title. It is worth to notice that the token’s label depends not only on its feature vector, but also on surrounding tokens. To reflect this in the classifier, the token’s feature vector contains not only features of the token itself, but also features of two preceding and two following tokens.

RESTful API

Example

$ curl -X POST --data "reference=the text of the reference" \
  http://cermine.ceon.pl/parse.do

$ curl -X POST --data "reference=L.+Bolikowski%2C+N.+Houssos%2C+P.+Manghi%2C+J.+Schirrwagen%2C+%22Data+as+First-class+C
itizens%2C%22+D-Lib+Magazine%2C+vol.+21%2C+no.+1%2F2%2C+2015.&format=nlm" \
  http://cermine.ceon.pl/parse.do

References