Try it!

How does it work?

Reference strings contain important metadata. The information we extract from a reference string includes: author, title, journal name, volume, issue, pages, publisher, location and year.

First the reference strings are tokenized. The tokens are transformed into vectors of features and labels are assigned to them. Finally, the adjacent tokens with the same label are concatenated and the resulting reference metadata record is formed. The heart of the implementation is the token classifier, which employs Conditional Random Fields and is built on top of GRMM and MALLET packages.

We use 42 features to describe the tokens. Some of them are based on the presence of a particular character class, e.g. digits or lowercase/uppercase letters. Others check whether the token is a particular character (e.g. a dot, a square bracket, a comma or a dash), or a particular word. Finally, we use features checking if the token is contained by the dictionary built from the dataset, e.g. a dictionary of cities or words commonly appearing in the journal title. It is worth to notice that the token’s label depends not only on its feature vector, but also on surrounding tokens. To reflect this in the classifier, the token’s feature vector contains not only features of the token itself, but also features of two preceding and two following tokens.

RESTful API

Warning! RESTful API is experimental and prone to change!

URL: http://cermine.ceon.pl/parse.do
Method: POST
ContentType: application/x-www-form-urlencoded
Parameters:

reference required A citation string to be parsed.

format optional, defaults to bibtex Response format. Can be either bibtex or nlm.

Example

$ curl -X POST --data "reference=the text of the reference" \
  http://cermine.ceon.pl/parse.do

$ curl -X POST --data "reference=L.+Bolikowski%2C+N.+Houssos%2C+P.+Manghi%2C+J.+Schirrwagen%2C+%22Data+as+First-class+C
itizens%2C%22+D-Lib+Magazine%2C+vol.+21%2C+no.+1%2F2%2C+2015.&format=nlm" \
  http://cermine.ceon.pl/parse.do

References

D. Tkaczyk, P. Szostek, P. J. Dendek, M. Fedoryszak, and Ł. Bolikowski, “CERMINE - automatic extraction of metadata and references from scientific literature,” in Proceedings of the 11th IAPR International Workshop on Document Analysis Systems, 2014.

reference	required	A citation string to be parsed.
format	optional, defaults to `bibtex`	Response format. Can be either `bibtex` or `nlm`.

Citation parser

Try it!

How does it work?

RESTful API

Example

References