Text Mining Services in OpenAIRE

Recently in Athens there was an impressive kick-off of the OpenAIRE2020 project, during which we presented OpenAIRE’s plans in the area of text and data mining of scholarly publications. Publications contain all kinds of rich information, which, although understandable to a human reader, are not machine-readable and thus cannot be used directly for indexing and recommending purposes. Authors’ affiliations, document classifications, references to biological and chemical databases, acknowledgements to research funding agencies are all valuable pieces of information for OpenAIRE’s scholarly communication services.

Text analysis

(Photo credit: Wouter Vandenneucker, license: CC BY-SA 2.0)

Enriching content

At ADA Lab we are primarily interested in mining scholarly publications and extracting from them information that will help making OpenAIRE even more attractive to our users. On the one hand, knowledge which we’re going to extract will shed more light on the state of scholarly communication. Thanks to quantitative indicators, policy-makers and enthusiasts alike will be able to take the pulse of open science and observe the latest trends in research. On the second hand, researchers will have better means to find the research outputs they need, as the extracted knowledge will allow our indexing services to better “understand” the content and come up with more relevant search results and recommendations.

How will we do this?

Together with our colleagues from CNR in Pisa and ARC in Athens, in the work package devoted to Knowledge Extraction Services (WP10), we will improve and expand the Information Inference Service created in the OpenAIREplus project. We will make improvements to the inference infrastructure: add visual workflow management and improve quality assurance. We will extend existing document content analysis functionality to extract information about structure of the document, affiliation of the authors, and sentiment of the citations. We will also enhance our automatic document classification functionality and introduce functionality of creating clusters of similar documents. We will also search for new types of links to outside knowledge bases, i.e., 3rd party, domain-specific repositories describing genes, chemicals, organisms, etc. Some solutions will be built from scratch, other will be based on software developed by the partners, like CERMINE and MadIS.

Code on Github

Finally, we will work on better uptake of the project’s deliverables by making our results even more discoverable and usable by the general public. To that end, we plan to migrate our code to GitHub (star our repository now) and to publish our data sets on Zenodo. Both the source codes and the data sets will be available on open licenses, of course! First deliverables in our work package are scheduled for August 2015. We’ll keep you up-to-date about our research on this blog, so stay tuned!

This blog post has been simultaneously published on the official OpenAIRE blog and on the ADA Lab blog. It is available under the CC BY 4.0 license.