Applied Data Analysis Lab

Sparkling-ferns for #ApacheSpark (Part 1: The Algorithm)

2015-11-17T12:00:00+00:00

Two weeks ago together with Mateusz Fedoryszak I attended the first european Spark Summit (#SparkSummitEU). What did we find there and how did we enrich Spark Community? Let me tell you the story of the summit and Sparkling Ferns...

First of all, yes, Matei Zaharia was excited as usual during his presentation. Second, yes, everyone had been super enthusiastic on every occasion during the Summit. Third, yes, you must attend Spark Summit if you haven’t done it yet. If you have - you know nothing else is even close to this experience. Anyway, we shall meet there next year!

Many attendees asked us about Random Ferns and our implementation of it for Apache Spark: Sparkling-ferns. Let’s go through Random Ferns FAQ. BTW: You may enjoy watching our talk:

Our slides are publically available as well.

1. What are Random Ferns?

Random Ferns is a supervised learning classification algorithm.

In classification algorithms items are described with a feature vector f. We want to label each item using a class c from the set of classes C. To do so we create a model, which takes a vector f and returns c, the most suitable class for the vector f.

Now, there are many ways to do so. We may use Naive Bayes, Random Forests, SVM, etc. or Random Ferns.

2. When should I be interested with Random Ferns?

There are some natural indicators to use a specific algorithm.

With Random Ferns these indicators are:

requirement to train a model in linear time against the number of items
requirement to train a model in linear time against the number of features
plenty of memory to use (Random Ferns model can be quite big, I am going to explain it later)

3. I’ve heard about Random Forests - can you compare it with Random Ferns?

Sure!

3.1. Random Forests

With Random Forests you have many decision trees. Each is trained on some subset of data. Let’s investigate one tree within the forest (see Fig.1.1.). In each node one or more features of an item are tested to choose which child node should be chosen. When you arrive at a leaf node you obtain a class c or list of probabilities for each class from C.

The important things to note are:

one or more features may be used to choose child node,
each node has its own test - sibling nodes can use completely different features and/or different thresholds.

The final class c assigned by a model is chosen e.g. by voting on results from particular trees.

Fig.1.1. The example of a decision tree within a Random Forest model.

3.2. Random Ferns

Now, when we use Random Ferns two things are different:

We are using perfect binary trees, where all levels of a tree are filled (they are balanced), and on each level only one feature is checked.
Each fern has a threshold set for each feature. As a result on each level the decision if an item's feature passes a threshold can be encoded with a binary value: 0 or 1.

As you can see in Fig.1.2. we have exactly 2^N leafs, where N is the number of features used. We also have C classes. What is more important is that we can enumerate all leafs using bits collected from the root to the chosen leaf. By going from the root to the bottom-left leaf in Fig.1.2. we collect bits 000, which represents the integer number 0. By going to the bottom-right leaf we collect bits 111, which represents number 7.

The natural next step is to switch from the tree representation to the 2D array representation, where leaf code is the first coordinate and a class number is the second coordinate. A value obtained by passing a leaf code and a class number c is probability of correctly labeling an item as a matching to a class c.

Fig.1.2. The example of a fern (represented as a tree) within a Random Ferns model.

Fig.1.3. The example of a fern (represented as a 2D array) within a Random Ferns model.

4. How big can be a Random Ferns model?

Now let’s use some numbers.

Example 1
Example 2
Observation 1: Adding 10 more features to a model makes it 1024 times bigger than the previous one.
Observation 2: Doubling the number of ferns or classes make a new model 2 times bigger than the previous one.

5. How fast can I train Random Ferns model?

During tests checking model creation time the following empirical dependency had been established:

It is easier to think about this dependency in terms of

So with a fixed number of features used f a model creation time grows linearly with a growth of a dataset size D. Conversely, with a fixed size of a dataset D a model creation time grows linearly with a growth of a number of features f.

What does it mean? When you are training your model, you can easily predict when it will be ready.

6. How can I “plug” Random Ferns package into Spark?

Use the command:

and enjoy Random Ferns right away!

7. How can I use Random Ferns code in Spark?

You should import mllib Vectors and LabeledPoint as well as all classes from the package pl.edu.icm.sparkling_ferns.
Next you should cast your data into instances of the LabeledPoint class.
Finally, you should feed the method FernForest.train with LabeledPoint instances. Please pass along also numberOfFerns, numberOfFeatures and the mapping from feature ID to the number of possible values of a feature.
1. Having two features, e.g. "doors" and "persons", with possible values (respectively): List("2", "3", "4", "more") and List("2", "4", "more"), the map passed should be: Map(0 -> 4, 1 -> 3)
2. If feature values are continuous you should pass an empty map (Map.empty).

8. Where I can find out more about Random Ferns?

For more details consult GitHub or Spark-packages.org. The complete example of the package use is in the test segment of a code.

PhD defense of ADA Laber Mateusz Kobos

2015-06-18T12:00:00+00:00

On 2015-06-11, I defended my PhD thesis entitled "Multiresolution classification using combination of density estimators" in the Systems Research Inistitute of the Polish Academy of Sciences.

In the thesis, we introduce a classification algorithm based on an idea of "multiple-resolution" (or "multiscale") approach to data analysis. In practice, the method uses an average of kernel density estimators where each estimator corresponds to a different data "resolution" (see the figure above for estimation of density generated by kernel density estimators with different smoothing parameters which can be interpreted as a multiple-resolution view of given five data points). First, we examine theoretical properties of this method; next, we propose a practical implementation of such algorithm with parameters of the density estimators and their number adjusted to minimize the misclassification probability. Subsequently, we test the algorithm on artificial data sets characterized by a multiple-resolution property. The tests show that the introduced algorithm is superior to the basic version based on one estimator per class. We also test the algorithm on benchmark data sets and compare the results obtained with the results of the basic version and other popular classification algorithms. The method is shown to fare better than the basic version and to be on a par with other popular algorithms.

The image above is an intuitive data-based justification of why the proposed algorithm using many kernel density estimators is better for certain data sets than the basic version of the method using a single density estimator per class. The image shows the mean classification error computed for two benchmark data sets: BUPA liver disorders and Pima Indians diabetes. Two density estimators per class with the same value of smoothing parameter per class were used. The more more the color of the point resembles blue, the smaller the value of given function in this point. The global minima of the function lying outside of the diagonal are marked with triangles while the minimum for points lying on the diagonal is marked with a circle. The basic version of the method can achieve only the values lying on the diagonal; however, the proposed algorithm can achieve all shown values. The value next to the ∆ symbol above the plots is the difference between the minimal value of the function for points lying on the diagonal and the value of the global minimum. One can see that in case of the BUPA liver disorders dataset, the difference is nonzero thus we can expect a smaller classification error when applying the proposed algorithm. In case of the Pima Indians diabetes the difference is zero, so no gain is expected.

The thesis is mainly an extension of paper M. Kobos, J. Mańdziuk. Multiple-resolution classification with combination of density estimators. Connection Science, 23(4):219–237, 2011.

CERMINE wins award at ESWC 2015

2015-06-17T09:00:00+00:00

ADA Lab's CERMINE participated in the Semantic Publishing challenge during the recent Extended Semantic Web Conference (ESWC 2015) in Portorož, Slovenia and we won the Best Performing Approach Award!

Extended Semantic Web Conference gathers researchers interested in various semantic technologies. The week between May 31st and June 4th was filled with workshops, tutorials, challenges, poster sessions, networking events and, of course, regular presentations. Apart from our participation in the Semantic Publishing Challenge, I came to ESWC to learn state of the art in knowledge representation (taking notes of ontologies and tools), and to meet researchers interested in machine-friendly scholarly communication.

Each of the three days of the main conference was kicked off by an excellent keynote: Viktor Mayer-Schönberger spoke about Big Data, Lise Getoor about Statistical Relational Learning and Massimo Poesio about Games with a Purpose. Two posters caught my attention: GERBIL, a system for benchmarking semantic annotations, and ODSF for managing data-intensive scientific collaboration.

The Semantic Publishing challenge, in which our CERMINE took place, gathered 9 teams, which worked on two tasks: one for extracting information from HTML pages and one for mining scholarly PDFs. The challenge was a good opportunity to meet some old friends (hello Christoph and Stefan!) and to make new ones (hello Angelo, Bahar, Francesco, Silvio and others!) Also, thank you Mendeley and Springer for sponsoring the awards!

This year the conference took place in the lovely Portorož, Slovenia — right by the Adriatic Sea. It was the 12th edition of the event. As a newcomer, I was enchanted by the friendly and relaxed atmosphere, both pre-organized and spontaneous social events were a testimony that the community is well-integrated. I'm looking forward to the next year's edition!

Introducing ADA Lab Open Science APIs

2015-05-11T12:00:00+00:00

Having our roots in the Centre for Open Science (CeON) we're very keen on making sure anybody interested can take advantage of algorithms we design. Today we are making another step in that direction: we introduce ADA Lab Open Science APIs.

APIs is the section of our website that will allow you to quickly see our technology in action. It contains demonstrators, each showcasing a small part of methods that we have designed. Although experimental for the time being, RESTful API is also provided so that you can use it in your apps.

To begin with we provide two CERMINE-based demonstrators: citation and affiliation parsers. Stay tuned as we'll regularly extend this section.

Text Mining Services in OpenAIRE

2015-02-16T11:00:00+00:00

Recently in Athens there was an impressive kick-off of the OpenAIRE2020 project, during which we presented OpenAIRE’s plans in the area of text and data mining of scholarly publications. Publications contain all kinds of rich information, which, although understandable to a human reader, are not machine-readable and thus cannot be used directly for indexing and recommending purposes. Authors’ affiliations, document classifications, references to biological and chemical databases, acknowledgements to research funding agencies are all valuable pieces of information for OpenAIRE’s scholarly communication services.

(Photo credit: Wouter Vandenneucker, license: CC BY-SA 2.0)

Enriching content

At ADA Lab we are primarily interested in mining scholarly publications and extracting from them information that will help making OpenAIRE even more attractive to our users. On the one hand, knowledge which we’re going to extract will shed more light on the state of scholarly communication. Thanks to quantitative indicators, policy-makers and enthusiasts alike will be able to take the pulse of open science and observe the latest trends in research. On the second hand, researchers will have better means to find the research outputs they need, as the extracted knowledge will allow our indexing services to better “understand” the content and come up with more relevant search results and recommendations.

How will we do this?

Together with our colleagues from CNR in Pisa and ARC in Athens, in the work package devoted to Knowledge Extraction Services (WP10), we will improve and expand the Information Inference Service created in the OpenAIREplus project. We will make improvements to the inference infrastructure: add visual workflow management and improve quality assurance. We will extend existing document content analysis functionality to extract information about structure of the document, affiliation of the authors, and sentiment of the citations. We will also enhance our automatic document classification functionality and introduce functionality of creating clusters of similar documents. We will also search for new types of links to outside knowledge bases, i.e., 3rd party, domain-specific repositories describing genes, chemicals, organisms, etc. Some solutions will be built from scratch, other will be based on software developed by the partners, like CERMINE and MadIS.

Code on Github

Finally, we will work on better uptake of the project’s deliverables by making our results even more discoverable and usable by the general public. To that end, we plan to migrate our code to GitHub (star our repository now) and to publish our data sets on Zenodo. Both the source codes and the data sets will be available on open licenses, of course! First deliverables in our work package are scheduled for August 2015. We’ll keep you up-to-date about our research on this blog, so stay tuned!

This blog post has been simultaneously published on the official OpenAIRE blog and on the ADA Lab blog. It is available under the CC BY 4.0 license.

Let's join FORCEs and make a difference in scholarly communication

2015-02-02T09:00:00+00:00

Two weeks ago I participated in FORCE2015 in Oxford. It was a third conference organized by FORCE11 community and a must-attend event for people interested in scholarly communication, and in particular its problems and various ways of addressing them.

One great thing about FORCE11 conferences is that they gather together people from a wide variety of backgrounds and professions: publishers, funders, librarians, researchers, programmers, and so on. This made FORCE2015 a great place to discuss various groups' needs and expectations, gain collaborators, advertise own work to potential consumers, exchange ideas, and provide and receive feedback about various initiatives.

The day before the main conference I attended ContentMine workshop. ContentMine is a community-driven initiative aiming at extracting facts (eg. species, molecules, particles) from scientific literature and making them accessible and reusable. The project is still young and in the process of building the community, but definitely worth taking a closer look at.

Another young but already interesting initiative I came across during the conference is Libraccess - a project which aims at collecting, aggregating, deduplicating and making available all kinds of open access scientific resources. Since Libraccess has a lot in common with our COMAC project, we decided to join forces and use this great opportunity to achieve common goals collaboratively, making use of individual complementary strengths. There aren't a lot of details yet, but stay tuned!

During the conference I was also presenting a demo of CERMINE - our Java library for extracting metadata and bibliography from scientific literature. Many thanks to all interested people, it was really great to meet you all!

All the interesting presentations and discussions at FORCE2015 painted a clear picture of the current state of scolarly communication, its problems and efforts made to solve them. For me the most important (and very optimistic) issue is increasing understanding in the community that academic data is in fact not only text, and therefore simply putting paper publications into computers is not enough. Data sets, code and images should become first-class citizens - properly identified, shared and cited. So from one side more and more tools and platforms for managing scientific artifacts other than text emerge, and from the other - a lot of effort is dedicated to automatically process huge volume of already existing unstructured scientific text in order to reverse engineer the process of creating them, mine the knowledge burried in them and transform into machine-readable formats. The latter is exactly what we are passionate about in ADA Lab.

FORCE2015 was a very interesting and unique experience for me. The event proved without a doubt that there are a lot of enthusiasts interested in the future of scolarly communication and the ways of improving it. Instead of attacking the same problems separately by individual people and teams, we should start organizing in larger groups and collaborate across teams, organizations and countries. If we manage to do so, the FORCE will definitely be with us!

Kraków – where AI meets the law

2014-12-22T10:00:00+00:00

Recently, I was lucky enough to participate in the JURIX 2014 conference, taking place in Kraków, 10-12 December 2014. This was an event aimed at injecting the advancements of computer science into the legal domain. I must admit that the Organizers really achieved their goal. At least from my strongly computer-scientish perspective...

During the conference, I presented a proof-of-concept study on how to detect and analyze topical trends in public procurement judgments. You can have a look at my poster, preprint or paper. Being able to present your work and gather feedback is great (by the way, I am very grateful for all questions and comments during poster session). However, listening to other talks is even fancier! Especially that JURIX 2014 provided loads of interesting stuff for me...

At the heart of each conference there are the invited talks. JURIX was no exception. Both of them were stunning.

On the first day, Noam Slonim presented the research related to the IBM Debating Technologies Project. Following the previous endeavour, that is WATSON, IBM comes up with a new challenge. WATSON was created, roughly speaking, to answer sophisticated questions formulated in the natural language. Now IBM wants to teach the machine to search for claims pro or against a given topic, together with the evidence supporting it. Typical topics could be banning violent video games or permitting performance enhancing drugs in sports. To get the feeling, what is it like to debate with the machine just spare 3 minutes to watch this video. If you are interested in the science behind, read the very fresh papers of the IBM Debator group – ACL Argumentation Mining Workshop 2014 paper or COLING 2014 paper. For me the most amazing thing is that this debating technology works on the basis of a large body of raw text (e.g., Wikipedia). You basically make the computer read, understand and find only the very relevant information for you. As Noam pointed out, this is not another search engine, this is a research engine!

Second talk, despite very difficult task, was a great match to the first one. Pieter Adriaans talked about measures of information present in the data. This talk addressed very fundamental questions, which, sadly, are not asked frequently enough in the age of the Big Data fuss. The roots of this subject date back to the giants – Shanon, Fisher and Kolmogorov. You can have a look at this very interesting paper full of insights and further references.

Except keynotes, there were a lot of interesting talks involving a large variety of subjects such, as linked data, legal information interchange standards/datasets, Bayesian networks, legal information retrieval systems, computer aided analysis of legislation, etc. Browse the (unfortunately pay-walled) proceedings, if you are hungry for more. The conference was accompanied by four workshops and doctoral consortium. The Organizers decided for parallel sessions scenario. Therefore, it was impossible for see all the interesting stuff. For me the definite highlights were the semantic workshop SW4LAW and the network analysis NAiL2014 workshop. Luckily proceedings from both are freely available on-line here and there.

Altogether, JURIX 2014 was very fruitful conference for me. I have learnt a lot about AI, law, and the intersection of both domains. Big "thank you" for the Organizers! I hope to make it to Braga in 2015!

Spark, D3, data visualization and Super Cow Powers

2014-11-26T15:00:00+00:00

Did you know that the amount of milk given by a cow depends on the number of days since its last calving? A plot of this correlation is called a lactation curve. Read on to find out how do we use Apache Spark and D3 to find out how much milk we can expect on a particular day.

Fig. 1. Milk yield per day.

Background — a lot of cows

Recently, ADA Lab has started a cooperation with the Polish Federation of Cattle Breeders and Dairy Farmers (PFHBiPM). One of the goals of our project is as simple as that: predict how much milk will a particular cow produce on a particular day. It turns out PFHBiPM was doing big data before it was cool: they have gathered 80M records of test milkings from 3M cows over two decades. This created a great opportunity for data analysis.

While drilling through the data, we thought it would be interesting to visualize lactation curves passing certain points on a chart. Just drawing them wouldn't tell us much: they were too numerous. Our goal then was to create an interactive 2D histogram of data points with respect to days after calving and the amount of obtained milk.

Our technology toolbox

We've decided to harness Spark to choose interesting points and group them. The first idea was to create an app with GUI which at some point would trigger computation on a cluster. However, we have come across a Spark Summit talk about Spark Job Server. It's a piece of software which allows you to talk with your cluster via REST. It would be a shame not to make use of that!

Other pieces were easy to fit: we've crafted a website which uses AJAX to fire Spark jobs and uses D3 to present the results. Working with bleeding edge technologies previously taught us they tend to have sharp edges. Actually none of them were very serious: during compilation Job Server didn't pass all the tests (so we've turned them off...) and AJAX refused to send requests to the remote domain (so we've hacked the Job Server to include Access-Control-Allow-Origin: * HTTP header).

As of D3, we haven't used any off-the-shelf 2D histogram function. Instead we've used range of lower level D3 features: AJAX requests handling, SVG manipulation and chart axis drawing.

The results

Fig. 2. Lactation curves passing a given point.

Above are the results of our work. Once again we've experienced that a picture is worth a thousand words. X-axis represents days after calving , while Y-axis corresponds to the milk yield. The darker the rectangle, the more data points in it. Red line is arithmetic mean. We'll keep you informed about our milky-project. Moo!

Affiliation parsing in CERMINE

2014-11-13T09:00:00+00:00

CERMINE is our Java library for extracting metadata from scientific literature. Among other information, CERMINE extracts the authors of the input document, their affiliations, and also associates authors with affiliations. Recently new functionality has beed added: affiliation parsing.

The goal of affiliation parsing is to recognize affiliation string fragments related to institution, address and country. Additionally, country names are decorated with their ISO codes. Here follows an example of a parsed affiliation string (it conforms to the JATS document description format):

<aff id="id">
  <label>id</label>
  <institution>Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw</institution>,
  <addr-line>ul. Prosta 69, 00-838 Warsaw</addr-line>,
  <country country="PL">Poland</country>
</aff>

Affiliations are parsed with the use of Conditional Random Fields classifier. First the affiliation string is tokenized, then each token is classified as institution, address, country or other, and finally neighbouring tokens with the same label are concatenated. The main feature used in CRFs is the classified word itself. Additional features are all binary: whether the token is a number, whether it is all uppercase/lowercase word, whether it is a lowercase word that starts with an uppercase letter, whether the token is contained by dictionaries of countries or words commonly appearing in institutions or addresses. Additionally, the token's feature vector contains not only features of the token itself, but also features of two preceding and two following tokens.

Affiliation parser was evaluated by a 5-fold cross validation with the use of 8,000 affiliations from PubMed Central Open Access Subset. Labelled affiliation fragment (institution, address or country) was considered correct only if the entire string was identical to the ground truth. The following results were obtained:

institution was correctly recognized in 92.4% of cases,
address was correctly recognized in 92.3% of cases,
country was correctly recognized in 99.5% of cases,
92.1% of affiliations were entirely correctly parsed.

Affiliation parser can be used via REST service. It can be accessed using cURL tool:

$ curl -X POST --data "affiliation=the text of the affiliation" http://cermine.ceon.pl/parse.do

For more information about the usage, visit CERMINE's GitHub page.

Interview with Michael Jordan about machine learning, big data, and other things

2014-10-27T19:00:00+00:00

Recently, IEEE Spectrum interviewed Michael Jordan - a leading researcher in machine learning. He gave his view on hype in machine learning as well as in big data analysis and presented his point of view related to some other interesting issues (technological singularity, P=NP, Turing test).

Here are more interesting points related to machine learning and big data that were made by the researcher:

The biological interpretations seem to be overused in the field of machine learning. Case in the point: activation function used in the neural network perceptron model is the same function as the one used in the statistical method called logistic regression that dates back to 1950s. The former method has nothing to do with neurons.
Due to huge amount of data analyzed, it is very easy to find spurious dependencies in big-data projects. People active in the field are not paying enough attention to this problem. There are some statistical methods to deal with these problems, like familywise error statistical tests, but many of them haven't been studied computationally and it will take decades to get them right.
Data analysis can deliver inferred data at a certain level of quality and we need to be explicit about it. We need to add error bars to the inferred data that we show. This is approach is missing in much of the current machine learning literature.
Because big data analyses often do not present information about the quality of produced prediction, an in more general terms, the analyses are often not methodologically sound, this might result in "big-data winter." This will be a general state of disappointment and lack of funding related to big data after its hype bubble bursts.

Paperity chooses CERMINE as its content extraction engine

2014-10-23T12:00:00+00:00

This is a guest post by Selcuk Ayguney and Marcin Wojnarski, creators of Paperity. We invited the authors to share their reasons for choosing ADA Lab's (recently awarded) CERMINE as their content extraction engine. Here's their story.

Paperity (www.paperity.org) is the first multi-disciplinary aggregator of peer-reviewed Open Access journals and papers, "gold" and "hybrid". It was launched in the beginning of October 2014 to facilitate access to scholarly literature across all different fields and already now includes nearly 200,000 articles from over 2,000 journals. While developing Paperity, we encountered the problem of extracting full text from PDF documents. Vast majority of academic papers are published as PDFs, and we wanted to unlock their contents and make them searchable in Paperity.

Extracting text from a PDF document is one of the hardest practical problems that seem easy on the first sight. It should not be much different than using a word processor, right? Absolutely wrong. PDF format is designed for laying out pages and faithfully reproducing the same visual layout everywhere, be it a screen or a printer. Therefore, it does not consist of a continuous stream of letters, words, and sentences; but of pages and objects with specific sizes and coordinates relative to the page. This is a very low-level representation that must be thoroughly preprocessed before it can be analyzed as a complete text. Moreover, PDF authoring tools apply different typographical tricks while converting the text to PDF. For example, letters "f" and "i" are typically joined in a single "glyph", to make them look better when printed, so that, for instance, the word "justification" in your word processor becomes "justiﬁcation" (note the single character that is a combination of "f" and "i") when converted to PDF. This adds another level of complexity while extracting text. Split words at the end of lines pose another problem.

We evaluated several toolkits designed for text extraction from PDF documents. In this article, we will share our findings and the rationale behind our final choice of CERMINE – the extraction tool developed by ADA Lab and CeON in ICM UW. After reviewing many packages, we shortlisted the following three open source tools:

PyPDF2: Written in Python, PyPDF2 is the successor of pyPDF and mainly focuses on document manipulation.
Apache PDFBox: Written in Java, it allows creation of new PDF documents, manipulation of existing documents and the ability to extract contents from documents.
CERMINE: Written in Java, designed specifically for analysis of scholarly articles; it is both a library and a web service for extracting metadata and content from scientific papers.

We will use a sample Open Access paper (Sulfolobus chromatin proteins modulate strand displacement by DNA polymerase B1, Nucleic Acids Research, 2013, Vol. 41, No. 17, accessible at http://paperity.org/p/34709961) to demonstrate some of the test cases we were concerned with. An image of the first page is included here for understanding the test results:

Content Selection Comparison

The first problem that arises during analysis of scholarly PDFs is how to correctly detect different blocks of text. A scholarly paper is not a continuous stream of words. Rather, it contains many different sections that can be arranged in very different ways on the page and inside entire document. Each block plays a different role, therefore it is important to correctly detect all of them and discover what roles they play in the document. Only CERMINE was able to do this job. Below we give a preview of outputs of all the three tools.

PDFMiner:

PDFBox:

CERMINE:

(Please note that the red arrows denote line wraps.)

At the beginning of the text extracted by PDFMiner and PDFBox, you can see meta data (title, journal name, authors) and article abstract, all combined into one stream of text. Moreover, paragraphs are most often broken into many separate lines in the output stream. CERMINE, on the other hand, can properly concatenate lines that belong to the same paragraph. It also detects the type of each block and starts extraction directly with the main body of the article. When necessary, CERMINE can also extract meta data fields as separate items in the output XML file.

All the three tools did a good job eliminating the download link written vertically on the right hand side of the pages.

Paragraph Structure and Formatting Comparison

Below is a more detailed example of how paragraphs are processed by the evaluated tools.

Original:

PDFMiner:

PDFBox:

CERMINE:

We can see that both PDFMiner and PDFBox produce line breaks wherever the original PDF has one, while CERMINE successfully joins lines into paragraphs, retaining the document structure. In other experiments, we have also found that CERMINE can actually represent the paragraph structure with high accuracy.

Moreover, only CERMINE was able to successfully merge split words: like “com-pacted” in the example above, merged to “compacted”. This is a very important feature, especially in documents with multi-column layout, where split words are very common. While PDFMiner was unable to parse typographical shorthands (like joined f and i), both PDFBox and CERMINE were able to interpret them correctly.

Conclusions

All in all, we found CERMINE to be a very sophisticated and reliable tool for analysis of scholarly PDFs. It does an excellent job in detecting different types of blocks, preserving the structure of paragraphs and decoding characters correctly. What is also important, CERMINE is open source, which enables unconstrained use and guarantees that the tool will be easy to customize when necessary and will have a growing community of users and contributors. That is why we decided to use it in Paperity.

We hope that CERMINE will be further developed in the future and one of possible directions that we might suggest is the recognition of national characters – a difficult task, given the peculiar encoding used in many PDFs, but very important for scholarly community and surely the one that can be successfully solved by CERMINE team.

Summer internship at ADA Lab

2014-10-06T09:00:00+00:00

My name is Jan Lasek and I was an intern at ICM ADA Lab team in the summer time. And I need to say that it was a great experience to work here!

I cooperated with Dominika Tkaczyk on implementing a new functionality to CERMINE project. Our goal was to extract table of contents from scholarly publications. This is, like many other problems in PDF processing, a challenging task. It is quite simple for a human, or even straightforward, however, the machines are still struggling with tasks of such type.

To be more precise: a given PDF is a sequence of consecutive lines of text. The goal is to extract the header lines into structured table of content. We divided the main task into two parts:

extracting outlying lines in a document that are presumably a header line and
grouping the lines according to similar formatting to arrive with sections, subsections and possible subsubsection headers.

Our initial approach to the first task was to perform clustering on lines according to their different properties. We wanted to identify outliers, i.e. the lines that did not fit into main group of lines. This solution resulted in retrieving most of the headers, however, also many false positives, which were other lines looking "suspicious" (e.g. equations, part of the tables, the last line in a column). We decided to give a supervised classifier a try, which learns to select header lines. To this end, we labeled all lines in 150 documents as regular or header lines. For the task of classification we employed Random Forest algorithm as "the best off-the-shelf classifier". This gave us promising results with about 0.95 of F-score statistic.

As far as second task is concerned, we used clustering algorithm to choose the best division of the lines (previously identified as "headers" in the first step) into one, two or three groups. In initial experiments, when we applied the clustering algorithm to the "clear data" (that is, manually labeled lines) we were able to extract 3 out of 4 table of contents without any mistake. When we merged both steps of the project we obtained a solution that retrieves every second table of content without an error. There is still some work to be done to arrive at 9 out of 10 accuracy!

Summing up, my internship at ICM UW was time well spent and I really appreciate that experience. I had an opportunity to work on an interesting and challenging task and cooperate with people passionate about their work. See you soon!

Impressions from PolTAL 2014

2014-09-30T20:00:00+00:00

A couple of days ago, members of our lab participated in PolTAL 2014, a conference bringing together linguists, computer scientists, and other researchers involved in computational linguistics and natural language processing.

After Island, Japan and numerous other distant places, TAL conference made it to Warsaw this year. Therefore, the two of the ADALabers (Michał Jungiewicz and Michał Łopuszyński) used this opportunity to present a poster on Unsupervised Keyword Extraction from Polish Legal Texts.

The conference was full of interesting and informative research reports. All the materials were generously made available by the participants with the encouragement and help from the organisers. Springer published the LNCS volume with PolTAL Proceedings. They promise to keep it available for free within a few weeks after the closing of the conference. So do not hesitate to grab your copy!

Just to wet your appetite before browsing the above materials, let us mention some of our personal highlights.

Johan Bos presented an excellent keynote on Adventures in Meaning Banking, where he gave an overview of the experiences from building Groningen Meaning Bank (GMB). GMB is free! Interestingly, not only you can browse it, download it, but you may also improve the GMB. This last thing you do by playing a game called wordrobe (sic!). This is an excellent application example of game with a purpose – a recently popular strategy to attract volunteers to a project.

Melanie Reiplinger presented a very tasty talk on Relation Extraction for the Food Domain. This was mostly focused on relations substitutedBy and suitsTo. The analysis was carried out on unlabelled data from the chefkoch.de community forum. Melanie and her colleagues even put up a web page with data together with a little demo in German to let you play with their results...

There were also many interesting posters, e.g., Evaluation of IR Strategies for Polish (this is always interesting, if you work with search engines for documents in Polish, like we do), Named Entity Recognition in Tweets (the form of micropost renders many traditional linguistic tools useless, so you have to develop new), or Slovak Web Discussion Corpus and accompanying NLP tools (in terms of the NLP resources, Slavic languages are unpopular and difficult to work with, so we appreciate the efforts of our colleagues from Košice).

The above highlights are by no means exhaustive, we definitely encourage you to fish for your own favourites here and there!

In the meanwhile, we countdown to xTAL 2016, wherever x turns out to be ...

Mind the gap! – DL2014

2014-09-23T09:30:00+00:00

Recently a few people from our lab visited London to participate in the Digital Libraries 2014 which was a conjunction of TPDL and JCDL – two best-known conferences on digital libraries.

Łukasz was co-chairing the 2nd Workshop on Linking and Contextualizing Publications and Datasets while Dominika and Mateusz were presenting their results on the 3rd International Workshop on Mining Scientific Publications.

LCPD2014 had a number of great presentations and — even more importantly — lively discussions about managing, sharing, peer-reviewing and citing research data sets. There was a good mix of theory and practice, including an excellent case study of the CRAWDAD repository presented by Tristan Henderson, and a comprehensive list of "dos" and "don'ts" for data repositories by Sarah Callaghan, based on her peer review of Earth sciences data sets.

WOSP2014 was an amazing opportunity to meet people from all over the world interested in scholarly communication, including the gurus like C. Lee Giles. Having discovered several very promising possible cooperation areas, we can't wait to unveil more details soon. Meanwhile, here are our presentation slides:

Dominika Tkaczyk, Pawel Szostek and Łukasz Bolikowski: GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles.
Mateusz Fedoryszak and Łukasz Bolikowski: Efficient blocking method for a large scale citation matching.

Proceedings from both events will soon appear as special issues in the D-Lib Magazine.

Want to remember Spark API or learn Scala? Use our courses on memrise.com

2014-09-15T15:30:00+00:00

You need 20 hours to be initially good at something and 10000 hours to be an expert in any domain. Be an expert easier and faster!

So how do you do it? First of all, you do the same thing for a long time. In fact, this is everything you need. The longer version is that you are good at something because you are refreshing your knowledge repeatedly. In order to do it effectively, you should make Learning Curve work for you and not against you.

Now, how to be an expert in everything you want and at the same time use the Learning Curve? To do so, you need to be exposed to the knowledge at appropriate time intervals. Doing it by yourself is tedious, but what if there was an app that could do it for you? And what if its name was Memrise?

Eventually, the most important question: how would this blog post go if no more freaky question were asked?!

So we have this great tool and we can use it. In IT there are myriads of APIs and other stuff.

For example, learning Google Guava is as easy as growing and watering when following the course "Google Guava Library" (when you follow the link you will know why I have suddenly switched to gardening vocabulary).
Description of your stuff in Linux OS is in the course "Linux file system hierarchy".
Fancy Vim shortcuts are in "Vi Keyboard" and "Vim".

This is good time to state that I am not an employee of Memrise and I have no shares of it in my pocket. Also, although above courses are mainly created by me, they are free and you can use them to boost your own skills.

As I have much more courses to mention, I grouped them into tracks which allow building one skill set at a time.

Perfect UNIX Administrator

Programming Polyglot

Google Cloud Engine Specialist

Connoisseur of Poland :-)

datadr: split-apply-combine package for R backed by Hadoop

2014-09-04T17:40:00+00:00

datadr is a package for the R programming language that provides a functionality of split-apply-combine for data transformation. See the Quickstart section in project's documentation for a nice overview of package's capabilities.

The package has three back-ends:

one to process small data on a single processor core,
second one to process medium data on many cores,
and the third one, called RHIPE, to process big data on a Hadoop cluster using MapReduce programming model.

The split-apply-combine functionality allows the user to split data into subsets, apply certain transformation to each of the subsets, and combine the results; its counterpart in SQL is the GROUP BY functionality. This functionality is analogous to the functionality provided by two other popular R libraries: plyr and dplyr. The main difference is that datadr can use the Hadoop cluster and thus process really large amounts of data (see project's FAQ for more details).

The github page of datadr project shows that the first commit was made in March 2013 and the project has one developer. On the other hand, the github page of Hadoop's backend RHIPE shows that the first commit was made in August 2009 and it has 5 developers.

CockroachDB: an open source version of Google Spanner

2014-07-25T10:20:00+00:00

A team of ex-Googlers is building an open source version of Google Spanner, i.e., a transactional database that spans across many data centers.

GoogleSpanner is a globally-scalable database system used internally by Google; it is the successor of the BigTable database. The database supports SQL-like queries, implements transactions, is distributable among many data centers, and is fast. The paper describing the technology published by a team of Googlers in 2012 caused quite a stir in the community since it managed to marry features that were thought to not be possible to combine together (e.g., see Hight Scalability blog or Doug Cutting's statement that Spanner-like technology should be the next step in Hadoop's evolution (see the clip with him where he talks about Hadoop's evolution - the interesting part starts at 5:50)).

Now, as reported in a recent article at Wired, a team of ex-Googlers is developing an open source version of the Google Spanner called CockroachDB. However, the team does not plan to implement all the features of the original solution in the first version of the project. Currently, their focus is on automatic replication of data between data centers and assuring lack of dependency of the solution on any particular distributed file system or system manager. According to the article, the project is in "alpha" development phase and is "nowhere near ready for use with production services."

Building Apache Spark App with Maven

2014-07-15T12:00:00+00:00

Recently we've been working on building Spark apps with Maven.

We've found these two texts very valuable:

However, we wanted something more: deployment of Spark app dependant on additional libraries. In this note we'll show you how to do that. Complete code is available on GitHub, here we'll just highlight some snippets.

Two elements are needed. First, appropriate settings in dependencies section of POM file. Note that Spark and Scala library are marked provided.

<dependencies>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.10.3</version>
    <scope>provided</scope>
  </dependency>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>0.9.1</version>
    <scope>provided</scope>
  </dependency>
  <!-- that's our additional dependency -->
  <dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>14.0</version>
  </dependency>
</dependencies>

Second, the submit.sh script to make building the classpath easier.

#!/bin/bash

# Example usage:
# ./submit.sh target/spark-intro-1.0-SNAPSHOT-jar-with-dependencies.jar -Dspark.master=local SimpleApp

JAR_FILE=$1
REMINDER=${@:2}

# If needed, alter these paths to confirm with your config

source /etc/spark/conf/spark-env.sh

SPARK_HOME=/usr/lib/spark; export SPARK_HOME
HADOOP_HOME=/usr/lib/hadoop; export HADOOP_HOME

# system jars:
CLASSPATH=/etc/hadoop/conf
CLASSPATH=$CLASSPATH:$HADOOP_HOME/*:$HADOOP_HOME/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/*:$HADOOP_HOME/../hadoop-mapreduce/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/*:$HADOOP_HOME/../hadoop-yarn/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/*:$HADOOP_HOME/../hadoop-hdfs/lib/*
CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*

# app jar:
CLASSPATH=$CLASSPATH:"$JAR_FILE"
CONFIG_OPTS="-Dspark.jars=$JAR_FILE"
java -cp $CLASSPATH $CONFIG_OPTS $REMINDER

To run everything just type:

git clone https://github.com/CeON/spark-intro.git
cd spark-intro  
mvn clean compile assembly:single
./submit.sh target/spark-intro-1.0-SNAPSHOT-jar-with-dependencies.jar -Dspark.master=local SimpleApp

Happy coding!

Data science workflow

2014-06-13T16:30:00+00:00

Description of a workflow of a data scientists published on CACM blog.

Generally, the workflow consists of 4 interconnected phases with some sub-steps:

Preparation:
- Acquire data
- Reformat and clean data
Analysis:
- Edit analysis scripts
- Execute scripts
- Inspect outputs
- Debug
Reflection
Dissemination

Each of these phases is related to its own challenges. The author developed prototype tools that are supposed to address them.

One interesting insight given in the blog entry is that the manual data cleaning is reported by data scientists as the most tedious and time-consuming part of their workflows. However, the author stresses that this step is also a very important one since "the chore of data reformatting and cleaning can lend insights into what assumptions are safe to make about the data, what idiosyncrasies exist in the collection process, and what models and analyses are appropriate to apply."

FUSE: project for mining game-changing technologies from scientific publications and patents

2014-06-12T12:00:00+00:00

In May's Nature, there is a column about an interesting text mining project called FUSE. The project is backed by US intelligence agency; its goal is to predict game-changing technologies based on mining of scientific publications and patent applications.

FUSE is one of the first projects to mine the full texts instead of only abstracts.

There are 3 teams that take part in the project:

One that "mines text for keywords, citations and phrases that indicate authors' outlooks in scholarly papers."
Another one that extracts "sentiment" in the natural language of papers.
The third team analyses connections between different topics, keywords and authors.

This four-year project started in 2011.