My name is Jan Lasek and I was an intern at ICM ADA Lab team in the summer time. And I need to say that it was a great experience to work here!
I cooperated with Dominika Tkaczyk on implementing a new functionality to CERMINE project. Our goal was to extract table of contents from scholarly publications. This is, like many other problems in PDF processing, a challenging task. It is quite simple for a human, or even straightforward, however, the machines are still struggling with tasks of such type.
To be more precise: a given PDF is a sequence of consecutive lines of text. The goal is to extract the header lines into structured table of content. We divided the main task into two parts:
- extracting outlying lines in a document that are presumably a header line and
- grouping the lines according to similar formatting to arrive with sections, subsections and possible subsubsection headers.
Our initial approach to the first task was to perform clustering on lines according to their different properties. We wanted to identify outliers, i.e. the lines that did not fit into main group of lines. This solution resulted in retrieving most of the headers, however, also many false positives, which were other lines looking "suspicious" (e.g. equations, part of the tables, the last line in a column). We decided to give a supervised classifier a try, which learns to select header lines. To this end, we labeled all lines in 150 documents as regular or header lines. For the task of classification we employed Random Forest algorithm as "the best off-the-shelf classifier". This gave us promising results with about 0.95 of F-score statistic.
As far as second task is concerned, we used clustering algorithm to choose the best division of the lines (previously identified as "headers" in the first step) into one, two or three groups. In initial experiments, when we applied the clustering algorithm to the "clear data" (that is, manually labeled lines) we were able to extract 3 out of 4 table of contents without any mistake. When we merged both steps of the project we obtained a solution that retrieves every second table of content without an error. There is still some work to be done to arrive at 9 out of 10 accuracy!
Summing up, my internship at ICM UW was time well spent and I really appreciate that experience. I had an opportunity to work on an interesting and challenging task and cooperate with people passionate about their work. See you soon!