Visit at ScraperWiki

26 March 2014 by Dominika Tkaczyk

Last week I spent in Liverpool visiting ScraperWiki. ScraperWiki provides tools for extracting, cleaning, analysing and managing data coming from various sources.

I spent most of my time on conversations, getting familiar with the variety of ScraperWiki's excellent extraction utilities that can deal with web pages, Twitter data, PDFs, and more. Since my work is partly related to processing PDF files, I was mostly interested in pdftables - a library for extracting tabular data from PDFs. During a brown bag session I also presented CERMINE - ADA Lab's system for extracting metadata and content from scientific literature (which currently barely touches tables in articles). The solutions have different purposes and scopes and they cannot be directly compared. Currently the only common part is parsing the PDF content, which is done by Poppler in pdftables and by iText in CERMINE.

Another interesting difference is that pdftables (and I believe other ScraperWiki's tools as well) strongly focuses on providing perfect results for specific, known cases, while CERMINE is a generic solution designed to deal with a wide variety of layouts. As a result, pdftables and other extraction tools will most likely need some manual work to adapt to documents and sources they've never seen before, but once this is done, they simply WORK. No excuses. CERMINE, on the other hand, will work out of the box for many different cases and document layouts, but the extraction results may not be perfect.

Apart from the technical aspects, I was also very interested to see the differences between the "academic" and "commercial" work place (I am mostly familiar with the former). Unfortunately, my hopes to experience the work environment different from what I am used to have been shattered very quickly. It turned out the atmosphere in ScraperWiki is not that far from the academic world I know. I felt it even before I entered the building for the first time - ScraperWiki is located in the university campus among university buildings. Warm and welcoming atmosphere in the company's room only added to this feeling. From day one, and in particular from the first stand-up meeting (held every morning), I was treated as a member of the family. The tea (no milk!) miraculously appeared on my desk every day, and I never even had to wash the tea mug (sorry, guys!).

The visit was an awesome experience. I met a lot of wonderful people, smart and passionate about their work. It seems that no matter where you come from, you will always feel good among other "computer people".

comments powered by Disqus