Applied Data Analysis Lab | Technology and news from the world

Interview with Michael Jordan about machine learning, big data, and other things

2014-10-27T19:00:00+00:00

Recently, IEEE Spectrum interviewed Michael Jordan - a leading researcher in machine learning. He gave his view on hype in machine learning as well as in big data analysis and presented his point of view related to some other interesting issues (technological singularity, P=NP, Turing test).

Here are more interesting points related to machine learning and big data that were made by the researcher:

The biological interpretations seem to be overused in the field of machine learning. Case in the point: activation function used in the neural network perceptron model is the same function as the one used in the statistical method called logistic regression that dates back to 1950s. The former method has nothing to do with neurons.
Due to huge amount of data analyzed, it is very easy to find spurious dependencies in big-data projects. People active in the field are not paying enough attention to this problem. There are some statistical methods to deal with these problems, like familywise error statistical tests, but many of them haven't been studied computationally and it will take decades to get them right.
Data analysis can deliver inferred data at a certain level of quality and we need to be explicit about it. We need to add error bars to the inferred data that we show. This is approach is missing in much of the current machine learning literature.
Because big data analyses often do not present information about the quality of produced prediction, an in more general terms, the analyses are often not methodologically sound, this might result in "big-data winter." This will be a general state of disappointment and lack of funding related to big data after its hype bubble bursts.

Want to remember Spark API or learn Scala? Use our courses on memrise.com

2014-09-15T15:30:00+00:00

You need 20 hours to be initially good at something and 10000 hours to be an expert in any domain. Be an expert easier and faster!

So how do you do it? First of all, you do the same thing for a long time. In fact, this is everything you need. The longer version is that you are good at something because you are refreshing your knowledge repeatedly. In order to do it effectively, you should make Learning Curve work for you and not against you.

Now, how to be an expert in everything you want and at the same time use the Learning Curve? To do so, you need to be exposed to the knowledge at appropriate time intervals. Doing it by yourself is tedious, but what if there was an app that could do it for you? And what if its name was Memrise?

Eventually, the most important question: how would this blog post go if no more freaky question were asked?!

So we have this great tool and we can use it. In IT there are myriads of APIs and other stuff.

For example, learning Google Guava is as easy as growing and watering when following the course "Google Guava Library" (when you follow the link you will know why I have suddenly switched to gardening vocabulary).
Description of your stuff in Linux OS is in the course "Linux file system hierarchy".
Fancy Vim shortcuts are in "Vi Keyboard" and "Vim".

This is good time to state that I am not an employee of Memrise and I have no shares of it in my pocket. Also, although above courses are mainly created by me, they are free and you can use them to boost your own skills.

As I have much more courses to mention, I grouped them into tracks which allow building one skill set at a time.

Perfect UNIX Administrator

Programming Polyglot

Google Cloud Engine Specialist

Connoisseur of Poland :-)

datadr: split-apply-combine package for R backed by Hadoop

2014-09-04T17:40:00+00:00

datadr is a package for the R programming language that provides a functionality of split-apply-combine for data transformation. See the Quickstart section in project's documentation for a nice overview of package's capabilities.

The package has three back-ends:

one to process small data on a single processor core,
second one to process medium data on many cores,
and the third one, called RHIPE, to process big data on a Hadoop cluster using MapReduce programming model.

The split-apply-combine functionality allows the user to split data into subsets, apply certain transformation to each of the subsets, and combine the results; its counterpart in SQL is the GROUP BY functionality. This functionality is analogous to the functionality provided by two other popular R libraries: plyr and dplyr. The main difference is that datadr can use the Hadoop cluster and thus process really large amounts of data (see project's FAQ for more details).

The github page of datadr project shows that the first commit was made in March 2013 and the project has one developer. On the other hand, the github page of Hadoop's backend RHIPE shows that the first commit was made in August 2009 and it has 5 developers.

CockroachDB: an open source version of Google Spanner

2014-07-25T10:20:00+00:00

A team of ex-Googlers is building an open source version of Google Spanner, i.e., a transactional database that spans across many data centers.

GoogleSpanner is a globally-scalable database system used internally by Google; it is the successor of the BigTable database. The database supports SQL-like queries, implements transactions, is distributable among many data centers, and is fast. The paper describing the technology published by a team of Googlers in 2012 caused quite a stir in the community since it managed to marry features that were thought to not be possible to combine together (e.g., see Hight Scalability blog or Doug Cutting's statement that Spanner-like technology should be the next step in Hadoop's evolution (see the clip with him where he talks about Hadoop's evolution - the interesting part starts at 5:50)).

Now, as reported in a recent article at Wired, a team of ex-Googlers is developing an open source version of the Google Spanner called CockroachDB. However, the team does not plan to implement all the features of the original solution in the first version of the project. Currently, their focus is on automatic replication of data between data centers and assuring lack of dependency of the solution on any particular distributed file system or system manager. According to the article, the project is in "alpha" development phase and is "nowhere near ready for use with production services."

Building Apache Spark App with Maven

2014-07-15T12:00:00+00:00

Recently we've been working on building Spark apps with Maven.

We've found these two texts very valuable:

However, we wanted something more: deployment of Spark app dependant on additional libraries. In this note we'll show you how to do that. Complete code is available on GitHub, here we'll just highlight some snippets.

Two elements are needed. First, appropriate settings in dependencies section of POM file. Note that Spark and Scala library are marked provided.

<dependencies>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.10.3</version>
    <scope>provided</scope>
  </dependency>
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>0.9.1</version>
    <scope>provided</scope>
  </dependency>
  <!-- that's our additional dependency -->
  <dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>14.0</version>
  </dependency>
</dependencies>

Second, the submit.sh script to make building the classpath easier.

#!/bin/bash

# Example usage:
# ./submit.sh target/spark-intro-1.0-SNAPSHOT-jar-with-dependencies.jar -Dspark.master=local SimpleApp

JAR_FILE=$1
REMINDER=${@:2}

# If needed, alter these paths to confirm with your config

source /etc/spark/conf/spark-env.sh

SPARK_HOME=/usr/lib/spark; export SPARK_HOME
HADOOP_HOME=/usr/lib/hadoop; export HADOOP_HOME

# system jars:
CLASSPATH=/etc/hadoop/conf
CLASSPATH=$CLASSPATH:$HADOOP_HOME/*:$HADOOP_HOME/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-mapreduce/*:$HADOOP_HOME/../hadoop-mapreduce/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-yarn/*:$HADOOP_HOME/../hadoop-yarn/lib/*
CLASSPATH=$CLASSPATH:$HADOOP_HOME/../hadoop-hdfs/*:$HADOOP_HOME/../hadoop-hdfs/lib/*
CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*

# app jar:
CLASSPATH=$CLASSPATH:"$JAR_FILE"
CONFIG_OPTS="-Dspark.jars=$JAR_FILE"
java -cp $CLASSPATH $CONFIG_OPTS $REMINDER

To run everything just type:

git clone https://github.com/CeON/spark-intro.git
cd spark-intro  
mvn clean compile assembly:single
./submit.sh target/spark-intro-1.0-SNAPSHOT-jar-with-dependencies.jar -Dspark.master=local SimpleApp

Happy coding!

Data science workflow

2014-06-13T16:30:00+00:00

Description of a workflow of a data scientists published on CACM blog.

Generally, the workflow consists of 4 interconnected phases with some sub-steps:

Preparation:
- Acquire data
- Reformat and clean data
Analysis:
- Edit analysis scripts
- Execute scripts
- Inspect outputs
- Debug
Reflection
Dissemination

Each of these phases is related to its own challenges. The author developed prototype tools that are supposed to address them.

One interesting insight given in the blog entry is that the manual data cleaning is reported by data scientists as the most tedious and time-consuming part of their workflows. However, the author stresses that this step is also a very important one since "the chore of data reformatting and cleaning can lend insights into what assumptions are safe to make about the data, what idiosyncrasies exist in the collection process, and what models and analyses are appropriate to apply."

FUSE: project for mining game-changing technologies from scientific publications and patents

2014-06-12T12:00:00+00:00

In May's Nature, there is a column about an interesting text mining project called FUSE. The project is backed by US intelligence agency; its goal is to predict game-changing technologies based on mining of scientific publications and patent applications.

FUSE is one of the first projects to mine the full texts instead of only abstracts.

There are 3 teams that take part in the project:

One that "mines text for keywords, citations and phrases that indicate authors' outlooks in scholarly papers."
Another one that extracts "sentiment" in the natural language of papers.
The third team analyses connections between different topics, keywords and authors.

This four-year project started in 2011.

Article: Cloudera Oryx as the next Mahout

2014-03-13T12:00:00+00:00

Quite interesting article on Gigaom.com which says that Cloudera is developing a system called Oryx. The system is aiming to be a better Mahout.

The things that are supposed to differentiate it from Mahout are mainly:

It will not only provide means for exploratory analysis of data but provide tools for deploying production services containing models produced by machine learning algorithms as well.
It will not only be based on MapReduce, but will use Apache Spark as well (Spark is a more and more popular Hadoop-like technology).

Debugging and manipulate function in RStudio

2013-12-30T12:00:00+00:00

An information for R and RStudio enthusiasts about cool new features in the most recent version of RStudio (0.98) which I noticed today.

Graphical debugging with a possibility to click, view local variables etc. (this was not available in the previous version). Finally a sensible debugger in R!
Function "manipulate" which allows to control R plot using sliders, buttons and other widgets

12-factor app

2013-12-17T12:00:00+00:00

12-factor app is a manifest or a set of good engineering practices for modern web applications (but not only for them) created by people from Heroku, based on their huge experience.

The most interesting points:

II. Dependencies: Explicitly declare and isolate dependencies
III. Config: Store config in the environment
IV. Backing Services: Treat backing services as attached resources
V. Build, release, run: Strictly separate build and run stages
VI. Processes: Execute the app as one or more stateless processes
VII. Port binding: Export services via port binding
VIII. Concurrency: Scale out via the process model
IX. Disposability: Maximize robustness with fast startup and graceful shutdown
X. Dev/prod parity: Keep development, staging, and production as similar as possible

Facebook Presto

2013-11-08T12:00:00+00:00

Facebook just open sourced its Hadoop solution called Presto for doing SQL queries on Big Data.

Interesting features of this system:

It doesn't use MapReduce paradigm.
It's couple of times faster than Hive: "Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook."
Its data sources are not only HDFS and HBase. One can use other sources which is a matter of implementing a certain API for given data source.
It seems that the system is already of production quality: "The system is actively used by over a thousand employees, who run more than 30,000 queries processing one petabyte daily.
In general, it seems that this is a direct counterpart of Google's Dremel/BigQuery tool which we discussed on one of our journal club meetings.

Sources:

A general description in Computerworld
A more detailed one on Facebook's engineering blog