<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
	
	<title type="text" xml:lang="en">Applied Data Analysis Lab | Technology and news from the world</title>
	<link type="application/atom+xml" href="http://adalab.icm.edu.pl/feeds/news-atom.xml" rel="self"/>
 	<link type="text" href="http://paulstamatiou.com" rel="alternate"/>
	<updated>2018-03-02T08:54:49+00:00</updated>
	<id>http://adalab.icm.edu.pl/blog/</id>
	<author>
		<name>ADA Lab, ICM UW</name>
	</author>
	<rights>CC-BY-SA 3.0</rights>
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	<entry>
		<title>Interview with Michael Jordan about machine learning, big data, and other things</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2014/10/27/interview-with-michael-jordan.html"/>
		<updated>2014-10-27T19:00:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2014/10/27/interview-with-michael-jordan</id>
		<content type="html">&lt;p&gt;Recently, &lt;a href=&quot;http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts&quot;&gt;IEEE Spectrum interviewed&lt;/a&gt; &lt;a href=&quot;http://en.wikipedia.org/wiki/Michael_I._Jordan&quot;&gt;Michael Jordan&lt;/a&gt; - a leading researcher in machine learning. He gave his view on hype in machine learning as well as in big data analysis and presented his point of view related to some other interesting issues (technological singularity, P=NP, Turing test).&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;Here are more interesting points related to machine learning and big data that were made by the researcher:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;biological interpretations seem to be overused&lt;/strong&gt; in the field of machine learning. Case in the point: activation function used in the neural network perceptron model is the same function as the one used in the statistical method called logistic regression that dates back to 1950s. The former method has nothing to do with neurons.&lt;/li&gt;
&lt;li&gt;Due to huge amount of data analyzed, it is &lt;strong&gt;very easy to find spurious dependencies in big-data projects&lt;/strong&gt;. People active in the field are not paying enough attention to this problem. There are some statistical methods to deal with these problems, like familywise error statistical tests, but many of them haven&amp;#39;t been studied computationally and it will take decades to get them right.&lt;/li&gt;
&lt;li&gt;Data analysis can deliver inferred data at a certain level of quality and we need to be explicit about it. We &lt;strong&gt;need to add error bars to the inferred data that we show&lt;/strong&gt;. This is approach is missing in much of the current machine learning literature.&lt;/li&gt;
&lt;li&gt;Because big data analyses often do not present information about the quality of produced prediction, an in more general terms, the analyses are often not methodologically sound, this might result in &lt;strong&gt;&amp;quot;big-data winter.&amp;quot;&lt;/strong&gt; This will be a general state of disappointment and lack of funding related to big data after its hype bubble bursts.&lt;/li&gt;
&lt;/ul&gt;
</content>
	</entry>
	
	
	
	
	
	
	
	
	
	
	
	<entry>
		<title>Want to remember Spark API or learn Scala? Use our courses on memrise.com</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2014/09/15/memrise.html"/>
		<updated>2014-09-15T15:30:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2014/09/15/memrise</id>
		<content type="html">&lt;p&gt;You need &lt;a href=&quot;https://www.youtube.com/watch?v=5MgBikgcWnY&quot;&gt;20 hours to be initially good at something&lt;/a&gt; and &lt;a href=&quot;http://www.bbc.com/news/magazine-26384712&quot;&gt;10000 hours to be an expert in any domain&lt;/a&gt;. Be an expert easier and faster!&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;So how do you do it? First of all, you do the same thing for a long time. In fact, this is everything you need. The longer version is that you are good at something because you are refreshing your knowledge repeatedly. In order to do it effectively, you should make &lt;a href=&quot;http://en.wikipedia.org/wiki/Learning_curve&quot;&gt;Learning Curve&lt;/a&gt; work for you and not against you.&lt;/p&gt;

&lt;p&gt;Now, how to be an expert in everything you want and at the same time use the Learning Curve? To do so, you need to be exposed to the knowledge at appropriate time intervals. Doing it by yourself is tedious, but what if there was an app that could do it for you? And what if its name was &lt;a href=&quot;http://www.memrise.com/&quot;&gt;Memrise&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Eventually, the most important question: how would this blog post go if no more freaky question were asked?!&lt;/p&gt;

&lt;p&gt;So we have this great tool and we can use it. In IT there are myriads of APIs and other stuff.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For example, learning Google Guava is as easy as growing and watering when following the course &lt;a href=&quot;http://www.memrise.com/course/388287/google-guava-library/&quot;&gt;&amp;quot;Google Guava Library&amp;quot;&lt;/a&gt; (when you follow the link you will know why I have suddenly switched to gardening vocabulary). &lt;/li&gt;
&lt;li&gt;Description of your stuff in Linux OS is in the course &lt;a href=&quot;http://www.memrise.com/course/388195/linux-file-system-hierarchy/&quot;&gt;&amp;quot;Linux file system hierarchy&amp;quot;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Fancy Vim shortcuts are in &lt;a href=&quot;http://www.memrise.com/course/376522/vi-keyboard/&quot;&gt;&amp;quot;Vi Keyboard&amp;quot;&lt;/a&gt; and &lt;a href=&quot;http://www.memrise.com/course/52903/vim-2/&quot;&gt;&amp;quot;Vim&amp;quot;&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is good time to state that I am not an employee of Memrise and I have no shares of it in my pocket. Also, although above courses are mainly created by me, they are free and you can use them to boost your own skills.&lt;/p&gt;

&lt;p&gt;As I have much more courses to mention, I grouped them into tracks which allow building one skill set at a time.&lt;/p&gt;

&lt;h3&gt;Perfect UNIX Administrator&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/50252/shell-fu/&quot;&gt;Shell-fu&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/195200/ubuntu-keyboard-shortcuts/&quot;&gt;Ubuntu Keyboard Shortcuts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/52903/vim-2/&quot;&gt;VIM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/376522/vi-keyboard/&quot;&gt;Vi Keyboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/18097/an-introduction-to-regular-expressions/&quot;&gt;An introduction to regular expressions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/388195/linux-file-system-hierarchy/&quot;&gt;Linux fies system hierarchy&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;Programming Polyglot&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;/http://www.memrise.com/course/311897/frequently-used-r-commands/&quot;&gt;Frequently Used R Commands&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/58531/r-reference-card/&quot;&gt;R Reference Card&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/368089/scala-cheat-sheet/&quot;&gt;Scala Cheat Sheet&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/373527/scala-unauthorised-twitter-school/&quot;&gt;Scala [unauthorized] Twitter School&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/388355/apache-pig-built-in-udfs/&quot;&gt;Apache Pig Built-in UDFs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/375026/apache-spark-api-basics/&quot;&gt;Apache Spark API Basics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/388287/google-guava-library/&quot;&gt;Google Guava Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/378101/githubs-git-cheat-sheet/&quot;&gt;Github Git Cheat Sheet&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;Google Cloud Engine Specialist&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/379024/unauthorised-gcloud-usage/&quot;&gt;[unauthorized] GCloud Usage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/379048/unauthorised-bdutil-usage/&quot;&gt;[unauthorized] BDUtil Usage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/375026/apache-spark-api-basics/&quot;&gt;Apache Spark API Basics&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;Connoisseur of Poland :-)&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/80288/poczet-wadcow-polski/&quot;&gt;Poczet władców Polski&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/84282/podzia-administracyjny-polski/&quot;&gt;Podział administracyjny Polski&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/107235/cities-of-poland/&quot;&gt;Miasta Polski&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/197593/dzielnice-warszawy/&quot;&gt;Dzielnice Warszawy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/68562/prawo-jazdy/&quot;&gt;Prawo jazdy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.memrise.com/course/171583/polskie-sowka-ktore-zwieksza-obwod-twojego-bicka/&quot;&gt;Polskie słówka, które zwiększą obwód twojego bicka&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
</content>
	</entry>
	
	
	
	<entry>
		<title>datadr: split-apply-combine package for R backed by Hadoop</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2014/09/04/R-datadr.html"/>
		<updated>2014-09-04T17:40:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2014/09/04/R-datadr</id>
		<content type="html">&lt;p&gt;&lt;code&gt;datadr&lt;/code&gt; is a package for the &lt;a href=&quot;http://www.r-project.org&quot;&gt;R&lt;/a&gt; programming language that provides a functionality of split-apply-combine for data transformation. See the &lt;a href=&quot;http://tesseradata.org/docs-datadr/#quickstart&quot;&gt;Quickstart section&lt;/a&gt; in project&amp;#39;s documentation for a nice overview of package&amp;#39;s capabilities.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;The package has three back-ends: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one to process small data on a single processor core, &lt;/li&gt;
&lt;li&gt;second one to process medium data on many cores, &lt;/li&gt;
&lt;li&gt;and the third one, called &lt;a href=&quot;http://www.datadr.org&quot;&gt;RHIPE&lt;/a&gt;, to process big data on a Hadoop cluster using MapReduce programming model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The split-apply-combine functionality allows the user to &lt;strong&gt;split&lt;/strong&gt; data into subsets, &lt;strong&gt;apply&lt;/strong&gt; certain transformation to each of the subsets, and &lt;strong&gt;combine&lt;/strong&gt; the results; its counterpart in SQL is the &lt;code&gt;GROUP BY&lt;/code&gt; functionality. This functionality is analogous to the functionality provided by two other popular R libraries: &lt;code&gt;plyr&lt;/code&gt; and &lt;code&gt;dplyr&lt;/code&gt;. The main difference is that &lt;code&gt;datadr&lt;/code&gt; can use the Hadoop cluster and thus process really large amounts of data (see project&amp;#39;s &lt;a href=&quot;http://tesseradata.org/docs-datadr/#faq&quot;&gt;FAQ&lt;/a&gt; for more details).&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/tesseradata/datadr&quot;&gt;github page of &lt;code&gt;datadr&lt;/code&gt;&lt;/a&gt; project shows that the first commit was made in March 2013 and the project has one developer. On the other hand, the &lt;a href=&quot;https://github.com/tesseradata/RHIPE&quot;&gt;github page of Hadoop&amp;#39;s backend RHIPE&lt;/a&gt; shows that the first commit was made in August 2009 and it has 5 developers.&lt;/p&gt;
</content>
	</entry>
	
	
	
	<entry>
		<title>CockroachDB: an open source version of Google Spanner</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2014/07/25/cocroachdb.html"/>
		<updated>2014-07-25T10:20:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2014/07/25/cocroachdb</id>
		<content type="html">&lt;p&gt;A team of ex-Googlers is building an open source version of Google Spanner, i.e., a transactional database that spans across many data centers.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Spanner_%28database%29&quot;&gt;GoogleSpanner&lt;/a&gt; is a globally-scalable database system used internally by Google; it is the successor of the &lt;a href=&quot;http://en.wikipedia.org/wiki/BigTable&quot;&gt;BigTable&lt;/a&gt; database. The database supports SQL-like queries, implements transactions, is distributable among many data centers, and is fast. The &lt;a href=&quot;http://research.google.com/archive/spanner.html&quot;&gt;paper describing the technology&lt;/a&gt; published by a team of Googlers in 2012 caused quite a stir in the community since it managed to marry features that were thought to not be possible to combine together (e.g., see &lt;a href=&quot;http://highscalability.com/blog/2012/9/24/google-spanners-most-surprising-revelation-nosql-is-out-and.html&quot;&gt;Hight Scalability blog&lt;/a&gt; or Doug Cutting&amp;#39;s statement that Spanner-like technology should be the next step in Hadoop&amp;#39;s evolution (see &lt;a href=&quot;http://www.cloudera.com/content/cloudera/en/resources/library/aboutcloudera/beyond-batch-the-evolution-of-the-hadoop-ecosystem-doug-cutting-video.html&quot;&gt;the clip with him where he talks about Hadoop&amp;#39;s evolution&lt;/a&gt; - the interesting part starts at 5:50)).&lt;/p&gt;

&lt;p&gt;Now, as reported in a &lt;a href=&quot;http://www.wired.com/2014/07/cockroachdb/&quot;&gt;recent article at Wired&lt;/a&gt;, a team of ex-Googlers is developing an open source version of the Google Spanner called &lt;a href=&quot;http://cockroachdb.org/&quot;&gt;CockroachDB&lt;/a&gt;. However, the team does not plan to implement all the features of the original solution in the first version of the project. Currently, their focus is on automatic replication of data between data centers and assuring lack of dependency of the solution on any particular distributed file system or system manager. According to the article, the project is in &amp;quot;alpha&amp;quot; development phase and is &amp;quot;nowhere near ready for use with production services.&amp;quot;&lt;/p&gt;
</content>
	</entry>
	
	
	
	<entry>
		<title>Building Apache Spark App with Maven</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2014/07/15/building-apache-spark-app.html"/>
		<updated>2014-07-15T12:00:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2014/07/15/building-apache-spark-app</id>
		<content type="html">&lt;p&gt;Recently we&amp;#39;ve been working on building Spark apps with Maven.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;We&amp;#39;ve found these two texts very valuable: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://spark.apache.org/docs/0.9.1/quick-start.html&quot;&gt;http://spark.apache.org/docs/0.9.1/quick-start.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/&quot;&gt;http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, we wanted something more: deployment of Spark app dependant
on additional libraries. In this note we&amp;#39;ll show you how to do that. 
Complete code is available on &lt;a href=&quot;https://github.com/CeON/spark-intro&quot;&gt;GitHub&lt;/a&gt;, 
here we&amp;#39;ll just highlight some snippets.&lt;/p&gt;

&lt;p&gt;Two elements are needed. First, appropriate settings in dependencies 
section of POM file. Note that Spark and Scala library are marked provided.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;xml&quot;&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;dependencies&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.scala-lang&lt;span class=&quot;nt&quot;&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;scala-library&lt;span class=&quot;nt&quot;&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.10.3&lt;span class=&quot;nt&quot;&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;scope&amp;gt;&lt;/span&gt;provided&lt;span class=&quot;nt&quot;&gt;&amp;lt;/scope&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.apache.spark&lt;span class=&quot;nt&quot;&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;spark-core_2.10&lt;span class=&quot;nt&quot;&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;version&amp;gt;&lt;/span&gt;0.9.1&lt;span class=&quot;nt&quot;&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;scope&amp;gt;&lt;/span&gt;provided&lt;span class=&quot;nt&quot;&gt;&amp;lt;/scope&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;c&quot;&gt;&amp;lt;!-- that&amp;#39;s our additional dependency --&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;com.google.guava&lt;span class=&quot;nt&quot;&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;guava&lt;span class=&quot;nt&quot;&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;version&amp;gt;&lt;/span&gt;14.0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/dependencies&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Second, the submit.sh script to make building the classpath easier.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;#!/bin/bash&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Example usage:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# ./submit.sh target/spark-intro-1.0-SNAPSHOT-jar-with-dependencies.jar -Dspark.master=local SimpleApp&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;JAR_FILE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;REMINDER&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;@:&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# If needed, alter these paths to confirm with your config&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;source&lt;/span&gt; /etc/spark/conf/spark-env.sh

&lt;span class=&quot;nv&quot;&gt;SPARK_HOME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/usr/lib/spark&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;SPARK_HOME
&lt;span class=&quot;nv&quot;&gt;HADOOP_HOME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/usr/lib/hadoop&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;HADOOP_HOME

&lt;span class=&quot;c&quot;&gt;# system jars:&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CLASSPATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/etc/hadoop/conf
&lt;span class=&quot;nv&quot;&gt;CLASSPATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CLASSPATH&lt;/span&gt;:&lt;span class=&quot;nv&quot;&gt;$HADOOP_HOME&lt;/span&gt;/*:&lt;span class=&quot;nv&quot;&gt;$HADOOP_HOME&lt;/span&gt;/lib/*
&lt;span class=&quot;nv&quot;&gt;CLASSPATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CLASSPATH&lt;/span&gt;:&lt;span class=&quot;nv&quot;&gt;$HADOOP_HOME&lt;/span&gt;/../hadoop-mapreduce/*:&lt;span class=&quot;nv&quot;&gt;$HADOOP_HOME&lt;/span&gt;/../hadoop-mapreduce/lib/*
&lt;span class=&quot;nv&quot;&gt;CLASSPATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CLASSPATH&lt;/span&gt;:&lt;span class=&quot;nv&quot;&gt;$HADOOP_HOME&lt;/span&gt;/../hadoop-yarn/*:&lt;span class=&quot;nv&quot;&gt;$HADOOP_HOME&lt;/span&gt;/../hadoop-yarn/lib/*
&lt;span class=&quot;nv&quot;&gt;CLASSPATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CLASSPATH&lt;/span&gt;:&lt;span class=&quot;nv&quot;&gt;$HADOOP_HOME&lt;/span&gt;/../hadoop-hdfs/*:&lt;span class=&quot;nv&quot;&gt;$HADOOP_HOME&lt;/span&gt;/../hadoop-hdfs/lib/*
&lt;span class=&quot;nv&quot;&gt;CLASSPATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CLASSPATH&lt;/span&gt;:&lt;span class=&quot;nv&quot;&gt;$SPARK_HOME&lt;/span&gt;/assembly/lib/*

&lt;span class=&quot;c&quot;&gt;# app jar:&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CLASSPATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CLASSPATH&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&amp;quot;$JAR_FILE&amp;quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CONFIG_OPTS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;-Dspark.jars=$JAR_FILE&amp;quot;&lt;/span&gt;
java -cp &lt;span class=&quot;nv&quot;&gt;$CLASSPATH&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CONFIG_OPTS&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$REMINDER&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To run everything just type:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;bash&quot;&gt;git clone https://github.com/CeON/spark-intro.git
&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;spark-intro  
mvn clean compile assembly:single
./submit.sh target/spark-intro-1.0-SNAPSHOT-jar-with-dependencies.jar -Dspark.master&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;local &lt;/span&gt;SimpleApp&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Happy coding!&lt;/em&gt;&lt;/p&gt;
</content>
	</entry>
	
	
	
	<entry>
		<title>Data science workflow</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2014/06/13/data-science-workflow.html"/>
		<updated>2014-06-13T16:30:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2014/06/13/data-science-workflow</id>
		<content type="html">&lt;p&gt;Description of a workflow of a data scientists published on &lt;a href=&quot;http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext&quot;&gt;CACM blog&lt;/a&gt;.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;Generally, the workflow consists of &lt;strong&gt;4 interconnected phases&lt;/strong&gt; with some sub-steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Preparation:

&lt;ul&gt;
&lt;li&gt;Acquire data&lt;/li&gt;
&lt;li&gt;Reformat and clean data&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Analysis:

&lt;ul&gt;
&lt;li&gt;Edit analysis scripts&lt;/li&gt;
&lt;li&gt;Execute scripts&lt;/li&gt;
&lt;li&gt;Inspect outputs&lt;/li&gt;
&lt;li&gt;Debug&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Reflection&lt;/li&gt;
&lt;li&gt;Dissemination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these phases is related to its own challenges. The author developed prototype tools that are supposed to address them.&lt;/p&gt;

&lt;p&gt;One interesting insight given in the blog entry is that the &lt;strong&gt;manual data cleaning&lt;/strong&gt; is reported by data scientists as the most tedious and time-consuming part of their workflows. However, the author stresses that this step is also a very important one since &amp;quot;the chore of data reformatting and cleaning can lend insights into what assumptions are safe to make about the data, what idiosyncrasies exist in the collection process, and what models and analyses are appropriate to apply.&amp;quot;&lt;/p&gt;
</content>
	</entry>
	
	
	
	<entry>
		<title>FUSE: project for mining game-changing technologies from scientific publications and patents</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2014/06/12/FUSE_scientific_publications_mining.html"/>
		<updated>2014-06-12T12:00:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2014/06/12/FUSE_scientific_publications_mining</id>
		<content type="html">&lt;p&gt;In &lt;a href=&quot;http://www.nature.com/news/text-mining-offers-clues-to-success-1.15263&quot;&gt;May&amp;#39;s Nature&lt;/a&gt;, there is a column about an interesting text mining project called FUSE. The project is backed by US intelligence agency; its goal is to predict game-changing technologies based on mining of scientific publications and patent applications.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;FUSE is one of the first projects to mine the full texts instead of only abstracts.&lt;/p&gt;

&lt;p&gt;There are 3 teams that take part in the project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One that &amp;quot;mines text for keywords, citations and phrases that indicate authors&amp;#39; outlooks in scholarly papers.&amp;quot;&lt;/li&gt;
&lt;li&gt;Another one that extracts &amp;quot;sentiment&amp;quot; in the natural language of papers.&lt;/li&gt;
&lt;li&gt;The third team analyses connections between different topics, keywords and authors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This four-year project started in 2011.&lt;/p&gt;
</content>
	</entry>
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	<entry>
		<title>Article: Cloudera Oryx as the next Mahout</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2014/03/13/cloudera-oryx.html"/>
		<updated>2014-03-13T12:00:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2014/03/13/cloudera-oryx</id>
		<content type="html">&lt;p&gt;Quite interesting &lt;a href=&quot;http://gigaom.com/2014/02/28/cloudera-is-rebuilding-machine-learning-for-hadoop-with-oryx/&quot;&gt;article on Gigaom.com&lt;/a&gt; which says that Cloudera is developing a system called Oryx. The system is aiming to be a better Mahout. &lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;The things that are supposed to differentiate it from Mahout are mainly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It will not only provide means for exploratory analysis of data but provide tools for deploying production services containing models produced by machine learning algorithms as well.&lt;/li&gt;
&lt;li&gt;It will not only be based on MapReduce, but will use Apache Spark as well (Spark is a more and more popular Hadoop-like technology).&lt;/li&gt;
&lt;/ol&gt;
</content>
	</entry>
	
	
	
	
	
	<entry>
		<title>Debugging and manipulate function in RStudio</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2013/12/30/r-studio-update.html"/>
		<updated>2013-12-30T12:00:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2013/12/30/r-studio-update</id>
		<content type="html">&lt;p&gt;An information for R and RStudio enthusiasts about cool new features in the most recent version of RStudio (0.98) which I noticed today.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.rstudio.com/ide/docs/debugging/overview&quot;&gt;Graphical debugging&lt;/a&gt; with a possibility to click, view local variables etc. (this was not available in the previous version). Finally a sensible debugger in R!&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.rstudio.com/ide/docs/advanced/manipulate&quot;&gt;Function &amp;quot;manipulate&amp;quot;&lt;/a&gt; which allows to control R plot using sliders, buttons and other widgets&lt;/li&gt;
&lt;/ul&gt;
</content>
	</entry>
	
	
	
	<entry>
		<title>12-factor app</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2013/12/17/twelve_factor-app.html"/>
		<updated>2013-12-17T12:00:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2013/12/17/twelve_factor-app</id>
		<content type="html">&lt;p&gt;&lt;a href=&quot;http://12factor.net/&quot;&gt;12-factor app&lt;/a&gt; is a manifest or a set of good engineering practices for modern web applications (but not only for them) created by people from Heroku, based on their huge experience.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;The most interesting points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;II. Dependencies: Explicitly declare and isolate dependencies&lt;/li&gt;
&lt;li&gt;III. Config: Store config in the environment&lt;/li&gt;
&lt;li&gt;IV. Backing Services: Treat backing services as attached resources&lt;/li&gt;
&lt;li&gt;V. Build, release, run: Strictly separate build and run stages&lt;/li&gt;
&lt;li&gt;VI. Processes: Execute the app as one or more stateless processes&lt;/li&gt;
&lt;li&gt;VII. Port binding: Export services via port binding&lt;/li&gt;
&lt;li&gt;VIII. Concurrency: Scale out via the process model&lt;/li&gt;
&lt;li&gt;IX. Disposability: Maximize robustness with fast startup and graceful shutdown&lt;/li&gt;
&lt;li&gt;X. Dev/prod parity: Keep development, staging, and production as similar as possible &lt;/li&gt;
&lt;/ul&gt;
</content>
	</entry>
	
	
	
	<entry>
		<title>Facebook Presto</title>
		<link href="http://adalab.icm.edu.pl/blog/technology/2013/11/08/facebook-presto.html"/>
		<updated>2013-11-08T12:00:00+00:00</updated>
		<id>http://adalab.icm.edu.pl/blog/technology/2013/11/08/facebook-presto</id>
		<content type="html">&lt;p&gt;Facebook just open sourced its Hadoop solution called &lt;strong&gt;Presto&lt;/strong&gt; for doing SQL queries on Big Data.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;Interesting features of this system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It doesn&amp;#39;t use MapReduce paradigm.&lt;/li&gt;
&lt;li&gt;It&amp;#39;s couple of times faster than Hive: &amp;quot;Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook.&amp;quot;&lt;/li&gt;
&lt;li&gt;Its data sources are not only HDFS and HBase. One can use other sources which is a matter of implementing a certain API for given data source.&lt;/li&gt;
&lt;li&gt;It seems that the system is already of production quality: &amp;quot;The system is actively used by over a thousand employees, who run more than 30,000 queries processing one petabyte daily. &lt;/li&gt;
&lt;li&gt;In general, it seems that this is a direct counterpart of Google&amp;#39;s Dremel/BigQuery tool which we discussed on one of our journal club meetings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A general description in &lt;a href=&quot;http://www.computerworld.com/s/article/9243848/Facebook_goes_open_source_with_query_engine_for_big_data&quot;&gt;Computerworld&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;A more detailed one on &lt;a href=&quot;https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920&quot;&gt;Facebook&amp;#39;s engineering blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
	</entry>
	
	
</feed>
