datadr: split-apply-combine package for R backed by Hadoop

04 September 2014 by Mateusz Kobos

datadr is a package for the R programming language that provides a functionality of split-apply-combine for data transformation. See the Quickstart section in project's documentation for a nice overview of package's capabilities.

The package has three back-ends:

The split-apply-combine functionality allows the user to split data into subsets, apply certain transformation to each of the subsets, and combine the results; its counterpart in SQL is the GROUP BY functionality. This functionality is analogous to the functionality provided by two other popular R libraries: plyr and dplyr. The main difference is that datadr can use the Hadoop cluster and thus process really large amounts of data (see project's FAQ for more details).

The github page of datadr project shows that the first commit was made in March 2013 and the project has one developer. On the other hand, the github page of Hadoop's backend RHIPE shows that the first commit was made in August 2009 and it has 5 developers.

comments powered by Disqus