datadr
is a package for the R programming language that provides a functionality of split-apply-combine for data transformation. See the Quickstart section in project's documentation for a nice overview of package's capabilities.
The package has three back-ends:
- one to process small data on a single processor core,
- second one to process medium data on many cores,
- and the third one, called RHIPE, to process big data on a Hadoop cluster using MapReduce programming model.
The split-apply-combine functionality allows the user to split data into subsets, apply certain transformation to each of the subsets, and combine the results; its counterpart in SQL is the GROUP BY
functionality. This functionality is analogous to the functionality provided by two other popular R libraries: plyr
and dplyr
. The main difference is that datadr
can use the Hadoop cluster and thus process really large amounts of data (see project's FAQ for more details).
The github page of datadr
project shows that the first commit was made in March 2013 and the project has one developer. On the other hand, the github page of Hadoop's backend RHIPE shows that the first commit was made in August 2009 and it has 5 developers.