datadr is a package for the R programming language that provides a functionality of split-apply-combine for data transformation. See the Quickstart section in project's documentation for a nice overview of package's capabilities.
The package has three back-ends:
- one to process small data on a single processor core,
- second one to process medium data on many cores,
- and the third one, called RHIPE, to process big data on a Hadoop cluster using MapReduce programming model.
The split-apply-combine functionality allows the user to split data into subsets, apply certain transformation to each of the subsets, and combine the results; its counterpart in SQL is the
GROUP BY functionality. This functionality is analogous to the functionality provided by two other popular R libraries:
dplyr. The main difference is that
datadr can use the Hadoop cluster and thus process really large amounts of data (see project's FAQ for more details).
The github page of
datadr project shows that the first commit was made in March 2013 and the project has one developer. On the other hand, the github page of Hadoop's backend RHIPE shows that the first commit was made in August 2009 and it has 5 developers.