R is among the top five data science tools in use today according O’Reilly research; the latest kdnuggets survey puts it in first, and IEEE Spectrum ranks it as the fifth most popular programming language.
The latest Rexer Data Miner survey revealed that in the past eight years, there has been an three-fold increase in the number of respondents using R, and a seven-fold increase in the number of analysts/scientists who have said that R is their primary tool.
Despite its popularity, the main drawback of vanilla R is its inherently “single threaded” nature and its need to fit all the data being processed in RAM. But nowadays, data sets are typically in the range of GBs and they are growing quickly to TBs. In short, current growth in data volume and variety is demanding more efficient tools by data scientists.
Every data science analysis starts with preparing, cleaning, and transforming the raw input data into some tabular data that can be further used in machine learning models.
In the particular case of R, data size problems usually arise when the input data do not fit in the RAM of the machine and when data analysis takes a long time because parallelism does not happen automatically. Without making the data smaller (through sampling, for example) this problem can be solved in two different ways:
While the first approach is obvious and can use the same code to deal with different data sizes, it can only scale to the memory limits of the machine being used. The second approach, by contrast, is more powerful but it is also more difficult to set up and adapt to existing legacy code.
There is a third approach—scaling-out horizontally can be solved by using R as an interface to the most popular distributed paradigms:
Novel distributed platforms also combine batch and stream processing, providing a SQL-like expression language—for instance, Apache Flink. There are also higher levels of abstraction that allow you to create a data processing language, such as the recently open sourced project Apache Beam from Google. However, these novel projects are still under development, and so far do not include R support.
After the data preparation step, the next common data science phase consists of training machine learning models, which can also be performed on a single machine or distributed among different machines. In the case of distributed machine learning frameworks, the most popular approaches using R, are the following:
Sidestepping the coding and customization issues of these approaches, you can seek out a commercial solution that uses R to access data on the front-end but uses its own big-data-native processing under the hood.
Teradata has also released an open source package in CRAN called toaster that allows users to compute, analyze, and visualize data with (on top of) the Teradata Aster database. It allows computing data in Aster by taking advantage of Aster distributed and parallel engines, and then creates visualizations of the results directly in R. For example, it allows users to execute K-Means or run several cross-validation iterations of a linear regression model in parallel.
Also related is MADlib, an open source library for scalable in-database analytics currently in incubator at Apache. There are other open source CRAN packages to deal with big data, such as biglm, bigpca, biganalytics, bigmemory or pbdR—but they are focused on specific issues rather than addressing the data science pipeline in general.
Big data analysis presents a lot of opportunities to extract hidden patterns when you are using the right algorithms and the underlying technology that will help to gather insights. Connecting new scales of data with familiar tools is a challenge, but tools like Aster R offer a way to combine the beauty and elegance of the R language within a distributed environment to allow processing data at scale.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.