Donate please

Support for pur content



Other Amount:



Your web address: :



“R” you into statisitics?

“R” is a disruptive new open source analytics technology with a rapidly growing user community. It is a powerful and extensible statistical object oriented programming language. It was build to support rapid development of computational analytics and data visualisation. It can be and is easily extended (for example a new interpretative engine) and the benefits of these extensions are shared with the user community. There are now thousands of R analytics packages available to download made available by users in a growing academic and business community Several large software packages (such as Oracle, Greenplum and SAS), have integrations with the R language, or have released R language support. R has a vast array of standard graphical formats that can be reused, as well as good functionality for the production of bespoke plot and graphical format types

While the product is not new and has a large following especially in the Analyst communities, general awareness of it is still low. Early adopters gain very significant lowing of TCO compared to full SAS licensing and implementation. As open source some risks remain (documentation, support, protection of content rights to name a few) and the ease of learning and cross-training into R is reasonable, but not trivial. However depending on an organisations maturity curve “R” shoud be on the evaluation list for Analytics and data mining.

Introduction

“R” is a software tool that, in its simplest definition,c is designed to help find patterns or exceptions in data. “R” is an interactive tool that allows the enduser to “steer” the outcomes and continually improve the model run on the data until a statistically significant result can be deduced. As such it allows the use of many statistical and interpretative methods. These can be extended even further through add-ons (Object Oriented extensions). These extension can even be written from scratch should the required method not be available. What makes “R” so different from other tools in this area is that

a) it is open source, which allows even for the modification of the program code itself!

b) It is free to download and use,

c) It is extreamly object oriented which makes it flexible to extend.

d) It has over 2 million users

As such this technology is very disruptive market play, as traditionally these capability were only available to business as a premium price.

“R” is similar to S, used in Matlab, yet influenced in architecture by Scheme, a powerful form of lisp. R holds particular appeal for statisticians because it contains a number of built-in mechanisms for organizing data, running calculations, and creating graphical representations of data sets.The open source community for R has over 2 Million users. The current R is the result of a collaborative effort with contributions from all over the world. R was initially written by Robert Gentleman and Ross Ihaka—also known as “R & R” of the Statistics Department of the University of Auckland. Since mid-1997 there has been a core group with write access to the R source.

R is underpinned by the R Foundation which, similarly to Apache, GNOME, and other source software foundation aim to provide the support of continued development of R, the exploration of new methodology, teaching and training and support through organising meetings and conferences but most importantly to provide routes to sufficient funding to keep the project going.

What is R?

R is an object-oriented interpreted programming language. It was build to support rapid development of computational analytics and data visualisation of the results.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has

  • an effective data handling and storage facility,
  • a suite of operators for calculations on arrays, in particular matrices,
  • a large, coherent, integrated collection of intermediate tools for data analysis,
  • a graphical facilities for data analysis and display either directly at the computer or on hardcopy,
  • a well developed, simple and effective programming language (called ‘S’) which includes conditionals, loops, user defined recursive functions and input and output facilities. (Indeed most of the system supplied functions are themselves written in the S language.)

R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis.

The R base package has a number of build in statistical and analytical capabilities, which are extended via the use of “packages” which are deployed through a library site (http://CRAN.R-project.org and others)

This approach is one of the main advantages of programming in R due to the fact that  it is a completely object oriented language, where data processing functions and algorithms can be created, managed, and shared easily between users. Software packages are centrally released and deployed from a hub called “CRAN”, in a style similar to perl’s CPAN, and installation processes are simple and effective incorporating features such as dependency checking, differential upgrades and the like.

The R language itself is clean, expressive and powerful once you become familiar with how to read it, and how to manipulate data with it (see example below). To learn R is not hard, but does take some time due to the vast libraries of functions available. There are now thousands of R analytics packages available to download made available by users in a growing academic and business community. Examples can be found here: http://cran.r-project.org/web/packages/

Technology

The following diagram outlines the basic core components to “R” and is used to illustrate the basic logical architecture

Example useage of “R”

“R” is surprisingly wide spread both from a content field perspective as well as the organisations that use it. Companies as diverse as Google, Pfizer, Merck, Bank of America, the InterContinental Hotels Group and Shell all have some aspect of their business relying on  people using “R”.

“R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.”

Close to 1,600 different packages reside on just one of the many Web sites devoted to R, and the number of packages has grown exponentially. One package, called BiodiversityR, offers a graphical interface aimed at making calculations of environmental trends easier.

Another package, called Emu, analyzes speech patterns, while GenABEL is used to study the human genome. The financial services community has demonstrated a particular affinity for R; dozens of packages exist for derivatives analysis alone. “The great beauty of R is that you can modify it to do all sorts of things,” said Hal Varian, chief economist at Google. “And you have a lot of prepackaged stuff that’s already available, so you’re standing on the shoulders of giants.”

Packages Interoperability

Several large software packages in general use by our clients have integrations with the R language, or have released R language support. Below I have listed some of these and included a link referencing the integration.

  • Statistica, StatSoft also has R language support

Native Interoperability

The R language itself, offers many native bindings and integrations of its own. Of note:

  • RExcel
    • R can be fully integrated with excel via the free Rexcel module In effect excel becomes a front end GUI for R.
    • Note this interoperability is done through the use of the RCOM module which provides a full COMM interface to the R environment.
    • http://rcom.univie.ac.at/
  • RMySql

R analytics is almost at the point where R based “Analytics Factories” could be put together. The developments below are moving in that direction:

  • PMML Export support
    • R can export its predictive modelling to PMML (predictive modelling mark-up language) which can then be deployed onto the Amazon compute cloud through ADAPA by Zementis.
  • Rattle
    • A data mining GUI build over the R language to facilitate building data mining algorithms which are then easily exportable to PMML for large scale deployment in the cloud for example. Rattle supports PMML export for:
      • Association Model
      • Clustering Model
      • Neural Network
      • SVM Model, SVM Binary Model
      • SVM Multinomial Model
      • Tree Model
      • Tree Random Survival Classification

Graphical Capabilities

R has a vast array of standard graphical formats that can be reused, as well as good functionality for the production of bespoke plot and graphical format types. To the left are just some of the thumbnails from the “R Gallery”, a repository highlighting the various graphical methods available to the R community. http://addictedtor.free.fr/graphiques/thumbs.php

Each of the thumbnails on the site links to the functions or source on which the capability depends. Below is an example of a bespoke plot. Notice how lattices can be constructed easily allowing several plot types to be co-presented cleanly.

Progression plot 3D Plot
n CoPlot Confidence Interval plot

Assessment

  • Early adopters will be ahead of the curve Benefiting from a much improved TCO over high end packages like SAS.
  • The R language requires learning, like any programming language, and to build a competent R development team would require a significant investment in people. But, not in software license costs, so this in many ways is still an attractive proposition.
  • There is still a lack of documentation and best practices around the technical architecture for running R in lights out production environments.
  • Architecturaly “R” is still immature (for example running only in single thread mode on Windows)
  • The ease of learning and cross-training into R is reasonable, but not trivial.
  • In terms of direction, R is moving towards mainstream enterprise use:
  • Major vendors are providing mechanisms for working with R code
  • professional grade IDE tools are in development by various parties
  • commercial ventures and start-ups are now offering enterprise support for R
  • the R community at large is working on scale-out of R out to cloud environments
  • R graphical capabilities are being integrated into web services to support dashboards and visualisation tools
  • A comprehensive list of SAS and SPSS functionality mapped to the R equivalent functions is available at the following link: http://rforsasandspssusers.com/ . This is worthy of note that R additionally includes a large number of modules that have no equivalent to SAS or SPSS.
  • Attitudes by “R” to commercial usage is still immature in line with many open source projects and requires compensatory action before business use is possible. An example for this is lack of clarity on change procedures and support for older versions.

Example R code

The Quantile Regression diagram above was hand coded, and below is the final function that creates it:

### Here is the actual calculations and graph plotting ########################
# Quantile regressions for tau=0.05, 0.50 & 0.95 with the chosen growth model
rq.05 <- nlrq(size ~ SSfuzremOrig2(age, Asym, lrc1, lrc2, c0),
    data = agesize, tau = 0.05, trace = FALSE)
rq.50 <- nlrq(size ~ SSfuzremOrig2(age, Asym, lrc1, lrc2, c0),
    data = agesize, tau = 0.50, trace = FALSE)
rq.95 <- nlrq(size ~ SSfuzremOrig2(age, Asym, lrc1, lrc2, c0),
    data = agesize, tau = 0.95, trace = FALSE)
# Predict sizes with these three growth models
c.05 = predict(rq.05, newdata = list(age = agecurves))
c.50 = predict(rq.50, newdata = list(age = agecurves))
c.95 = predict(rq.95, newdata = list(age = agecurves))
curves = data.frame(c.05, c.50, c.95)
# Plot the graph (histograms with time, curves for the three quantile
# regressions and residuals as boxplots)
rqFreqPlot(ages, freqs, agecurves, curves, barscale = .35, barcol = "gray90",
    boxwex = .15, xlim = c(0, 7), ylim = c(0,67), main = "",
    xlab = expression(paste(italic("t"), " (years)")),
    ylab1 = expression(paste(italic("D"), " (mm)")),
    ylab2 = expression(paste(Delta, italic("D"), " (mm)")), las = 1, lty = 1)
title("Sea urchin growth modeled using quantile regression")

Useful links and futher reading

Subject Link
Official R website http://www.r-project.org/
R Foundation Statuets http://www.r-project.org/foundation/Rfoundation-statutes.pdf
Commercialised R http://www.revolutionanalytics.com/
Visualisation examples for R http://addictedtor.free.fr/graphiques/
Introduction to R programming http://en.wikibooks.org/wiki/R_Programming
Statistics: An Introduction Using R by Michael J. Crawley (Paperback – 11 Mar 2005)
The R Book by Michael J. Crawley (Hardcover – 20 April 2007)
Introductory Statistics with R (Statistics and Computing) by Peter Dalgaard (Paperback – 1 Sep 2008)
Comprehensive reading list http://www.r-project.org/doc/bib/R-books.html

Sources:

R Project website
New York Times By ASHLEE VANCE (P): Jan 6, 2009

3 comments to “R” you into statisitics?

  • R republic

    Terrific work! This is the type of information that should be shared around the web. Shame on the search engines for not positioning this post higher!

  • private student loans

    This is such a great resource that you are providing and you give it away for free. I enjoy seeing websites that understand the value of providing a prime resource for free. I truly loved reading your post. Thanks!

  • WP Themes

    Good fill someone in on and this post helped me alot in my college assignement. Thank you as your information.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>