Thursday, March 19, 2015

Why is R awesome?

For someone it is a magic, somebody hates its notation (maybe you!),  it has some weird rules and maybe it is just a programming language like others (That is also my opinion). As the other programming languages, R has its good and bad properties but I can say it is the best candidate as a toolbox of a statistician or researchers who work on data analysis.

In this blog post, I collect 8 (from 0 to 7) nice properties of R. As a lecturer and researcher, I experienced that many students are more capable to understand some statistical concepts when I try to show and get them work using Monte Carlo simulations.  In R, we are able to write compact codes to demonstrate these concepts which would be difficult to implement in an other programming language. R is not a simple toy, so we are always capable to enhance our knowledge, programming skills and get capabilities of writing better codes by introducing external codes that are written in real programming languages (an old joke of real man which uses C).


So, if it is, why is R awesome ?



0. Syntax of Algol Family

R has a weird assign operator but the remaining part is similar to Algol family languages such as C, C++, Java and C#.  R has a similar facility of operator overloading (yes, it is not exactly the operator overloading), in other terms, single or compound character of symbols can be assigned to function names like this:


> '%_%' <- function(a,b){
+    return(exp(a+b))
+ }
> 5 %_% 2
[1] 1096.633


1. Vectors are primitive data types

Yes, vectors are also primitives with an opening and a closing bracket in other members of Algol. In C/C++ they are arrays of primitives and objects in Java. Contrary this, binary operators are directly applicable on the vectors and matrices in R.  For example estimation of least squares coefficients is a single line expression in R as:


> assign("x",cbind(1,1:30))
> assign("y",3+3*x[,2]+rnorm(30))
> solve(t(x) %*% x) %*% t(x) %*% y
         [,1]
[1,] 2.858916
[2,] 3.003787

This example shows the differences between a scaler and a vector:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
> assign("x", c(1,2,3))
> assign("a", 5)
> typeof(x)
[1] "double"
> typeof(a)
[1] "double"
> class(x)
[1] "numeric"
> class(a)
[1] "numeric"

No difference!


2. Theorems get alive in minutes

Suppose that X is a random variable that follows an Exponential Distribution with ratio = 5.
Sum or mean of randomly selected samples with size of N follows a normal distribution.  This is an explanation of the Central Limit Theorem with an example. Theorems are theorems. But you may see a fast demonstration (and probably a proof for educational purposes only) and try to write a rapid application. A process of writing a code like this takes minutes if you use R.


> assign("nsamp", 5000)
> assign("n", 100)
> assign("theta", 5.0)
> assign("sums", rep(0,nsamp))
> 
> for (i in 1:nsamp){
+     sums[i] <- sum(rexp(n = n, rate = theta)) 
+ }
> hist(sums)




3. There is always a second plan for faster code

Now suppose that we are drawing 50,000 samples randomly using the code above. What would be the computation time?


> assign("nsamp", 50000)
> assign("n", 100)
> assign("theta", 5.0)
> assign("sums", rep(0,nsamp))
> 
> s <- system.time(
+     for (i in 1:nsamp){
+         sums[i] <- sum(rexp(n = n, rate = theta)) 
+     }
+ )
> 
> print(s)
   user  system elapsed 
  0.582   0.000   0.572 




Drawing 50,000 samples with size 100 takes 0.582 seconds. Is it now fast enough? Lets try to write it in C++ !


#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
NumericVector CalculateRandomSums(int m, int n) {
   NumericVector result(m);
   int i;
   for (i = 0; i < m; i++){
     result[i] = sum(rexp(n, 5.0));
   }
   return(result);
}


After compiling the code within Rcpp, we can call the function CalculateRandomSums() from R.


> s <- system.time(
+ vect <- calculaterandomsums(50000,100)
> print(s)
   user  system elapsed 
  0.185   0.000   0.184 

Now our R code is 3.145946 times slower than the code written in C++.


4. Interaction with C/C++/Fortran is enjoyable

Since a huge amount of R is written in C, migration of old C libraries is easy by writing wrapper methods using SEXP data types. Rcpp masks these routines in a clever way. Fortran code is also
linkable. Interaction with other languages makes use of old libraries in R and enables the possibility of writing faster new libraries.  It is also possible to create instances of R in C and C++ applications.
For an enjoyable example, have a look at the section 3. There is always a second plan for faster code.
The R package eive includes a small portion of C++ code and it is a compact example of calling C++ functions from within R. Accessing C++ objects from R is also possible thank to Rcpp. Click here to see the explanation and an example.


5. Interaction with Java

Calling Java from R (rJava) and calling R from Java (JRI, RCaller) are all possible. Renjin has a different concept as it is the R interpreter written in Java (Another possibility of calling R from Java , huh?).  A detailed comparison of these method is given in this documentation and this.


6. Sophisticated variable scoping

In R, functions have their own variable scopes and accessing variables at the top level is possible. Addition to this, variable scoping is handled by standard R lists (specially they are called environments) and in any side of code user based environments can be created. For detailed information visit Environment in R.


7. Optional Object Oriented Programming (O-OOP) 

R functions take values of variables as parameters rather than their addresses. If a vector with size of 10,0000 is passed through a function, R first copies this vector then passes it to the function. After body of the function is performed, the copied parameter is then labeled as free for later garbage collecting. As C/C++ programmers know, passing objects with their addresses rather than their values is a good solution for using less memory and spending less computation time. Reference classes in R are passed to functions with their addresses in a way similar to passing C++ references and Java objects to functions and methods:



 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Person <- setRefClass(
    Class = "Person",
    fields = c("name","surname","email"),
    methods = list(
        initialize = function(name, surname, email){
            .self$name <- name
            .self$surname <- surname
            .self$email <- email
        },
        
        setName = function(name){
            .self$name <- name
        },
        
        setSurname = function(surname){
            .self$surname <- surname
        },
        
        setEMail = function (email){
            .self$email <- email
        },
        
        toString = function (){
            return(paste(name, " ", surname, " ", email))
        }   
    ) # End of methods
) # End of class



p <- Person$new("John","Brown","brown@server.org")
print(p$toString())

The output is

[1] "John   Brown   brown@server.org"

Java and C++ programmers probably like this notation!


Have a nice read!


No comments:

Post a Comment

Thanks