Notes on R

Installing R

R can be downloaded from http://cran-r.project.org/ and installed (on windows) by double clicking the .EXE executable.

Using R behind a proxy server

R has facilities for grabbing and updating packages from external sources, such as bioconductor. In case you're stuck behind a firewall or a proxy there are options for allowing R to access the internet. Two examples:
  1. create an alias to R and modify it by adding the argument: --internet2
    e.g. in the target field: "C:\Program Files\R\rw1091\bin\Rgui.exe" --internet2
  2. Same as above but hand it specific parameters for http_proxy and http_proxy_user
    e.g.: "C:\Program Files\R\rw1091\bin\Rgui.exe" http_proxy=http://proxyname:8080/ http_proxy_user=ask

Using R with a database such as MySQL

I'd like to use R on my PC with some MySQL databases I have on various local linux boxes. I don't have MySQL installed on my PC, so R will have to supply all the correct libraries. RMySQL requires DBI. Install DBI from the R menu for installing modules from cran.
Note: while upgrading to R v2.0.1 I was unable to find the RMySQL package for windows on the menu of packages at cran. Searching cran explicitly, I found the package source as a .tar.gz file, but there was no windows binary in .zip format as there is for other packages. I finally found the windows binary at: http://stat.bell-labs.com/RS-DBI/download/ where I was able to download and install the package as a local zip file.

Loading the RMySQL library, gives the following error:

> library(RMySQL)
Error in dyn.load(x, as.logical(local), as.logical(now)) : 
        unable to load shared library "C:/PROGRA~1/R/rw1091/library/RMySQL/libs/RMySQL.dll":
  LoadLibrary failure:  The specified module could not be found.
Error in library(RMySQL) : .First.lib failed

The RMySQL dll's are not being found. I know they exist (libmySQL.dll, RMySQL.dll) because a search reveals that they are located at: C:\Program Files\R\rw1091\library\RMySQL\libs
There are at least three ways to solve this problem:

Now loading the RMySQL library works, but returns another error message:

> library(RMySQL)
Warning message: 
DLL attempted to change FPU control word from 8001f to 9001f 
> 
This error message appears to be of no consequence.

Let's try a sample session of loading the library, connecting to a database, executing a query, and viewing the results:

> library(RMySQL)
> mycon <- dbConnect(MySQL(), user='cws', dbname="cws", host="pi", password='delores')
> rs <- dbSendQuery(mycon, "SELECT slide_name FROM arrays limit 5")
> data1 <- fetch(rs, n = -1)
> data1
  slide_name
1        DM1-110
2        DM1-111
3        DM1-112
4        DM1-113
5        DM1-114
(the -1 parameter to fetch means return all rows)


Simple plot examples in R (and plot symbol summary)

Using R to plot a matrix of numbers as an image


Using Index Vectors

Problem: You have a vector of numbers and you want to extract a sub-selection which meet a certain criteria.

Solution: Use the vector in a logical statement which can act as an index vector to be applied to the vector itself. Sounds confusing, but basically you create a criteria to be applied to the elements of the vector, if the criteria is evaluated as TRUE, the index position for that number is returned, if it is FALSE, nothing is returned.

Thus for a vector of numbers:
> # a skewed distribution of numbers
> a <- c(0,0,1,2,2,3,4,5,6,6,7,7,7,8,9,500,567,566,1000)
> # select only those in the range of interest - say numbers less than 10
> b <- a[ a < 10 ]
> # examine the results
> b
 [1] 0 0 1 2 2 3 4 5 6 6 7 7 7 8 9
> another condition could be specified with the "&" symbol
> b <- a[ a < 10 & a > 0]
> b
 [1] 1 2 2 3 4 5 6 6 7 7 7 8 9
> # plot a histogram
> hist(b,freq=F,main=paste("Histogram") )


Sorting a matrix based on a particular column

Generate a 4 x 5 matrix of random numbers:
# declare a 4 x 5 matrix, and fill it with 20 samplings 
# from a series of integers between 1 and 100.

> x <- array(sample(1:100,20), dim=c(4,5))
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]   77   95   35   29   79
[2,]   55   15   99    7   88
[3,]   50   71   64   91   34
[4,]   49   31   66   19   28

# We can use the order function to sort the matrix
# based on a particular column. For example x[order(x[ ,2]), ] 
# will sort based on the 2nd column:

> x[order(x[ ,2]), ]
     [,1] [,2] [,3] [,4] [,5]
[1,]   55   15   99    7   88
[2,]   49   31   66   19   28
[3,]   50   71   64   91   34
[4,]   77   95   35   29   79

# return just the top 2 rows
x[order(x[ ,2])[1:2], ]

Tying data together into a matrix or data frame

cbind or data.frame? A matrix is all one type, whereas a data frame can contain mixed types.
age  <- c( 20,25,30,50,20,30)
sex  <- c("M","F","F","M","M","F")
hgt  <- c(1.80,1.65,1.70,1.75,1.85,1.68)
obs  <- c("first","second","third","fourth","fifth","sixth")
base <- cbind(age=age,sex=sex,hgt=hgt,obs=obs)

> base
     age  sex hgt    obs
[1,] "20" "M" "1.8"  "first"
[2,] "25" "F" "1.65" "second"
[3,] "30" "F" "1.7"  "third"
[4,] "50" "M" "1.75" "fourth"
[5,] "20" "M" "1.85" "fifth"
[6,] "30" "F" "1.68" "sixth"
In binding variables together, they form a matrix and all elements of a matrix must have the same type. So all columns become character variables. A better approach would be
 base <- data.frame(age, sex, hgt, obs)
except that turns "obs" into a factor. Use I() to treat abs "as is":
base<-data.frame(age, sex, hgt, obs=I(obs))

> base
  age sex  hgt    obs
1  20   M 1.80  first
2  25   F 1.65 second
3  30   F 1.70  third
4  50   M 1.75 fourth
5  20   M 1.85  fifth
6  30   F 1.68  sixth

(based on an R-Help posting from 2000)


My other R pages
Chris Seidel