30 April 2015
"Education is not the filling of a pale, but the lighting of a fire"
W.B. Yeats
t.test
, anova
, regression
)random forests
, clustering
, machine learning
)It is used widely in academia and industry
It is currently one of the most popular analytical software used among professional data scientists (academic + commercial)
C++
etc.)Simple arithmetic
5 * 5
[1] 25
Advanced calculator \[I_n(Q;J=j) = -p_{j}log_{e}p_{j} + \sum_{i=1}^K\frac{p_{ij}}{K}log_{e}p_{ij}\]
In[j] <- (-(p[j])*log(p[j])) + sum((p[i, j]/k)*log(p[i, j]))
shiny
package
Choice of integrated development environment (IDE) is essential for convenient and efficient projects (e.g. analysis for paper x)
They also provide some niceties such as syntax highlighting, code completion, and code diagnostics (RStudio only)
Although RStudio has minimised the practical importance of this, it is still essential to understand it
The working directory is the place where R looks for files that you ask to read, and the place where results will be written
The working directory can be determined using:
getwd()
To manually specify a new working directory, use:
setwd("path/to/new/folder")
Although the best place to find help when using R is online, there is a more formal build-in help system
All functions have a dedicated help file that can be accessed as follows:
?function_name # or help(function_name)
Providing you have got the name correct, this will open a help file
We will see some examples of these files later
These can be installed using:
install.packages("package_name")
library(devtools) install_github("username/repo")
5 + 5
[1] 10
log10(10)
[1] 1
variable <- value value -> variable variable = value
<-
" since it is pretty explicit
x <- 10 x
[1] 10
y <- x y
[1] 10
^ or ** powers: 2^10 == 2**10 * and / Mutliplication and division + and - Addition and subtraction %/% Integer division %*% Conformable matrix multiplication etc. etc.
Common build in mathematical functions
sin()|cos()|tan()|log()|log10() sqrt()|sum()|floor()|ceiling() round()|abs()|acos()|atan()|factorial()
Example
sum(1,3)
[1] 4
sum(1,3) == 1 + 3
[1] TRUE
5 * "a"
Error in 5 * "a": non-numeric argument to binary operator
typeof(5.3)
[1] "double"
typeof("ABC")
[1] "character"
typeof(TRUE)
[1] "logical"
is.character(4)
[1] FALSE
Simple: all values have the same type (e.g. all numeric)
matrix vector array
Complex: values can have multiple/complex types
factor() list() data.frame()
x
x <- c(1, 2, 3) x
[1] 1 2 3
y
y <- c("Kevin", "Keenan") y
[1] "Kevin" "Keenan"
z
from x and y?z <- c(x, y) z
[1] "1" "2" "3" "Kevin" "Keenan"
z <- list(numbers = x, strings = y) z
$numbers [1] 1 2 3 $strings [1] "Kevin" "Keenan"
In statistical analysis in R, the most common data structure used is the dataframe
This is actually just a list
with two dimensions (rows and columns)
z <- list(rank = c(1,2,3), people = c("John", "Sarah", "Liz")) z
$rank [1] 1 2 3 $people [1] "John" "Sarah" "Liz"
as.data.frame(z)
rank people 1 1 John 2 2 Sarah 3 3 Liz
dataframe
. However, other data structure like the matrix etc. may come in useful.sex | hgt | col |
---|---|---|
M | 75.22 | blond |
M | 76.11 | black |
F | 75.84 | red |
M | 75.31 | brown |
M | 75.22 | red |
str(dat)
'data.frame': 5 obs. of 3 variables: $ sex: Factor w/ 2 levels "F","M": 2 2 1 2 2 $ hgt: num 75.2 76.1 75.8 75.3 75.2 $ col: Factor w/ 4 levels "black","blond",..: 2 1 4 3 4
dat$sex
[1] M M F M M Levels: F M
tapply(X = dat$hgt, INDEX = dat$sex, FUN = mean)
F M 75.840 75.465
class | subclass | genome_size |
---|---|---|
Mammal | Placental | 3.922 |
Amphibian | Frog | 5.221 |
Amphibian | Salamander | 33.21 |
Bird | Bird | 1.202 |
Mammal | Placental | 3.471 |
Mammal | Placental | 3.424 |
tapply(X = genome$genome_size, INDEX = genome$class, FUN = mean, na.rm = TRUE)
Amphibian Bird Fish Mammal Reptiles 17.833651 1.410088 1.505703 3.404727 2.116691
levels
functionlevels(genome$class) <- c("a", "b", "f", "m", "r") tapply(X = genome$genome_size, INDEX = genome$class, FUN = mean, na.rm = TRUE)
a b f m r 17.833651 1.410088 1.505703 3.404727 2.116691
$
index functiondat$hgt
[1] 75.22 76.11 75.84 75.31 75.22
dat[ , 2]
[1] 75.22 76.11 75.84 75.31 75.22
dat[ , "hgt"]
[1] 75.22 76.11 75.84 75.31 75.22
dat[1, 2]
[1] 75.22
dat
dataframe.with(genome, plot(class, genome_size))
library(ggplot2) ggplot(data = genome, aes(x = class, y = genome_size, fill = class)) + geom_boxplot()
fun_name <- function(arguments){ work done return(results) }
for loops
apply
is used to apply a function to a margin of an object with > 1 dimensions# create a matrix x <- matrix(rnorm(100), ncol = 5, nrow = 20) head(x)
[,1] [,2] [,3] [,4] [,5] [1,] 0.946585640 -1.1735769 0.1288554 1.5929138 1.0177542 [2,] 0.004398704 -0.1556425 -0.1458756 0.0450106 -0.2511646 [3,] -0.352322306 -1.9189098 -0.1639110 -0.7151284 -1.4299934 [4,] -0.529695509 -0.1952588 1.7635520 0.8652231 1.7091210 [5,] 0.739589226 -2.5923277 0.7625865 1.0744410 1.4350696 [6,] -1.063457415 1.3140022 1.1114311 1.8956548 -0.7103711
# calculate the standard deviation of each row apply(X = x, MARGIN = 1, FUN = sd)
[1] 1.0724441 0.1224504 0.7405635 1.0592279 1.6324541 1.3127500 0.4743751 [8] 0.7211607 1.2977208 1.1263022 1.3564019 0.5357712 1.3171231 1.2722752 [15] 1.4457967 0.1638006 1.1846336 0.3935585 1.0542124 0.6171212
apply(X = x, MARGIN = 2, FUN = sd)
[1] 1.0310062 0.9299603 1.0337509 0.8722552 1.1473742
apply
function can also be used on dataframes
, and multi-dimensional arrayssapply
and lapply
are very similar, the major difference being sapply
will usually return a vector, while lapply
always returns a list.
sapply
x <- 1:5 x
[1] 1 2 3 4 5
# return the square root of each value in x using sapply sapply(X = x, FUN = sqrt)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
sapply
and lapply
are very similar, the major difference being sapply
will usually return a vector, while lapply
always returns a list.
lapply
# return the square root of each value in x using lapply lapply(X = x, FUN = sqrt)
[[1]] [1] 1 [[2]] [1] 1.414214 [[3]] [1] 1.732051 [[4]] [1] 2 [[5]] [1] 2.236068