Introduction to R

30 April 2015

Introduction

Intro

"Education is not the filling of a pale, but the lighting of a fire"

W.B. Yeats

Session Structure

The course will be three days of relaxed learning
Day 1 will be entirely focused on getting familiar with R
The day will begin with getting everything installed and a short presentation providing an overview of R (10:30 - 12:30)
- Capabilities + syntax
Then we will do a short practical to hone our skills (13:30 - 15:30)
For the final session we can have an interactive session where you can provide challenges and we can work through the solutions together (maybe with your own data)

What is R?

A programming language
- Computational problems
- Academia
- Industry
A statistical package
- Basic stats (t.test, anova, regression)
- Advanced stats (random forests, clustering, machine learning)
A scripting environment
- Building complex analysis pipelines
- Bioinformatics

What is R?

It is used widely in academia and industry
It is currently one of the most popular analytical software used among professional data scientists (academic + commercial)

What is R?

Positives
- Fully functional statistical programming environment
- Open source
- Cross-platform (Windows/Linux/Mac)
- Excellent graphing capabilities
- Thousands of free extension packages
- Large online community for support
- Reproducible research
- High level syntax (relatively easy to learn and code)

What is R?

Negatives
- Steep learning curve (Lots of help available)
- Minimal GUI capabilities
- Analysing large data sets can be troublesome (not impossible)
- Scripts cannot be compiled into stand alone .exe programs (web apps?)
- Interpreted language (slow compared to compiled C++ etc.)
- Thousands of (free) extension packages

What can you do in R?

What can you do in R

Simple arithmetic
```
  5 * 5
```
```
[1] 25
```
Advanced calculator \[I_n(Q;J=j) = -p_{j}log_{e}p_{j} + \sum_{i=1}^K\frac{p_{ij}}{K}log_{e}p_{ij}\]
```
In[j] <- (-(p[j])*log(p[j])) + sum((p[i, j]/k)*log(p[i, j]))
```

What can you do in R?

Data analysis
- General statistical tests (t-tests/anova/glm etc.)
- Specialist analysis (abc/clustering/networks etc.)
Data visualisation
- Basic plotting

What can you do in R?

More complex plotting

What can you do in R?

Design web applications using the shiny package
- diveRsity-online (diveRsity package app for popgen analysis)
- divMigrate-online (web app for directional gene flow analysis)
- k-means cluster (demo of k-mean clustering of Isis data)
- Global sea surface temperature

Programming in R

Choice of integrated development environment (IDE) is essential for convenient and efficient projects (e.g. analysis for paper x)
They also provide some niceties such as syntax highlighting, code completion, and code diagnostics (RStudio only)
There are a number of choices available

Programming in R

Why RStudio?
- Well supported
- The most actively developed IDE
- Some of the best R developers in the world involved (Hadley Wickham, JJ Alaire)
- Much more than R
  - $\LaTeX$ (Write Word/PDF/HTML papers in RStudio)
  - markdown (Write reports, papers or presentations, e.g. this presentation)
- Version control (git + SVN)
- Projects
- Lots of new features with each release

Playing with RStudio: practical 1

Working with R

The working directory

Although RStudio has minimised the practical importance of this, it is still essential to understand it
The working directory is the place where R looks for files that you ask to read, and the place where results will be written
The working directory can be determined using:
```
getwd()
```
To manually specify a new working directory, use:
```
setwd("path/to/new/folder")
```

Getting help in R

Although the best place to find help when using R is online, there is a more formal build-in help system
All functions have a dedicated help file that can be accessed as follows:
```
?function_name
# or
help(function_name)
```
Providing you have got the name correct, this will open a help file
We will see some examples of these files later

Getting help in R

There are many useful websites dedicated to using R

Getting help in R

Hundreds of books for all kinds of research using R

Getting help in R

A list of almost every book ever written for R
- This site also appears to have links to pdf version of many of the books. I am not clear on the legality of this so exercise caution.
Each other!
- One of the major benefits of R is the active and enthusiastic community
- There are some 'bad eggs' who make learning more difficult than it should be, but the majority understand the challenge.

Beyond base R

The true power of R comes from the 1000s of additional packages available
These can be installed using:
```
install.packages("package_name")
```
A growing number of packages are available on github
- Difficult to find the gems
- May have poor documentation and bugs

library(devtools)
install_github("username/repo")

Syntax: the grammar of the language

Basic operations

R can be used as a simple calculator

5 + 5

[1] 10

And a less simple calculator

log10(10)

[1] 1

Basic operations

Assignment is the process of storing values in variable names
- Three type of assignment in R
```
variable <- value
value -> variable
variable = value
```
I prefer "<-" since it is pretty explicit
- Variables are stored in the computers memory for interactive use
```
  x <- 10
  x
```
```
[1] 10
```
```
  y <- x
  y
```
```
[1] 10
```

Basic operations

Common mathematical operators

^ or **  powers: 2^10 == 2**10
* and /  Mutliplication and division
+ and -  Addition and subtraction
%/%      Integer division
%*%      Conformable matrix multiplication
etc. etc.

Basic operations

Common build in mathematical functions

sin()|cos()|tan()|log()|log10()
sqrt()|sum()|floor()|ceiling()
round()|abs()|acos()|atan()|factorial()

Example

sum(1,3)

[1] 4

sum(1,3) == 1 + 3

[1] TRUE

Data types in R

All variables in R will have a type
This information is important for a number of reasons
- A numeric type can't be multiplied by a character type
```
  5 * "a"
```
```
Error in 5 * "a": non-numeric argument to binary operator
```
Certain data structures can only contain as single data type
- vector
- matrix
Others allow a mix of types
- lists
- dataframe

Data types in R

The best way to learn this stuff is through practice and getting errors
Test functions are also available:

typeof(5.3)

[1] "double"

typeof("ABC")

[1] "character"

typeof(TRUE)

[1] "logical"

is.character(4)

[1] FALSE

Data structures in R

There are simple and complex data structures in R
- Simple: all values have the same type (e.g. all numeric)
```
matrix
vector
array
```
- Complex: values can have multiple/complex types
```
factor()
list()
data.frame()
```

Data structures in R

Creating a numeric vector named x

x <- c(1, 2, 3)
x

[1] 1 2 3

Creating a character vector named y

y <- c("Kevin", "Keenan")
y

[1] "Kevin"  "Keenan"

Data structures in R

What happens if we try to create a vector z from x and y?

z <- c(x, y)
z

[1] "1"      "2"      "3"      "Kevin"  "Keenan"

This is know as coercion, and is important to be aware of

Data structures in R

What if we really did want a variable with both numeric and character values

z <- list(numbers = x, strings = y)
z

$numbers
[1] 1 2 3

$strings
[1] "Kevin"  "Keenan"

This feature is extremely useful for statistical computing

Data structures in R

In statistical analysis in R, the most common data structure used is the dataframe
This is actually just a list with two dimensions (rows and columns)

z <- list(rank = c(1,2,3), people = c("John", "Sarah", "Liz"))
z

$rank
[1] 1 2 3

$people
[1] "John"  "Sarah" "Liz"

as.data.frame(z)

  rank people
1    1   John
2    2  Sarah
3    3    Liz

R for data analysis

Introduction

Although population genetics analysis is slightly different from standard analyses in R, a number of 'standard' skills will come in handy for:
- Data visualisation
- Taking results beyond the basics
Most data analysis (and manipulation) tasks are carried out on the dataframe. However, other data structure like the matrix etc. may come in useful.

Dataframes

A list with two-dimensions
Used to store data tables in R
Allows for the storage of vectors of equal length with different types
- Example
sex hgt col

M 75.22 blond

M 76.11 black

F 75.84 red

M 75.31 brown

M 75.22 red

sex	hgt	col
M	75.22	blond
M	76.11	black
F	75.84	red
M	75.31	brown
M	75.22	red

Dataframes

Let's see what we mean by vector of different types

str(dat)

'data.frame':   5 obs. of  3 variables:
 $ sex: Factor w/ 2 levels "F","M": 2 2 1 2 2
 $ hgt: num  75.2 76.1 75.8 75.3 75.2
 $ col: Factor w/ 4 levels "black","blond",..: 2 1 4 3 4

What is a factor?

A special vector in R, where elements take on one of a set of values know as levels

dat$sex

[1] M M F M M
Levels: F M

Factors are very useful for grouping variables in statistical analyses and plotting

tapply(X = dat$hgt, INDEX = dat$sex, FUN = mean)

     F      M 
75.840 75.465

What is a factor?

A more realistic example using genome sizes

class	subclass	genome_size
Mammal	Placental	3.922
Amphibian	Frog	5.221
Amphibian	Salamander	33.21
Bird	Bird	1.202
Mammal	Placental	3.471
Mammal	Placental	3.424

Calculate the mean genome size per class

tapply(X = genome$genome_size, INDEX = genome$class, FUN = mean, na.rm = TRUE)

Amphibian      Bird      Fish    Mammal  Reptiles 
17.833651  1.410088  1.505703  3.404727  2.116691

What is a factor?

The levels of a factor can be changed using the levels function

levels(genome$class) <- c("a", "b", "f", "m", "r")
tapply(X = genome$genome_size, INDEX = genome$class, FUN = mean, na.rm = TRUE)

        a         b         f         m         r 
17.833651  1.410088  1.505703  3.404727  2.116691

Dataframes

Variables can be access (indexed) from dataframe a number of ways in R

Using the $ index function

  dat$hgt

[1] 75.22 76.11 75.84 75.31 75.22

Using standard indexing

  dat[ , 2]

[1] 75.22 76.11 75.84 75.31 75.22

Using named indexes

  dat[ , "hgt"]

[1] 75.22 76.11 75.84 75.31 75.22

Dataframes

Single values can be accessed from a dataframe as follows

dat[1, 2]

[1] 75.22

This gives us the height for the first row in the dat dataframe.

Plotting

Using what we know

Graphic is one of the most attractive features of R
- Basic plotting {base graphics}
- Complex/pretty plotting {ggplot2, ggviz etc.}
Basic

with(genome, plot(class, genome_size))

Using what we know

Less basic

library(ggplot2)
ggplot(data = genome, aes(x = class, y = genome_size, fill = class)) +
  geom_boxplot()

Functions in R

Functions

As R is a programming language, it is also possible to write functions (programs/modules/routines)

fun_name <- function(arguments){
  work
  done
  return(results)
}

We will get the chance to write our own functions later

Meet the apply family

The apply family

Useful for applying an expression many times
These are some of the most useful functions in R
- apply
- lapply
- sapply
- mapply
- by
- rapply
- eapply
Differ in both input and output structure
Supersede the need for for loops

The apply family

apply is used to apply a function to a margin of an object with > 1 dimensions

# create a matrix
x <- matrix(rnorm(100), ncol = 5, nrow = 20)
head(x)

             [,1]       [,2]       [,3]       [,4]       [,5]
[1,]  0.946585640 -1.1735769  0.1288554  1.5929138  1.0177542
[2,]  0.004398704 -0.1556425 -0.1458756  0.0450106 -0.2511646
[3,] -0.352322306 -1.9189098 -0.1639110 -0.7151284 -1.4299934
[4,] -0.529695509 -0.1952588  1.7635520  0.8652231  1.7091210
[5,]  0.739589226 -2.5923277  0.7625865  1.0744410  1.4350696
[6,] -1.063457415  1.3140022  1.1114311  1.8956548 -0.7103711

# calculate the standard deviation of each row
apply(X = x, MARGIN = 1, FUN = sd)

 [1] 1.0724441 0.1224504 0.7405635 1.0592279 1.6324541 1.3127500 0.4743751
 [8] 0.7211607 1.2977208 1.1263022 1.3564019 0.5357712 1.3171231 1.2722752
[15] 1.4457967 0.1638006 1.1846336 0.3935585 1.0542124 0.6171212

The apply family

If we wanted to calculate the standard deviation for each column, we simply need to 'apply' the function to the second margin (col) of the matrix:

apply(X = x, MARGIN = 2, FUN = sd)

[1] 1.0310062 0.9299603 1.0337509 0.8722552 1.1473742

The apply function can also be used on dataframes, and multi-dimensional arrays

The apply family

sapply and lapply are very similar, the major difference being sapply will usually return a vector, while lapply always returns a list.
- sapply

x <- 1:5
x

[1] 1 2 3 4 5

# return the square root of each value in x using sapply
sapply(X = x, FUN = sqrt)

[1] 1.000000 1.414214 1.732051 2.000000 2.236068

The apply family

sapply and lapply are very similar, the major difference being sapply will usually return a vector, while lapply always returns a list.
- lapply

# return the square root of each value in x using lapply
lapply(X = x, FUN = sqrt)

[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

[[4]]
[1] 2

[[5]]
[1] 2.236068

Introduction

Intro

Session Structure

What is R?

What is R?

What is R?

What is R?

What is R?

What can you do in R?

What can you do in R

What can you do in R?

What can you do in R?

What can you do in R?

Programming in R

Programming in R

Programming in R

Playing with RStudio: practical 1

Working with R

The working directory

Getting help in R

Getting help in R

Getting help in R

Getting help in R

Beyond base R

Syntax: the grammar of the language

Basic operations

Basic operations

Basic operations

Basic operations

Data types in R

Data types in R

Data structures in R

Data structures in R

Data structures in R

Data structures in R

Data structures in R

R for data analysis

Introduction

Dataframes

Dataframes

What is a factor?

What is a factor?

What is a factor?

Dataframes

Dataframes

Plotting

Using what we know

Using what we know

Functions in R

Functions

Meet the apply family

The apply family

The apply family

The apply family

The apply family

The apply family

Really getting to know R: Practical 2