30 April 2015

Introduction

Intro

"Education is not the filling of a pale, but the lighting of a fire"

W.B. Yeats

Session Structure

  • The course will be three days of relaxed learning
  • Day 1 will be entirely focused on getting familiar with R
  • The day will begin with getting everything installed and a short presentation providing an overview of R (10:30 - 12:30)
    • Capabilities + syntax
  • Then we will do a short practical to hone our skills (13:30 - 15:30)
  • For the final session we can have an interactive session where you can provide challenges and we can work through the solutions together (maybe with your own data)

What is R?

What is R?

  • A programming language
    • Computational problems
    • Academia
    • Industry
  • A statistical package
    • Basic stats (t.test, anova, regression)
    • Advanced stats (random forests, clustering, machine learning)
  • A scripting environment
    • Building complex analysis pipelines
    • Bioinformatics

What is R?

  • It is used widely in academia and industry

  • It is currently one of the most popular analytical software used among professional data scientists (academic + commercial)

What is R?

  • Positives
    • Fully functional statistical programming environment
    • Open source
    • Cross-platform (Windows/Linux/Mac)
    • Excellent graphing capabilities
    • Thousands of free extension packages
    • Large online community for support
    • Reproducible research
    • High level syntax (relatively easy to learn and code)

What is R?

  • Negatives
    • Steep learning curve (Lots of help available)
    • Minimal GUI capabilities
    • Analysing large data sets can be troublesome (not impossible)
    • Scripts cannot be compiled into stand alone .exe programs (web apps?)
    • Interpreted language (slow compared to compiled C++ etc.)
    • Thousands of (free) extension packages

What can you do in R?

What can you do in R

  • Simple arithmetic

      5 * 5
    [1] 25
  • Advanced calculator \[I_n(Q;J=j) = -p_{j}log_{e}p_{j} + \sum_{i=1}^K\frac{p_{ij}}{K}log_{e}p_{ij}\]

    In[j] <- (-(p[j])*log(p[j])) + sum((p[i, j]/k)*log(p[i, j]))

What can you do in R?

  • Data analysis
    • General statistical tests (t-tests/anova/glm etc.)
    • Specialist analysis (abc/clustering/networks etc.)
  • Data visualisation
    • Basic plotting

What can you do in R?

  • More complex plotting

What can you do in R?

Programming in R

Programming in R

  • Choice of integrated development environment (IDE) is essential for convenient and efficient projects (e.g. analysis for paper x)

  • They also provide some niceties such as syntax highlighting, code completion, and code diagnostics (RStudio only)

  • There are a number of choices available

Programming in R

  • Why RStudio?
    • Well supported
    • The most actively developed IDE
    • Some of the best R developers in the world involved (Hadley Wickham, JJ Alaire)
    • Much more than R
      • \(\LaTeX\) (Write Word/PDF/HTML papers in RStudio)
      • markdown (Write reports, papers or presentations, e.g. this presentation)
    • Version control (git + SVN)
    • Projects
    • Lots of new features with each release

Playing with RStudio: practical 1

Working with R

The working directory

  • Although RStudio has minimised the practical importance of this, it is still essential to understand it

  • The working directory is the place where R looks for files that you ask to read, and the place where results will be written

  • The working directory can be determined using:

    getwd()
  • To manually specify a new working directory, use:

    setwd("path/to/new/folder")

Getting help in R

  • Although the best place to find help when using R is online, there is a more formal build-in help system

  • All functions have a dedicated help file that can be accessed as follows:

    ?function_name
    # or
    help(function_name)
  • Providing you have got the name correct, this will open a help file

  • We will see some examples of these files later

Getting help in R

Getting help in R

  • Hundreds of books for all kinds of research using R

Getting help in R

  • A list of almost every book ever written for R
    • This site also appears to have links to pdf version of many of the books. I am not clear on the legality of this so exercise caution.
  • Each other!
    • One of the major benefits of R is the active and enthusiastic community
    • There are some 'bad eggs' who make learning more difficult than it should be, but the majority understand the challenge.

Beyond base R

  • The true power of R comes from the 1000s of additional packages available
  • These can be installed using:

    install.packages("package_name")
  • A growing number of packages are available on github
    • Difficult to find the gems
    • May have poor documentation and bugs
library(devtools)
install_github("username/repo")

Syntax: the grammar of the language

Basic operations

  • R can be used as a simple calculator
5 + 5
[1] 10
  • And a less simple calculator
log10(10)
[1] 1

Basic operations

  • Assignment is the process of storing values in variable names
    • Three type of assignment in R
    variable <- value
    value -> variable
    variable = value
  • I prefer "<-" since it is pretty explicit
    • Variables are stored in the computers memory for interactive use
      x <- 10
      x
    [1] 10
      y <- x
      y
    [1] 10

Basic operations

  • Common mathematical operators
^ or **  powers: 2^10 == 2**10
* and /  Mutliplication and division
+ and -  Addition and subtraction
%/%      Integer division
%*%      Conformable matrix multiplication
etc. etc.

Basic operations

  • Common build in mathematical functions

    sin()|cos()|tan()|log()|log10()
    sqrt()|sum()|floor()|ceiling()
    round()|abs()|acos()|atan()|factorial()
  • Example

sum(1,3)
[1] 4
sum(1,3) == 1 + 3
[1] TRUE

Data types in R

  • All variables in R will have a type
  • This information is important for a number of reasons
    • A numeric type can't be multiplied by a character type
      5 * "a"
    Error in 5 * "a": non-numeric argument to binary operator
  • Certain data structures can only contain as single data type
    • vector
    • matrix
  • Others allow a mix of types
    • lists
    • dataframe

Data types in R

  • The best way to learn this stuff is through practice and getting errors
  • Test functions are also available:
typeof(5.3)
[1] "double"
typeof("ABC")
[1] "character"
typeof(TRUE)
[1] "logical"
is.character(4)
[1] FALSE

Data structures in R

  • There are simple and complex data structures in R
    • Simple: all values have the same type (e.g. all numeric)

      matrix
      vector
      array
    • Complex: values can have multiple/complex types

      factor()
      list()
      data.frame()

Data structures in R

  • Creating a numeric vector named x
x <- c(1, 2, 3)
x
[1] 1 2 3
  • Creating a character vector named y
y <- c("Kevin", "Keenan")
y
[1] "Kevin"  "Keenan"

Data structures in R

  • What happens if we try to create a vector z from x and y?
z <- c(x, y)
z
[1] "1"      "2"      "3"      "Kevin"  "Keenan"
  • This is know as coercion, and is important to be aware of

Data structures in R

  • What if we really did want a variable with both numeric and character values
z <- list(numbers = x, strings = y)
z
$numbers
[1] 1 2 3

$strings
[1] "Kevin"  "Keenan"
  • This feature is extremely useful for statistical computing

Data structures in R

  • In statistical analysis in R, the most common data structure used is the dataframe

  • This is actually just a list with two dimensions (rows and columns)

z <- list(rank = c(1,2,3), people = c("John", "Sarah", "Liz"))
z
$rank
[1] 1 2 3

$people
[1] "John"  "Sarah" "Liz"  
as.data.frame(z)
  rank people
1    1   John
2    2  Sarah
3    3    Liz

R for data analysis

Introduction

  • Although population genetics analysis is slightly different from standard analyses in R, a number of 'standard' skills will come in handy for:
    • Data visualisation
    • Taking results beyond the basics
  • Most data analysis (and manipulation) tasks are carried out on the dataframe. However, other data structure like the matrix etc. may come in useful.

Dataframes

  • A list with two-dimensions
  • Used to store data tables in R
  • Allows for the storage of vectors of equal length with different types
    • Example
    sex hgt col
    M 75.22 blond
    M 76.11 black
    F 75.84 red
    M 75.31 brown
    M 75.22 red

Dataframes

  • Let's see what we mean by vector of different types
str(dat)
'data.frame':   5 obs. of  3 variables:
 $ sex: Factor w/ 2 levels "F","M": 2 2 1 2 2
 $ hgt: num  75.2 76.1 75.8 75.3 75.2
 $ col: Factor w/ 4 levels "black","blond",..: 2 1 4 3 4

What is a factor?

  • A special vector in R, where elements take on one of a set of values know as levels
dat$sex
[1] M M F M M
Levels: F M
  • Factors are very useful for grouping variables in statistical analyses and plotting
tapply(X = dat$hgt, INDEX = dat$sex, FUN = mean)
     F      M 
75.840 75.465 

What is a factor?

  • A more realistic example using genome sizes
class subclass genome_size
Mammal Placental 3.922
Amphibian Frog 5.221
Amphibian Salamander 33.21
Bird Bird 1.202
Mammal Placental 3.471
Mammal Placental 3.424
  • Calculate the mean genome size per class
tapply(X = genome$genome_size, INDEX = genome$class, FUN = mean, na.rm = TRUE)
Amphibian      Bird      Fish    Mammal  Reptiles 
17.833651  1.410088  1.505703  3.404727  2.116691 

What is a factor?

  • The levels of a factor can be changed using the levels function
levels(genome$class) <- c("a", "b", "f", "m", "r")
tapply(X = genome$genome_size, INDEX = genome$class, FUN = mean, na.rm = TRUE)
        a         b         f         m         r 
17.833651  1.410088  1.505703  3.404727  2.116691 

Dataframes

  • Variables can be access (indexed) from dataframe a number of ways in R
    • Using the $ index function
      dat$hgt
    [1] 75.22 76.11 75.84 75.31 75.22
    • Using standard indexing
      dat[ , 2]
    [1] 75.22 76.11 75.84 75.31 75.22
    • Using named indexes
      dat[ , "hgt"]
    [1] 75.22 76.11 75.84 75.31 75.22

Dataframes

  • Single values can be accessed from a dataframe as follows
dat[1, 2]
[1] 75.22
  • This gives us the height for the first row in the dat dataframe.

Plotting

Using what we know

  • Graphic is one of the most attractive features of R
    • Basic plotting {base graphics}
    • Complex/pretty plotting {ggplot2, ggviz etc.}
  • Basic
with(genome, plot(class, genome_size))

Using what we know

  • Less basic
library(ggplot2)
ggplot(data = genome, aes(x = class, y = genome_size, fill = class)) +
  geom_boxplot()

Functions in R

Functions

  • As R is a programming language, it is also possible to write functions (programs/modules/routines)
fun_name <- function(arguments){
  work
  done
  return(results)
}
  • We will get the chance to write our own functions later

Meet the apply family

The apply family

  • Useful for applying an expression many times
  • These are some of the most useful functions in R
    • apply
    • lapply
    • sapply
    • mapply
    • by
    • rapply
    • eapply
  • Differ in both input and output structure
  • Supersede the need for for loops

The apply family

  • apply is used to apply a function to a margin of an object with > 1 dimensions
# create a matrix
x <- matrix(rnorm(100), ncol = 5, nrow = 20)
head(x)
             [,1]       [,2]       [,3]       [,4]       [,5]
[1,]  0.946585640 -1.1735769  0.1288554  1.5929138  1.0177542
[2,]  0.004398704 -0.1556425 -0.1458756  0.0450106 -0.2511646
[3,] -0.352322306 -1.9189098 -0.1639110 -0.7151284 -1.4299934
[4,] -0.529695509 -0.1952588  1.7635520  0.8652231  1.7091210
[5,]  0.739589226 -2.5923277  0.7625865  1.0744410  1.4350696
[6,] -1.063457415  1.3140022  1.1114311  1.8956548 -0.7103711
# calculate the standard deviation of each row
apply(X = x, MARGIN = 1, FUN = sd)
 [1] 1.0724441 0.1224504 0.7405635 1.0592279 1.6324541 1.3127500 0.4743751
 [8] 0.7211607 1.2977208 1.1263022 1.3564019 0.5357712 1.3171231 1.2722752
[15] 1.4457967 0.1638006 1.1846336 0.3935585 1.0542124 0.6171212

The apply family

  • If we wanted to calculate the standard deviation for each column, we simply need to 'apply' the function to the second margin (col) of the matrix:
apply(X = x, MARGIN = 2, FUN = sd)
[1] 1.0310062 0.9299603 1.0337509 0.8722552 1.1473742
  • The apply function can also be used on dataframes, and multi-dimensional arrays

The apply family

  • sapply and lapply are very similar, the major difference being sapply will usually return a vector, while lapply always returns a list.
    • sapply
x <- 1:5
x
[1] 1 2 3 4 5
# return the square root of each value in x using sapply
sapply(X = x, FUN = sqrt)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

The apply family

  • sapply and lapply are very similar, the major difference being sapply will usually return a vector, while lapply always returns a list.
    • lapply
# return the square root of each value in x using lapply
lapply(X = x, FUN = sqrt)
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

[[4]]
[1] 2

[[5]]
[1] 2.236068

Really getting to know R: Practical 2