R Practical Session

Kevin Keenan (SSCB)
16/09/2013

Introduction

The following practical session will allow us to review some of the topics covered in the taught sessions.

  • R & RStudio installation
  • Getting familiar with our setup (RStudio, wd, etc.)
  • Warm up (some interactive tasks)
  • Data structures (with basic manipulation)
  • Reading/writing data
  • Advanced Data manipulation
  • Basic statistical analyses
  • Plotting in R

Installing R & RStudio

Follow the instructions below:

  • To install R, click here

  • To install RStudio, click here

  • Otherwise, click here

R installation instructions

The latest version of R can be downloaded from the CRAN website using the following link:

Choose the distribution of R that is suitable for your operating system, and install as normal.

If you have not installed RStudio click here, otherwise click here and await instructions.

RStudio installation instructions

The latest version of RStudio can be downloaded using the following link:

Choose the distribution of RStudio that is suitable for your operating system

If you have not installed R click here, otherwise click here and await instructions.

Getting familiar with our setup

In this section we will take a brief tour of some of the capabilities of RStudio, as well as learning how R interacts with the file system

RStudio

Basic layout of RStudio

pic

Setting the working directory (code)

The working directory (wd) is the folder in which R will read and write files by default. It is important to know where this is on your system.

  • To find out where your current wd is:
getwd()
  • Set your working directory to a convenient location on your system
setwd("my/preferred/directory")

Setting the working directory (GUI)

One of the benefits of using RStudio is that it provides many tools which make certain tasks more efficient.

It is still important to know how to do most things through the command line.

Using RStudio to set the working directory

  1. Navigate to the Files tab
  2. Navigate to your folder of choice
  3. Click the More tab
  4. Choose the Set as Working Directory option

Notice that the actual code appears in the console

Warm up

In this section, we will quickly get used to executing commands in R

Simple arithmetic

Exercise

Using the R console as a calculator, complete the following tasks

  1. \( 13 + 4 \)
  2. \( 82 \times 3 \)
  3. \( \sqrt{144} \)
  4. \( \frac{100}{20} \)
  5. \( \log_{e}{(100)} \)
  6. \( \log_{10}{(100)} \)
  7. \( {5}! \)
  8. \( \frac{1}{2.0 \times 10^{-3}} \)

Simple arithmetic

Answers

# Q1
13 + 4
[1] 17
# Q2
82 * 3
[1] 246

Simple arithmetic

Answers

# Q3
sqrt(144)
[1] 12
# Q4
100 / 20
[1] 5

Simple arithmetic

Answers

# Q5
log(100)
[1] 4.605
# Q6
log10(100)
[1] 2

Simple arithmetic

Answers

# Q7
factorial(5)
[1] 120
# OR
prod(5, 4, 3, 2, 1)
[1] 120

Simple arithmetic

Answers

# Q8
1 / 2e-3
[1] 500

Basic Scripting

Scripting is a good way to keep a record of your analyses. Code is generally written in a text editor before it is passed to the R terminal.

Assignment operations

Exercise

  1. Using either the seq function, the ':' operator or the c function, create a variable x with the value:

    1 2 3 4 5 6 7 8 9 10
    
  2. Create a variable y with the value \( x^2 \)

    Hint:
    Remember R has vectorisation
    

Assignment operations

Answers

# Q1
# using 'seq'
x <- seq(from = 1, to = 10, by = 1)
# using ':'
x <- 1:10
# using 'c'
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# Q2
y <- x^2

Basic indexing

Exercise

  1. Using the extraction operators '[]', what is the value of the \( 7^{th} \) element of y?

  2. Assign the \( 5^{th} \) element of y to a new variable, y5.

  3. Assign the \( 5^{th} \) element if x to the new variable, x5

  4. Can you think of a way to test if x5 is equal to the square root of y5

Hint:
?sqrt

Basic indexing

Answers

# Q1
y[7]
[1] 49
# Q2
y5 <- y[5]
# Q3
x5 <- x[5]

Basic indexing

Answers

# Q4
x5 == sqrt(y5)
[1] TRUE
# OR
x5 == y5/x5
[1] TRUE

Some simple functions

Exercises

  1. What does the paste function do?

  2. Using the paste function, generate a variable, strng, containing three elements in the format below:

    A1 B2 C3
    

Some simple functions

Answers

# Q1
?paste
# Q2

# Method 1
paste(c("A", "B", "C"), 1:3, sep = "")

# Method 2
paste(LETTERS[1:3], 1:3, sep ="")

# Method 3
let <- c("A", "B", "C")
num <- 1:3
paste(let, num, sep = "")

Data structures

In this section we will learn how to explore and manipulate some of the most common data structures in R.

Vector

Exercise

  1. Create a random normal variable, rand, with 100 elements, a mean of 5 and a standard deviation of 1.

    Hint:
    ?rnorm
    
  2. What is the class/mode of the rand object?

  3. Is rand a list or a vector?

  4. Check to see that rand actually does have \( 100 \) elements, a mean of \( 5 \) and a standard deviation of \( 1 \).

Vector

Answers

# Q1
rand <- rnorm(n = 100, mean = 5, sd = 1)

  [1] 5.936 5.470 3.572 5.236 6.445 6.327 4.930 5.600 5.055 7.131 4.463
 [12] 2.662 5.034 5.678 5.620 5.710 5.495 4.967 6.050 6.539 4.882 4.270
 [23] 6.446 3.974 7.087 4.758 5.163 6.192 6.429 3.094 5.253 6.050 5.486
 [34] 3.836 4.730 5.792 5.340 5.373 5.197 4.938 5.193 5.316 4.874 4.769
 [45] 3.950 6.304 4.031 4.843 3.959 5.191 4.985 6.260 2.908 5.555 4.732
 [56] 6.384 5.028 5.165 3.947 3.627 5.932 4.768 6.628 4.249 6.238 5.334
 [67] 6.066 5.682 4.269 4.915 3.137 6.097 5.256 6.028 6.422 5.787 6.444
 [78] 6.140 4.670 5.260 3.682 6.900 4.065 4.836 6.929 5.642 5.438 4.917
 [89] 5.597 5.005 6.187 6.061 5.693 6.122 7.035 5.051 5.173 4.497 5.698
[100] 4.182

Vector

Answers

# Q2
class(rand)
[1] "numeric"
# OR
mode(rand)
[1] "numeric"

Vector

Answers

# Q3
is.vector(rand)
[1] TRUE
# OR
is.list(rand)
[1] FALSE
Hint:
Try placing the '!' character in front of these functions

Vector

Answers

# Q4
length(rand)
[1] 100
mean(rand)
[1] 5.273
sd(rand)
[1] 0.9609

Matrix

Remember that a matrix is simply a vector with with dimensions.

Exercise

  1. Convert the vector, rand to a matrix, rand_mat, with \( 10 \) columns

  2. Can you think of a way to test how R 'fills' the matrix using the data from rand?

Matrix

Answers

# Q1
rand_mat <- matrix(rand, ncol = 10)

       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,] 5.936 4.463 4.882 5.253 5.193 4.985 5.932 3.137 3.682 6.187
 [2,] 5.470 2.662 4.270 6.050 5.316 6.260 4.768 6.097 6.900 6.061
 [3,] 3.572 5.034 6.446 5.486 4.874 2.908 6.628 5.256 4.065 5.693
 [4,] 5.236 5.678 3.974 3.836 4.769 5.555 4.249 6.028 4.836 6.122
 [5,] 6.445 5.620 7.087 4.730 3.950 4.732 6.238 6.422 6.929 7.035
 [6,] 6.327 5.710 4.758 5.792 6.304 6.384 5.334 5.787 5.642 5.051
 [7,] 4.930 5.495 5.163 5.340 4.031 5.028 6.066 6.444 5.438 5.173
 [8,] 5.600 4.967 6.192 5.373 4.843 5.165 5.682 6.140 4.917 4.497
 [9,] 5.055 6.050 6.429 5.197 3.959 3.947 4.269 4.670 5.597 5.698
[10,] 7.131 6.539 3.094 4.938 5.191 3.627 4.915 5.260 5.005 4.182

Matrix

Answers

# Q2
rand_mat[ , 1] == rand[1:10]
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
rand_mat[ , 2] == rand[11:20]
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Matrix

Exercise

  1. Give the columns of rand_mat the following names:

    col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
    
  2. Using the web, find a function to easily calculate the mean of each column of rand_mat.

  3. Can you think of two ways to extract the \( 5^{th} \) column of rand_mat?

Matrix

Answers

# Q1
colnames(rand_mat) <- paste("col_", 1:10, 
                            sep = "")

      col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
 [1,] 5.936 4.463 4.882 5.253 5.193 4.985 5.932 3.137 3.682  6.187
 [2,] 5.470 2.662 4.270 6.050 5.316 6.260 4.768 6.097 6.900  6.061
 [3,] 3.572 5.034 6.446 5.486 4.874 2.908 6.628 5.256 4.065  5.693
 [4,] 5.236 5.678 3.974 3.836 4.769 5.555 4.249 6.028 4.836  6.122
 [5,] 6.445 5.620 7.087 4.730 3.950 4.732 6.238 6.422 6.929  7.035
 [6,] 6.327 5.710 4.758 5.792 6.304 6.384 5.334 5.787 5.642  5.051
 [7,] 4.930 5.495 5.163 5.340 4.031 5.028 6.066 6.444 5.438  5.173
 [8,] 5.600 4.967 6.192 5.373 4.843 5.165 5.682 6.140 4.917  4.497
 [9,] 5.055 6.050 6.429 5.197 3.959 3.947 4.269 4.670 5.597  5.698
[10,] 7.131 6.539 3.094 4.938 5.191 3.627 4.915 5.260 5.005  4.182

Matrix

Answers

# Q2
colMeans(rand_mat)
 col_1  col_2  col_3  col_4  col_5  col_6  col_7  col_8  col_9 col_10 
 5.570  5.222  5.229  5.200  4.843  4.859  5.408  5.524  5.301  5.570 
# Q3
rand_mat[ ,5] == rand_mat[ , "col_5"]
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

List

The most flexible data structure in the R language is the list. Unlike matrices and vectors, lists can contain elements of variable mode

Exercise

  1. Create a list, my_details, containing the following three pieces of information;

    a. Your full_name as a character vector

    b. Your DOB as a numeric vector with three elements, relating to the day, month and year

    c. Your country of birth (COB) as a string

List

Answers

my_details <- list(full_name = c("Kevin", "Michael",
                                 "Keenan"),
                   DOB = c(16, 7, 1986),
                   COB = "Northern Ireland")
$full_name
[1] "Kevin"   "Michael" "Keenan" 

$DOB
[1]   16    7 1986

$COB
[1] "Northern Ireland"

List

Exercise

  1. Can you think of two ways to extract only your surname from my_details?

  2. Return the your month of birth from my_details.

  3. How many letters are in your first name?

List

Answers

# Q1
my_details[[1]][3]
[1] "Keenan"
# OR
my_details$full_name[3]
[1] "Keenan"

List

Additional help

In the previous answer, it was necessary to know how many elements were present in my_details$full_name.

It is good to try to generalise your code to require as little manual information as possible. This can be done by incorporating information about the object into our index term.

my_details$full_name[length(my_details$full_name)]
[1] "Keenan"

List

Answers

# Q2
my_details$DOB[2]
[1] 7

# Q3
nchar(my_details$full_name[length(my_details$full_name)])

[1] 6

Data frame

The data frame is the data structure of choice for statistical analyses. It is similar in structure to a matrix, except that each column is allowed a different class/mode. In this respect, a data frame is better described as a list with two dimensions.

In this section we will combine the tasks of exploring the data frame object type, and reading and writing data in R, as well as some statistical testing and plotting.

Reading a .csv file

Background

In the working directory, you will notice a file named, class-stats.csv. This file contains information about the attendees of a previous course I taught on. The participants were required to measure and record some basic information about themselves. The details of these data are not important yet, but we will do some simple analyses on them later.

For now, our task is to get the hang of reading and manipulating data within R.

Reading a .csv file

Exercise

  1. Using the web or the lecture notes, find a suitable function to read class-stats.csv, and assign it to the variable, class_data.

    Hint:
    Does the file have a header?
    
  2. When you have successfully read the file, find out how many people attended the previous course.

Reading a .csv file

Answers

# Q1
class_data <- read.csv("class-stats.csv", 
                       header = TRUE)
# Q2
nrow(class_data)
[1] 37

Exploring a data frame

Exercise

  1. Using a single function, how can you find out the class of each column in class_data.

  2. Find the levels of the variable Gender in class_data.

Hint:
When trying to index variables in a data frame, remember it is just a list with two dimensions.

Exploring a data frame

Answers

# Q1
str(class_data)

'data.frame':   37 obs. of  5 variables:
 $ Gender  : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 2 1 ...
 $ Colour  : Factor w/ 6 levels "Black","Blonde",..: 1 1 3 5 3 2 2 3 1 3 ...
 $ Height  : num  185 193 173 157 170 ...
 $ LengthRH: Factor w/ 15 levels "?","1.6","15",..: 11 14 11 5 4 4 4 6 15 9 ...
 $ LengthLF: Factor w/ 14 levels "?","22","22.5",..: 13 13 13 7 11 7 8 7 13 7 ...

# Q2
levels(class_data$Gender)
[1] "F" "M"

Missing data

Background

There are many ways to 'probe' a data frame for information. However, it is important to ensure that all data are in a format that R understands. You will see from the output of str that LengthRH and LengthLF have been coded as factors, despite obviously being continuous variables.

Remember, missing data in R are generally coded as NA. In class_data, missing data have been coded as '?', causing R to interpret them as catagorical data.

   Gender Colour Height LengthRH LengthLF
35      M  Brown  185.4        ?        ?

Missing data

Exercise

  1. Replace all ? values in class_data with NA.

  2. Convert LengthRH and LengthLF back to numeric vectors.

    Hint:
    You will need to assign the converted values back to class_data. The safest way is to create two new columns
    
  3. There is a more straight forward way to deal with missing data at the 'read' stage. Can you find it?

Missing data

Answers

# Q1
class_data[class_data == "?"] <- NA
   Gender Colour Height LengthRH LengthLF
35      M  Brown  185.4     <NA>     <NA>
# Q2
class_data$LengthRH_new <- as.numeric(class_data$LengthRH)
class_data$LengthLF_new <- as.numeric(class_data$LengthLF)

  'data.frame': 37 obs. of  7 variables:
   $ Gender      : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 2 1 ...
   $ Colour      : Factor w/ 6 levels "Black","Blonde",..: 1 1 3 5 3 2 2 3 1 3 ...
   $ Height      : num  185 193 173 157 170 ...
   $ LengthRH    : Factor w/ 15 levels "?","1.6","15",..: 11 14 11 5 4 4 4 6 15 9 ...
   $ LengthLF    : Factor w/ 14 levels "?","22","22.5",..: 13 13 13 7 11 7 8 7 13 7 ...
   $ LengthRH_new: num  11 14 11 5 4 4 4 6 15 9 ...
   $ LengthLF_new: num  13 13 13 7 11 7 8 7 13 7 ...

Missing data

Answers

# Q3

class_data <- read.csv("class-stats.csv",
                       header = TRUE,
                       na.string = "?")

'data.frame':   37 obs. of  5 variables:
 $ Gender  : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 2 1 ...
 $ Colour  : Factor w/ 6 levels "Black","Blonde",..: 1 1 3 5 3 2 2 3 1 3 ...
 $ Height  : num  185 193 173 157 170 ...
 $ LengthRH: num  18 21 18 16.2 16 16 16 16.5 24 17.5 ...
 $ LengthLF: num  30 30 30 25 28 25 26 25 30 25 ...

Writing data

Exercise

After reading out data correctly, we now have an object named class_data with our data in the right format.

In some instances it might be useful to write this 'corrected' data for future use.

  1. Write this object to a new .csv file named, 'class-stat-corr.csv'
Hint:
Make sure you export the data as well at the column names, but not the row names, which R sets as row numbers by default.

Writing data

Answer

# Q1
write.csv(class_data, 
          file = "class-stat-corr.csv",
          row.names = FALSE)

Exploring your data

Exercise

  1. After reading the data into R using the correct missing data code, calculate the mean height for males and females.

    Hint:
    ?tapply
    
  2. Find the maximum right hand length for each hair colour.

Exploring your data

Answers

# Q1
tapply(class_data$Height,
       INDEX = class_data$Gender,
       FUN = "mean")
    F     M 
165.7 180.6 

Exploring your data

Answers

# Q2
tapply(class_data$LengthRH,
       INDEX = class_data$Colour,
       FUN = "max")

 Black Blonde  Brown   Fair Ginger   Grey 
  24.0   19.0     NA   20.0   17.5   20.0 

Exploring your data

Exercise

  1. Why do you think the maximum for Brown is NA?

  2. Can you find a solution?

Exploring your data

Answers

# Q1
R refuses to calculate some parameters unless you explicitly allow the removal of missing data
# Q2
tapply(class_data$LengthRH,
       INDEX = class_data$Colour,
       FUN = "max",
       na.rm = TRUE)

 Black Blonde  Brown   Fair Ginger   Grey 
  24.0   19.0   20.0   20.0   17.5   20.0 

Basic statistics in R

Exercise

  1. Using our sample data, class_data and the function t.test, can you test the following hypothesis?

    Male humans are taller than female humans.
    
    Hint:
    Remember to make sure that our data are normally distributed with equal variance
    

Basic statistics in R

Answer

# Test for normality
# males
shapiro.test(class_data$Height[class_data$Gender == "M"])

    Shapiro-Wilk normality test

data:  class_data$Height[class_data$Gender == "M"]
W = 0.8728, p-value = 0.07093
# females
shapiro.test(class_data$Height[class_data$Gender == "F"])

    Shapiro-Wilk normality test

data:  class_data$Height[class_data$Gender == "F"]
W = 0.9416, p-value = 0.1612

Basic statistics in R

Answer

# OR
tapply(class_data$Height, class_data$Gender, shapiro.test)
$F

    Shapiro-Wilk normality test

data:  X[[1L]]
W = 0.9416, p-value = 0.1612


$M

    Shapiro-Wilk normality test

data:  X[[2L]]
W = 0.8728, p-value = 0.07093

Basic statistics in R

Answer

# variance test
var.test(class_data$Height[class_data$Gender == "M"],
         class_data$Height[class_data$Gender == "F"])

    F test to compare two variances

data:  class_data$Height[class_data$Gender == "M"] and class_data$Height[class_data$Gender == "F"]
F = 1.735, num df = 11, denom df = 24, p-value = 0.2506
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6708 5.5046
sample estimates:
ratio of variances 
             1.735 

Basic statistics in R

Answer

# t-test
t.test(class_data$Height[class_data$Gender == "M"],
       class_data$Height[class_data$Gender == "F"],
       alternative = "greater",
       var.equal = TRUE)

    Two Sample t-test

data:  class_data$Height[class_data$Gender == "M"] and class_data$Height[class_data$Gender == "F"]
t = 4.808, df = 35, p-value = 1.429e-05
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 9.66  Inf
sample estimates:
mean of x mean of y 
    180.6     165.7 

Basic plotting in R

Exercise

  1. Using the function plot, visualise the relationship between height and feet size

  2. What do you notice from the plot? Can you figure out how to fix the problem?

Basic plotting in R

Answers

# Q1
plot(class_data$Height, class_data$LengthLF)

plot of chunk fiftyfive

Basic plotting in R

Answers

# Q2
There is an outlier in LengthLF

# find which entry is incorrect
outlier <- which(class_data$LengthLF == 
                 max(class_data$LengthLF, na.rm = TRUE))

# print outlier
outlier
[1] 11

Basic plotting in R

Answers

We know now that the error is in the \( 11^{th} \) row of class_data. We can delete the outlier value by replacing it with an NA

class_data$LengthLF[11] <- NA

# check
class_data[11, ]
   Gender Colour Height LengthRH LengthLF
11      F  Black  157.5     17.2       NA

Basic plotting in R

Exercise

Plot the relationship again.

  1. Can you find a way to plot male and female data as different coloured, solid points?

  2. Can you change the axes labels and add a title to your plot?

Basic plotting in R

Answers

plot(class_data$Height, class_data$LengthLF)

plot of chunk fiftyeight

Basic plotting in R

Answers

# Q1
# create an empty plot
plot(class_data$Height, class_data$LengthLF,
     type = "n")

# add male data in blue
points(class_data$Height[class_data$Gender == "M"],
       class_data$LengthLF[class_data$Gender == "M"],
       col = "blue", pch = 16)

# add female data in red
points(class_data$Height[class_data$Gender == "F"],
       class_data$LengthLF[class_data$Gender == "F"],
       col = "red", pch = 16)

Basic plotting in R

Answers

plot of chunk sixty

Basic plotting in R

Answers

# Q2
plot(class_data$Height, class_data$LengthLF,
     xlab = "Height (cm)",
     ylab = "Left foot length (cm)",
     main = "Feet size vs Height")

plot of chunk sixtyone

Reproducibility

R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.4.1

loaded via a namespace (and not attached):
[1] digest_0.6.3   evaluate_0.4.7 formatR_0.9    stringr_0.6.2 
[5] tools_3.0.1