Kevin Keenan (SSCB)
16/09/2013
The following practical session will allow us to review some of the topics covered in the taught sessions.
R
& RStudio
installation RStudio
, wd
, etc.)R
In this section we will take a brief tour of some of the capabilities of RStudio
, as well as learning how R
interacts with the file system
Basic layout of RStudio
The working directory (wd) is the folder in which R
will read and write files by default. It is important to know where this is on your system.
getwd()
setwd("my/preferred/directory")
One of the benefits of using RStudio
is that it provides many tools which make certain tasks more efficient.
It is still important to know how to do most things through the command line.
Using RStudio
to set the working directory
Notice that the actual code appears in the console
In this section, we will quickly get used to executing commands in R
Exercise
Using the R
console as a calculator, complete the following tasks
Answers
# Q1
13 + 4
[1] 17
# Q2
82 * 3
[1] 246
Answers
# Q3
sqrt(144)
[1] 12
# Q4
100 / 20
[1] 5
Answers
# Q5
log(100)
[1] 4.605
# Q6
log10(100)
[1] 2
Answers
# Q7
factorial(5)
[1] 120
# OR
prod(5, 4, 3, 2, 1)
[1] 120
Answers
# Q8
1 / 2e-3
[1] 500
Scripting is a good way to keep a record of your analyses. Code is generally written in a text editor before it is passed to the R
terminal.
Exercise
Using either the seq
function, the ':
' operator or the c
function, create a variable x
with the value:
1 2 3 4 5 6 7 8 9 10
Create a variable y
with the value \( x^2 \)
Hint:
Remember R has vectorisation
Answers
# Q1
# using 'seq'
x <- seq(from = 1, to = 10, by = 1)
# using ':'
x <- 1:10
# using 'c'
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# Q2
y <- x^2
Exercise
Using the extraction operators '[]
', what is the value of the \( 7^{th} \) element of y
?
Assign the \( 5^{th} \) element of y
to a new variable, y5
.
Assign the \( 5^{th} \) element if x
to the new variable, x5
Can you think of a way to test if x5
is equal to the square root of y5
Hint:
?sqrt
Answers
# Q1
y[7]
[1] 49
# Q2
y5 <- y[5]
# Q3
x5 <- x[5]
Answers
# Q4
x5 == sqrt(y5)
[1] TRUE
# OR
x5 == y5/x5
[1] TRUE
Exercises
What does the paste
function do?
Using the paste
function, generate a variable, strng
, containing three elements in the format below:
A1 B2 C3
Answers
# Q1
?paste
# Q2
# Method 1
paste(c("A", "B", "C"), 1:3, sep = "")
# Method 2
paste(LETTERS[1:3], 1:3, sep ="")
# Method 3
let <- c("A", "B", "C")
num <- 1:3
paste(let, num, sep = "")
In this section we will learn how to explore and manipulate some of the most common data structures in R
.
Exercise
Create a random normal variable, rand
, with 100 elements, a mean of 5 and a standard deviation of 1.
Hint:
?rnorm
What is the class/mode of the rand
object?
Is rand
a list
or a vector
?
Check to see that rand
actually does have \( 100 \) elements, a mean of \( 5 \) and a standard deviation of \( 1 \).
Answers
# Q1
rand <- rnorm(n = 100, mean = 5, sd = 1)
[1] 5.936 5.470 3.572 5.236 6.445 6.327 4.930 5.600 5.055 7.131 4.463
[12] 2.662 5.034 5.678 5.620 5.710 5.495 4.967 6.050 6.539 4.882 4.270
[23] 6.446 3.974 7.087 4.758 5.163 6.192 6.429 3.094 5.253 6.050 5.486
[34] 3.836 4.730 5.792 5.340 5.373 5.197 4.938 5.193 5.316 4.874 4.769
[45] 3.950 6.304 4.031 4.843 3.959 5.191 4.985 6.260 2.908 5.555 4.732
[56] 6.384 5.028 5.165 3.947 3.627 5.932 4.768 6.628 4.249 6.238 5.334
[67] 6.066 5.682 4.269 4.915 3.137 6.097 5.256 6.028 6.422 5.787 6.444
[78] 6.140 4.670 5.260 3.682 6.900 4.065 4.836 6.929 5.642 5.438 4.917
[89] 5.597 5.005 6.187 6.061 5.693 6.122 7.035 5.051 5.173 4.497 5.698
[100] 4.182
Answers
# Q2
class(rand)
[1] "numeric"
# OR
mode(rand)
[1] "numeric"
Answers
# Q3
is.vector(rand)
[1] TRUE
# OR
is.list(rand)
[1] FALSE
Hint:
Try placing the '!' character in front of these functions
Answers
# Q4
length(rand)
[1] 100
mean(rand)
[1] 5.273
sd(rand)
[1] 0.9609
Remember that a matrix
is simply a vector with with dimensions.
Exercise
Convert the vector, rand
to a matrix, rand_mat
, with \( 10 \) columns
Can you think of a way to test how R
'fills' the matrix using the data from rand
?
Answers
# Q1
rand_mat <- matrix(rand, ncol = 10)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 5.936 4.463 4.882 5.253 5.193 4.985 5.932 3.137 3.682 6.187
[2,] 5.470 2.662 4.270 6.050 5.316 6.260 4.768 6.097 6.900 6.061
[3,] 3.572 5.034 6.446 5.486 4.874 2.908 6.628 5.256 4.065 5.693
[4,] 5.236 5.678 3.974 3.836 4.769 5.555 4.249 6.028 4.836 6.122
[5,] 6.445 5.620 7.087 4.730 3.950 4.732 6.238 6.422 6.929 7.035
[6,] 6.327 5.710 4.758 5.792 6.304 6.384 5.334 5.787 5.642 5.051
[7,] 4.930 5.495 5.163 5.340 4.031 5.028 6.066 6.444 5.438 5.173
[8,] 5.600 4.967 6.192 5.373 4.843 5.165 5.682 6.140 4.917 4.497
[9,] 5.055 6.050 6.429 5.197 3.959 3.947 4.269 4.670 5.597 5.698
[10,] 7.131 6.539 3.094 4.938 5.191 3.627 4.915 5.260 5.005 4.182
Answers
# Q2
rand_mat[ , 1] == rand[1:10]
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
rand_mat[ , 2] == rand[11:20]
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Exercise
Give the columns of rand_mat
the following
names:
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
Using the web, find a function to easily calculate the mean of each column of rand_mat
.
Can you think of two ways to extract the \( 5^{th} \) column of rand_mat
?
Answers
# Q1
colnames(rand_mat) <- paste("col_", 1:10,
sep = "")
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
[1,] 5.936 4.463 4.882 5.253 5.193 4.985 5.932 3.137 3.682 6.187
[2,] 5.470 2.662 4.270 6.050 5.316 6.260 4.768 6.097 6.900 6.061
[3,] 3.572 5.034 6.446 5.486 4.874 2.908 6.628 5.256 4.065 5.693
[4,] 5.236 5.678 3.974 3.836 4.769 5.555 4.249 6.028 4.836 6.122
[5,] 6.445 5.620 7.087 4.730 3.950 4.732 6.238 6.422 6.929 7.035
[6,] 6.327 5.710 4.758 5.792 6.304 6.384 5.334 5.787 5.642 5.051
[7,] 4.930 5.495 5.163 5.340 4.031 5.028 6.066 6.444 5.438 5.173
[8,] 5.600 4.967 6.192 5.373 4.843 5.165 5.682 6.140 4.917 4.497
[9,] 5.055 6.050 6.429 5.197 3.959 3.947 4.269 4.670 5.597 5.698
[10,] 7.131 6.539 3.094 4.938 5.191 3.627 4.915 5.260 5.005 4.182
Answers
# Q2
colMeans(rand_mat)
col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10
5.570 5.222 5.229 5.200 4.843 4.859 5.408 5.524 5.301 5.570
# Q3
rand_mat[ ,5] == rand_mat[ , "col_5"]
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The most flexible data structure in the R
language is the list
. Unlike matrices and vectors, lists can contain elements of variable mode
Exercise
Create a list, my_details
, containing the following three pieces of information;
a. Your full_name
as a character vector
b. Your DOB
as a numeric vector with three
elements, relating to the day, month and year
c. Your country of birth (COB
) as a string
Answers
my_details <- list(full_name = c("Kevin", "Michael",
"Keenan"),
DOB = c(16, 7, 1986),
COB = "Northern Ireland")
$full_name
[1] "Kevin" "Michael" "Keenan"
$DOB
[1] 16 7 1986
$COB
[1] "Northern Ireland"
Exercise
Can you think of two ways to extract only your surname from my_details
?
Return the your month of birth from my_details
.
How many letters are in your first name?
Answers
# Q1
my_details[[1]][3]
[1] "Keenan"
# OR
my_details$full_name[3]
[1] "Keenan"
Additional help
In the previous answer, it was necessary to know how many elements were present in my_details$full_name
.
It is good to try to generalise your code to require as little manual information as possible. This can be done by incorporating information about the object into our index term.
my_details$full_name[length(my_details$full_name)]
[1] "Keenan"
Answers
# Q2
my_details$DOB[2]
[1] 7
# Q3
nchar(my_details$full_name[length(my_details$full_name)])
[1] 6
The data frame
is the data structure of choice for statistical analyses. It is similar in structure to a matrix, except that each column is allowed a different class/mode. In this respect, a data frame is better described as a list with two dimensions.
In this section we will combine the tasks of exploring the data frame
object type, and reading and writing data in R
, as well as some statistical testing and plotting.
Background
In the working directory, you will notice a file named, class-stats.csv. This file contains information about the attendees of a previous course I taught on. The participants were required to measure and record some basic information about themselves. The details of these data are not important yet, but we will do some simple analyses on them later.
For now, our task is to get the hang of reading and manipulating data within R
.
Exercise
Using the web or the lecture notes, find a suitable function to read class-stats.csv, and assign it to the variable, class_data
.
Hint:
Does the file have a header?
When you have successfully read the file, find out how many people attended the previous course.
Answers
# Q1
class_data <- read.csv("class-stats.csv",
header = TRUE)
# Q2
nrow(class_data)
[1] 37
Exercise
Using a single function, how can you find out the class
of each column in class_data
.
Find the levels of the variable Gender
in class_data
.
Hint:
When trying to index variables in a data frame, remember it is just a list with two dimensions.
Answers
# Q1
str(class_data)
'data.frame': 37 obs. of 5 variables:
$ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 2 1 ...
$ Colour : Factor w/ 6 levels "Black","Blonde",..: 1 1 3 5 3 2 2 3 1 3 ...
$ Height : num 185 193 173 157 170 ...
$ LengthRH: Factor w/ 15 levels "?","1.6","15",..: 11 14 11 5 4 4 4 6 15 9 ...
$ LengthLF: Factor w/ 14 levels "?","22","22.5",..: 13 13 13 7 11 7 8 7 13 7 ...
# Q2
levels(class_data$Gender)
[1] "F" "M"
Background
There are many ways to 'probe' a data frame for information. However, it is important to ensure that all data are in a format that R
understands. You will see from the output of str
that LengthRH
and LengthLF
have been coded as factors
, despite obviously being continuous variables.
Remember, missing data in R
are generally coded as NA
. In class_data
, missing data have been coded as '?', causing R
to interpret them as catagorical data.
Gender Colour Height LengthRH LengthLF
35 M Brown 185.4 ? ?
Exercise
Replace all ?
values in class_data
with NA
.
Convert LengthRH
and LengthLF
back to numeric vectors.
Hint:
You will need to assign the converted values back to class_data. The safest way is to create two new columns
There is a more straight forward way to deal with missing data at the 'read' stage. Can you find it?
Answers
# Q1
class_data[class_data == "?"] <- NA
Gender Colour Height LengthRH LengthLF
35 M Brown 185.4 <NA> <NA>
# Q2
class_data$LengthRH_new <- as.numeric(class_data$LengthRH)
class_data$LengthLF_new <- as.numeric(class_data$LengthLF)
'data.frame': 37 obs. of 7 variables:
$ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 2 1 ...
$ Colour : Factor w/ 6 levels "Black","Blonde",..: 1 1 3 5 3 2 2 3 1 3 ...
$ Height : num 185 193 173 157 170 ...
$ LengthRH : Factor w/ 15 levels "?","1.6","15",..: 11 14 11 5 4 4 4 6 15 9 ...
$ LengthLF : Factor w/ 14 levels "?","22","22.5",..: 13 13 13 7 11 7 8 7 13 7 ...
$ LengthRH_new: num 11 14 11 5 4 4 4 6 15 9 ...
$ LengthLF_new: num 13 13 13 7 11 7 8 7 13 7 ...
Answers
# Q3
class_data <- read.csv("class-stats.csv",
header = TRUE,
na.string = "?")
'data.frame': 37 obs. of 5 variables:
$ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 1 1 1 1 2 1 ...
$ Colour : Factor w/ 6 levels "Black","Blonde",..: 1 1 3 5 3 2 2 3 1 3 ...
$ Height : num 185 193 173 157 170 ...
$ LengthRH: num 18 21 18 16.2 16 16 16 16.5 24 17.5 ...
$ LengthLF: num 30 30 30 25 28 25 26 25 30 25 ...
Exercise
After reading out data correctly, we now have an object named class_data
with our data in the right format.
In some instances it might be useful to write this 'corrected' data for future use.
Hint:
Make sure you export the data as well at the column names, but not the row names, which R sets as row numbers by default.
Answer
# Q1
write.csv(class_data,
file = "class-stat-corr.csv",
row.names = FALSE)
Exercise
After reading the data into R
using the correct missing data code, calculate the mean height for males and females.
Hint:
?tapply
Find the maximum right hand length for each hair colour.
Answers
# Q1
tapply(class_data$Height,
INDEX = class_data$Gender,
FUN = "mean")
F M
165.7 180.6
Answers
# Q2
tapply(class_data$LengthRH,
INDEX = class_data$Colour,
FUN = "max")
Black Blonde Brown Fair Ginger Grey
24.0 19.0 NA 20.0 17.5 20.0
Exercise
Why do you think the maximum for Brown
is NA
?
Can you find a solution?
Answers
# Q1
R refuses to calculate some parameters unless you explicitly allow the removal of missing data
# Q2
tapply(class_data$LengthRH,
INDEX = class_data$Colour,
FUN = "max",
na.rm = TRUE)
Black Blonde Brown Fair Ginger Grey
24.0 19.0 20.0 20.0 17.5 20.0
Exercise
Using our sample data, class_data
and the function t.test
, can you test the following hypothesis?
Male humans are taller than female humans.
Hint:
Remember to make sure that our data are normally distributed with equal variance
Answer
# Test for normality
# males
shapiro.test(class_data$Height[class_data$Gender == "M"])
Shapiro-Wilk normality test
data: class_data$Height[class_data$Gender == "M"]
W = 0.8728, p-value = 0.07093
# females
shapiro.test(class_data$Height[class_data$Gender == "F"])
Shapiro-Wilk normality test
data: class_data$Height[class_data$Gender == "F"]
W = 0.9416, p-value = 0.1612
Answer
# OR
tapply(class_data$Height, class_data$Gender, shapiro.test)
$F
Shapiro-Wilk normality test
data: X[[1L]]
W = 0.9416, p-value = 0.1612
$M
Shapiro-Wilk normality test
data: X[[2L]]
W = 0.8728, p-value = 0.07093
Answer
# variance test
var.test(class_data$Height[class_data$Gender == "M"],
class_data$Height[class_data$Gender == "F"])
F test to compare two variances
data: class_data$Height[class_data$Gender == "M"] and class_data$Height[class_data$Gender == "F"]
F = 1.735, num df = 11, denom df = 24, p-value = 0.2506
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.6708 5.5046
sample estimates:
ratio of variances
1.735
Answer
# t-test
t.test(class_data$Height[class_data$Gender == "M"],
class_data$Height[class_data$Gender == "F"],
alternative = "greater",
var.equal = TRUE)
Two Sample t-test
data: class_data$Height[class_data$Gender == "M"] and class_data$Height[class_data$Gender == "F"]
t = 4.808, df = 35, p-value = 1.429e-05
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
9.66 Inf
sample estimates:
mean of x mean of y
180.6 165.7
Exercise
Using the function plot
, visualise the relationship between height and feet size
What do you notice from the plot? Can you figure out how to fix the problem?
Answers
# Q1
plot(class_data$Height, class_data$LengthLF)
Answers
# Q2
There is an outlier in LengthLF
# find which entry is incorrect
outlier <- which(class_data$LengthLF ==
max(class_data$LengthLF, na.rm = TRUE))
# print outlier
outlier
[1] 11
Answers
We know now that the error is in the \( 11^{th} \) row of class_data
. We can delete the outlier value by replacing it with an NA
class_data$LengthLF[11] <- NA
# check
class_data[11, ]
Gender Colour Height LengthRH LengthLF
11 F Black 157.5 17.2 NA
Exercise
Plot the relationship again.
Can you find a way to plot male and female data as different coloured, solid points?
Can you change the axes labels and add a title to your plot?
Answers
plot(class_data$Height, class_data$LengthLF)
Answers
# Q1
# create an empty plot
plot(class_data$Height, class_data$LengthLF,
type = "n")
# add male data in blue
points(class_data$Height[class_data$Gender == "M"],
class_data$LengthLF[class_data$Gender == "M"],
col = "blue", pch = 16)
# add female data in red
points(class_data$Height[class_data$Gender == "F"],
class_data$LengthLF[class_data$Gender == "F"],
col = "red", pch = 16)
Answers
Answers
# Q2
plot(class_data$Height, class_data$LengthLF,
xlab = "Height (cm)",
ylab = "Left foot length (cm)",
main = "Feet size vs Height")
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.4.1
loaded via a namespace (and not attached):
[1] digest_0.6.3 evaluate_0.4.7 formatR_0.9 stringr_0.6.2
[5] tools_3.0.1