In this section we will cover:
A working directory is the folder on your computer where you are currently working. You can find out your current working directory by typing getwd()
> getwd()
[1] "/Users/zhian/Documents/Everhart/IntroRNCAPS"
If you’ve downloaded and un-zipped this directory to your desktop, you might see something like /Users/<yourname>/Desktop/IntroR
. This is the default place where R will begin reading and writing files. For example, you can use the function list.files()
to view the files in your current working directory. These are the same files that we downloaded earlier. If you’re using Rstudio, you can compare the file list with the “Files” tab in the bottom right panel.
In order to use list.files()
, we should provide it with a file path. We can provide it a “.”, which means “here” to most computer programs.
> list.files(".")
[1] "data" "docs"
[3] "IntroRNCAPS.Rproj" "LICENSE"
[5] "Makefile" "Part1-Introduction.R"
[7] "Part2-Analysis.R" "Part99-ggplot2Visualization.R"
[9] "PartAEL-Basic_Visualization.R" "README.md"
You can see that the first entry in here is “data”. This is where we’ve placed the fungicide example data.
> list.files("data")
[1] "FungicideExample.csv" "README.md"
We can use the read.table()
function to read these data in to R. It’s important to remember that while in R, these data are simply a copy kept in memory, not on the disk, so we don’t have to worry too much about accidentally deleting the data :).
So, how do we actually USE the read.table()
function? A good first step to figuring out how you can use a function is to look at it’s help page. The way you can do that is by typing either help(“function_name”) or ?function_name.
> stop("
+
+ Type ?read.table and answer these three questions:
+
+ 1. What does it do? (Description)
+ 2. What are the first three arguments and their defaults? (Usage/Arguments)
+ 3. What does it return? (Value)
+
+ ")
Error in eval(expr, envir, enclos):
Type ?read.table and answer these three questions:
1. What does it do? (Description)
2. What are the first three arguments and their defaults? (Usage/Arguments)
3. What does it return? (Value)
In order to read our data into R, we will need to provide three things:
Now that we have these elements, we can read the data into a variable, which we can call “fungicide”. Once we do this, we can check the dimensions to make sure that we have all of the data.
> fungicide <- read.csv("data/FungicideExample.csv", header = TRUE, sep = ",")
> nrow(fungicide)
[1] 9
> ncol(fungicide)
[1] 7
We can print the data to screen by simply typing its name
> fungicide
Julian.Date TwentyOneThirtySevenWheat TwentyOneThirtySevenWheat.trt
1 97 0.00 0.00
2 104 0.00 0.00
3 111 0.00 0.00
4 118 0.00 0.00
5 125 0.00 0.00
6 132 0.00 0.00
7 139 2.34 1.81
8 146 7.56 7.89
9 154 28.78 15.04
CutterWheat CutterWheat.Trt JaggerWheat JaggerWheat.Trt
1 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00
4 0.00 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00
6 0.00 0.00 0.00 0.00
7 1.15 1.79 1.85 2.27
8 3.62 2.40 6.92 5.00
9 17.89 6.21 47.39 20.17
We can also use the str()
function (short for “structure”) to have a broad overview of what our data looks like. This is useful for data frames with many columns.
> str(fungicide)
'data.frame': 9 obs. of 7 variables:
$ Julian.Date : int 97 104 111 118 125 132 139 146 154
$ TwentyOneThirtySevenWheat : num 0 0 0 0 0 ...
$ TwentyOneThirtySevenWheat.trt: num 0 0 0 0 0 ...
$ CutterWheat : num 0 0 0 0 0 ...
$ CutterWheat.Trt : num 0 0 0 0 0 0 1.79 2.4 6.21
$ JaggerWheat : num 0 0 0 0 0 ...
$ JaggerWheat.Trt : num 0 0 0 0 0 ...
The data presented here are from http://www.apsnet.org/edcenter/advanced/topics/EcologyAndEpidemiologyInR/DiseaseProgress/Pages/StripeRust.aspx and consist of three cultivars of wheat treated with different fungicides Jagger wheat is a succeptible variety whereas Cutter and 2137 are both resistant varieties. It is assumed that the fungicide will only be able to prevent new infections for two weeks after applicaton. With these data, we want to answer the following questions:
To answer these questions, we will use the summary statistic, Area Under the Disease Progress Curve (AUDPC).
So, how do you calculate this? You could code the trapezoid rule yourself, OR you could find a package that was designed for this.
> stop("
+
+ Do an internet search for AUDPC in R. What did you find?
+
+ ")
Error in eval(expr, envir, enclos):
Do an internet search for AUDPC in R. What did you find?
The first thing that likely popped up was the function audpc()
in the agricolae package. If we want to use it, we can download the package to our computer with the function install.packages()
. This will install a package from CRAN and place it into your R Library. Where is your R library? Type .libPaths()
to find out!
> .libPaths()
>
> install.packages("agricolae", repos = "https://cran.rstudio.com")
Now that we have the agricolae package installed, we can load into our R workspace using the library()
function, which looks in our library and loads the functions and data sets within that package. You can find out about what functions are avialable with help(package = “agricolae”).
To accomplish our task, however, we need to find out how to use the audpc()
function. First we should load the agricolae package
> library("agricolae")
Now we can look up help for audpc()
> ?audpc
This function will take in a data frame of severity and a vector of dates. Since our vector of dates is in the first column, we will need to manipulate the data frame by subsetting.
> jdate <- fungicide$Julian.Date # this is also the first column
> jdate
[1] 97 104 111 118 125 132 139 146 154
> fungicide.audpc <- audpc(evaluation = fungicide[, -1], dates = jdate, type = "relative")
Error:
The number of dates of evaluation
must agree with the number of evaluations
Well this doesn’t look good. What can this error message mean? Let’s take a look at the audpc help page one more time; this time, we’ll look at the Examples section. Try to copy and paste the code one line at a time and see what happens. Check the evaluaton data relative to the dates and see how it’s relevant.
> stop("
+
+ Stop and look at the examples from ?audpc.
+
+ ")
Error in eval(expr, envir, enclos):
Stop and look at the examples from ?audpc.
We can see from example 3, that the data must be arranged where dates are in separate columns. We need to transpose our data. To do this, we can use the t()
function:
> t(fungicide[, -1])
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
TwentyOneThirtySevenWheat 0 0 0 0 0 0 2.34 7.56
TwentyOneThirtySevenWheat.trt 0 0 0 0 0 0 1.81 7.89
CutterWheat 0 0 0 0 0 0 1.15 3.62
CutterWheat.Trt 0 0 0 0 0 0 1.79 2.40
JaggerWheat 0 0 0 0 0 0 1.85 6.92
JaggerWheat.Trt 0 0 0 0 0 0 2.27 5.00
[,9]
TwentyOneThirtySevenWheat 28.78
TwentyOneThirtySevenWheat.trt 15.04
CutterWheat 17.89
CutterWheat.Trt 6.21
JaggerWheat 47.39
JaggerWheat.Trt 20.17
Now we can plug this into the audpc()
function:
> fungicide.audpc <- audpc(evaluation = t(fungicide[, -1]), dates = jdate, type = "relative")
> fungicide.audpc
TwentyOneThirtySevenWheat TwentyOneThirtySevenWheat.trt
0.033017544 0.023158772
CutterWheat CutterWheat.Trt
0.018729825 0.009714035
JaggerWheat JaggerWheat.Trt
0.044633333 0.023521053
This gives us a vector with the AUDPC values per treatment. The vector format does not make it particularly easy to compare the effect of cultivar or treatment. A better format to store these data would be in a table. Since all the values are numeric, we can place them into a matrix using the matrix()
function. We are going to put the samples in rows and experiments in columns.
> fungicide.res <- matrix(fungicide.audpc, nrow = 3, ncol = 2, byrow = TRUE)
> fungicide.res
[,1] [,2]
[1,] 0.03301754 0.023158772
[2,] 0.01872982 0.009714035
[3,] 0.04463333 0.023521053
We have our result, but there are no labels for the rows and columns. Notice how we selected byrow = TRUE
. This means that the matrix was filled row by row. So the first two elements of the vector would go in the first row, the next two would go in the second, and so on and so forth.
> names(fungicide.audpc)[1:2]
[1] "TwentyOneThirtySevenWheat" "TwentyOneThirtySevenWheat.trt"
> names(fungicide.audpc)[3:4]
[1] "CutterWheat" "CutterWheat.Trt"
> names(fungicide.audpc)[5:6]
[1] "JaggerWheat" "JaggerWheat.Trt"
Since this is a small matrix, we can easily name these rows and columns ourselves using the rownames()
and colnames()
functions.
> rownames(fungicide.res) <- c("2137", "Cutter", "Jagger")
> colnames(fungicide.res) <- c("Control", "Treated")
> fungicide.res
Control Treated
2137 0.03301754 0.023158772
Cutter 0.01872982 0.009714035
Jagger 0.04463333 0.023521053
If we wanted to use these data as a table in a paper, we should export it to a csv file. We do this using the function write.table()
> dir.create("results")
> write.table(fungicide.res, file = "results/audpc.csv", sep = ",",
+ col.names = NA,
+ row.names = TRUE)
> #
Imagine for a second that, instead of one control and one treatment for each cultivar, that we had one control and two treatments. Without the Julian.Date
column, the dimensions of our data frame would be 9 X 9.
What would happen if we had tried the audpc()
function on that matrix? It would have given us an answer, but not the correct one.
One solution to this would be to standardize the data format so that we have one observation per row. Our data, for example, might look something like this truncated example:
Experiment Julian.Date Cultivar Severity
1 control 139 TwentyOneThirtySevenWheat 2.34
2 control 146 TwentyOneThirtySevenWheat 7.56
3 control 154 TwentyOneThirtySevenWheat 28.78
4 control 139 CutterWheat 1.15
5 control 146 CutterWheat 3.62
6 control 154 CutterWheat 17.89
7 control 139 JaggerWheat 1.85
8 control 146 JaggerWheat 6.92
9 control 154 JaggerWheat 47.39
10 treatment 139 TwentyOneThirtySevenWheat 1.81
11 treatment 146 TwentyOneThirtySevenWheat 7.89
12 treatment 154 TwentyOneThirtySevenWheat 15.04
13 treatment 139 CutterWheat 1.79
14 treatment 146 CutterWheat 2.40
15 treatment 154 CutterWheat 6.21
16 treatment 139 JaggerWheat 2.27
17 treatment 146 JaggerWheat 5.00
18 treatment 154 JaggerWheat 20.17
The advantages here is that, when you look at the data, it is immediately clear what all the values mean, because each column represents either measured data or metadata.
This data format is known by many names, including an un-pivot table and long data. We will be referring to this as “tidy” data. Now we could go into a spreadsheet program and reshape the data ourselves, but why should we do that when we know:
If we use R to reshape our data, we can keep a track record of what we did to it! Luckily, in the past few years, a suite of packages have come out explicitly designed to deal with tidy data frames aptly called the “tidyverse”. If you are taking this as a workshop, you should have downloaded this package beforehand. Otherwise, please download it now.
The tidyverse is a package that loads several packages dealing with tidy data including:
For details, see https://tidyverse.tidyverse.org
> library("tidyverse")
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ----------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
There are several ways of creating the tidy data frame; we will show one here, but there are others that are worth exploring on your own.
The process we will demonstrate takes three steps:
gather()
(This is also known as “unpivot” in Excel: http://www.excel-university.com/unpivot-excel-data/).bind_rows()
Subset control and treatmend data into separate data frames. Since we have a data frame, we will use matrix-like subsetting.
> control <- fungicide[, c(1, 2, 4, 6)]
> treatment <- fungicide[, c(1, 3, 5, 7)]
Now we have two separate data frames with 4 columns each, but still we have three observations per row.
> stop("
+ Would this approach work for a different data set? Why or why not?
+ ")
Error in eval(expr, envir, enclos):
Would this approach work for a different data set? Why or why not?
Now that we have our experiments in separate tables, we can reshape each table into tidy format by using tidyr’s gather()
. This will “gather” all of the data that are spread across columns into single rows. Users of excel might be familiar with this operation as “unpivot”.
In order to do this, we will need to specify three things:
I often find the key/value syntax confusing and have to look at the examples in the help page for gather()
.
> stop("
+ Look at the examples for gather and determine what the names of the
+ key and value columns should be for our data.
+ (hint: you can use example(gather) to run them automatically).
+ ")
Error in eval(expr, envir, enclos):
Look at the examples for gather and determine what the names of the
key and value columns should be for our data.
(hint: you can use example(gather) to run them automatically).
Now that we have our keys and values, we can plug them into gather()
.
Note, however, this uses “bare” names (names that are not between quotation marks) to specify columns. This is a feature of the tidyverse to make typing easier, but does not necessarily work outside of the tidyverse.
To avoid typing all of the columns we want to collapse, we’ll simply specify the column that we don’t want to collapse, which is Julian.Date. This is done by placing a minus (-) symbol in front of it.
> con.tidy <- gather(data = control, key = Cultivar, value = Severity, -Julian.Date)
> con.tidy
Julian.Date Cultivar Severity
1 97 TwentyOneThirtySevenWheat 0.00
2 104 TwentyOneThirtySevenWheat 0.00
3 111 TwentyOneThirtySevenWheat 0.00
4 118 TwentyOneThirtySevenWheat 0.00
5 125 TwentyOneThirtySevenWheat 0.00
6 132 TwentyOneThirtySevenWheat 0.00
7 139 TwentyOneThirtySevenWheat 2.34
8 146 TwentyOneThirtySevenWheat 7.56
9 154 TwentyOneThirtySevenWheat 28.78
10 97 CutterWheat 0.00
11 104 CutterWheat 0.00
12 111 CutterWheat 0.00
13 118 CutterWheat 0.00
14 125 CutterWheat 0.00
15 132 CutterWheat 0.00
16 139 CutterWheat 1.15
17 146 CutterWheat 3.62
18 154 CutterWheat 17.89
19 97 JaggerWheat 0.00
20 104 JaggerWheat 0.00
21 111 JaggerWheat 0.00
22 118 JaggerWheat 0.00
23 125 JaggerWheat 0.00
24 132 JaggerWheat 0.00
25 139 JaggerWheat 1.85
26 146 JaggerWheat 6.92
27 154 JaggerWheat 47.39
Note that we can easily get back our original data by using the spread()
function, which is the opposite of gather()
:
> spread(data = con.tidy, key = Cultivar, value = Severity)
Julian.Date CutterWheat JaggerWheat TwentyOneThirtySevenWheat
1 97 0.00 0.00 0.00
2 104 0.00 0.00 0.00
3 111 0.00 0.00 0.00
4 118 0.00 0.00 0.00
5 125 0.00 0.00 0.00
6 132 0.00 0.00 0.00
7 139 1.15 1.85 2.34
8 146 3.62 6.92 7.56
9 154 17.89 47.39 28.78
This was simply to demonstrate that with tidy data, it’s possible to reshape it in most any way possible.
Now we can tidy our treatment data. But first, because we want to combine this data with our control, we want to make sure that the values in the “Cultivar” column are the same. Currently we have the following columns in our treatment data.
> colnames(treatment)
[1] "Julian.Date" "TwentyOneThirtySevenWheat.trt"
[3] "CutterWheat.Trt" "JaggerWheat.Trt"
If we take a look at these column names, we can see that they all end in .Trt, except for the second column, which ends in .trt. Because R is case-sensitive, it will see .trt and .Trt as two different things. To avoid this problem, we will simply replace the column names of the treatment to those of the control.
> colnames(control)
[1] "Julian.Date" "TwentyOneThirtySevenWheat"
[3] "CutterWheat" "JaggerWheat"
> colnames(treatment) <- colnames(control)
> treatment
Julian.Date TwentyOneThirtySevenWheat CutterWheat JaggerWheat
1 97 0.00 0.00 0.00
2 104 0.00 0.00 0.00
3 111 0.00 0.00 0.00
4 118 0.00 0.00 0.00
5 125 0.00 0.00 0.00
6 132 0.00 0.00 0.00
7 139 1.81 1.79 2.27
8 146 7.89 2.40 5.00
9 154 15.04 6.21 20.17
Now we can tidy our treatment data.
> trt.tidy <- gather(data = treatment, key = Cultivar, value = Severity, -Julian.Date)
We can combine the treatments by using the dplyr function bind_rows()
, which takes any number of data frames and places them on top of one another. It takes an optional argument, .id
(note the “.”) that specifies a separate column to create identifying the source data frame.
> fungicide.tidy <- bind_rows(control = con.tidy, fungicide = trt.tidy, .id = "Experiment")
> fungicide.tidy
Experiment Julian.Date Cultivar Severity
1 control 97 TwentyOneThirtySevenWheat 0.00
2 control 104 TwentyOneThirtySevenWheat 0.00
3 control 111 TwentyOneThirtySevenWheat 0.00
4 control 118 TwentyOneThirtySevenWheat 0.00
5 control 125 TwentyOneThirtySevenWheat 0.00
6 control 132 TwentyOneThirtySevenWheat 0.00
7 control 139 TwentyOneThirtySevenWheat 2.34
8 control 146 TwentyOneThirtySevenWheat 7.56
9 control 154 TwentyOneThirtySevenWheat 28.78
10 control 97 CutterWheat 0.00
11 control 104 CutterWheat 0.00
12 control 111 CutterWheat 0.00
13 control 118 CutterWheat 0.00
14 control 125 CutterWheat 0.00
15 control 132 CutterWheat 0.00
16 control 139 CutterWheat 1.15
17 control 146 CutterWheat 3.62
18 control 154 CutterWheat 17.89
19 control 97 JaggerWheat 0.00
20 control 104 JaggerWheat 0.00
21 control 111 JaggerWheat 0.00
22 control 118 JaggerWheat 0.00
23 control 125 JaggerWheat 0.00
24 control 132 JaggerWheat 0.00
25 control 139 JaggerWheat 1.85
26 control 146 JaggerWheat 6.92
27 control 154 JaggerWheat 47.39
28 fungicide 97 TwentyOneThirtySevenWheat 0.00
29 fungicide 104 TwentyOneThirtySevenWheat 0.00
30 fungicide 111 TwentyOneThirtySevenWheat 0.00
31 fungicide 118 TwentyOneThirtySevenWheat 0.00
32 fungicide 125 TwentyOneThirtySevenWheat 0.00
33 fungicide 132 TwentyOneThirtySevenWheat 0.00
34 fungicide 139 TwentyOneThirtySevenWheat 1.81
35 fungicide 146 TwentyOneThirtySevenWheat 7.89
36 fungicide 154 TwentyOneThirtySevenWheat 15.04
37 fungicide 97 CutterWheat 0.00
38 fungicide 104 CutterWheat 0.00
39 fungicide 111 CutterWheat 0.00
40 fungicide 118 CutterWheat 0.00
41 fungicide 125 CutterWheat 0.00
42 fungicide 132 CutterWheat 0.00
43 fungicide 139 CutterWheat 1.79
44 fungicide 146 CutterWheat 2.40
45 fungicide 154 CutterWheat 6.21
46 fungicide 97 JaggerWheat 0.00
47 fungicide 104 JaggerWheat 0.00
48 fungicide 111 JaggerWheat 0.00
49 fungicide 118 JaggerWheat 0.00
50 fungicide 125 JaggerWheat 0.00
51 fungicide 132 JaggerWheat 0.00
52 fungicide 139 JaggerWheat 2.27
53 fungicide 146 JaggerWheat 5.00
54 fungicide 154 JaggerWheat 20.17
We’ve gone to the work to tidy up the data; let’s save it to our disk so that we can easily reuse it later.
> write_csv(fungicide.tidy, path = "data/FungicideTidy.csv")
The package dplyr is one of the most powerful R packages because it allows you to think about manipulating data in a way that makes sense to humans. Here, I will show you how to calculate AUDPC from the tidy data frame in three steps:
Here, we use the group_by()
function to split the rows into specific groups.
> tmp <- group_by(fungicide.tidy, Experiment, Cultivar)
> tmp
Source: local data frame [54 x 4]
Groups: Experiment, Cultivar [6]
# A tibble: 54 x 4
Experiment Julian.Date Cultivar Severity
<chr> <int> <chr> <dbl>
1 control 97 TwentyOneThirtySevenWheat 0.00
2 control 104 TwentyOneThirtySevenWheat 0.00
3 control 111 TwentyOneThirtySevenWheat 0.00
4 control 118 TwentyOneThirtySevenWheat 0.00
5 control 125 TwentyOneThirtySevenWheat 0.00
6 control 132 TwentyOneThirtySevenWheat 0.00
7 control 139 TwentyOneThirtySevenWheat 2.34
8 control 146 TwentyOneThirtySevenWheat 7.56
9 control 154 TwentyOneThirtySevenWheat 28.78
10 control 97 CutterWheat 0.00
# ... with 44 more rows
You can see here that it says “Groups: Experiment, Cultivar [6]”. This indicates that we have six groups, containing the experiment and cultivar pairs.
Now that we have told dplyr to assign groups, we can use the dplyr function summarize()
to summarize the data within the groups with a function that takes in a vector (or multiple vectors) and returns a single value. In our case, this is the AUDPC function.
> tmp <- summarize(tmp, AUDPC = audpc(Severity, Julian.Date, type = "relative"))
> tmp
Source: local data frame [6 x 3]
Groups: Experiment [?]
# A tibble: 6 x 3
Experiment Cultivar AUDPC
<chr> <chr> <dbl>
1 control CutterWheat 0.018729825
2 control JaggerWheat 0.044633333
3 control TwentyOneThirtySevenWheat 0.033017544
4 fungicide CutterWheat 0.009714035
5 fungicide JaggerWheat 0.023521053
6 fungicide TwentyOneThirtySevenWheat 0.023158772
Now we see that we have 6 rows and three columns, If we wanted to have each experiment side by side, we can use the spread()
function as we saw earlier to spread out our data
> fungicide.res <- spread(tmp, Experiment, AUDPC)
> fungicide.res
# A tibble: 3 x 3
Cultivar control fungicide
* <chr> <dbl> <dbl>
1 CutterWheat 0.01872982 0.009714035
2 JaggerWheat 0.04463333 0.023521053
3 TwentyOneThirtySevenWheat 0.03301754 0.023158772
Now we can write this to a table!
> write.table(fungicide.res,
+ file = "results/audpc_tidy.csv",
+ sep = ",",
+ row.names = FALSE,
+ col.names = TRUE)