In this section we will cover:
A working directory is the folder on your computer where you are currently working. You can find out your current working directory by typing getwd()
> getwd()
[1] "/Users/nikitagambhir/Desktop/IntroR_Workshop"
If you’ve downloaded and un-zipped this directory to your desktop, you might see something like /Users/<yourname>/Desktop/IntroR_Workshop
. This is the
default place where R will begin reading and writing files. For example, you
can use the function list.files()
to view the files in your current working directory. These are the same files that we downloaded earlier. If you’re using Rstudio, you can compare the file list with the “Files” tab in the bottom right panel.
In order to use list.files()
, we should provide it with a file path. We can provide it a “.”, which means “here” to most computer programs.
> list.files(".")
[1] "data" "docs" "IntroR_Workshop.Rproj"
[4] "LICENSE" "Makefile" "Part1-Introduction.R"
[7] "Part2-Analysis.R" "Part3-Visualization.R" "README.md"
You can see that the first entry in here is “data”. This is where we’ve placed the fungicide example data.
> list.files("data")
[1] "fungicide_dat.csv" "README.md"
We can use the read.csv()
function to read (or import) these data into R. It’s important to remember that while in R, these data are simply a copy kept in memory, not on the disk, so we don’t have to worry too much about accidentally deleting the data :).
So, how do we actually USE the read.csv()
function?
> ?read.csv
You will notice that the help page describes all the sibling functions, too: read.table()
, read.csv2()
, read.delim()
, read.delim2()
. Let’s read the Description and Usage first. All arguments except ‘file’ have a default value (e.g. header = TRUE). You do not need to specify these arguments unless you want to change the default. Let’s use read.csv()
with the default values.
> read.csv("data/fungicide_dat.csv")
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_A 173.98 5.1
8 Fungicide_A 174.27 5.2
9 Fungicide_A 173.61 5.0
10 Fungicide_A 173.88 5.0
11 Fungicide_A 174.17 5.2
12 Fungicide_A 173.49 5.1
13 Fungicide_B 175.98 4.1
14 Fungicide_B 175.58 3.9
15 Fungicide_B 175.75 4.2
16 Fungicide_B 175.88 4.0
17 Fungicide_B 175.68 4.1
18 Fungicide_B 175.95 4.2
Let’s use read.table()
with the default values.
> read.table("data/fungicide_dat.csv")
V1
1 Treatment,Yield_bu_per_acre,Severity
2 Control,173.82,5.5
3 Control,174.23,5.6
4 Control,173.57,5.4
5 Control,173.61,5.4
6 Control,174.19,5.6
7 Control,173.8,5.5
8 Fungicide_A,173.98,5.1
9 Fungicide_A,174.27,5.2
10 Fungicide_A,173.61,5
11 Fungicide_A,173.88,5
12 Fungicide_A,174.17,5.2
13 Fungicide_A,173.49,5.1
14 Fungicide_B,175.98,4.1
15 Fungicide_B,175.58,3.9
16 Fungicide_B,175.75,4.2
17 Fungicide_B,175.88,4
18 Fungicide_B,175.68,4.1
19 Fungicide_B,175.95,4.2
This has two issues:
read.table()
. We should change to header = TRUE.> read.table("data/fungicide_dat.csv", header = TRUE, sep = ",")
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_A 173.98 5.1
8 Fungicide_A 174.27 5.2
9 Fungicide_A 173.61 5.0
10 Fungicide_A 173.88 5.0
11 Fungicide_A 174.17 5.2
12 Fungicide_A 173.49 5.1
13 Fungicide_B 175.98 4.1
14 Fungicide_B 175.58 3.9
15 Fungicide_B 175.75 4.2
16 Fungicide_B 175.88 4.0
17 Fungicide_B 175.68 4.1
18 Fungicide_B 175.95 4.2
Now that we have these elements, we can read the data into an object, which we can call “fungicide”. Once we do this, we can check the dimensions to make sure that we have all of the data.
> fungicide <- read.csv("data/fungicide_dat.csv")
> nrow(fungicide)
[1] 18
> ncol(fungicide)
[1] 3
We can print the data to screen by simply typing its name.
> fungicide
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_A 173.98 5.1
8 Fungicide_A 174.27 5.2
9 Fungicide_A 173.61 5.0
10 Fungicide_A 173.88 5.0
11 Fungicide_A 174.17 5.2
12 Fungicide_A 173.49 5.1
13 Fungicide_B 175.98 4.1
14 Fungicide_B 175.58 3.9
15 Fungicide_B 175.75 4.2
16 Fungicide_B 175.88 4.0
17 Fungicide_B 175.68 4.1
18 Fungicide_B 175.95 4.2
We can use the str()
function (short for “structure”) to have a broad overview of what our data looks like. This is useful for data frames with many columns.
> str(fungicide)
'data.frame': 18 obs. of 3 variables:
$ Treatment : Factor w/ 3 levels "Control","Fungicide_A",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Yield_bu_per_acre: num 174 174 174 174 174 ...
$ Severity : num 5.5 5.6 5.4 5.4 5.6 5.5 5.1 5.2 5 5 ...
We can also use the View()
function to look at our data in spreadsheet-style.
> stop("
+
+ Type View(fungicide) and inspect the data
+
+ ")
Error in eval(expr, envir, enclos):
Type View(fungicide) and inspect the data
The dummy data presented here consists of yield (measured in bu/acre) and disease severity (measured on a scale of 1 to 10) of a corn culitvar treated with two fungicides. This trial was conducted to measure the efficacy of the two fungicides to manage disease. The experiment was laid out as a Completely Randomized Design.
With these data, we want to answer the following questions:
The package ‘dplyr’ provides functions for easy and advanced data manipulation. If we want to use it, we can download the package to our computer with the function install.packages()
. This will install a package from CRAN and place it into your R Library.
> install.packages("dplyr")
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
To load this package, we can use the function library()
.
> library("dplyr")
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Let’s go step by step to answer this:
fungicide$Yield_kg_per_ha <- fungicide$Yield_bu_per_acre*62.77
We can also use the function mutate()
from ‘dplyr’. This adds a new variable using existing variables. The usage of this function is as:
mutate(data, new_variable_name = calculation_based_upon_existing_variables)
> fungicide_1 <- mutate(fungicide, Yield_kg_per_ha = Yield_bu_per_acre*62.77)
Note: We did not have to use fungicide$Yield_bu_per_acre.
Let’s print the data to see what we have
> fungicide_1
Treatment Yield_bu_per_acre Severity Yield_kg_per_ha
1 Control 173.82 5.5 10910.68
2 Control 174.23 5.6 10936.42
3 Control 173.57 5.4 10894.99
4 Control 173.61 5.4 10897.50
5 Control 174.19 5.6 10933.91
6 Control 173.80 5.5 10909.43
7 Fungicide_A 173.98 5.1 10920.72
8 Fungicide_A 174.27 5.2 10938.93
9 Fungicide_A 173.61 5.0 10897.50
10 Fungicide_A 173.88 5.0 10914.45
11 Fungicide_A 174.17 5.2 10932.65
12 Fungicide_A 173.49 5.1 10889.97
13 Fungicide_B 175.98 4.1 11046.26
14 Fungicide_B 175.58 3.9 11021.16
15 Fungicide_B 175.75 4.2 11031.83
16 Fungicide_B 175.88 4.0 11039.99
17 Fungicide_B 175.68 4.1 11027.43
18 Fungicide_B 175.95 4.2 11044.38
We have a new column Yield_kg_per_ha
. We have two columns with yield and one with severity. We don’t want severity data and want yield data only in kg/ha.
Treatment
and Yield_kg_per_ha
columns We can use the function select()
that picks variables based on their names.> fungicide_2 <- select(fungicide_1, Treatment, Yield_kg_per_ha)
> fungicide_2
Treatment Yield_kg_per_ha
1 Control 10910.68
2 Control 10936.42
3 Control 10894.99
4 Control 10897.50
5 Control 10933.91
6 Control 10909.43
7 Fungicide_A 10920.72
8 Fungicide_A 10938.93
9 Fungicide_A 10897.50
10 Fungicide_A 10914.45
11 Fungicide_A 10932.65
12 Fungicide_A 10889.97
13 Fungicide_B 11046.26
14 Fungicide_B 11021.16
15 Fungicide_B 11031.83
16 Fungicide_B 11039.99
17 Fungicide_B 11027.43
18 Fungicide_B 11044.38
summarize()
.> summarize(fungicide_2, Mean_yield = mean(Yield_kg_per_ha))
Mean_yield
1 10954.9
Wait, we just got one mean even though we had three different groups (Control, Fungicide_A, Fungicide_B). We need to find the mean for every group. How can we tell R that it needs to group the data according to the treatment?
group_by()
for this purpose.> fungicide_3 <- group_by(fungicide_2, Treatment)
> fungicide_3
# A tibble: 18 x 2
# Groups: Treatment [3]
Treatment Yield_kg_per_ha
<fct> <dbl>
1 Control 10911.
2 Control 10936.
3 Control 10895.
4 Control 10897.
5 Control 10934.
6 Control 10909.
7 Fungicide_A 10921.
8 Fungicide_A 10939.
9 Fungicide_A 10897.
10 Fungicide_A 10914.
11 Fungicide_A 10933.
12 Fungicide_A 10890.
13 Fungicide_B 11046.
14 Fungicide_B 11021.
15 Fungicide_B 11032.
16 Fungicide_B 11040.
17 Fungicide_B 11027.
18 Fungicide_B 11044.
You will notice that fungicide_2
and fungicide_3
are a little different. fungicide_3
is also a “tibble”, which is a form of data frame that gives more information about our data (e.g. what kind of data the columns are). You will notice that Treatment is a factor, while Yield_kg_per_ha is a double (decimal). It also tells us that our data is grouped by Treatment (therefore, it has three groups). To add the grouping information to this data frame, ‘dplyr’ uses a sister package to convert it into a tibble. Also, note that Yield_kg_per_ha
is not printing the whole answer (after decimals) and has underlined first two digits. The tibble format is different when we print it, but the same when we use View()
. The digits are underlined so that it is easier to read larger numbers. It’s role is similar to a comma. So, ‘10,000’ will have ‘10’ underlined and ‘100,000’ will have ‘100’ underlined.
> fungicide_4 <- summarize(fungicide_3, Mean_yield = mean(Yield_kg_per_ha))
> fungicide_4
# A tibble: 3 x 2
Treatment Mean_yield
<fct> <dbl>
1 Control 10914.
2 Fungicide_A 10916.
3 Fungicide_B 11035.
Instead of creating different objects everytime, we can perform multiple functions at once and create one final object. We use the pipe operator, %>% , for this purpose. The left hand side of %>% acts as the input on which an operation on the right side is performed. A shortcut to write %>% is ctrl+shift+m
.
> yield_summary <- fungicide %>%
+ mutate(Yield_kg_per_ha = Yield_bu_per_acre*62.77) %>%
+ select(Treatment, Yield_kg_per_ha) %>%
+ group_by(Treatment) %>%
+ summarise(Mean_yield = mean(Yield_kg_per_ha))
> yield_summary
# A tibble: 3 x 2
Treatment Mean_yield
<fct> <dbl>
1 Control 10914.
2 Fungicide_A 10916.
3 Fungicide_B 11035.
> View(yield_summary)
We answered our first question!
You will notice that on the console below, the code for creating yield_summary
has +
signs before every line. The plus signs just indicate that R is waiting for the code to be finished. This sign also appears when we forget to put a closing paranthesis ‘)’. Try this:
> str(yield_summary
+
+ )
Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of 2 variables:
$ Treatment : Factor w/ 3 levels "Control","Fungicide_A",..: 1 2 3
$ Mean_yield: num 10914 10916 11035
Let’s go to our second question.
First, we need to create a new data frame where we only have severity data for Control and Fungicide A. If we want to filter a data frame based upon specific values of a variable, we can use the function filter()
.
> filter(fungicide, Treatment == "Control" | Treatment == "Fungicide_A")
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_A 173.98 5.1
8 Fungicide_A 174.27 5.2
9 Fungicide_A 173.61 5.0
10 Fungicide_A 173.88 5.0
11 Fungicide_A 174.17 5.2
12 Fungicide_A 173.49 5.1
The treatment column should either be equal (==
) to Control OR (|
) Fungicide_A. We can also write the same expression as:
> filter(fungicide, Treatment != "Fungicide_B")
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_A 173.98 5.1
8 Fungicide_A 174.27 5.2
9 Fungicide_A 173.61 5.0
10 Fungicide_A 173.88 5.0
11 Fungicide_A 174.17 5.2
12 Fungicide_A 173.49 5.1
The treatment column should not be equal to (!=
) to Fungicide_B. So, it will have everything except Fungicide B (“Control” and “Fungicide_A”).
After filtering, we need to add a column Percent_Severity
, where severity is measured in percent and not on a scale of 1 to 10. Any guess on which function should we use? Let’s perform all operations in one go.
> severity_dat <- fungicide %>%
+ filter(Treatment == "Control" | Treatment == "Fungicide_A") %>%
+ mutate(Percent_Severity = Severity/10*100) %>%
+ select(Treatment, Percent_Severity)
> severity_dat
Treatment Percent_Severity
1 Control 55
2 Control 56
3 Control 54
4 Control 54
5 Control 56
6 Control 55
7 Fungicide_A 51
8 Fungicide_A 52
9 Fungicide_A 50
10 Fungicide_A 50
11 Fungicide_A 52
12 Fungicide_A 51
If we wanted to use these data as a table in a paper, we should export it to a csv file. We can do this using the function write.table()
. Before that, let’s create a new directory (or folder) named results. We can do this using the function dir.create()
.
> dir.create("results")
> write.table(severity_dat, file = "results/severity_dat.csv", sep = ",",
+ col.names = TRUE,
+ row.names = FALSE)
Let’s do ANOVA for yield and severity data. We can use alpha level = 0.05. If we do an internet search for ANOVA in R, you will find that function aov
is appropriate for this. How do we use aov
?
> ?aov
This function takes in a formula to be tested. To find if the independent variable (Treatment
) had an effect on the dependent variable (Yield_bu_per_acre
), the formula should be written as ‘dependent_variable ~ independent_variable’. If we used blocks in our experimental design, the formula can be modified to ‘dependent_variable ~ independent_variable + block’. The function also takes in a data frame in which these variables are present.
Let’s use this function.
> aov(formula=Yield_bu_per_acre ~ Treatment, data=fungicide)
Call:
aov(formula = Yield_bu_per_acre ~ Treatment, data = fungicide)
Terms:
Treatment Residuals
Sum of Squares 14.722711 0.992333
Deg. of Freedom 2 15
Residual standard error: 0.2572072
Estimated effects may be unbalanced
As mentioned in the ‘Description’, aov
fits an ANOVA model to an experimental design. Let’s check the ‘Value’ section of the help page so we know what does it return. It returns an object, which can further be explored with print()
and summary()
. Let’s assign it to an object so that we can use print()
and summary()
on it.
> fit_yield <- aov(formula=Yield_bu_per_acre ~ Treatment, data=fungicide)
> print(fit_yield)
Call:
aov(formula = Yield_bu_per_acre ~ Treatment, data = fungicide)
Terms:
Treatment Residuals
Sum of Squares 14.722711 0.992333
Deg. of Freedom 2 15
Residual standard error: 0.2572072
Estimated effects may be unbalanced
> summary(fit_yield) # This is what we want!
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 14.723 7.361 111.3 1.01e-09 ***
Residuals 15 0.992 0.066
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output is similar to an ANOVA table. ‘Pr(>F)’ is the associated p-value. As the p-value is less than 0.05, the null hypothesis is rejected. So, at least one of the treatments is different. Now that we know that differences exist between our treatments, how can we find which treatments are different? We can do this by using Tukey’s post-hoc test. If you scroll down the help page of aov
and go the section ‘See Also’, you will see a function TukeyHSD
. This is what we need.
> TukeyHSD(fit_yield)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Yield_bu_per_acre ~ Treatment, data = fungicide)
$Treatment
diff lwr upr p adj
Fungicide_A-Control 0.030000 -0.3557208 0.4157208 0.977785
Fungicide_B-Control 1.933333 1.5476125 2.3190542 0.000000
Fungicide_B-Fungicide_A 1.903333 1.5176125 2.2890542 0.000000
There is no difference between ‘Fungicide_A’ and ‘Control’ as the p-value is 0.978 (>0.05).
Let’s do similar analysis using the Severity data
> fit_severity <- aov(formula=Severity ~ Treatment, data=fungicide)
> summary(fit_severity) # p-value is less than 0.05, so let's do Tukey's post-hoc test
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 6.401 3.201 323.7 4.6e-13 ***
Residuals 15 0.148 0.010
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> TukeyHSD(fit_severity)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Severity ~ Treatment, data = fungicide)
$Treatment
diff lwr upr p adj
Fungicide_A-Control -0.400000 -0.5491295 -0.2508705 1.28e-05
Fungicide_B-Control -1.416667 -1.5657962 -1.2675371 0.00e+00
Fungicide_B-Fungicide_A -1.016667 -1.1657962 -0.8675371 0.00e+00
All the treatment pairs are significantly different from each other as the p-value is < 0.05.