In this section we will cover:
A working directory is the folder on your computer where you are currently working. You can find out your current working directory by typing getwd()
> getwd()
[1] "/Users/nikitagambhir/Desktop/APS_IntroR_2019"
If you’ve downloaded and un-zipped this directory to your desktop, you might see something like /Users/<yourname>/Desktop/APS_IntroR_2019
. This is the default place where R will begin reading and writing files. For example, you can use the function list.files()
to view the files in your current working directory. These are the same files that we downloaded earlier. If you’re using Rstudio, you can compare the file list with the “Files” tab in the bottom right panel.
In order to use list.files()
, we should provide it with a file path. We can provide it a “.”, which means “here” to most computer programs.
> list.files(".")
[1] "APS_IntroR_2019.Rproj" "data" "docs"
[4] "LICENSE" "Makefile" "Part1-Introduction.R"
[7] "Part2-Analysis.R" "Part3-Visualization.R" "README.md"
You can see that the first entry in here is “data”. This is where we’ve placed the field_data example data.
> list.files("data")
[1] "fungicide_dat.csv" "README.md"
We can use the read.csv()
function to read (or import) these data into R. It’s important to remember that while in R, these data are simply a copy kept in memory, not on the disk, so we don’t have to worry too much about accidentally deleting the data :).
So, how do we actually USE the read.csv()
function?
> ?read.csv
You will notice that the help page describes all the sibling functions, too: read.table()
, read.csv2()
, read.delim()
, read.delim2()
. Let’s read the Description and Usage first. All arguments except ‘file’ have a default value (e.g. header = TRUE). You do not need to specify these arguments unless you want to change the default. Let’s use read.csv()
with the default values.
> read.csv("data/fungicide_dat.csv")
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_A 173.98 5.1
8 Fungicide_A 174.27 5.2
9 Fungicide_A 173.61 5.0
10 Fungicide_A 173.88 5.0
11 Fungicide_A 174.17 5.2
12 Fungicide_A 173.49 5.1
13 Fungicide_B 175.98 4.1
14 Fungicide_B 175.58 3.9
15 Fungicide_B 175.75 4.2
16 Fungicide_B 175.88 4.0
17 Fungicide_B 175.68 4.1
18 Fungicide_B 175.95 4.2
Let’s use read.table()
with the default values.
> read.table("data/fungicide_dat.csv")
V1
1 Treatment,Yield_bu_per_acre,Severity
2 Control,173.82,5.5
3 Control,174.23,5.6
4 Control,173.57,5.4
5 Control,173.61,5.4
6 Control,174.19,5.6
7 Control,173.8,5.5
8 Fungicide_A,173.98,5.1
9 Fungicide_A,174.27,5.2
10 Fungicide_A,173.61,5
11 Fungicide_A,173.88,5
12 Fungicide_A,174.17,5.2
13 Fungicide_A,173.49,5.1
14 Fungicide_B,175.98,4.1
15 Fungicide_B,175.58,3.9
16 Fungicide_B,175.75,4.2
17 Fungicide_B,175.88,4
18 Fungicide_B,175.68,4.1
19 Fungicide_B,175.95,4.2
This has two issues:
read.table()
. We should change to header = TRUE.> read.table("data/fungicide_dat.csv", header = TRUE, sep = ",")
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_A 173.98 5.1
8 Fungicide_A 174.27 5.2
9 Fungicide_A 173.61 5.0
10 Fungicide_A 173.88 5.0
11 Fungicide_A 174.17 5.2
12 Fungicide_A 173.49 5.1
13 Fungicide_B 175.98 4.1
14 Fungicide_B 175.58 3.9
15 Fungicide_B 175.75 4.2
16 Fungicide_B 175.88 4.0
17 Fungicide_B 175.68 4.1
18 Fungicide_B 175.95 4.2
Now that we have these elements, we can read the data into an object, which we can call “field_data”. Once we do this, we can check the no. of rows and columns to make sure that we have all of the data.
> field_data <- read.csv("data/fungicide_dat.csv")
> nrow(field_data)
[1] 18
> ncol(field_data)
[1] 3
We can print the data to screen by simply typing its name.
> field_data
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_A 173.98 5.1
8 Fungicide_A 174.27 5.2
9 Fungicide_A 173.61 5.0
10 Fungicide_A 173.88 5.0
11 Fungicide_A 174.17 5.2
12 Fungicide_A 173.49 5.1
13 Fungicide_B 175.98 4.1
14 Fungicide_B 175.58 3.9
15 Fungicide_B 175.75 4.2
16 Fungicide_B 175.88 4.0
17 Fungicide_B 175.68 4.1
18 Fungicide_B 175.95 4.2
We can also use the View()
function to look at our data in spreadsheet-style.
View()
is helpful, but if we have a data frame with too many columns it may become difficult to scroll and see all the columns. We can use the str()
function (short for “structure”) to have a broad overview of what our data looks like.
> str(field_data)
'data.frame': 18 obs. of 3 variables:
$ Treatment : Factor w/ 3 levels "Control","Fungicide_A",..: 1 1 1 1 1 1 2 2 2 2 ...
$ Yield_bu_per_acre: num 174 174 174 174 174 ...
$ Severity : num 5.5 5.6 5.4 5.4 5.6 5.5 5.1 5.2 5 5 ...
The dummy data presented here consists of yield (measured in bu/acre) and disease severity (measured on a scale of 1 to 10) of a corn culitvar treated with two fungicides. This trial was conducted to measure the efficacy of the two fungicides to manage disease. The experiment was laid out as a Completely Randomized Design. Say our target journal wants the results published in S.I. units.
With these data, we want to answer the following questions:
The package ‘dplyr’ provides functions for easy and advanced data manipulation. If we want to use it, we can download the package to our computer with the function install.packages()
. This will install a package from CRAN and place it into your R Library.
> install.packages("dplyr", repos = "https://cran.rstudio.com")
The downloaded binary packages are in
/var/folders/y1/z82xd0b501193md_9_d3vs9r0000gn/T//RtmpC8bCbI/downloaded_packages
To load this package, we can use the function library()
.
> library("dplyr")
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Let’s go step by step to answer this:
To do this conversion for corn, we need to multiply the yield in bu/acre with 62.77. So, how can we add a column with yield data in kg/ha? We can do this similar to what we learnt in Part 1.
field_data$Yield_kg_per_ha <- field_data$Yield_bu_per_acre*62.77
We can also use the function mutate()
from ‘dplyr’. This adds a new variable using existing variables. The usage of this function is as:
mutate(data, new_variable_name = calculation_based_upon_existing_variables)
> dat_kg_ha <- mutate(field_data, Yield_kg_per_ha = Yield_bu_per_acre*62.77)
Note: We did not have to use field_data$Yield_bu_per_acre. Let’s print the data to see what we have
> dat_kg_ha
Treatment Yield_bu_per_acre Severity Yield_kg_per_ha
1 Control 173.82 5.5 10910.68
2 Control 174.23 5.6 10936.42
3 Control 173.57 5.4 10894.99
4 Control 173.61 5.4 10897.50
5 Control 174.19 5.6 10933.91
6 Control 173.80 5.5 10909.43
7 Fungicide_A 173.98 5.1 10920.72
8 Fungicide_A 174.27 5.2 10938.93
9 Fungicide_A 173.61 5.0 10897.50
10 Fungicide_A 173.88 5.0 10914.45
11 Fungicide_A 174.17 5.2 10932.65
12 Fungicide_A 173.49 5.1 10889.97
13 Fungicide_B 175.98 4.1 11046.26
14 Fungicide_B 175.58 3.9 11021.16
15 Fungicide_B 175.75 4.2 11031.83
16 Fungicide_B 175.88 4.0 11039.99
17 Fungicide_B 175.68 4.1 11027.43
18 Fungicide_B 175.95 4.2 11044.38
We have a new column Yield_kg_per_ha
. We have two columns with yield and one with severity. We don’t want severity data and want yield data only in kg/ha.
Treatment
and Yield_kg_per_ha
columnsWe can use the function select()
that picks variables based on their names.
> yield_kg_ha <- select(dat_kg_ha, Treatment, Yield_kg_per_ha)
> yield_kg_ha
Treatment Yield_kg_per_ha
1 Control 10910.68
2 Control 10936.42
3 Control 10894.99
4 Control 10897.50
5 Control 10933.91
6 Control 10909.43
7 Fungicide_A 10920.72
8 Fungicide_A 10938.93
9 Fungicide_A 10897.50
10 Fungicide_A 10914.45
11 Fungicide_A 10932.65
12 Fungicide_A 10889.97
13 Fungicide_B 11046.26
14 Fungicide_B 11021.16
15 Fungicide_B 11031.83
16 Fungicide_B 11039.99
17 Fungicide_B 11027.43
18 Fungicide_B 11044.38
If we want to summarize multiple values to a single value, for example, mean, we can use the function summarize()
.
> summarize(yield_kg_ha, Mean_yield = mean(Yield_kg_per_ha))
Mean_yield
1 10954.9
Wait, we just got one mean even though we had three different groups (Control, Fungicide_A, Fungicide_B). We need to find the mean for every group. How can we tell R that it needs to group the data according to the treatment?
We can use the function group_by()
for this purpose.
> grp_yield_kg_ha <- group_by(yield_kg_ha, Treatment)
> grp_yield_kg_ha
# A tibble: 18 x 2
# Groups: Treatment [3]
Treatment Yield_kg_per_ha
<fct> <dbl>
1 Control 10911.
2 Control 10936.
3 Control 10895.
4 Control 10897.
5 Control 10934.
6 Control 10909.
7 Fungicide_A 10921.
8 Fungicide_A 10939.
9 Fungicide_A 10897.
10 Fungicide_A 10914.
11 Fungicide_A 10933.
12 Fungicide_A 10890.
13 Fungicide_B 11046.
14 Fungicide_B 11021.
15 Fungicide_B 11032.
16 Fungicide_B 11040.
17 Fungicide_B 11027.
18 Fungicide_B 11044.
You will notice that yield_kg_ha
and grp_yield_kg_ha
are a little different. grp_yield_kg_ha
is also a “tibble”, which is a form of data frame that gives more information about our data (e.g. what kind of data the columns are). You will notice that Treatment is a factor, while Yield_kg_per_ha is a double (decimal). It also tells us that our data is grouped by Treatment (therefore, it has three groups). To add the grouping information to this data frame, ‘dplyr’ uses a sister package to convert it into a tibble. Also, note that Yield_kg_per_ha
is not printing the whole answer (after decimals) and has underlined first two digits. The tibble format is different when we print it, but the same when we use View()
. The digits are underlined so that it is easier to read larger numbers. It’s role is similar to a comma. So, ‘10,000’ will have ‘10’ underlined and ‘100,000’ will have ‘100’ underlined.
> mean_yield_dat <- summarize(grp_yield_kg_ha, Mean_yield = mean(Yield_kg_per_ha))
> mean_yield_dat
# A tibble: 3 x 2
Treatment Mean_yield
<fct> <dbl>
1 Control 10914.
2 Fungicide_A 10916.
3 Fungicide_B 11035.
Instead of creating different objects everytime, we can perform multiple functions at once and create one final object. We use the pipe operator, %>% , for this purpose. The left hand side of %>% acts as the input on which an operation on the right side is performed. A shortcut to write %>% is ctrl+shift+m
.
> yield_summary <- field_data %>%
+ mutate(Yield_kg_per_ha = Yield_bu_per_acre*62.77) %>%
+ select(Treatment, Yield_kg_per_ha) %>%
+ group_by(Treatment) %>%
+ summarise(Mean_yield = mean(Yield_kg_per_ha))
> yield_summary
# A tibble: 3 x 2
Treatment Mean_yield
<fct> <dbl>
1 Control 10914.
2 Fungicide_A 10916.
3 Fungicide_B 11035.
> View(yield_summary)
We answered our first question!
You will notice that on the console below, the code for creating yield_summary
has +
signs before every line. The plus signs just indicate that R is waiting for the code to be finished. This sign also appears when we forget to put a closing paranthesis ‘)’. Try this:
> str(yield_summary
+
+ )
Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of 2 variables:
$ Treatment : Factor w/ 3 levels "Control","Fungicide_A",..: 1 2 3
$ Mean_yield: num 10914 10916 11035
Before moving to our second question, let’s do an exercise. You may use the cheatsheet for ‘dplyr’ (provided in the workshop or can be accessed from the ‘Help’ tab) or ‘Google’ to do this:
yield_summary
, create a newtest_summary
in which the column Mean_yield
is renamed toMean_yield_kg_ha
. Hint: rename
it ;)> test_summary <- yield_summary %>%
+ rename(Mean_yield_kg_ha = Mean_yield)
> test_summary
# A tibble: 3 x 2
Treatment Mean_yield_kg_ha
<fct> <dbl>
1 Control 10914.
2 Fungicide_A 10916.
3 Fungicide_B 11035.
Let’s go to our second question.
First, we need to create a new data frame where we only have severity data for Control and Fungicide B. If we want to filter a data frame based upon specific values of a variable, we can use the function filter()
.
> filter(field_data, Treatment == "Control" | Treatment == "Fungicide_B")
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_B 175.98 4.1
8 Fungicide_B 175.58 3.9
9 Fungicide_B 175.75 4.2
10 Fungicide_B 175.88 4.0
11 Fungicide_B 175.68 4.1
12 Fungicide_B 175.95 4.2
The treatment column should either be equal (==
) to Control OR (|
) Fungicide_B. We can also write the same expression as:
> filter(field_data, Treatment != "Fungicide_A")
Treatment Yield_bu_per_acre Severity
1 Control 173.82 5.5
2 Control 174.23 5.6
3 Control 173.57 5.4
4 Control 173.61 5.4
5 Control 174.19 5.6
6 Control 173.80 5.5
7 Fungicide_B 175.98 4.1
8 Fungicide_B 175.58 3.9
9 Fungicide_B 175.75 4.2
10 Fungicide_B 175.88 4.0
11 Fungicide_B 175.68 4.1
12 Fungicide_B 175.95 4.2
The treatment column should not be equal to (!=
) to Fungicide_A. So, it will have everything except Fungicide A (“Control” and “Fungicide_B”).
After filtering, we need to add a column Percent_Severity
, where severity is measured in percent and not on a scale of 1 to 10. Any guess on which function should we use? Let’s perform all operations in one go.
> severity_dat <- field_data %>%
+ filter(Treatment == "Control" | Treatment == "Fungicide_B") %>%
+ mutate(Percent_Severity = Severity/10*100) %>%
+ select(Treatment, Percent_Severity)
> severity_dat
Treatment Percent_Severity
1 Control 55
2 Control 56
3 Control 54
4 Control 54
5 Control 56
6 Control 55
7 Fungicide_B 41
8 Fungicide_B 39
9 Fungicide_B 42
10 Fungicide_B 40
11 Fungicide_B 41
12 Fungicide_B 42
We do not see a column for the number of replications for each treatment. They might be unequal (environmental factors like insect damage, etc. can reduce the pool from where data can be collected).
count
manually.> field_data %>%
+ group_by(Treatment) %>%
+ count(Treatment)
# A tibble: 3 x 2
# Groups: Treatment [3]
Treatment n
<fct> <int>
1 Control 6
2 Fungicide_A 6
3 Fungicide_B 6
Now that we have answered our first two questions, what if we wanted to use these data as a table in a paper, we should export it to a csv file. We can do this using the function write.table()
. Before that, let’s create a new directory (or folder) named results. We can do this using the function dir.create()
.
> dir.create("results")
> write.table(severity_dat, file = "results/severity_dat.csv", sep = ",",
+ col.names = TRUE,
+ row.names = FALSE)
Let’s solve our final question.
Let’s do ANOVA for yield and severity data. We can use alpha level = 0.05. If we do an internet search for ANOVA in R, you will find that function aov
is appropriate for this. How do we use aov
?
> ?aov
This function takes in a formula to be tested. To find if the independent variable (Treatment
) had an effect on the dependent variable (Yield_bu_per_acre
), the formula should be written as ‘dependent_variable ~ independent_variable’. If we used blocks in our experimental design, the formula can be modified to ‘dependent_variable ~ independent_variable + block’. The function also takes in a data frame in which these variables are present.
Let’s use this function.
> aov(formula=Yield_bu_per_acre ~ Treatment, data=field_data)
Call:
aov(formula = Yield_bu_per_acre ~ Treatment, data = field_data)
Terms:
Treatment Residuals
Sum of Squares 14.722711 0.992333
Deg. of Freedom 2 15
Residual standard error: 0.2572072
Estimated effects may be unbalanced
As mentioned in the ‘Description’, aov
fits an ANOVA model to an experimental design. Let’s check the ‘Value’ section of the help page so we know what does it return. It returns an object, which can further be explored with print()
and summary()
. Let’s assign it to an object so that we can use print()
and summary()
on it.
> fit_yield <- aov(formula=Yield_bu_per_acre ~ Treatment, data=field_data)
> print(fit_yield)
Call:
aov(formula = Yield_bu_per_acre ~ Treatment, data = field_data)
Terms:
Treatment Residuals
Sum of Squares 14.722711 0.992333
Deg. of Freedom 2 15
Residual standard error: 0.2572072
Estimated effects may be unbalanced
> summary(fit_yield) # This is what we want!
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 14.723 7.361 111.3 1.01e-09 ***
Residuals 15 0.992 0.066
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output is similar to an ANOVA table. ‘Pr(>F)’ is the associated p-value. As the p-value is less than 0.05, the null hypothesis is rejected. So, at least one of the treatments is different. Now that we know that differences exist between our treatments, how can we find which treatments are different? We can do this by using Tukey’s post-hoc test. If you scroll down the help page of aov
and go the section ‘See Also’, you will see a function TukeyHSD
. This is what we need.
> TukeyHSD(fit_yield)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Yield_bu_per_acre ~ Treatment, data = field_data)
$Treatment
diff lwr upr p adj
Fungicide_A-Control 0.030000 -0.3557208 0.4157208 0.977785
Fungicide_B-Control 1.933333 1.5476125 2.3190542 0.000000
Fungicide_B-Fungicide_A 1.903333 1.5176125 2.2890542 0.000000
There is no difference between ‘Fungicide_A’ and ‘Control’ as the p-value is 0.978 (>0.05).
Let’s do a similar analysis using the Severity data.
> fit_severity <- aov(formula=Severity ~ Treatment, data=field_data)
> summary(fit_severity) # p-value is less than 0.05, so let's do Tukey's post-hoc test
Df Sum Sq Mean Sq F value Pr(>F)
Treatment 2 6.401 3.201 323.7 4.6e-13 ***
Residuals 15 0.148 0.010
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> TukeyHSD(fit_severity)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Severity ~ Treatment, data = field_data)
$Treatment
diff lwr upr p adj
Fungicide_A-Control -0.400000 -0.5491295 -0.2508705 1.28e-05
Fungicide_B-Control -1.416667 -1.5657962 -1.2675371 0.00e+00
Fungicide_B-Fungicide_A -1.016667 -1.1657962 -0.8675371 0.00e+00
All the treatment pairs are significantly different from each other as the p-value is < 0.05.