Let’s practice!

In this section we will cover:

  1. Working directories
  2. Reading in data
  3. Advanced data manipulation
  4. Statistical Analysis

Step 1: Working Directories

A working directory is the folder on your computer where you are currently working. You can find out your current working directory by typing getwd()

> getwd()
[1] "/Users/nikitagambhir/Desktop/IntroR_2019"

If you’ve downloaded and un-zipped this directory to your desktop, you might see something like /Users/<yourname>/Desktop/IntroR_2019. This is the default place where R will begin reading and writing files. For example, you can use the function list.files() to view the files in your current working directory. These are the same files that we downloaded earlier. If you’re using Rstudio, you can compare the file list with the “Files” tab in the bottom right panel.

In order to use list.files(), we should provide it with a file path. We can provide it a “.”, which means “here” to most computer programs.

> list.files(".")
[1] "data"                  "docs"                  "IntroR_2019.Rproj"    
[4] "LICENSE"               "Makefile"              "Part1-Introduction.R" 
[7] "Part2-Analysis.R"      "Part3-Visualization.R" "README.md"            

You can see that the first entry in here is “data”. This is where we’ve placed the field_data example data.

> list.files("data")
[1] "fungicide_dat.csv" "README.md"        

Step 2: Reading in Data

We can use the read.csv() function to read (or import) these data into R. It’s important to remember that while in R, these data are simply a copy kept in memory, not on the disk, so we don’t have to worry too much about accidentally deleting the data :).

So, how do we actually USE the read.csv() function?

> ?read.csv

You will notice that the help page describes all the sibling functions, too: read.table(), read.csv2(), read.delim(), read.delim2(). Let’s read the Description and Usage first. All arguments except ‘file’ have a default value (e.g. header = TRUE). You do not need to specify these arguments unless you want to change the default. Let’s use read.csv() with the default values.

> read.csv("data/fungicide_dat.csv")
     Treatment Yield_bu_per_acre Severity
1      Control            173.82      5.5
2      Control            174.23      5.6
3      Control            173.57      5.4
4      Control            173.61      5.4
5      Control            174.19      5.6
6      Control            173.80      5.5
7  Fungicide_A            173.98      5.1
8  Fungicide_A            174.27      5.2
9  Fungicide_A            173.61      5.0
10 Fungicide_A            173.88      5.0
11 Fungicide_A            174.17      5.2
12 Fungicide_A            173.49      5.1
13 Fungicide_B            175.98      4.1
14 Fungicide_B            175.58      3.9
15 Fungicide_B            175.75      4.2
16 Fungicide_B            175.88      4.0
17 Fungicide_B            175.68      4.1
18 Fungicide_B            175.95      4.2

Let’s use read.table() with the default values.

> read.table("data/fungicide_dat.csv")
                                     V1
1  Treatment,Yield_bu_per_acre,Severity
2                    Control,173.82,5.5
3                    Control,174.23,5.6
4                    Control,173.57,5.4
5                    Control,173.61,5.4
6                    Control,174.19,5.6
7                     Control,173.8,5.5
8                Fungicide_A,173.98,5.1
9                Fungicide_A,174.27,5.2
10                 Fungicide_A,173.61,5
11                 Fungicide_A,173.88,5
12               Fungicide_A,174.17,5.2
13               Fungicide_A,173.49,5.1
14               Fungicide_B,175.98,4.1
15               Fungicide_B,175.58,3.9
16               Fungicide_B,175.75,4.2
17                 Fungicide_B,175.88,4
18               Fungicide_B,175.68,4.1
19               Fungicide_B,175.95,4.2

This has two issues:

  1. The column names are treated as the first row. This is because the default for header is FALSE in read.table(). We should change to header = TRUE.
  2. The data is presented as one column, ‘V1’. This is because the separator for each cell in the data is a “,” and not “” (space). We should change to sep = “,”.
> read.table("data/fungicide_dat.csv", header = TRUE, sep = ",")
     Treatment Yield_bu_per_acre Severity
1      Control            173.82      5.5
2      Control            174.23      5.6
3      Control            173.57      5.4
4      Control            173.61      5.4
5      Control            174.19      5.6
6      Control            173.80      5.5
7  Fungicide_A            173.98      5.1
8  Fungicide_A            174.27      5.2
9  Fungicide_A            173.61      5.0
10 Fungicide_A            173.88      5.0
11 Fungicide_A            174.17      5.2
12 Fungicide_A            173.49      5.1
13 Fungicide_B            175.98      4.1
14 Fungicide_B            175.58      3.9
15 Fungicide_B            175.75      4.2
16 Fungicide_B            175.88      4.0
17 Fungicide_B            175.68      4.1
18 Fungicide_B            175.95      4.2

Now that we have these elements, we can read the data into an object, which we can call “field_data”. Once we do this, we can check the no. of rows and columns to make sure that we have all of the data.

> field_data <- read.csv("data/fungicide_dat.csv")
> nrow(field_data)
[1] 18
> ncol(field_data)
[1] 3

We can print the data to screen by simply typing its name.

> field_data
     Treatment Yield_bu_per_acre Severity
1      Control            173.82      5.5
2      Control            174.23      5.6
3      Control            173.57      5.4
4      Control            173.61      5.4
5      Control            174.19      5.6
6      Control            173.80      5.5
7  Fungicide_A            173.98      5.1
8  Fungicide_A            174.27      5.2
9  Fungicide_A            173.61      5.0
10 Fungicide_A            173.88      5.0
11 Fungicide_A            174.17      5.2
12 Fungicide_A            173.49      5.1
13 Fungicide_B            175.98      4.1
14 Fungicide_B            175.58      3.9
15 Fungicide_B            175.75      4.2
16 Fungicide_B            175.88      4.0
17 Fungicide_B            175.68      4.1
18 Fungicide_B            175.95      4.2

We can also use the View() function to look at our data in spreadsheet-style.

Exercise 1: Type View(field_data) and take a minute to inspect the data.

View() is helpful, but if we have a data frame with too many columns it may become difficult to scroll and see all the columns. We can use the str() function (short for “structure”) to have a broad overview of what our data looks like.

> str(field_data)
'data.frame':   18 obs. of  3 variables:
 $ Treatment        : Factor w/ 3 levels "Control","Fungicide_A",..: 1 1 1 1 1 1 2 2 2 2 ...
 $ Yield_bu_per_acre: num  174 174 174 174 174 ...
 $ Severity         : num  5.5 5.6 5.4 5.4 5.6 5.5 5.1 5.2 5 5 ...

The dummy data presented here consists of yield (measured in bu/acre) and disease severity (measured on a scale of 1 to 10) of a corn culitvar treated with two fungicides. This trial was conducted to measure the efficacy of the two fungicides to manage disease. The experiment was laid out as a Completely Randomized Design. Say our target journal wants the results published in S.I. units.

With these data, we want to answer the following questions:

  1. What is the mean yield of each treatment group in kg/ha?
  2. What is the percent severity of Control and Fungicide B?
  3. Which fungicide shows better results? (ANOVA)

Step 3: Advanced data manipulation

The package ‘dplyr’ provides functions for easy and advanced data manipulation. If we want to use it, we can download the package to our computer with the function install.packages(). This will install a package from CRAN and place it into your R Library.

> install.packages("dplyr", repos = "https://cran.rstudio.com")

The downloaded binary packages are in
    /var/folders/y1/z82xd0b501193md_9_d3vs9r0000gn/T//RtmpbRvnIM/downloaded_packages

To load this package, we can use the function library().

> library("dplyr")

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

1. What is the mean yield of each treatment group in kg/ha?

Let’s go step by step to answer this:

a) Convert yield data from bu/acre to kg/ha

To do this conversion for corn, we need to multiply the yield in bu/acre with 62.77. So, how can we add a column with yield data in kg/ha? We can do this similar to what we learnt in Part 1.

field_data$Yield_kg_per_ha <- field_data$Yield_bu_per_acre*62.77

We can also use the function mutate() from ‘dplyr’. This adds a new variable using existing variables. The usage of this function is as:

mutate(data, new_variable_name = calculation_based_upon_existing_variables)

> dat_kg_ha <- mutate(field_data, Yield_kg_per_ha = Yield_bu_per_acre*62.77)

Note: We did not have to use field_data$Yield_bu_per_acre. Let’s print the data to see what we have

> dat_kg_ha
     Treatment Yield_bu_per_acre Severity Yield_kg_per_ha
1      Control            173.82      5.5        10910.68
2      Control            174.23      5.6        10936.42
3      Control            173.57      5.4        10894.99
4      Control            173.61      5.4        10897.50
5      Control            174.19      5.6        10933.91
6      Control            173.80      5.5        10909.43
7  Fungicide_A            173.98      5.1        10920.72
8  Fungicide_A            174.27      5.2        10938.93
9  Fungicide_A            173.61      5.0        10897.50
10 Fungicide_A            173.88      5.0        10914.45
11 Fungicide_A            174.17      5.2        10932.65
12 Fungicide_A            173.49      5.1        10889.97
13 Fungicide_B            175.98      4.1        11046.26
14 Fungicide_B            175.58      3.9        11021.16
15 Fungicide_B            175.75      4.2        11031.83
16 Fungicide_B            175.88      4.0        11039.99
17 Fungicide_B            175.68      4.1        11027.43
18 Fungicide_B            175.95      4.2        11044.38

We have a new column Yield_kg_per_ha. We have two columns with yield and one with severity. We don’t want severity data and want yield data only in kg/ha.

b) Create a new data frame with only Treatment and Yield_kg_per_ha columns

We can use the function select() that picks variables based on their names.

> yield_kg_ha <- select(dat_kg_ha, Treatment, Yield_kg_per_ha)
> yield_kg_ha
     Treatment Yield_kg_per_ha
1      Control        10910.68
2      Control        10936.42
3      Control        10894.99
4      Control        10897.50
5      Control        10933.91
6      Control        10909.43
7  Fungicide_A        10920.72
8  Fungicide_A        10938.93
9  Fungicide_A        10897.50
10 Fungicide_A        10914.45
11 Fungicide_A        10932.65
12 Fungicide_A        10889.97
13 Fungicide_B        11046.26
14 Fungicide_B        11021.16
15 Fungicide_B        11031.83
16 Fungicide_B        11039.99
17 Fungicide_B        11027.43
18 Fungicide_B        11044.38

c) Find the mean yield

If we want to summarize multiple values to a single value, for example, mean, we can use the function summarize().

> summarize(yield_kg_ha, Mean_yield = mean(Yield_kg_per_ha))
  Mean_yield
1    10954.9

Wait, we just got one mean even though we had three different groups (Control, Fungicide_A, Fungicide_B). We need to find the mean for every group. How can we tell R that it needs to group the data according to the treatment?

d) Group data according to treatment

We can use the function group_by() for this purpose.

> grp_yield_kg_ha <- group_by(yield_kg_ha, Treatment)
> grp_yield_kg_ha 
# A tibble: 18 x 2
# Groups:   Treatment [3]
   Treatment   Yield_kg_per_ha
   <fct>                 <dbl>
 1 Control              10911.
 2 Control              10936.
 3 Control              10895.
 4 Control              10897.
 5 Control              10934.
 6 Control              10909.
 7 Fungicide_A          10921.
 8 Fungicide_A          10939.
 9 Fungicide_A          10897.
10 Fungicide_A          10914.
11 Fungicide_A          10933.
12 Fungicide_A          10890.
13 Fungicide_B          11046.
14 Fungicide_B          11021.
15 Fungicide_B          11032.
16 Fungicide_B          11040.
17 Fungicide_B          11027.
18 Fungicide_B          11044.

You will notice that yield_kg_ha and grp_yield_kg_ha are a little different. grp_yield_kg_ha is also a “tibble”, which is a form of data frame that gives more information about our data (e.g. what kind of data the columns are). You will notice that Treatment is a factor, while Yield_kg_per_ha is a double (decimal). It also tells us that our data is grouped by Treatment (therefore, it has three groups). To add the grouping information to this data frame, ‘dplyr’ uses a sister package to convert it into a tibble. Also, note that Yield_kg_per_ha is not printing the whole answer (after decimals) and has underlined first two digits. The tibble format is different when we print it, but the same when we use View(). The digits are underlined so that it is easier to read larger numbers. It’s role is similar to a comma. So, ‘10,000’ will have ‘10’ underlined and ‘100,000’ will have ‘100’ underlined.

e) Find the mean yield of every group

> mean_yield_dat <- summarize(grp_yield_kg_ha, Mean_yield = mean(Yield_kg_per_ha))
> mean_yield_dat
# A tibble: 3 x 2
  Treatment   Mean_yield
  <fct>            <dbl>
1 Control         10914.
2 Fungicide_A     10916.
3 Fungicide_B     11035.

Instead of creating different objects everytime, we can perform multiple functions at once and create one final object. We use the pipe operator, %>% , for this purpose. The left hand side of %>% acts as the input on which an operation on the right side is performed. A shortcut to write %>% is ctrl+shift+m.

> yield_summary <- field_data %>% 
+   mutate(Yield_kg_per_ha = Yield_bu_per_acre*62.77) %>%  
+   select(Treatment, Yield_kg_per_ha) %>% 
+   group_by(Treatment) %>% 
+   summarise(Mean_yield = mean(Yield_kg_per_ha))
> yield_summary
# A tibble: 3 x 2
  Treatment   Mean_yield
  <fct>            <dbl>
1 Control         10914.
2 Fungicide_A     10916.
3 Fungicide_B     11035.
> View(yield_summary)

We answered our first question!

You will notice that on the console below, the code for creating yield_summary has + signs before every line. The plus signs just indicate that R is waiting for the code to be finished. This sign also appears when we forget to put a closing paranthesis ‘)’. Try this:

> str(yield_summary
+ 
+ )
Classes 'tbl_df', 'tbl' and 'data.frame':   3 obs. of  2 variables:
 $ Treatment : Factor w/ 3 levels "Control","Fungicide_A",..: 1 2 3
 $ Mean_yield: num  10914 10916 11035

Before moving to our second question, let’s do an exercise. You may use the cheatsheet for ‘dplyr’ (provided in the workshop or can be accessed from the ‘Help’ tab) or ‘Google’ to do this:

Exercise 2: Using the existing object yield_summary, create a new

object test_summary in which the column Mean_yield is renamed to

Mean_yield_kg_ha. Hint: rename it ;)

> test_summary <- yield_summary %>% 
+   rename(Mean_yield_kg_ha = Mean_yield)
> test_summary
# A tibble: 3 x 2
  Treatment   Mean_yield_kg_ha
  <fct>                  <dbl>
1 Control               10914.
2 Fungicide_A           10916.
3 Fungicide_B           11035.

Let’s go to our second question.

2. What is the percent severity of Control and Fungicide B?

First, we need to create a new data frame where we only have severity data for Control and Fungicide B. If we want to filter a data frame based upon specific values of a variable, we can use the function filter().

> filter(field_data, Treatment == "Control" | Treatment == "Fungicide_B")
     Treatment Yield_bu_per_acre Severity
1      Control            173.82      5.5
2      Control            174.23      5.6
3      Control            173.57      5.4
4      Control            173.61      5.4
5      Control            174.19      5.6
6      Control            173.80      5.5
7  Fungicide_B            175.98      4.1
8  Fungicide_B            175.58      3.9
9  Fungicide_B            175.75      4.2
10 Fungicide_B            175.88      4.0
11 Fungicide_B            175.68      4.1
12 Fungicide_B            175.95      4.2

The treatment column should either be equal (==) to Control OR (|) Fungicide_B. We can also write the same expression as:

> filter(field_data, Treatment != "Fungicide_A")
     Treatment Yield_bu_per_acre Severity
1      Control            173.82      5.5
2      Control            174.23      5.6
3      Control            173.57      5.4
4      Control            173.61      5.4
5      Control            174.19      5.6
6      Control            173.80      5.5
7  Fungicide_B            175.98      4.1
8  Fungicide_B            175.58      3.9
9  Fungicide_B            175.75      4.2
10 Fungicide_B            175.88      4.0
11 Fungicide_B            175.68      4.1
12 Fungicide_B            175.95      4.2

The treatment column should not be equal to (!=) to Fungicide_A. So, it will have everything except Fungicide A (“Control” and “Fungicide_B”).

After filtering, we need to add a column Percent_Severity, where severity is measured in percent and not on a scale of 1 to 10. Any guess on which function should we use? Let’s perform all operations in one go.

> severity_dat <- field_data %>% 
+   filter(Treatment == "Control" | Treatment == "Fungicide_B") %>% 
+   mutate(Percent_Severity = Severity/10*100) %>% 
+   select(Treatment, Percent_Severity)
> severity_dat
     Treatment Percent_Severity
1      Control               55
2      Control               56
3      Control               54
4      Control               54
5      Control               56
6      Control               55
7  Fungicide_B               41
8  Fungicide_B               39
9  Fungicide_B               42
10 Fungicide_B               40
11 Fungicide_B               41
12 Fungicide_B               42

We do not see a column for the number of replications for each treatment. They might be unequal (environmental factors like insect damage, etc. can reduce the pool from where data can be collected).

Exercise 3: Find the no. of replications for each treatment. Don’t count manually.

> field_data %>% 
+   group_by(Treatment) %>% 
+   count(Treatment)
# A tibble: 3 x 2
# Groups:   Treatment [3]
  Treatment       n
  <fct>       <int>
1 Control         6
2 Fungicide_A     6
3 Fungicide_B     6

Now that we have answered our first two questions, what if we wanted to use these data as a table in a paper, we should export it to a csv file. We can do this using the function write.table(). Before that, let’s create a new directory (or folder) named results. We can do this using the function dir.create().

> dir.create("results")
> write.table(severity_dat, file = "results/severity_dat.csv", sep = ",", 
+             col.names = TRUE, 
+             row.names = FALSE)

Step 4: Statistical Analysis

Let’s solve our final question.

3. Which fungicide shows better results? (ANOVA)

Let’s do ANOVA for yield and severity data. We can use alpha level = 0.05. If we do an internet search for ANOVA in R, you will find that function aov is appropriate for this. How do we use aov?

> ?aov

This function takes in a formula to be tested. To find if the independent variable (Treatment) had an effect on the dependent variable (Yield_bu_per_acre), the formula should be written as ‘dependent_variable ~ independent_variable’. If we used blocks in our experimental design, the formula can be modified to ‘dependent_variable ~ independent_variable + block’. The function also takes in a data frame in which these variables are present.

Let’s use this function.

> aov(formula=Yield_bu_per_acre ~ Treatment, data=field_data) 
Call:
   aov(formula = Yield_bu_per_acre ~ Treatment, data = field_data)

Terms:
                Treatment Residuals
Sum of Squares  14.722711  0.992333
Deg. of Freedom         2        15

Residual standard error: 0.2572072
Estimated effects may be unbalanced

As mentioned in the ‘Description’, aov fits an ANOVA model to an experimental design. Let’s check the ‘Value’ section of the help page so we know what does it return. It returns an object, which can further be explored with print() and summary(). Let’s assign it to an object so that we can use print() and summary() on it.

> fit_yield <- aov(formula=Yield_bu_per_acre ~ Treatment, data=field_data) 
> print(fit_yield) 
Call:
   aov(formula = Yield_bu_per_acre ~ Treatment, data = field_data)

Terms:
                Treatment Residuals
Sum of Squares  14.722711  0.992333
Deg. of Freedom         2        15

Residual standard error: 0.2572072
Estimated effects may be unbalanced
> summary(fit_yield) # This is what we want!
            Df Sum Sq Mean Sq F value   Pr(>F)    
Treatment    2 14.723   7.361   111.3 1.01e-09 ***
Residuals   15  0.992   0.066                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output is similar to an ANOVA table. ‘Pr(>F)’ is the associated p-value. As the p-value is less than 0.05, the null hypothesis is rejected. So, at least one of the treatments is different. Now that we know that differences exist between our treatments, how can we find which treatments are different? We can do this by using Tukey’s post-hoc test. If you scroll down the help page of aov and go the section ‘See Also’, you will see a function TukeyHSD. This is what we need.

> TukeyHSD(fit_yield) 
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Yield_bu_per_acre ~ Treatment, data = field_data)

$Treatment
                            diff        lwr       upr    p adj
Fungicide_A-Control     0.030000 -0.3557208 0.4157208 0.977785
Fungicide_B-Control     1.933333  1.5476125 2.3190542 0.000000
Fungicide_B-Fungicide_A 1.903333  1.5176125 2.2890542 0.000000

There is no difference between ‘Fungicide_A’ and ‘Control’ as the p-value is 0.978 (>0.05).

Let’s do a similar analysis using the Severity data.

> fit_severity <- aov(formula=Severity ~ Treatment, data=field_data)
> summary(fit_severity) # p-value is less than 0.05, so let's do Tukey's post-hoc test
            Df Sum Sq Mean Sq F value  Pr(>F)    
Treatment    2  6.401   3.201   323.7 4.6e-13 ***
Residuals   15  0.148   0.010                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> TukeyHSD(fit_severity) 
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Severity ~ Treatment, data = field_data)

$Treatment
                             diff        lwr        upr    p adj
Fungicide_A-Control     -0.400000 -0.5491295 -0.2508705 1.28e-05
Fungicide_B-Control     -1.416667 -1.5657962 -1.2675371 0.00e+00
Fungicide_B-Fungicide_A -1.016667 -1.1657962 -0.8675371 0.00e+00

All the treatment pairs are significantly different from each other as the p-value is < 0.05.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.