[1] 3
Coding for Reproducible Research
Laura Roldan-Gomez
11/1/22
Emma Walker
Xinran Du
xd247@exeter.ac.uk
Laura Roldan-Gomez
lr480@exeter.ac.uk
Part of Coding for Reproducible Research Programme
Why are you here?
This material is new and designed by Theresa Walker alongside a team of people including Emma and myself.
It is based on other materials and our experience.
I have taught R before, on multiple occasions.
So today will be an experiment but should run smoothly.
Check out the material here
Be kind by default
The rest of the rules of engagement are available on course website
Method of delivery: three parts with tasks and breaks every hour or so
Please sign in/register your attendance, via an online sheet link
Data types
Data structures
DATA FRAMES
Matrices (bonus track)
Take a few minutes for:
questions and troubleshooting
opening RStudio, and…
setting your working directory.
… setwd()
… or Session button
Data types define how the values are stored in the computer. There are 3 core data types:
[1] "numeric"
consists of letters, words, sentences or characters, such as:
[1] "a"
[1] "datatypes"
[1] "Learning R is fun"
[1] "@%£$"
Basically, anything within “” is a character or string, so… numbers can also be characters!
NOW…
Logical values can either be TRUE or FALSE
or sometimes 1 and 0.
That means that if you type:
…if you don’t use as.logical you get a set of numbers
When we group the different data types, they get into a structure. The different structures are:
A vector is the most common and basic data structure in R. Technically, vectors can be one of two types:
only holds elements (or atoms) of a single data type and each atom is a scalar, which means it “has length one”. It is homogenous.
Examples:
CONCATENATE,
INDEXING
AND
COERCION
or combine c() function constructs a vector. We used it above to construct the logical vector.
c() can also be used to add elements to a vector.
Let’s put this together and create our first little data set. We’ll work with data sets further down but I want to wrap up this section with one of the functions of vectors.
We use two functions: as.data.frame and cbind or column bind.
Every atom in a vector has a ‘location’ represented by a number or index. Our vector x from the previous exercise has length one. So the index for 3 is 1.
Likewise, vector z has 5 letters and therefore, 5 indexes.
We can call this numbers by using []
numeric(0)
[1] NA
The positions in a data frame behave like a coordinate [row, column].
So you have
And you can also use indexing to get whole columns or rows.
v_name y y_times_two z v_log
1 Greg 1 2 a TRUE
2 Sarah 2 4 b FALSE
3 Tracy 3 6 c FALSE
4 Jon 4 8 d TRUE
5 Annette 5 10 e TRUE
We can even create a new data frame using this. It’s great when you have huge data sets but you are interested in just a few variables.
Even though R’s vectors have a specific data type, it’s quite easy to coerce them to another type.
Let’s look at our data set and see what type of data we have:
Coercion can also be triggered by other actions. When R joined our vectors to form a data frame, it read every element as a character.
When you use real data, you must check this before running any analysis. And you need to coerce the data to whatever makes sense.
For instance, we must coerce our variables x, y, y_times_two to numeric and variable v_log to logic
'data.frame': 5 obs. of 6 variables:
$ v_name : chr "Greg" "Sarah" "Tracy" "Jon" ...
$ x : num 3 3 3 3 3
$ y : num 1 2 3 4 5
$ y_times_two: num 2 4 6 8 10
$ z : chr "a" "b" "c" "d" ...
$ v_log : logi TRUE FALSE FALSE TRUE TRUE
Note: If you need to change something to character, you use as.character
…still vectors but can contain several datatypes (can be heterogenous)
…each atom of the list can itself be composed of more than one atom (has a length > one).
Atomic vectors are very constrained (atoms are of a scalar/length 1 and need to be one data type).
You might need a vector that violates these constraints - can have length > 1 and not be of the same type.
We construct a list explicitly with list() but, like atomic vectors, most lists are created some other way in real life.
# A more impressive one
(b_list <- list(logical = TRUE, integer = 4L, double = 4 * 1.2, character = "character"))
$logical
[1] TRUE
$integer
[1] 4
$double
[1] 4.8
$character
[1] "character"
# A very impressive one
(c_list <- list(letters[26:22], transcendental = c(pi, exp(1)), f = function(x) x^2))
[[1]]
[1] "z" "y" "x" "w" "v"
$transcendental
[1] 3.141593 2.718282
$f
function(x) x^2
You can also coerce other objects using as.list().
There are 3 ways to index a list and the differences are very important:
1.) With single square brackets, just like we indexed atomic vectors. Note this always returns a list, even if we request a single component.
[[1]]
[1] "four" "five"
$integer
[1] 4
$double
[1] 4.8
$transcendental
[1] 3.141593 2.718282
2.) With double square brackets. This can only be used to access a single component and it returns the “naked” component. You can request a component with a positive integer or by name.
3.) With the $ addressing named components. Like [[, this can only be used to access a single component, but it is even more limited: You must specify the component by name.
This is a bit advanced as it uses loops but a good example of where you can use lists…
Loop over your data to create many data frames and keep them in a list. I’m currently using this as part of my PhD for which I am running over 13.000 models! So yes, lists can be useful.
What is the class of ourdata_as_list[1]?
What is the class of ourdata_as_list[[1]]?
Consider
Use [ and [[ to attempt to retrieve elements 2 and 3 from my_vec and my_list. What succeeds vs. fails? What if you try to retrieve element 2 alone? Does [[ even work on atomic vectors? Compare and contrast the results from the various combinations of indexing method and input object.
Data frames are the data structure you will use the most for statistics or data management in general.
To get your data in R, you can do one of three things:
Create your own data set (we did that already with ‘ourdata’)
Use a built-in data set like iris
First, let’s create a data frame by hand using the command data.frame (previously, we used a different method):
Second, let’s use several commands to explore a data frame:
id x y
1 a 1 11
2 b 2 12
3 c 3 13
4 d 4 14
5 e 5 15
6 f 6 16
id x y
5 e 5 15
6 f 6 16
7 g 7 17
8 h 8 18
9 i 9 19
10 j 10 20
[1] 10 3
[1] 10
[1] 3
'data.frame': 10 obs. of 3 variables:
$ id: chr "a" "b" "c" "d" ...
$ x : int 1 2 3 4 5 6 7 8 9 10
$ y : int 11 12 13 14 15 16 17 18 19 20
Below we show that a data frame is actually a special list:
As shown above, there are ways to retrieve specific elements from the data frame, the data frame can be sliced or indexed .
Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets.
…so, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a $.
[1] 11 12 13 14 15 16 17 18 19 20
- Add and change names:
When you look at both the dat and iris data frames from earlier, they have no rownames
id x y
1 a 1 11
2 b 2 12
3 c 3 13
4 d 4 14
5 e 5 15
6 f 6 16
7 g 7 17
8 h 8 18
9 i 9 19
10 j 10 20
However, when we look at the mtcars data set, it does
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
To give give each row a name, we can… Albeit a bit non-sensical but the entire data frame is just a generic example…
We can also rename columns. Let’s assume we want to change the abbreviation of the first three columns of mtcars to the actual words:
miles per gallon cylinders displacement hp drat wt qsec
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02
Datsun 710 22.8 4 108 93 3.85 2.320 18.61
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02
Valiant 18.1 6 225 105 2.76 3.460 20.22
vs am gear carb
Mazda RX4 0 1 4 4
Mazda RX4 Wag 0 1 4 4
Datsun 710 1 1 4 1
Hornet 4 Drive 1 0 3 1
Hornet Sportabout 0 0 3 2
Valiant 1 0 3 1
By using [1:3] we only changed a subset of the column names. If you want to change them all, the vector with the column names must correspond to the number of columns.
You can also append a column of choice to your data frame. Remember, it needs to have the same length as the other columns:
miles per gallon cylinders displacement hp drat wt
Mazda RX4 21.0 6 160.0 110 3.90 2.620
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875
Datsun 710 22.8 4 108.0 93 3.85 2.320
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440
Valiant 18.1 6 225.0 105 2.76 3.460
Duster 360 14.3 8 360.0 245 3.21 3.570
Merc 240D 24.4 4 146.7 62 3.69 3.190
Merc 230 22.8 4 140.8 95 3.92 3.150
Merc 280 19.2 6 167.6 123 3.92 3.440
Merc 280C 17.8 6 167.6 123 3.92 3.440
Merc 450SE 16.4 8 275.8 180 3.07 4.070
Merc 450SL 17.3 8 275.8 180 3.07 3.730
Merc 450SLC 15.2 8 275.8 180 3.07 3.780
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250
Lincoln Continental 10.4 8 460.0 215 3.00 5.424
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345
Fiat 128 32.4 4 78.7 66 4.08 2.200
Honda Civic 30.4 4 75.7 52 4.93 1.615
Toyota Corolla 33.9 4 71.1 65 4.22 1.835
Toyota Corona 21.5 4 120.1 97 3.70 2.465
Dodge Challenger 15.5 8 318.0 150 2.76 3.520
AMC Javelin 15.2 8 304.0 150 3.15 3.435
Camaro Z28 13.3 8 350.0 245 3.73 3.840
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845
Fiat X1-9 27.3 4 79.0 66 4.08 1.935
Porsche 914-2 26.0 4 120.3 91 4.43 2.140
Lotus Europa 30.4 4 95.1 113 3.77 1.513
Ford Pantera L 15.8 8 351.0 264 4.22 3.170
Ferrari Dino 19.7 6 145.0 175 3.62 2.770
Maserati Bora 15.0 8 301.0 335 3.54 3.570
Volvo 142E 21.4 4 121.0 109 4.11 2.780
qsec vs am gear carb favorites
Mazda RX4 16.46 0 1 4 4 1
Mazda RX4 Wag 17.02 0 1 4 4 2
Datsun 710 18.61 1 1 4 1 3
Hornet 4 Drive 19.44 1 0 3 1 4
Hornet Sportabout 17.02 0 0 3 2 5
Valiant 20.22 1 0 3 1 6
Duster 360 15.84 0 0 3 4 7
Merc 240D 20.00 1 0 4 2 8
Merc 230 22.90 1 0 4 2 9
Merc 280 18.30 1 0 4 4 10
Merc 280C 18.90 1 0 4 4 11
Merc 450SE 17.40 0 0 3 3 12
Merc 450SL 17.60 0 0 3 3 13
Merc 450SLC 18.00 0 0 3 3 14
Cadillac Fleetwood 17.98 0 0 3 4 15
Lincoln Continental 17.82 0 0 3 4 16
Chrysler Imperial 17.42 0 0 3 4 17
Fiat 128 19.47 1 1 4 1 18
Honda Civic 18.52 1 1 4 2 19
Toyota Corolla 19.90 1 1 4 1 20
Toyota Corona 20.01 1 0 3 1 21
Dodge Challenger 16.87 0 0 3 2 22
AMC Javelin 17.30 0 0 3 2 23
Camaro Z28 15.41 0 0 3 4 24
Pontiac Firebird 17.05 0 0 3 2 25
Fiat X1-9 18.90 1 1 4 1 26
Porsche 914-2 16.70 0 1 5 2 27
Lotus Europa 16.90 1 1 5 2 28
Ford Pantera L 14.50 0 1 5 4 29
Ferrari Dino 15.50 0 1 5 6 30
Maserati Bora 14.60 0 1 5 8 31
Volvo 142E 18.60 1 1 4 2 32
We can also subset (or filter based on a conditional statement) a data frame using subset. The function takes two arguments subset(x, condition). X is the data frame to perform subset on, condition is the conditional statement to subset with:
miles per gallon cylinders displacement hp drat wt
Mazda RX4 21.0 6 160.0 110 3.90 2.620
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440
Valiant 18.1 6 225.0 105 2.76 3.460
Duster 360 14.3 8 360.0 245 3.21 3.570
Merc 280 19.2 6 167.6 123 3.92 3.440
Merc 280C 17.8 6 167.6 123 3.92 3.440
Merc 450SE 16.4 8 275.8 180 3.07 4.070
Merc 450SL 17.3 8 275.8 180 3.07 3.730
Merc 450SLC 15.2 8 275.8 180 3.07 3.780
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250
Lincoln Continental 10.4 8 460.0 215 3.00 5.424
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345
Dodge Challenger 15.5 8 318.0 150 2.76 3.520
AMC Javelin 15.2 8 304.0 150 3.15 3.435
Camaro Z28 13.3 8 350.0 245 3.73 3.840
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845
Ford Pantera L 15.8 8 351.0 264 4.22 3.170
Ferrari Dino 19.7 6 145.0 175 3.62 2.770
Maserati Bora 15.0 8 301.0 335 3.54 3.570
qsec vs am gear carb favorites
Mazda RX4 16.46 0 1 4 4 1
Mazda RX4 Wag 17.02 0 1 4 4 2
Hornet 4 Drive 19.44 1 0 3 1 4
Hornet Sportabout 17.02 0 0 3 2 5
Valiant 20.22 1 0 3 1 6
Duster 360 15.84 0 0 3 4 7
Merc 280 18.30 1 0 4 4 10
Merc 280C 18.90 1 0 4 4 11
Merc 450SE 17.40 0 0 3 3 12
Merc 450SL 17.60 0 0 3 3 13
Merc 450SLC 18.00 0 0 3 3 14
Cadillac Fleetwood 17.98 0 0 3 4 15
Lincoln Continental 17.82 0 0 3 4 16
Chrysler Imperial 17.42 0 0 3 4 17
Dodge Challenger 16.87 0 0 3 2 22
AMC Javelin 17.30 0 0 3 2 23
Camaro Z28 15.41 0 0 3 4 24
Pontiac Firebird 17.05 0 0 3 2 25
Ford Pantera L 14.50 0 1 5 4 29
Ferrari Dino 15.50 0 1 5 6 30
Maserati Bora 14.60 0 1 5 8 31
Load the dataset worms
Try the command ‘summary()’
Try to selectively rename 2 rows of your choice in mtcars.
What happens if you try to add a new column of a length that is less than 32.
Extract (using either [] or $) the columns Sepal.Length and Sepal.Width from the iris dataset and make a new data frame out of them using data.frame(). Subset the new data frame for Sepal.Length > 4.6.
Think of categories.
Factors are so-called derived data types. They are normally used to group variables into unique categories or levels. For example, a data set may be grouped by gender or month of the year. Such data are usually loaded into R as a numeric or character data type requiring that they be converted to a factor using the as.factor() function.
In the following chunk of code, we create a factor from a character object.
Note that a is of character data type and fact is the factor representation of a.
Tell R that you want this to be a factor…
However, the derived object fact is now stored as an integer!
Yet, when displaying the contents of fact we see character values.
Well, fact is a more complicated object than the simple objects created thus far in that the factor is storing additional information not seen in its output.
This hidden information is stored in attributes.
There are two attributes of the factor object fact : class and levels.
To appreciate the benefits of a factor we’ll first create a data frame. One column will be assigned the fact factor and another will be assigned some random numeric values.
The month column is now a factor with three levels: F, M and N. We can use the str() function to view the dataframe’s structure as well as its columns classes.
There are functions that recognize factor data types and allow you to split the output into groups defined by the factor’s unique levels. For example, to create three box plots of the value min_sunshine, one for each month group F, M and N:
The tilde ~ is used in the function to split (or condition) the data into separate plots based on the factor month.
A factor will define a hierarchy for its levels.
When we invoked the levels function in the last example, you may have noted that the levels output were ordered F, M and N–this is the level hierarchy defined for months (i.e. F>M>N ).
This means that regardless of the order in which the factors appear in a table, anytime a plot or operation is conditioned by the factor, the grouped elements will appear in the order defined by the levels’ hierarchy.
If we wanted the box plots to be plotted in a different order we must modify the month column by releveling the factor object as follows:
Load ‘esoph’ which contains the data from a case-control study of (o)esophageal cancer
remove the last column
rename the columns from agegp to Age_Group, from alcgp to Alcohol_consump, from tobgp to Tobacco_consump and leave the column name of ncases the same.
subset to only contain rows that have an Alcohol_consump of 120+.
convert agegp into a factor and assign it to a new variable. Assess the attributes of that variable.
What data type is Alcohol_consump?
Matrices are atomic vectors with dimensions; the number of rows and columns. As with atomic vectors, the elements of a matrix must be of the same data type.
To create an empty matrix, we need to define those dimensions:
We can find out how many dimensions a matrix has by using
You can check that matrices are vectors with a class attribute of matrix by using class() and typeof().
While class() shows that m is a matrix, typeof() shows that in this case the matrix is an integer vector (these can be character vectors, too).
When creating a matrix, it is important to remember that matrices are filled column-wise
If that is not what you want, you can use the byrow argument (a logical: can be TRUE or FALSE) to specify how the matrix is filled
You can create a matrix from a vector:
A lot is going on here. Let’s dissect it:
Note: if you want to get the same vector each time with the same parameters, you need to use set.seed() with a defined number first
All of the above takes the random integer vector and transforms it into a matrix with 2 rows and 5 columns.
You can also bind columns and rows using cbind() and rbind:
Akin to vectors, we revisit our square-brackets and can retrieve elements of a matrix by specifying the index along each dimension (e.g. “row” and “column”) in single square brackets.
Transform the built-in dataset iris into a matrix using as.matrix() and assign it to a new variable of your choice.
When you use class() and typeof(), what results do you get and why? What happened to the doubles in the data frame (hint: remember the coercion rules from earlier)?
This was hard but do keep going. Join us for Session 3 - Manipulating and Plotting Data. Hopefully, you will find it easier than today.
Thank YOU! For your attention and effort
Be a part of this! You can helps us running or assisting a future session (they can )opt into this via the feedback survey)
Tell us… What you liked and what you didn’t using the feedback survey