Data Types and Structures

Coding for Reproducible Research

Laura Roldan-Gomez

11/1/22

We are…


Emma Walker


Xinran Du

xd247@exeter.ac.uk


Laura Roldan-Gomez

lr480@exeter.ac.uk

Welcome to this hybrid session

  • Part of Coding for Reproducible Research Programme

  • Why are you here?

Expectations for today


  • This material is new and designed by Theresa Walker alongside a team of people including Emma and myself.

  • It is based on other materials and our experience.

  • I have taught R before, on multiple occasions.

  • So today will be an experiment but should run smoothly.

Check out the material here

Code of conduct


Be kind by default


The rest of the rules of engagement are available on course website

Today


  • Method of delivery: three parts with tasks and breaks every hour or so

  • Please sign in/register your attendance, via an online sheet link

Our plan for this session


  • Data types

  • Data structures

  • DATA FRAMES

  • Matrices (bonus track)

1. Set Up


Take a few minutes for:

  • questions and troubleshooting

  • opening RStudio, and…

  • setting your working directory.

… setwd()

… or Session button

2. Data Types - The basis of it all


Data types define how the values are stored in the computer. There are 3 core data types:


  • Numeric
  • Character or strings
  • Logical

2.1. Numeric

  • integers
3 # This is a number
[1] 3
class(3) # you can check. 
[1] "numeric"
# The function class() and typeof() returns the datatype of an object
typeof(3) # another way of checking (double and numeric are the same thing)
[1] "double"


15.5 # This is also a number
[1] 15.5
class(15.5)
[1] "numeric"


  • complex:
1+4i # This is also a number
[1] 1+4i
class(1+4i)
[1] "complex"

2.2. Characters or strings:


consists of letters, words, sentences or characters, such as:

"a" # note the quote marks
[1] "a"
"datatypes"
[1] "datatypes"
"Learning R is fun"
[1] "Learning R is fun"
"@%£$"
[1] "@%£$"


class("a") # Check
[1] "character"

Basically, anything within “” is a character or string, so… numbers can also be characters!


3 # This is a number
[1] 3
2 # This is also a number
[1] 2
3-2 # I can treat them as numbers
[1] 1


NOW…

"3" # This is not a number
[1] "3"
"2" # This is not a number
[1] "2"


"3"-"2" # I cannot treat them as numbers
Error in "3" - "2": non-numeric argument to binary operator

2.3. Logical:

Logical values can either be TRUE or FALSE

or sometimes 1 and 0.


That means that if you type:

z = as.logical(c(1,0,0,1))
typeof(z)
[1] "logical"


…if you don’t use as.logical you get a set of numbers

z = (c(1,0,0,1)) 
typeof(z)
[1] "double"

3. Data Structures

When we group the different data types, they get into a structure. The different structures are:

  • vectors:
    • atomic vector
    • list
  • matrix
  • data frame
  • factors

3.1. Vectors

A vector is the most common and basic data structure in R. Technically, vectors can be one of two types:

  • Atomic vectors
  • Lists

a. atomic vectors:


only holds elements (or atoms) of a single data type and each atom is a scalar, which means it “has length one”. It is homogenous.


Examples:

x <- 3 # a vector of one number


y <- 1:5 # numbers from 1 to 5


y_times_two <- (y*2) # vector b multiplied by two


z <- letters[1:5] # the first 5 letters of the alphabet


v_log <- c(TRUE, FALSE, FALSE, TRUE, TRUE) # a logical vector

CONCATENATE,


INDEXING


AND


COERCION

Concatenate

or combine c() function constructs a vector. We used it above to construct the logical vector.


c() can also be used to add elements to a vector.

v_name <- c("Sarah", "Tracy", "Jon")
v_name
[1] "Sarah" "Tracy" "Jon"  


v_name <- c(v_name, "Annette")
v_name
[1] "Sarah"   "Tracy"   "Jon"     "Annette"


v_name<- c("Greg", v_name)
v_name
[1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"

Let’s use it


Let’s put this together and create our first little data set. We’ll work with data sets further down but I want to wrap up this section with one of the functions of vectors.


We use two functions: as.data.frame and cbind or column bind.

ourdata <- as.data.frame(cbind
                         (v_name, 
                           x, 
                           y, 
                           y_times_two, 
                           z, 
                           v_log))


ourdata2 <- as.data.frame(rbind(v_name, x, y, y_times_two, z, v_log)) # we can also use rbind or row bind. Run this and see the difference.

Indexing


Every atom in a vector has a ‘location’ represented by a number or index. Our vector x from the previous exercise has length one. So the index for 3 is 1.


Likewise, vector z has 5 letters and therefore, 5 indexes.


We can call this numbers by using []


x[1]
[1] 3


x[0] # index 0 doesn't exist in this vector
numeric(0)
x[2] # index 2 doesn't exist in this vector
[1] NA


z[3] # running this line will give you the letter in position 3.
[1] "c"

You can also:


  • use index ranges:
z[2:3]
[1] "b" "c"


  • remove an element by using a negative number
z[-5] # removes the last element
[1] "a" "b" "c" "d"


z[-(1:3)] # removes the first three elements
[1] "d" "e"

Indexing data frames


The positions in a data frame behave like a coordinate [row, column].


So you have

ourdata[1,1] # element in the first row and first column
[1] "Greg"


And you can also use indexing to get whole columns or rows.

ourdata[2] # I get the second column 
  x
1 3
2 3
3 3
4 3
5 3


ourdata[-2] # I removed the second column
   v_name y y_times_two z v_log
1    Greg 1           2 a  TRUE
2   Sarah 2           4 b FALSE
3   Tracy 3           6 c FALSE
4     Jon 4           8 d  TRUE
5 Annette 5          10 e  TRUE


ourdata[1,] # I get the first row
  v_name x y y_times_two z v_log
1   Greg 3 1           2 a  TRUE


ourdata[,(2:3)] # I get columns 2 and 3
  x y
1 3 1
2 3 2
3 3 3
4 3 4
5 3 5

Use indexing to create a subset


We can even create a new data frame using this. It’s great when you have huge data sets but you are interested in just a few variables.


new_subset <- ourdata[(1:3),(2:4)] 


new_subset[1,1] <- "Laura" # I can also change elements in my dataframe.

Data type coercion


Even though R’s vectors have a specific data type, it’s quite easy to coerce them to another type.


Let’s look at our data set and see what type of data we have:


str(ourdata)
'data.frame':   5 obs. of  6 variables:
 $ v_name     : chr  "Greg" "Sarah" "Tracy" "Jon" ...
 $ x          : chr  "3" "3" "3" "3" ...
 $ y          : chr  "1" "2" "3" "4" ...
 $ y_times_two: chr  "2" "4" "6" "8" ...
 $ z          : chr  "a" "b" "c" "d" ...
 $ v_log      : chr  "TRUE" "FALSE" "FALSE" "TRUE" ...

Careful:


Coercion can also be triggered by other actions. When R joined our vectors to form a data frame, it read every element as a character.


When you use real data, you must check this before running any analysis. And you need to coerce the data to whatever makes sense.

For instance, we must coerce our variables x, y, y_times_two to numeric and variable v_log to logic


ourdata$x<-as.numeric(ourdata$x)
ourdata$y<-as.numeric(ourdata$y)
ourdata$y_times_two<-as.numeric(ourdata$y_times_two)
ourdata$v_log <- as.logical(ourdata$v_log)


str(ourdata) # This is making more sense
'data.frame':   5 obs. of  6 variables:
 $ v_name     : chr  "Greg" "Sarah" "Tracy" "Jon" ...
 $ x          : num  3 3 3 3 3
 $ y          : num  1 2 3 4 5
 $ y_times_two: num  2 4 6 8 10
 $ z          : chr  "a" "b" "c" "d" ...
 $ v_log      : logi  TRUE FALSE FALSE TRUE TRUE


Note: If you need to change something to character, you use as.character

b. lists:


…still vectors but can contain several datatypes (can be heterogenous)

…each atom of the list can itself be composed of more than one atom (has a length > one).


Atomic vectors are very constrained (atoms are of a scalar/length 1 and need to be one data type).

You might need a vector that violates these constraints - can have length > 1 and not be of the same type.

We construct a list explicitly with list() but, like atomic vectors, most lists are created some other way in real life.


(a_list <- list(1:3, c("four", "five")))
[[1]]
[1] 1 2 3

[[2]]
[1] "four" "five"


# A more impressive one
(b_list <- list(logical = TRUE, integer = 4L, double = 4 * 1.2, character = "character"))
$logical
[1] TRUE

$integer
[1] 4

$double
[1] 4.8

$character
[1] "character"


# A very impressive one
(c_list <- list(letters[26:22], transcendental = c(pi, exp(1)), f = function(x) x^2))
[[1]]
[1] "z" "y" "x" "w" "v"

$transcendental
[1] 3.141593 2.718282

$f
function(x) x^2


You can also coerce other objects using as.list().


ourdata_as_list <- as.list(ourdata)

List indexing


There are 3 ways to index a list and the differences are very important:

1.) With single square brackets, just like we indexed atomic vectors. Note this always returns a list, even if we request a single component.


a_list[c(FALSE, TRUE)]
[[1]]
[1] "four" "five"
b_list[2:3]
$integer
[1] 4

$double
[1] 4.8
c_list["transcendental"]
$transcendental
[1] 3.141593 2.718282


ourdata_as_list[1]
$v_name
[1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"

2.) With double square brackets. This can only be used to access a single component and it returns the “naked” component. You can request a component with a positive integer or by name.


a_list[[2]]
[1] "four" "five"
b_list[["double"]]
[1] 4.8
ourdata_as_list[[1]]
[1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"

3.) With the $ addressing named components. Like [[, this can only be used to access a single component, but it is even more limited: You must specify the component by name.


ourdata_as_list$v_name
[1] "Greg"    "Sarah"   "Tracy"   "Jon"     "Annette"

How can I use this?


This is a bit advanced as it uses loops but a good example of where you can use lists…


Loop over your data to create many data frames and keep them in a list. I’m currently using this as part of my PhD for which I am running over 13.000 models! So yes, lists can be useful.


why_lists <- list() # create an empty list

for (i in 1:5){
  why_lists[[i]] <- ourdata[(i-1):(i),]
} 

Tasks and break - 10 minutes

  1. What is the class of ourdata_as_list[1]?

  2. What is the class of ourdata_as_list[[1]]?

  3. Consider

my_vec <- c(a = 1, b = 2, c = 3)
my_list <- list(a = 1, b = 2, c = 3)

Use [ and [[ to attempt to retrieve elements 2 and 3 from my_vec and my_list. What succeeds vs. fails? What if you try to retrieve element 2 alone? Does [[ even work on atomic vectors? Compare and contrast the results from the various combinations of indexing method and input object.

4. Data frames

Data frames are the data structure you will use the most for statistics or data management in general.


Get your data:

To get your data in R, you can do one of three things:

  • Create your own data set (we did that already with ‘ourdata’)

  • Use a built-in data set like iris

data(iris)
  • Load your own data. There are multiple ways to load your data, for example using the command read.csv:
worms <- read.csv("data/worms.csv")

Create a data frame:


First, let’s create a data frame by hand using the command data.frame (previously, we used a different method):

dat <- data.frame(id = letters[1:10], x = 1:10, y = 11:20)
dat
   id  x  y
1   a  1 11
2   b  2 12
3   c  3 13
4   d  4 14
5   e  5 15
6   f  6 16
7   g  7 17
8   h  8 18
9   i  9 19
10  j 10 20

Explore a data frame:

Second, let’s use several commands to explore a data frame:

head(dat) # shows first 6 rows
  id x  y
1  a 1 11
2  b 2 12
3  c 3 13
4  d 4 14
5  e 5 15
6  f 6 16
tail(dat)    # shows last 6 rows
   id  x  y
5   e  5 15
6   f  6 16
7   g  7 17
8   h  8 18
9   i  9 19
10  j 10 20
dim(dat)     # returns the dimensions of data frame (i.e. number of rows and number of columns)
[1] 10  3
nrow(dat)    # number of rows
[1] 10
ncol(dat)    # number of columns
[1] 3
str(dat)     # structure of data frame - name, type and preview of data in each column
'data.frame':   10 obs. of  3 variables:
 $ id: chr  "a" "b" "c" "d" ...
 $ x : int  1 2 3 4 5 6 7 8 9 10
 $ y : int  11 12 13 14 15 16 17 18 19 20

names(dat)
[1] "id" "x"  "y" 


colnames(dat)# both show the names attribute for a data frame
[1] "id" "x"  "y" 


sapply(dat, class) # shows the class of each column in the data frame
         id           x           y 
"character"   "integer"   "integer" 

Below we show that a data frame is actually a special list:

is.list(dat)
[1] TRUE


class(dat)
[1] "data.frame"

Indexing/slicing


As shown above, there are ways to retrieve specific elements from the data frame, the data frame can be sliced or indexed .


Because data frames are rectangular, elements of data frame can be referenced by specifying the row and the column index in single square brackets.

dat[1, 3] # You know this already, this is a reminder
[1] 11

Data frames are also lists

…so, it is possible to refer to columns (which are elements of such list) using the list notation, i.e. either double square brackets or a $.


dat[["y"]]
 [1] 11 12 13 14 15 16 17 18 19 20


 [1] 11 12 13 14 15 16 17 18 19 20

Restructure your data frame - pimp up your data frame


- Add and change names:

When you look at both the dat and iris data frames from earlier, they have no rownames

dat
   id  x  y
1   a  1 11
2   b  2 12
3   c  3 13
4   d  4 14
5   e  5 15
6   f  6 16
7   g  7 17
8   h  8 18
9   i  9 19
10  j 10 20


head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

However, when we look at the mtcars data set, it does

head(mtcars) # *mtcars has rownames!*
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Go fancy

To give give each row a name, we can… Albeit a bit non-sensical but the entire data frame is just a generic example…

names<-c("first","second","third","fourth","fifth","sixth", "seventh","eighth","ninth","tenth") # we create a vector
names
 [1] "first"   "second"  "third"   "fourth"  "fifth"   "sixth"   "seventh"
 [8] "eighth"  "ninth"   "tenth"  

rownames(dat)=names # we assign the vectors as rownames for dat


dat
        id  x  y
first    a  1 11
second   b  2 12
third    c  3 13
fourth   d  4 14
fifth    e  5 15
sixth    f  6 16
seventh  g  7 17
eighth   h  8 18
ninth    i  9 19
tenth    j 10 20

We can also rename columns. Let’s assume we want to change the abbreviation of the first three columns of mtcars to the actual words:

colnames(mtcars)[1:3]<-c("miles per gallon","cylinders","displacement")
head(mtcars)
                  miles per gallon cylinders displacement  hp drat    wt  qsec
Mazda RX4                     21.0         6          160 110 3.90 2.620 16.46
Mazda RX4 Wag                 21.0         6          160 110 3.90 2.875 17.02
Datsun 710                    22.8         4          108  93 3.85 2.320 18.61
Hornet 4 Drive                21.4         6          258 110 3.08 3.215 19.44
Hornet Sportabout             18.7         8          360 175 3.15 3.440 17.02
Valiant                       18.1         6          225 105 2.76 3.460 20.22
                  vs am gear carb
Mazda RX4          0  1    4    4
Mazda RX4 Wag      0  1    4    4
Datsun 710         1  1    4    1
Hornet 4 Drive     1  0    3    1
Hornet Sportabout  0  0    3    2
Valiant            1  0    3    1

By using [1:3] we only changed a subset of the column names. If you want to change them all, the vector with the column names must correspond to the number of columns.

b. Append a column

You can also append a column of choice to your data frame. Remember, it needs to have the same length as the other columns:

  • find out how many rows our mtcars actually has:
nrow(mtcars) 
[1] 32


  • generate new column
favorites=1:32 

  • append
mtcars$favorites=favorites
mtcars
                    miles per gallon cylinders displacement  hp drat    wt
Mazda RX4                       21.0         6        160.0 110 3.90 2.620
Mazda RX4 Wag                   21.0         6        160.0 110 3.90 2.875
Datsun 710                      22.8         4        108.0  93 3.85 2.320
Hornet 4 Drive                  21.4         6        258.0 110 3.08 3.215
Hornet Sportabout               18.7         8        360.0 175 3.15 3.440
Valiant                         18.1         6        225.0 105 2.76 3.460
Duster 360                      14.3         8        360.0 245 3.21 3.570
Merc 240D                       24.4         4        146.7  62 3.69 3.190
Merc 230                        22.8         4        140.8  95 3.92 3.150
Merc 280                        19.2         6        167.6 123 3.92 3.440
Merc 280C                       17.8         6        167.6 123 3.92 3.440
Merc 450SE                      16.4         8        275.8 180 3.07 4.070
Merc 450SL                      17.3         8        275.8 180 3.07 3.730
Merc 450SLC                     15.2         8        275.8 180 3.07 3.780
Cadillac Fleetwood              10.4         8        472.0 205 2.93 5.250
Lincoln Continental             10.4         8        460.0 215 3.00 5.424
Chrysler Imperial               14.7         8        440.0 230 3.23 5.345
Fiat 128                        32.4         4         78.7  66 4.08 2.200
Honda Civic                     30.4         4         75.7  52 4.93 1.615
Toyota Corolla                  33.9         4         71.1  65 4.22 1.835
Toyota Corona                   21.5         4        120.1  97 3.70 2.465
Dodge Challenger                15.5         8        318.0 150 2.76 3.520
AMC Javelin                     15.2         8        304.0 150 3.15 3.435
Camaro Z28                      13.3         8        350.0 245 3.73 3.840
Pontiac Firebird                19.2         8        400.0 175 3.08 3.845
Fiat X1-9                       27.3         4         79.0  66 4.08 1.935
Porsche 914-2                   26.0         4        120.3  91 4.43 2.140
Lotus Europa                    30.4         4         95.1 113 3.77 1.513
Ford Pantera L                  15.8         8        351.0 264 4.22 3.170
Ferrari Dino                    19.7         6        145.0 175 3.62 2.770
Maserati Bora                   15.0         8        301.0 335 3.54 3.570
Volvo 142E                      21.4         4        121.0 109 4.11 2.780
                     qsec vs am gear carb favorites
Mazda RX4           16.46  0  1    4    4         1
Mazda RX4 Wag       17.02  0  1    4    4         2
Datsun 710          18.61  1  1    4    1         3
Hornet 4 Drive      19.44  1  0    3    1         4
Hornet Sportabout   17.02  0  0    3    2         5
Valiant             20.22  1  0    3    1         6
Duster 360          15.84  0  0    3    4         7
Merc 240D           20.00  1  0    4    2         8
Merc 230            22.90  1  0    4    2         9
Merc 280            18.30  1  0    4    4        10
Merc 280C           18.90  1  0    4    4        11
Merc 450SE          17.40  0  0    3    3        12
Merc 450SL          17.60  0  0    3    3        13
Merc 450SLC         18.00  0  0    3    3        14
Cadillac Fleetwood  17.98  0  0    3    4        15
Lincoln Continental 17.82  0  0    3    4        16
Chrysler Imperial   17.42  0  0    3    4        17
Fiat 128            19.47  1  1    4    1        18
Honda Civic         18.52  1  1    4    2        19
Toyota Corolla      19.90  1  1    4    1        20
Toyota Corona       20.01  1  0    3    1        21
Dodge Challenger    16.87  0  0    3    2        22
AMC Javelin         17.30  0  0    3    2        23
Camaro Z28          15.41  0  0    3    4        24
Pontiac Firebird    17.05  0  0    3    2        25
Fiat X1-9           18.90  1  1    4    1        26
Porsche 914-2       16.70  0  1    5    2        27
Lotus Europa        16.90  1  1    5    2        28
Ford Pantera L      14.50  0  1    5    4        29
Ferrari Dino        15.50  0  1    5    6        30
Maserati Bora       14.60  0  1    5    8        31
Volvo 142E          18.60  1  1    4    2        32

c. Subset your data:

We can also subset (or filter based on a conditional statement) a data frame using subset. The function takes two arguments subset(x, condition). X is the data frame to perform subset on, condition is the conditional statement to subset with:

subset(mtcars, cylinders>4)
                    miles per gallon cylinders displacement  hp drat    wt
Mazda RX4                       21.0         6        160.0 110 3.90 2.620
Mazda RX4 Wag                   21.0         6        160.0 110 3.90 2.875
Hornet 4 Drive                  21.4         6        258.0 110 3.08 3.215
Hornet Sportabout               18.7         8        360.0 175 3.15 3.440
Valiant                         18.1         6        225.0 105 2.76 3.460
Duster 360                      14.3         8        360.0 245 3.21 3.570
Merc 280                        19.2         6        167.6 123 3.92 3.440
Merc 280C                       17.8         6        167.6 123 3.92 3.440
Merc 450SE                      16.4         8        275.8 180 3.07 4.070
Merc 450SL                      17.3         8        275.8 180 3.07 3.730
Merc 450SLC                     15.2         8        275.8 180 3.07 3.780
Cadillac Fleetwood              10.4         8        472.0 205 2.93 5.250
Lincoln Continental             10.4         8        460.0 215 3.00 5.424
Chrysler Imperial               14.7         8        440.0 230 3.23 5.345
Dodge Challenger                15.5         8        318.0 150 2.76 3.520
AMC Javelin                     15.2         8        304.0 150 3.15 3.435
Camaro Z28                      13.3         8        350.0 245 3.73 3.840
Pontiac Firebird                19.2         8        400.0 175 3.08 3.845
Ford Pantera L                  15.8         8        351.0 264 4.22 3.170
Ferrari Dino                    19.7         6        145.0 175 3.62 2.770
Maserati Bora                   15.0         8        301.0 335 3.54 3.570
                     qsec vs am gear carb favorites
Mazda RX4           16.46  0  1    4    4         1
Mazda RX4 Wag       17.02  0  1    4    4         2
Hornet 4 Drive      19.44  1  0    3    1         4
Hornet Sportabout   17.02  0  0    3    2         5
Valiant             20.22  1  0    3    1         6
Duster 360          15.84  0  0    3    4         7
Merc 280            18.30  1  0    4    4        10
Merc 280C           18.90  1  0    4    4        11
Merc 450SE          17.40  0  0    3    3        12
Merc 450SL          17.60  0  0    3    3        13
Merc 450SLC         18.00  0  0    3    3        14
Cadillac Fleetwood  17.98  0  0    3    4        15
Lincoln Continental 17.82  0  0    3    4        16
Chrysler Imperial   17.42  0  0    3    4        17
Dodge Challenger    16.87  0  0    3    2        22
AMC Javelin         17.30  0  0    3    2        23
Camaro Z28          15.41  0  0    3    4        24
Pontiac Firebird    17.05  0  0    3    2        25
Ford Pantera L      14.50  0  1    5    4        29
Ferrari Dino        15.50  0  1    5    6        30
Maserati Bora       14.60  0  1    5    8        31

Tasks and Break - 10 to 20 minutes

  • Load the dataset worms

  • Try the command ‘summary()’

  • Try to selectively rename 2 rows of your choice in mtcars.

  • What happens if you try to add a new column of a length that is less than 32.

  • Extract (using either [] or $) the columns Sepal.Length and Sepal.Width from the iris dataset and make a new data frame out of them using data.frame(). Subset the new data frame for Sepal.Length > 4.6.

5. Special section - Factors

Think of categories.


Factors are so-called derived data types. They are normally used to group variables into unique categories or levels. For example, a data set may be grouped by gender or month of the year. Such data are usually loaded into R as a numeric or character data type requiring that they be converted to a factor using the as.factor() function.

5.1. Create a factor:

In the following chunk of code, we create a factor from a character object.

mon <- c("March","February","February","November","February","March","March","March","February","November")


Note that a is of character data type and fact is the factor representation of a.

typeof(mon)
[1] "character"

Tell R that you want this to be a factor…

fact <- as.factor(mon)


However, the derived object fact is now stored as an integer!

typeof(fact)
[1] "integer"


Yet, when displaying the contents of fact we see character values.

fact
 [1] March    February February November February March    March    March   
 [9] February November
Levels: February March November

How can this be?

Well, fact is a more complicated object than the simple objects created thus far in that the factor is storing additional information not seen in its output.


This hidden information is stored in attributes.


attributes(fact)
$levels
[1] "February" "March"    "November"

$class
[1] "factor"

There are two attributes of the factor object fact : class and levels.

class(fact)
[1] "factor"


levels(fact) # lists the three unique values in fact. The order reflects their *numeric* representation. In essence, fact is storing each value as an integer that points to one of the three unique levels.
[1] "February" "March"    "November"

5.2. Use of a factor:


To appreciate the benefits of a factor we’ll first create a data frame. One column will be assigned the fact factor and another will be assigned some random numeric values.

x <- c(166, 47, 61, 148, 62, 123, 232, 98, 93, 110)


dat_fact <- data.frame(min_sunshine = x, month = fact)
dat_fact
   min_sunshine    month
1           166    March
2            47 February
3            61 February
4           148 November
5            62 February
6           123    March
7           232    March
8            98    March
9            93 February
10          110 November

The month column is now a factor with three levels: F, M and N. We can use the str() function to view the dataframe’s structure as well as its columns classes.


str(dat_fact)
'data.frame':   10 obs. of  2 variables:
 $ min_sunshine: num  166 47 61 148 62 123 232 98 93 110
 $ month       : Factor w/ 3 levels "February","March",..: 2 1 1 3 1 2 2 2 1 3

There are functions that recognize factor data types and allow you to split the output into groups defined by the factor’s unique levels. For example, to create three box plots of the value min_sunshine, one for each month group F, M and N:

boxplot(min_sunshine ~ 
          month, dat_fact, 
        horizontal = TRUE)

The tilde ~ is used in the function to split (or condition) the data into separate plots based on the factor month.

5.3. Rearranging level order

A factor will define a hierarchy for its levels.


When we invoked the levels function in the last example, you may have noted that the levels output were ordered F, M and N–this is the level hierarchy defined for months (i.e. F>M>N ).


This means that regardless of the order in which the factors appear in a table, anytime a plot or operation is conditioned by the factor, the grouped elements will appear in the order defined by the levels’ hierarchy.

If we wanted the box plots to be plotted in a different order we must modify the month column by releveling the factor object as follows:

dat_fact$month <- factor(dat_fact$month,
                  levels=c("November","February","March"))
str(dat_fact)
'data.frame':   10 obs. of  2 variables:
 $ min_sunshine: num  166 47 61 148 62 123 232 98 93 110
 $ month       : Factor w/ 3 levels "November","February",..: 3 2 2 1 2 3 3 3 2 1

boxplot(min_sunshine ~ month, 
        dat_fact, horizontal = TRUE)

Tasks

  • Load ‘esoph’ which contains the data from a case-control study of (o)esophageal cancer

  • remove the last column

  • rename the columns from agegp to Age_Group, from alcgp to Alcohol_consump, from tobgp to Tobacco_consump and leave the column name of ncases the same.

  • subset to only contain rows that have an Alcohol_consump of 120+.

  • convert agegp into a factor and assign it to a new variable. Assess the attributes of that variable.

  • What data type is Alcohol_consump?

6. Matrices

Matrices are atomic vectors with dimensions; the number of rows and columns. As with atomic vectors, the elements of a matrix must be of the same data type.


To create an empty matrix, we need to define those dimensions:

m<-matrix(nrow=2, ncol=2)
m
     [,1] [,2]
[1,]   NA   NA
[2,]   NA   NA


We can find out how many dimensions a matrix has by using

dim(m)
[1] 2 2


You can check that matrices are vectors with a class attribute of matrix by using class() and typeof().

m <- matrix(c(1:3))

While class() shows that m is a matrix, typeof() shows that in this case the matrix is an integer vector (these can be character vectors, too).


class(m)
[1] "matrix" "array" 


typeof(m)
[1] "integer"

When creating a matrix, it is important to remember that matrices are filled column-wise

m<-matrix(1:6, nrow=2, ncol=3)
m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6


If that is not what you want, you can use the byrow argument (a logical: can be TRUE or FALSE) to specify how the matrix is filled

m<-matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

You can create a matrix from a vector:

m<-sample(1:100, size=10)
m
 [1] 46 19 68 89 66 22 72 35 73 37


dim(m)<-c(2,5)
m
     [,1] [,2] [,3] [,4] [,5]
[1,]   46   68   66   72   73
[2,]   19   89   22   35   37

A lot is going on here. Let’s dissect it:

  • We generate a random integer vector using sample(). sample() in this case randomly draws 10 (size=10) numbers from 1 to 100 (1:100).


Note: if you want to get the same vector each time with the same parameters, you need to use set.seed() with a defined number first

  • We assign the vector dimensions using dim() and c(2,5), with the later being c(rows, columns).

All of the above takes the random integer vector and transforms it into a matrix with 2 rows and 5 columns.

You can also bind columns and rows using cbind() and rbind:

x <- 1:3
y <- 10:12
m<-cbind(x, y)
m
     x  y
[1,] 1 10
[2,] 2 11
[3,] 3 12


n<-rbind(x,y)
n
  [,1] [,2] [,3]
x    1    2    3
y   10   11   12

6.1. Matrix indexing


Akin to vectors, we revisit our square-brackets and can retrieve elements of a matrix by specifying the index along each dimension (e.g. “row” and “column”) in single square brackets.

m[3,2] # Note that it is [row,column].
 y 
12 

Tasks

  1. Transform the built-in dataset iris into a matrix using as.matrix() and assign it to a new variable of your choice.

  2. When you use class() and typeof(), what results do you get and why? What happened to the doubles in the data frame (hint: remember the coercion rules from earlier)?

Congratulations this is the END!

via Gfycat

This was hard but do keep going. Join us for Session 3 - Manipulating and Plotting Data. Hopefully, you will find it easier than today.

  • Thank YOU! For your attention and effort

  • Be a part of this! You can helps us running or assisting a future session (they can )opt into this via the feedback survey)

  • Tell us… What you liked and what you didn’t using the feedback survey