8. Subset selection

Working with data commonly involves subset selection. For instance, you may need to replace certain values in the data set, or you may want to restrict analysis or computations to a specific subgroup of the data set.

Section structure:

At the end of this section you’ll be able to select a subset of data from a vector or data frame.

8.1. Subsetting a vector by position number

Consider again the two vectors with “height” (in cm) and determined “sex” (m/f) for 16 individuals (as in section 6.1). Type and run:

height <- c(184.0, 174.2, 166.6, 193.2, 173.8, 166.4, 175.4, 183.3, 159.4, 171.8, 179.2, 165.8, 170.4, 178.1, 171.4, 159.7)
sex <- c('m', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f')

To ask for the first element of the vector object height, we indicate the position number of the element (number ‘1’) that we want to extract in square brackets. Type and run:

height[1] # display first element in object height

We may even store this element as a new object arbitrarily named firstheight. Type and run:

firstheight <- height[1]

We can also use a vector to indicate the position numbers of multiple values we want to extract. For instance, to extract the second and tenth value:

ind <- c(2, 10)
[1] 174.2 171.8

Or, equivalently:

height[c(2, 10)]

8.2. Subsetting a vector by logical statement

You may want to subset only those elements that pass a certain threshold. For example, suppose you want to extract all elements of the object heigth with a value larger than 170 cm. This can be specified as a logical statement. Type and run:

over170 <- height > 170

The object over170 is a logical vector taking on the values TRUE for elements in the object heigth with a value larger than 170 cm and FALSE for other values in the object heigth.

To extract the values of height with a value higher than 170 we can use the object over170.Type and run:

heigthover170 <- height[over170]
 [1] 184.0 174.2 193.2 173.8 175.4 183.3 171.8 179.2 170.4 178.1 171.4

Or equivalently:

heigthover170 <- height[height > 170]

The command should be read as follows: from the object heigth, select the subset for which the condition ‘height > 170’ is satisfied.

8.3. More logical statements

The > command is called a ‘logical operator’. R also recognizes the following logical operators:

  • < (less than)
  • <= (less than or equal to)
  • >= (greater than or equal to)
  • != (not equal to)
  • == (equal to)

For instance, suppose you want to extract the values in height that belong to males. Type and run:

height[sex == 'm'] # display height for males
[1] 184.0 174.2 166.6 193.2 173.8 166.4 175.4 183.3

The command should be read as follows: from the object heigth, select the subset for which the condition ‘sex is male’ is satisfied. Alternatively, you may use:

height[sex != 'f']

changing the condition to ‘sex is not female’.

It is also possible to combine multiple logical statements. For example, you may want to select subjects that are both female and have height larger than 170 cm. Type and run:

height[sex == 'f' & height > 170] # display height for females AND height higher than 170 
[1] 171.8 179.2 170.4 178.1 171.4

The & (‘AND’) operator combines both logical statements, thereby requiring that both statements are satisfied. If only one of the two statements need to be satisfied, one may use the operator | (‘OR’). Type and run:

height[sex == 'f' | height > 170] # display height for females OR height higher than 170 
 [1] 184.0 174.2 193.2 173.8 175.4 183.3 159.4 171.8 179.2 165.8 170.4
[12] 178.1 171.4 159.7

8.4. Subsetting data frames

We will make use of the aneurysm data for the remainder of this section. Be sure that aneurysm.txt is stored in your work directory (if you forgot all about work directory or the aneurysm data, skip back to section 7). Type and run:

df <- read.table(file = "aneurysm.txt", header = TRUE)

Remember that you can select a column of a data frame using the $ operator. Type and run:


Notice that df$AGE10 is simply a numeric vector. To perform subset select we can therefore use the same approach on df$AGE10 as ordinary vectors. For instance, we may use df$AGE10[c(1, 2, 3)] to select the first three values or df$AGE10[df$AGE10 >= 6] to select all values equal to or greater than 6).

However, data frames are two-dimensional, and it is possible to directly extract certain values based on row and column numbers. Similar to the approach for subsetting a vector, we use square brackets. However, to indicate the difference between the row and columns numbers we separate the two using a comma. The number given before the comma indicates the row number; the number given after the comma indicates the column number. For example, you may want to extract from df the fifth value from the second column. Type and run:

df[5, 2]
[1] 5.4

To extract the first three values of the second column, type and run:

df[c(1, 2, 3), 2]
[1] 4.3 4.5 4.9

Sometimes you may want to select a few columns but all rows, or all columns but a few rows. This can be achieved by leaving the position indicator empty. Do not forget the comma! For instance, to select only the rows 237 and 238 in df, type and run:

df[c(237, 238),]
237 Male   8.0  1   1        1    1     0  Alive
238 Male   8.4  0   1        1    1     1   Dead

To select only columns 1 and 4, type and run:

df[,c(1, 4)]

Subset selection of data frames is commonly used to limit an analysis to a certain subgroup. To be able to do so, you may want to first split the complete data frame according to the subgroup. E.g. suppose we want to analyze the males and females in df separately, type and run:

df.m <- df[df$SEX == 'Male',]
df.f <- df[df$SEX == 'Female',]

Check to be sure that df.m contains only data of males and df.f only of females.