6. Data frames

In this sixth section we’ll show how to work with two-dimensional data matrices in R. While there are various of ways of doing so in R, we will focus only on so-called “data frames”.

Section structure:

At the end of this section, you’ll be able to contruct and understand the basic structure of data matrices in R.


6.1. Constructing a data frame from two vectors

Remember that, to create a vector, we use the function c() (if you forgot all about vectors, see section 3.2). Suppose we have measured “height” (in cm) and determined “sex” (m/f) for 16 individuals. To store these data in two vectors, type and run:

height <- c(184.0, 174.2, 166.6, 193.2, 173.8, 166.4, 175.4, 183.3, 159.4, 171.8, 179.2, 165.8, 170.4, 178.1, 171.4, 159.7)
sex <- c('m', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f')


If you are familiar with data analysis in other programs than R, you might realize that this approach is unlike those other programs. In MS Excel or SPSS, such data would be stored in a spreadsheet looking like this:


   


For these data, the rows represent different individuals (person 1 to 16) and the columns represent different variables (“height” and “sex”“). To create this in R we can use the function data.frame(). Type and run:

dat <- data.frame(height, sex)
dat

The order of the data points in both vectors is important! When combining both vectors into one data frame, we must be sure that the first observation in the vector height (184.0) is from the same individual as the first observation in the vector sex (“m”), that the second observation in the vector height (174.2) is from the same individual as the second observation in the vector sex (“m”), and so on.


6.2. Avoid confusion: keep your workspace clean

Look at the upper right panel displaying the objects stored in the R environment.



The environment stores the data.frame in an object called dat. This object is the combination of height and sex, which are also stored as vector objects. This can easily create confusion! To remove the objects height and sex, we can use the function rm(). Type and run:

rm(sex)
rm(height)

The two objects have disappeared from the list in the Environment window on the right. That is, if we would ask for height (or sex), R will return an error. Type and run:

height
Error: object 'height' not found

However, the information that was stored in height and sex still is available in dat. We can access that information using the $ sign as follows:

dat$height
dat$sex

As with a standard vector, we can apply functions to the columns of dat. Type and run:

mean(dat$height)
table(dat$sex)


6.3. General construction of a data frame

Clearly, the approach of first creating two vectors, combining them into a data frame and subsequently removing the two vector objects from the environment is a bit of a hassle. Indeed, this task can be simplified! Type and run:

dat <- data.frame(
  height = c(184.0, 174.2, 166.6, 193.2, 173.8, 166.4, 175.4, 183.3, 159.4, 171.8, 179.2, 165.8, 170.4, 178.1, 171.4, 159.7), 
  sex = c('m', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f')
  )

The command above should be read as follows: “Create a new object called ‘dat’, and let it be a data frame that has one column named ‘height’ with values (184.0, 174.2, …, 159.7)” and one column named ‘sex’ with values (‘m’, ‘m’, …, ‘f’)“.

Now suppose that, for these same subjects, additional data has become available, because their current smoking status (0: current non-smoker or 1: current smoker) was registered. Hence, we want to add a column called smoking_status to the data frame dat. To do this, type and run:

dat$smoking_status <- c(0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0)
dat
   height sex smoking_status
1   184.0   m              0
2   174.2   m              0
3   166.6   m              0
4   193.2   m              0
5   173.8   m              0
6   166.4   m              1
7   175.4   m              0
8   183.3   m              0
9   159.4   f              1
10  171.8   f              0
11  179.2   f              0
12  165.8   f              1
13  170.4   f              1
14  178.1   f              0
15  171.4   f              1
16  159.7   f              0


6.4. Creating data sets in R: a good idea?

The short answer is: no.

If you need to manually “digitalize” data for more than a few patients, spreadsheet programs (MS Excel, for instance) are more convenient for building such data sets. Once you have finished building your data set or if you work with an existing data set you can (and should!) use R to perform your analyses.

With R you are able to work with a large number of different data file types, such as SPSS files (.sav extension), MS Excel files (.xls or .xlsx extension), and comma separated values files (.csv extension).