Chapter 6 Data Frames

R holds most datasets like you might see in a spreadsheet as a class of objects called a data frame. Others have created packages with variations of data frames, such as data tables and tibbles, which have some advanced features or other benefits, but most of what we discuss for data frames will work for these classes of objects as well.

Structure of a Data Frame

While the internal workings of a data frame are surprisingly complex and allow for some fairly advanced data manipulation, the basics are pretty straightforward.

A data frame is a rectangular structure. This means it has two dimensions: rows and columns. Each row has the same number of columns and each column has the same number of rows. It is best to think of a data frame as a collection of columns. All the data within a column must be the same type of data (for example, all numbers or all text); however, different columns within a data frame can hold different types of data. Every entry in a column is called a cell. An example of a small data frame holding hypothetical student information is given below.

##      ID             Name         Major  GPA
## 1 20304      Adam Baines       English 3.30
## 2 20305 Claire Dougherty    Statistics 3.53
## 3 20306      Eric Forest    Psychology 3.87
## 4 30101      George Hull Public Health 3.79

The columns ID and GPA hold numeric variables, while Name and Major hold text, or “character” variables.

While every column has the same number of rows, some values might be missing (for example, a student’s major may be unknown). Missing values are denoted by NA, holding a spot in the data frame and keeping the rectangular structure intact.

Using Data Frames

Data frames are only useful if you can access parts of them at a time to view data, perform analysis, or graph results.

We can access a cell within the data frame by using square brackets and giving, in order, the row and column number we want to view. For example, the data frame above is named gpa, so to access a cell particular cell, we would type gpa[row_number, column_number]:

gpa[2, 3]

## [1] "Statistics"

gpa[4, 1]

## [1] 30101

The first line of code gives the value in the second row and third column of the dataset: the second major listed. The second line gives the value in the fourth row and first column, the ID of the fourth student.

If we want an entire row or an entire column, we give the row or column number and leave the other entry blank:

gpa[1, ]

##      ID        Name   Major GPA
## 1 20304 Adam Baines English 3.3

gpa[, 2]

## [1] "Adam Baines"      "Claire Dougherty" "Eric Forest"      "George Hull"

The first line of code gives the first row of data, all the information for Adam Baines. The second line of code gives the second column, the student names.

While we can access columns by giving the column number, that can make code hard to read. The preferred way to access a column within a data frame is to use the dollar sign, $, and then give the column name. The full code would look like dataframe_name$column_name. For example:

gpa$Name

## [1] "Adam Baines"      "Claire Dougherty" "Eric Forest"      "George Hull"

gpa$GPA

## [1] 3.30 3.53 3.87 3.79

The above lines of code are much more clear for a human to read, assuming the column names are meaningful. Once we have a column of data, we can compute statistics, such as the mean and standard deviation

mean(gpa$GPA)

## [1] 3.6225

sd(gpa$GPA)

## [1] 0.2594064

You can also create new variables in your dataset by modifing or combining existing variables. For example, say you realize the GPAs are calculated incorrectly and everyone should have an additional 0.10 points.

gpa$GPAcorrect = gpa$GPA + 0.1
print(gpa)

##      ID             Name         Major  GPA GPAcorrect
## 1 20304      Adam Baines       English 3.30       3.40
## 2 20305 Claire Dougherty    Statistics 3.53       3.63
## 3 20306      Eric Forest    Psychology 3.87       3.97
## 4 30101      George Hull Public Health 3.79       3.89

A new variable, GPAcorrect is added to the end of the data frame. When we calculated gpa$GPA + 0.01, we saved it to a new variable inside the existing data frame, adding the new variable as a column at the end⁴.

Some functions that take multiple variables streamline how you can input columns from a dataset, but for a lot of simple tasks, the dataframe_name$column_name format will work just fine.

If we had saved the new data to an existing variable, it would have saved over that column and removed the original data. That’s generally considered a bad idea because you always want to keep the original data available to refer back to.↩︎