Chapter 7 More Data Structures

In the last chapter we introduced the concept of data frames and of variables types within data frames. Data frames are special cases of objects in the R system. You can think of objects as nouns - they are the “things” in your R environment: datasets, variables, regression models, etc.

In R, every object has a class that tells you what the object is and tells functions what they can do with the objects. Again back to last chapter, data frames are a class of objects in R. They have rows and columns, columns can be different classes, and all values in a column must be of the same class.

Some of the most common classes you will encounter are described below.

Data

Numbers

When people think of statistics and data, they think of numbers. The two main classes of number-based data in R are numeric and integer classes. As you might guess from the names, the main difference is that the integer class requires values to be integers while the numeric class allows decimals. The main difference for these classes is under the hood - integer objects take less memory to store than do numeric objects. You can do all the same calculations with one as you would the other.

Categorical Variables

Categorical variables are primarily held in character and factor classes. Character objects are strings of text, such as names or addresses. With some R wizardry, you can edit the text character-by-character. Factors, on the other hand, are more traditional categorical data. Usually we think of factors as categorical data with a relatively small number of levels, such as highest educational degree obtained or employment status.

Similar to the distinction between numeric and integer classes, for most introductory purposes character and factor classes can be treated the same. One important difference is that character strings are wrapped in quotation marks. You must do this when you create them and they will have quotation marks when displayed on the screen.

Logical Variables

A rather unique class of variables is logical. Logical variables can take one of two values, TRUE or FALSE, and are used to denote true/false comparisons between two values. For example:

result1 = 2 < 4
result1

## [1] TRUE

result2 = 5 < 1
result2

## [1] FALSE

If you attempt to do math on logicals, TRUE will become 1 and FALSE will become 0, so adding logicals together will tell you how many TRUEs there are.

You can also use Boolean logic to create complex logical statements by combining many logicals into a single logical, such as “are all of these true” or “are any of these true.” We won’t worry about that now, but it is a powerful approach to controlling what calculations can happen in certain instances.

Dates and Times

The final classes of data that we will talk about are pretty unlikely to appear as a primary variable of focus, but still might show up from time to time. These are classes that store date and time information. First, the Date class holds dates, which usually display in the form YYYY-MM-DD, so that, for example, 2022-01-31 is January 31, 2022.

If you need times along with the dates, the classes POSIXct and POSIXlt are available in R. Both of these hold times up to the second (or fraction of a second) but have slightly different inner-workings. The POSIX classes (and times generally) can be difficult to work woth becuase of differences in time zones and daylight savings. You likely won’t have to do intense work with time data, so we won’t get into too many details here.

Matrices

At first glance, matrices are a lot like data frames - they are two-dimensional and rectangular. The primary difference is that for an object to be a matrix class, all the cells in the matrix must be the same class. Any of the classes above (numeric, character, logical, Date, etc.) will work, but they must be consistent throughout the matrix. As discussed in the last chapter, data frames allow each column to have a different class, so they are a bit more general than objects of class matrix.

Outputs

In addition to the data you start with, the results you generate are also objects. Some results might have simple classes. For example, the result of computing the mean of a set of numbers is a numeric object. In other cases, the results are more complex and have their own class structure.

Some classes you are likely to see in your results are described below.

Hypothesis Tests

Results from hypothesis tests available in base R are stored as objects of class htest. These objects have different components depending on the type of test, but all have statistic, parameter and p.value components. The statistic component is the value of the test statistic, the parameter component is the value of the relevant parameter(s) (usually degrees of freedom), and the p.value component is the p-value derived from the test. Similar to how columns can be accessed from a dataset, these components can be accessed using the $ sign. For example:

twoway

##      [,1] [,2] [,3]
## [1,]   12   34   27
## [2,]   19   40   24

mytest = chisq.test(twoway)
mytest

## 
##  Pearson's Chi-squared test
## 
## data:  twoway
## X-squared = 1.6092, df = 2, p-value = 0.4473

mytest$statistic

## X-squared 
##  1.609189

mytest$parameter

## df 
##  2

mytest$p.value

## [1] 0.4472693

Regression Models

Regression model classes depend on the type of model being run. A linear model (function lm) results in an object of class lm. If you use the glm function to run a logistic regression or other generalized linear models, the result will have class glm⁵. Similar to the htest class, we can access certain components of an lm or glm object using the $ symbol, or we can use functions that know how to handle these classes of objects. As some examples:

model_name$coef and coef(model_name) will give the estimated regression coefficients.
confint(model_name, level=0.95) will give 95% confidence intervals for each coefficient.
model_name$fitted and fitted(model_name) will give the fitted values ($\hat{y}$ from a linear regression).
model_name$residuals and residuals(model_name) will give the model residuals ($y - \hat{y}$ from a linear model).
summary(model_name) will produce a few summary statistics, including a table of regression coefficients with p-values and fit statistics particular to the model type (for example, $R^2$ and residual standard error in linear regression).

Plots

If you use ggplot2 to create graphics, they can also be saved as objects. The saved graphics can then be recalled, displayed, and edited as needed.

The base R graphics plots can also be saved, but in most cases you can only save the underlying structure of the plot, not any annotations, color changes, etc., that you’ve made.

Not all regression classes match the function name, but many do.↩︎