Chapter 8 R Language and Syntax

So we’re in Chapter 8 and finally starting to talk about how to write R code. Just like any natural human language, computer coding languages have a syntax to them, an expectation about how parts of the language go together to form a coherent whole.

Here we will discuss some of the basic R syntax rules that often get overlooked in a course that isn’t dedicated to R programming, but that are necessary to write code and understand what it is doing. Some of this you will have seen already in earlier chapters, but now we will talk about what those small bits of code are doing.

Assignment to Variables

One of the most important abilities of the R language is the ability to “assign” results to variables. Whether it is a simple calculation or a complex statistical model, you may want to save particular values and be able to recall them later, either as inputs to new analyses or just to view their results. Assignment usually happens through the equal sign, =, or a left arrow <-6. Once you assign a value or object to a variable, that variable becomes an object in your environment. You can see it in the top right pane of your R Studio window and you can recall that value for the rest of your session. For example:

a = 4
b = 3^2
a
## [1] 4
b
## [1] 9
a * b
## [1] 36
# using the GPA dataset from chapter 6:
gpa$GPA
## [1] 3.30 3.53 3.87 3.79
gpa$GPA / a
## [1] 0.8250 0.8825 0.9675 0.9475
avg_gpa = mean(gpa$GPA)
avg_gpa
## [1] 3.6225

One thing you will notice is that when we assign a value or result to a variable, we don’t actually get to see the value without calling that variable again (in particular, see the avg_gpa example). One way to assign a result and view it at the same time is to wrap your assignment in parentheses:

(sd_gpa = sd(gpa$GPA))
## [1] 0.2594064

The parentheses tell R to save the result to sd_gpa, but also print the result to the screen. This can save a lot of re-typing.

Case Sensitive

R is what is called “case sensitive.” That means that capital and lower case letters are treated as different names. So apple, Apple, and APPLE can be three different objects that coexist. It is important to keep R’s case sensitive syntax in mind because if you call an object or function using the wrong case, you will receive and error message that it cannot be found or doesn’t exist. Also, just because you can use capital and lower case letters to distinguish objects doesn’t mean you should do that. Having to remember the difference between apple, Apple, and APPLE is going to make your coding more complicated than it already is. Usually we try to stick to lower case letters as much as possible. You can get some ideas for formatting variable names in the next section.

Naming Rules and Conventions

When you assign values to variables, choosing a good name is important. In the example above, single-letter names like a and b are okay for quick calculations, but they are really meaningless. Names like avg_gpa and sd_gpa are better since they tell you what the value is. When you are choosing a name, there are a few rules to keep in mind and some other conventions, or rules of thumb, that help keep things straight.

Rule: Names of objects can contain letters, numbers, and underscores (_) and must start with letter. So, some valid names might be mydata, mydata1, mydata2, or my_data. Some examples of invalid names would be my data, my-data, 2times, or x*2.

When you are working with data that has a meaningful background (which is most data that isn’t simulated for an example), your names should be as descriptive as possible without being burdensome in length. In many cases, you will want to combine multiple words into a short descriptive phrase. As seen above, spaces aren’t allowed in names, nor are dashes, which are treated as negative or minus signs. However, there are three common conventions you might want to use:

Convention: lower case. Simply smash words together by removing the spaces. The name mydata is a good example of this. Lower case convention works well if you’re only combining two short words; however, it can be difficult to read longer names. For example, maledepressionmodel1 is just too many consecutive letters.

Convention: camel case. Lower case convention is modified to camel case by capitalizing the first letter of every word (this may or may not include the first letter of the first word). The result is a name that looks like a camel’s back, with capital letters forming “humps.” So, myData and MyData would be considered camel case, as would maleDepressionModel1 and MaleDepressionModel1. The capital letters help make the words (or more importantly, the breaks between them) stand out.

Convention: snake case. Snake case replaces the spaces between words with underscores. Examples of snake case would be my_data and male_depression_model1. Snake case makes the names a little longer by adding underscore(s), but it is often the most readable approach.

Convention: consistency. Regardless of whether you choose to go with lowercase, CamelCase, or snake_case, it is helpful to be consistent throughout. Consistency will help you remember teh names you have given different objects. I generally like using snake case, though there are some times when I use lower case for very simple names like mydata. For anything more complicated than that, I prefer snake case. Snake case also means I know I will never have capital letters in my names, so it’s one less thing to have to remember.

Indexing

You already saw an example of indexing in Data Frames. When you have an object containing a sequence of numbers or other values, you might want to get a specific entry or a subset of the entries held in that objects. The process of obtaining these parts of an object by pointing to their locations is called indexing.

We index an object by using square brackets after the object’s name. If the object is one-dimensional, we simply give the location or locations of the values we want to see. For example:

powers = c(2, 4, 8, 16, 32, 64, 128, 256, 512, 1024)
powers[1]
## [1] 2
powers[2]
## [1] 4
powers[3]
## [1] 8
powers[10]
## [1] 1024
powers[c(2, 4, 5)]
## [1]  4 16 32

With two or more dimensions, we need to give the index of each dimension. In an introductory class, you are unlikely to see anything more than two dimensions, so we will only discuss that here. Examples of two-dimensional objects are data frames and matrices. From two-dimensional objects, the square brackets include the row first, then the column. Using the data frame gpa from Data Frames, we see:

gpa
##      ID             Name         Major  GPA GPAcorrect
## 1 20304      Adam Baines       English 3.30       3.40
## 2 20305 Claire Dougherty    Statistics 3.53       3.63
## 3 20306      Eric Forest    Psychology 3.87       3.97
## 4 30101      George Hull Public Health 3.79       3.89
gpa[1, 2]
## [1] "Adam Baines"
gpa[3, 4]
## [1] 3.87

If we leave out one of the indices, you will get the entire row or column you asked for:

gpa[1, ]
##      ID        Name   Major GPA GPAcorrect
## 1 20304 Adam Baines English 3.3        3.4
gpa[, 1]
## [1] 20304 20305 20306 30101

Remember to include the comma so it’s clear whether you are giving the row (before the comma) or the column (after the comma).

If we want a sequence of consecutive values, we can use the colon : to give an sequence. The code a:b gives a, a+1, a+2, up to b. For example:

1:5
## [1] 1 2 3 4 5
(-2):3
## [1] -2 -1  0  1  2  3
0.5:9
## [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
powers[1:5]
## [1]  2  4  8 16 32

Brackets

R uses three types of brackets for different reasons.

In the previous section, we have seen square brackets, [], are used to index an object.

Rounded brackets, or parentheses, () are used for order of operations in mathematical calculations7 as well as for providing the inputs to functions. We will talk about functions more in the next chapter, but we have already seen them in practice, for example:

mean(gpa$GPA)
## [1] 3.6225

Here, mean is the name of the function and we use gpa$GPA as an input to that function. This is similar to the mathematical notation \(f(x)\), where \(f\) is a function and \(x\), placed inside the parentheses, is the input.

The last type of bracket you might see in R are curly brackets, {}. Curly brackets are used to set off some piece of code from the rest of the script. For example, if you are defining your own function, you write the code for the function inside curly brackets. If you are doing some kind of conditional coding, where you run different code depending on the value of an object, each part of code is wrapped in brackets. You’re unlikely to see code that needs curly brackets in an intro class, but in case you do, you can tell it is setting the code apart as special in some way.

Commenting

An R script file is a sequence of commands you are telling R to execute. Sometimes these commands can be complicated, either because there is a long sequence of them or because one line is a bit tricky. If you want to leave yourself or someone else a note, you can add comments to your code using #. This symbol is called a number sign, pound sign, or hash tag, depending on when you were born.

Comments are parts of your script file that are meant to be read by a human and not interpreted as code. When you place a # in a line, anything after that is considered a comment and skipped when R is trying to run your code. For example:

mean(gpa$GPA)  # calculates average GPA
sd(gpa$GPA)  # calculates standard deviation of GPA

If you want a comment to span more than one line, you will need to start a new line and include another # in that line. We usually like to line up # that make the same comment

mean(gpa$GPA, na.rm=TRUE)  # calculates the average
                           # GPA, but first removes
                           # any missing values and
                           # uses the number of remaining
                           # values in the denominator

That’s actually a pretty bad comment. It’s too wordy and really just explaining what the one function is doing. If we really wanted to know what na.rm=TRUE means, we could look in the help file for the mean function.

Comments should make your code easier to read. This is helpful for people like group members and instructors, but is also helpful for you when you return to your work. A common refrain in coding is, “Your most important collaborator is future you.” That means you need to leave yourself hints and notes so that when you go back to this code tomorrow or in six months or more, you can more easily pick it up and understand what past you was trying to do.

Use of White Space

Another way to increase the human readability of your code it to use white space liberally. White space is any space in your script that doesn’t include character code. This means everything from spaces to indentations to blank lines. While some of the suggestions below aren’t formal syntax, they are preferred practice for making code easier to read.

Spaces. As much as possible, you should use spaces around equal signs, mathematical functions, and after commas. For example:

myname = 'Travis'  # clearly separates the object name from the value it takes
myname='Travis'    # takes much longer to read and understand

2 * 3  # is preferred
2*3    # is too squished

mean(gpa$GPA, na.rm=TRUE)  # separates function inputs
mean(gpa$GPA,na.rm=TRUE)   # runs everything together

gpa[2, ]  # clearly shows a missing column index
gpa[2,]   # takes away the blank space that should help indicate an entry is missing

Indentation. If you are writing a line of code that is too long to easily see or read as a single line, you can start a new line and continue that code. This is most common in functions that take multiple inputs, especially if some inputs have long names. You will usually indent two spaces for every level you are coding, or if you are using multiple lines inside a function, you will indent to the start of the first function input. See here:

oddsratio = (pa / (1 - pa)) / 
  (pb / (1 - pb))

mean(gpa$GPA, 
     trim=0.05, 
     na.rm=TRUE)

R Studio will often default to indenting when appropriate. This is one of those cases where it’s best if you let R Studio do it’s thing. It may seem like a waste of space, but it’s not like you’re printing much of this out anyway.

Blank lines. You have already seen me use blank lines in the chunks of code above. You generally want to leave blank lines between lines of code. One exception is if (1) each line of code is pretty simple on its own AND (2) each line of code is leading directly into the next, as below:

mytab = table(mydata$x, mydata$y)
myprops = prop.table(mytab, margin=1)

Here, the ultimate goal was to get a table of proportions, but we needed the intermediate step of the table of counts. For someone who is moderately familiar with R code, the connection between the two lines is easy to see.

You can also do away with the blank lines if you are running a series of very simple but loosely related lines of code:

mean(gpa$GPA)
sd(gpa$GPA)
median(gpa$GPA)
quantiles(gpa$GPA, probs=c(0.25, 0.75))

These lines don’t feed right into one another, but they are relatively simple calculations all done on the same data, essentially getting the basic descriptive statistics.


  1. You can also use a right arrow -> to assign the other direction, like 3 -> w, but people who do that are basically anarchists.↩︎

  2. for example, 2 * (1 + 4) gives 2 * 5, or 10.↩︎