Using regular expressions with stringr

In lab 3, you used regular expressions in the context of text files and .csv files. However, regular expressions are also useful in the RStudio environment, especially when using the str_extract() family of functions in the package stringr.

The following examples demonstrate how to use stringr to manipulate character strings in the R environment. We’ll start by looking at some data where manipulating character strings might be useful:

# Look at some data where stringr might be useful
head(mamm.data)
##   Habitat     Site Station Tag    Genus        Species
## 1   Field Audubon1      2A 142    Zapus      hudsonius
## 2   Field Audubon1      4A 174    Zapus      hudsonius
## 3   Field Audubon1     30A 226 Microtus pennsylvanicus
## 4   Field Audubon1     30B 227 Microtus pennsylvanicus
## 5   Field Audubon1     25A 228    Zapus      hudsonius
## 6   Field Audubon1     24B 230    Zapus      hudsonius

There are two issues with this dataset that stringr can help us with:
* The variable “Station” contains the trap letter in addition to the station number. The letter isn’t necessary for most analyses but removing it permanently erases data.
* Need to create a variable containing species codes (first two letters of genus + first two letters of the species epithet)

To use stringr and regex to drop the letters from the “Station” variable:

# Pull numbers from variable Station
mamm2 <- mamm.data %>%
  mutate(Station = unlist(str_extract_all(mamm.data$Station, 
                                          pattern = "\\d+")))

print(head(mamm2))
##   Habitat     Site Station Tag    Genus        Species
## 1   Field Audubon1       2 142    Zapus      hudsonius
## 2   Field Audubon1       4 174    Zapus      hudsonius
## 3   Field Audubon1      30 226 Microtus pennsylvanicus
## 4   Field Audubon1      30 227 Microtus pennsylvanicus
## 5   Field Audubon1      25 228    Zapus      hudsonius
## 6   Field Audubon1      24 230    Zapus      hudsonius

The function str_extract_all() extracts all portions of a character string that match the given pattern. The regular expression //d+ pulls all digits from the string.

To create species codes, we have to use the function str_extract on two different variables (Genus & Species) and then paste the resulting strings together:

# Combine first two characters from Genus & Species to create species code
mamm3 <- mamm.data %>%
  mutate(Species_Code = paste(str_extract(Genus, pattern = "^.{2}"),
                              str_extract(Species, pattern = "^.{2}"),
                              sep = "")) %>%
  select(-c(Genus, Species))

print(head(mamm3))
##   Habitat     Site Station Tag Species_Code
## 1   Field Audubon1      2A 142         Zahu
## 2   Field Audubon1      4A 174         Zahu
## 3   Field Audubon1     30A 226         Mipe
## 4   Field Audubon1     30B 227         Mipe
## 5   Field Audubon1     25A 228         Zahu
## 6   Field Audubon1     24B 230         Zahu

The regular expression ^.{2} pulls the first two characters, no matter the type, from the character string. The function paste() combines our two strings, with the argument sep defining how the strings should be separated (or in this case, not separated).

Finally, stringr and regular expressions are useful when working with several different files of data. This can occur when running simulations where each combination of parameter values is a separate output file. You can use str_extract() functions to pull parameter information from file names and put them in a data frame:

# Create list of files
files <- c("psi0.8det0.1.csv", "psi0.8det0.5.csv", "psi0.45det0.1.csv")

# Extract pattern of numbers and decimals
params <- str_extract_all(files, pattern = "(\\d+\\.\\d+)")

# Put them in a data frame
param.frame <- as.data.frame(do.call(rbind, params))

# Make column names parameter names
colnames(param.frame) <- c("psi", "det")

print(param.frame)
##    psi det
## 1  0.8 0.1
## 2  0.8 0.5
## 3 0.45 0.1

The regular expression in str_extract_all() pulls a series of numbers, followed by a period, and then another series of numbers. The rest of the code organizes these values in a data frame.

I find the str_extract() functions to be the most useful when manipulating strings in R. However, there are many useful functions in the stringr package. A helpful cheat sheet can be found at this link.