In lab 3, you used regular expressions in the context of text files and .csv files. However, regular expressions are also useful in the RStudio environment, especially when using the str_extract() family of functions in the package stringr.
The following examples demonstrate how to use stringr to manipulate character strings in the R environment. We’ll start by looking at some data where manipulating character strings might be useful:
# Look at some data where stringr might be useful
head(mamm.data)
## Habitat Site Station Tag Genus Species
## 1 Field Audubon1 2A 142 Zapus hudsonius
## 2 Field Audubon1 4A 174 Zapus hudsonius
## 3 Field Audubon1 30A 226 Microtus pennsylvanicus
## 4 Field Audubon1 30B 227 Microtus pennsylvanicus
## 5 Field Audubon1 25A 228 Zapus hudsonius
## 6 Field Audubon1 24B 230 Zapus hudsonius
There are two issues with this dataset that stringr can help us with:
* The variable “Station” contains the trap letter in addition to the station number. The letter isn’t necessary for most analyses but removing it permanently erases data.
* Need to create a variable containing species codes (first two letters of genus + first two letters of the species epithet)
To use stringr and regex to drop the letters from the “Station” variable:
# Pull numbers from variable Station
mamm2 <- mamm.data %>%
mutate(Station = unlist(str_extract_all(mamm.data$Station,
pattern = "\\d+")))
print(head(mamm2))
## Habitat Site Station Tag Genus Species
## 1 Field Audubon1 2 142 Zapus hudsonius
## 2 Field Audubon1 4 174 Zapus hudsonius
## 3 Field Audubon1 30 226 Microtus pennsylvanicus
## 4 Field Audubon1 30 227 Microtus pennsylvanicus
## 5 Field Audubon1 25 228 Zapus hudsonius
## 6 Field Audubon1 24 230 Zapus hudsonius
The function str_extract_all() extracts all portions of a character string that match the given pattern. The regular expression //d+
pulls all digits from the string.
To create species codes, we have to use the function str_extract on two different variables (Genus & Species) and then paste the resulting strings together:
# Combine first two characters from Genus & Species to create species code
mamm3 <- mamm.data %>%
mutate(Species_Code = paste(str_extract(Genus, pattern = "^.{2}"),
str_extract(Species, pattern = "^.{2}"),
sep = "")) %>%
select(-c(Genus, Species))
print(head(mamm3))
## Habitat Site Station Tag Species_Code
## 1 Field Audubon1 2A 142 Zahu
## 2 Field Audubon1 4A 174 Zahu
## 3 Field Audubon1 30A 226 Mipe
## 4 Field Audubon1 30B 227 Mipe
## 5 Field Audubon1 25A 228 Zahu
## 6 Field Audubon1 24B 230 Zahu
The regular expression ^.{2}
pulls the first two characters, no matter the type, from the character string. The function paste() combines our two strings, with the argument sep defining how the strings should be separated (or in this case, not separated).
Finally, stringr and regular expressions are useful when working with several different files of data. This can occur when running simulations where each combination of parameter values is a separate output file. You can use str_extract() functions to pull parameter information from file names and put them in a data frame:
# Create list of files
files <- c("psi0.8det0.1.csv", "psi0.8det0.5.csv", "psi0.45det0.1.csv")
# Extract pattern of numbers and decimals
params <- str_extract_all(files, pattern = "(\\d+\\.\\d+)")
# Put them in a data frame
param.frame <- as.data.frame(do.call(rbind, params))
# Make column names parameter names
colnames(param.frame) <- c("psi", "det")
print(param.frame)
## psi det
## 1 0.8 0.1
## 2 0.8 0.5
## 3 0.45 0.1
The regular expression in str_extract_all() pulls a series of numbers, followed by a period, and then another series of numbers. The rest of the code organizes these values in a data frame.
I find the str_extract() functions to be the most useful when manipulating strings in R. However, there are many useful functions in the stringr package. A helpful cheat sheet can be found at this link.