What is dplyr?

  • a new(ish) package that provides a set of tools for manipulating datasets in R
  • part of the tidyverse: an opinionated collection of R packages which share philosophy, grammar, and data structure
  • specifically written to be fast!
  • has individual functions that correspond to the most common operations
  • makes it easier for you to figure out what to do with your data

The core ‘verbs’

  1. filter()
  2. arrange()
  3. select()
  4. summarize() and group_by()
  5. mutate()

Start with a built-in dataset

library(tidyverse)
data(starwars)
class(starwars)
## [1] "tbl_df"     "tbl"        "data.frame"

What is a tibble?

  • “a modern take on data frames”
  • said to keep the great aspects of data frames and drops the frustrating ones (i.e. changing variable names, changing an input type)
glimpse(starwars) #more effective than str() in this case   
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
head(starwars) #looks a little different since it's a tibble (mentions dimensions and also states variable type)
## # A tibble: 6 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

Cleaning up our data

# complete.cases is not part of dplyr, but it is very useful!
#dataClean <- starwars[complete.cases(starwars),] # removes all rows with NA values but doesn't work with lists

# to get complete cases for just certain columns 
starwarsClean <-starwars[complete.cases(starwars[,1:10]),]


# Check for NAs
is.na(starwarsClean[1,1]) #useful for only a few observations since it returns a list of True/False
##       name
## [1,] FALSE
anyNA(starwarsClean)
## [1] FALSE
anyNA(starwars[,1:10]) # compared to our original dataset
## [1] TRUE
# What does our data look like now?
glimpse(starwarsClean) # the list columns are still there but no NAs in columns 1-10
## Rows: 29
## Columns: 14
## $ name       <chr> "Luke Skywalker", "Darth Vader", "Leia Organa", "Owen Lars"…
## $ height     <int> 172, 202, 150, 178, 165, 183, 182, 188, 228, 180, 170, 170,…
## $ mass       <dbl> 77.0, 136.0, 49.0, 120.0, 75.0, 84.0, 77.0, 84.0, 112.0, 80…
## $ hair_color <chr> "blond", "none", "brown", "brown, grey", "brown", "black", …
## $ skin_color <chr> "fair", "white", "light", "light", "light", "light", "fair"…
## $ eye_color  <chr> "blue", "yellow", "brown", "blue", "blue", "brown", "blue-g…
## $ birth_year <dbl> 19.0, 41.9, 19.0, 52.0, 47.0, 24.0, 57.0, 41.9, 200.0, 29.0…
## $ sex        <chr> "male", "male", "female", "male", "female", "male", "male",…
## $ gender     <chr> "masculine", "masculine", "feminine", "masculine", "feminin…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Alderaan", "Tatooine", "Tatooine",…
## $ species    <chr> "Human", "Human", "Human", "Human", "Human", "Human", "Huma…
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, "Imperial Sp…
## $ starships  <list> <"X-wing", "Imperial shuttle">, "TIE Advanced x1", <>, <>,…
head(starwarsClean)
## # A tibble: 6 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 3 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 4 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## 5 Beru Whites…    165    75 brown   light   blue       47   fema… femin… Tatooi…
## 6 Biggs Darkl…    183    84 black   light   brown      24   male  mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

filter(): Pick/subset observations by their values

### uses >, >=, <, <=,!=, == (not just one!) for comparisons   
### Logical operators: & | !

# filter automatically excludes NAs, have to ask for them specifically 

filter(starwarsClean, gender == "masculine" & height < 180) # you can also use commas in place of &   
## # A tibble: 8 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue         19 male  mascu… Tatooi…
## 2 Owen Lars       178   120 brown,… light   blue         52 male  mascu… Tatooi…
## 3 Wedge Antil…    170    77 brown   fair    hazel        21 male  mascu… Corell…
## 4 Palpatine       170    75 grey    pale    yellow       82 male  mascu… Naboo  
## 5 Lando Calri…    177    79 black   dark    brown        31 male  mascu… Socorro
## 6 Lobot           175    79 none    light   blue         37 male  mascu… Bespin 
## 7 Wicket Syst…     88    20 brown   brown   brown         8 male  mascu… Endor  
## 8 Darth Maul      175    80 none    red     yellow       54 male  mascu… Dathom…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld
filter(starwars, gender == "masculine", height <180, height > 100) # can add multiple conditions for the same variable   
## # A tibble: 17 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue         19 male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow      112 none  mascu… Tatooi…
##  3 Owen Lars      178   120 brown,… light   blue         52 male  mascu… Tatooi…
##  4 Greedo         173    74 <NA>    green   black        44 male  mascu… Rodia  
##  5 Jabba Desi…    175  1358 <NA>    green-… orange      600 herm… mascu… Nal Hu…
##  6 Wedge Anti…    170    77 brown   fair    hazel        21 male  mascu… Corell…
##  7 Palpatine      170    75 grey    pale    yellow       82 male  mascu… Naboo  
##  8 Lando Calr…    177    79 black   dark    brown        31 male  mascu… Socorro
##  9 Lobot          175    79 none    light   blue         37 male  mascu… Bespin 
## 10 Nien Nunb      160    68 none    grey    black        NA male  mascu… Sullust
## 11 Finis Valo…    170    NA blond   fair    blue         91 male  mascu… Corusc…
## 12 Watto          137    NA black   blue, … yellow       NA male  mascu… Toydar…
## 13 Sebulba        112    40 none    grey, … orange       NA male  mascu… Malast…
## 14 Darth Maul     175    80 none    red     yellow       54 male  mascu… Dathom…
## 15 Gasgano        122    NA none    white,… black        NA male  mascu… Troiken
## 16 Ben Quadin…    163    65 none    grey, … orange       NA male  mascu… Tund   
## 17 Eeth Koth      171    NA black   brown   brown        NA male  mascu… Iridon…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld
filter(starwars, eye_color %in% c("blue", "brown")) # we can use %in% for a few conditions, similar to ==
## # A tibble: 40 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  3 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  4 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  5 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  6 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  7 Wilhuff Ta…    180    NA auburn… fair    blue       64   male  mascu… Eriadu 
##  8 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
##  9 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## 10 Jek Tono P…    180   110 brown   fair    blue       NA   male  mascu… Bestin…
## # … with 30 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

arrange(): Reorder rows

arrange(starwarsClean, by = height) #default is ascending order   
## # A tibble: 29 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Wicket Sys…     88  20   brown   brown   brown         8 male  mascu… Endor  
##  2 Leia Organa    150  49   brown   light   brown        19 fema… femin… Aldera…
##  3 Beru White…    165  75   brown   light   blue         47 fema… femin… Tatooi…
##  4 Padmé Amid…    165  45   brown   light   brown        46 fema… femin… Naboo  
##  5 Barriss Of…    166  50   black   yellow  blue         40 fema… femin… Mirial 
##  6 Wedge Anti…    170  77   brown   fair    hazel        21 male  mascu… Corell…
##  7 Palpatine      170  75   grey    pale    yellow       82 male  mascu… Naboo  
##  8 Luminara U…    170  56.2 black   yellow  blue         58 fema… femin… Mirial 
##  9 Luke Skywa…    172  77   blond   fair    blue         19 male  mascu… Tatooi…
## 10 Lobot          175  79   none    light   blue         37 male  mascu… Bespin 
## # … with 19 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
arrange(starwarsClean, by = desc(height)) #we can use desc() to change that
## # A tibble: 29 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Ki-Adi-Mun…    198    82 white   pale    yellow     92   male  mascu… Cerea  
##  4 Jar Jar Bi…    196    66 none    orange  orange     52   male  mascu… Naboo  
##  5 Dooku          193    80 white   fair    brown     102   male  mascu… Serenno
##  6 Bossk          190   113 none    green   red        53   male  mascu… Trando…
##  7 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  8 Mace Windu     188    84 none    dark    brown      72   male  mascu… Haruun…
##  9 Plo Koon       188    80 none    orange  black      22   male  mascu… Dorin  
## 10 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
## # … with 19 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
arrange(starwarsClean, height, desc(mass)) # each additional column used to break ties with the preceding column
## # A tibble: 29 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Wicket Sys…     88  20   brown   brown   brown         8 male  mascu… Endor  
##  2 Leia Organa    150  49   brown   light   brown        19 fema… femin… Aldera…
##  3 Beru White…    165  75   brown   light   blue         47 fema… femin… Tatooi…
##  4 Padmé Amid…    165  45   brown   light   brown        46 fema… femin… Naboo  
##  5 Barriss Of…    166  50   black   yellow  blue         40 fema… femin… Mirial 
##  6 Wedge Anti…    170  77   brown   fair    hazel        21 male  mascu… Corell…
##  7 Palpatine      170  75   grey    pale    yellow       82 male  mascu… Naboo  
##  8 Luminara U…    170  56.2 black   yellow  blue         58 fema… femin… Mirial 
##  9 Luke Skywa…    172  77   blond   fair    blue         19 male  mascu… Tatooi…
## 10 Darth Maul     175  80   none    red     yellow       54 male  mascu… Dathom…
## # … with 19 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
starwars1<- arrange(starwars, height) # missing values are at the end; note we haven't been assigning anything to a variable, just printing (until now) 
tail(starwars1)
## # A tibble: 6 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Arvel Crynyd     NA    NA brown   fair    brown        NA male  mascu… <NA>   
## 2 Finn             NA    NA black   dark    dark         NA male  mascu… <NA>   
## 3 Rey              NA    NA brown   light   hazel        NA fema… femin… <NA>   
## 4 Poe Dameron      NA    NA brown   light   brown        NA male  mascu… <NA>   
## 5 BB8              NA    NA none    none    black        NA none  mascu… <NA>   
## 6 Captain Pha…     NA    NA unknown unknown unknown      NA <NA>  <NA>   <NA>   
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

select(): Choose variables by their names

## All of these do the same thing (subset)
starwarsClean[1:10,] # using base R
## # A tibble: 10 × 14
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
## 10 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld
select(starwarsClean, 1:10) # you can use numbers to subset
## # A tibble: 29 × 10
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
## 10 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 19 more rows, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld
select(starwarsClean, name:species) # you can use variables names too
## # A tibble: 29 × 11
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
## 10 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 19 more rows, 1 more variable: species <chr>, and abbreviated variable
## #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
select(starwarsClean, -(films:starships)) # you can subset everything except particular variables
## # A tibble: 29 × 11
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
## 10 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 19 more rows, 1 more variable: species <chr>, and abbreviated variable
## #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
## Rearrange columns
select(starwarsClean, name, gender, species, everything()) # using the everything() helper function is useful if you have a few variables to move to the beginning
## # A tibble: 29 × 14
##    name        gender species height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex  
##    <chr>       <chr>  <chr>    <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
##  1 Luke Skywa… mascu… Human      172    77 blond   fair    blue       19   male 
##  2 Darth Vader mascu… Human      202   136 none    white   yellow     41.9 male 
##  3 Leia Organa femin… Human      150    49 brown   light   brown      19   fema…
##  4 Owen Lars   mascu… Human      178   120 brown,… light   blue       52   male 
##  5 Beru White… femin… Human      165    75 brown   light   blue       47   fema…
##  6 Biggs Dark… mascu… Human      183    84 black   light   brown      24   male 
##  7 Obi-Wan Ke… mascu… Human      182    77 auburn… fair    blue-g…    57   male 
##  8 Anakin Sky… mascu… Human      188    84 blond   fair    blue       41.9 male 
##  9 Chewbacca   mascu… Wookiee    228   112 brown   unknown blue      200   male 
## 10 Han Solo    mascu… Human      180    80 brown   fair    brown      29   male 
## # … with 19 more rows, 4 more variables: homeworld <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year
select(starwarsClean, contains("color")) ## other helpers include: ends_with, starts_with, matches (reg ex), num_range
## # A tibble: 29 × 3
##    hair_color    skin_color eye_color
##    <chr>         <chr>      <chr>    
##  1 blond         fair       blue     
##  2 none          white      yellow   
##  3 brown         light      brown    
##  4 brown, grey   light      blue     
##  5 brown         light      blue     
##  6 black         light      brown    
##  7 auburn, white fair       blue-gray
##  8 blond         fair       blue     
##  9 brown         unknown    blue     
## 10 brown         fair       brown    
## # … with 19 more rows
## Renaming columns
select(starwars, haircolor = hair_color) # select can rename but only keeps specified variable; actual name of the column comes after stating new name
## # A tibble: 87 × 1
##    haircolor    
##    <chr>        
##  1 blond        
##  2 <NA>         
##  3 <NA>         
##  4 none         
##  5 brown        
##  6 brown, grey  
##  7 brown        
##  8 <NA>         
##  9 black        
## 10 auburn, white
## # … with 77 more rows
rename(starwarsClean,  haircolor = hair_color) # keeps all the variables
## # A tibble: 29 × 14
##    name        height  mass hairc…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
## 10 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 19 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​haircolor, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

mutate(): Create new variables with functions of existing variables

# Let's create a new column that is a height divided by mass
mutate(starwarsClean, ratio = height/mass) # note we use arithmetic operators 
## # A tibble: 29 × 15
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
## 10 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 19 more rows, 5 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, ratio <dbl>, and abbreviated variable
## #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
starwars_lbs <- mutate(starwarsClean, mass_lbs=mass*2.2) # now let's convert kg to lbs
head(starwars_lbs) # mass_lbs is now the last variable of the tibble
## # A tibble: 6 × 15
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 3 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 4 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## 5 Beru Whites…    165    75 brown   light   blue       47   fema… femin… Tatooi…
## 6 Biggs Darkl…    183    84 black   light   brown      24   male  mascu… Tatooi…
## # … with 5 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, mass_lbs <dbl>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
select(starwars_lbs, 1:3, mass_lbs, everything()) # we can bring it to the beginning using select
## # A tibble: 29 × 15
##    name        height  mass mass_…¹ hair_…² skin_…³ eye_c…⁴ birth…⁵ sex   gender
##    <chr>        <int> <dbl>   <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr> 
##  1 Luke Skywa…    172    77    169. blond   fair    blue       19   male  mascu…
##  2 Darth Vader    202   136    299. none    white   yellow     41.9 male  mascu…
##  3 Leia Organa    150    49    108. brown   light   brown      19   fema… femin…
##  4 Owen Lars      178   120    264  brown,… light   blue       52   male  mascu…
##  5 Beru White…    165    75    165  brown   light   blue       47   fema… femin…
##  6 Biggs Dark…    183    84    185. black   light   brown      24   male  mascu…
##  7 Obi-Wan Ke…    182    77    169. auburn… fair    blue-g…    57   male  mascu…
##  8 Anakin Sky…    188    84    185. blond   fair    blue       41.9 male  mascu…
##  9 Chewbacca      228   112    246. brown   unknown blue      200   male  mascu…
## 10 Han Solo       180    80    176  brown   fair    brown      29   male  mascu…
## # … with 19 more rows, 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, and abbreviated variable
## #   names ¹​mass_lbs, ²​hair_color, ³​skin_color, ⁴​eye_color, ⁵​birth_year
# If we only wanted the new variable
transmute(starwarsClean, mass_lbs=mass*2.2) # just keep the new variable
## # A tibble: 29 × 1
##    mass_lbs
##       <dbl>
##  1     169.
##  2     299.
##  3     108.
##  4     264 
##  5     165 
##  6     185.
##  7     169.
##  8     185.
##  9     246.
## 10     176 
## # … with 19 more rows
transmute(starwarsClean, mass, mass_lbs=mass*2.2) # you can mention variables you want to keep in the new dataset
## # A tibble: 29 × 2
##     mass mass_lbs
##    <dbl>    <dbl>
##  1    77     169.
##  2   136     299.
##  3    49     108.
##  4   120     264 
##  5    75     165 
##  6    84     185.
##  7    77     169.
##  8    84     185.
##  9   112     246.
## 10    80     176 
## # … with 19 more rows

summarize() and group_by(): Collapse many values down to a single summary

summarize(starwarsClean, meanHeight = mean(height)) # gives summary statistics for entire tibble
## # A tibble: 1 × 1
##   meanHeight
##        <dbl>
## 1       178.
# working with NAs
summarize(starwars, meanHeight = mean(height)) # does not calculate mean if NAs are present
## # A tibble: 1 × 1
##   meanHeight
##        <dbl>
## 1         NA
summarize(starwars, meanHeight = mean(height, na.rm=TRUE), TotalNumber = n()) # n() is a useful function in determining sample size
## # A tibble: 1 × 2
##   meanHeight TotalNumber
##        <dbl>       <int>
## 1       174.          87
summarize(starwarsClean, meanHeight = mean(height), number = n() ) #na.rm not necessary since we are working with a clean dataset
## # A tibble: 1 × 2
##   meanHeight number
##        <dbl>  <int>
## 1       178.     29
# use group_by for maximum usefulness
starwarsGenders<- group_by(starwars, gender)
head(starwarsGenders) # now we see the groups mentioned at the top of the tibble
## # A tibble: 6 × 14
## # Groups:   gender [2]
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
## #   ³​eye_color, ⁴​birth_year, ⁵​homeworld
summarize(starwarsGenders, meanHeight = mean(height, na.rm = TRUE), number = n() ) # now we can compare height and sample size between groups
## # A tibble: 3 × 3
##   gender    meanHeight number
##   <chr>          <dbl>  <int>
## 1 feminine        165.     17
## 2 masculine       177.     66
## 3 <NA>            181.      4

What is piping?

  • used to emphasize a sequence of actions
  • it lets you pass an intermediate result onto the next function (it takes the output of one statement and makes it the input of the next statement)
  • avoid when you need to manipulate more than one object at a time or there are meaningful intermediate objects
  • formatting: should always have a space before it and usually followed by a new line (usually automatic indent)
starwarsClean %>%
    group_by(gender) %>%
    summarize(meanHeight = mean(height, na.rm=T), number=n()) # so much cleaner! notice we lost a bunch of groups with the clean dataset
## # A tibble: 2 × 3
##   gender    meanHeight number
##   <chr>          <dbl>  <int>
## 1 feminine        166.      6
## 2 masculine       181.     23
## case_when() is useful for multiple ifelse statements


starwarsClean %>%
  mutate(sp = case_when(species == "Human" ~ "Human", TRUE ~ "Non-Human")) %>%
  select(name, sp, everything())
## # A tibble: 29 × 15
##    name  sp    height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr> <chr>  <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke… Human    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Dart… Human    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia… Human    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen… Human    178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru… Human    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Bigg… Human    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-… Human    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anak… Human    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Chew… Non-…    228   112 brown   unknown blue      200   male  mascu… Kashyy…
## 10 Han … Human    180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 19 more rows, 4 more variables: species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld
starwarsClean %>%
  group_by(films) %>%
  mutate(
    sp = case_when(species == "Human" ~ "Human", TRUE ~ "Non-Human"),
    status = case_when(str_detect(films, "A New Hope") ~ "OG", TRUE ~ "Later")) %>%
  select(name,sp, status, everything()) %>%
  arrange(status) %>%
  {starwarsOGstatus <<- .}
## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing

## Warning in stri_detect_regex(string, pattern, negate = negate, opts_regex =
## opts(pattern)): argument is not an atomic vector; coercing
tail(starwarsOGstatus)
## # A tibble: 6 × 16
## # Groups:   films [6]
##   name    sp    status height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender
##   <chr>   <chr> <chr>   <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr> 
## 1 Beru W… Human OG        165    75 brown   light   blue         47 fema… femin…
## 2 Biggs … Human OG        183    84 black   light   brown        24 male  mascu…
## 3 Obi-Wa… Human OG        182    77 auburn… fair    blue-g…      57 male  mascu…
## 4 Chewba… Non-… OG        228   112 brown   unknown blue        200 male  mascu…
## 5 Han So… Human OG        180    80 brown   fair    brown        29 male  mascu…
## 6 Wedge … Human OG        170    77 brown   fair    hazel        21 male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year
## long to wide format and vice-versa
glimpse(starwarsClean)
## Rows: 29
## Columns: 14
## $ name       <chr> "Luke Skywalker", "Darth Vader", "Leia Organa", "Owen Lars"…
## $ height     <int> 172, 202, 150, 178, 165, 183, 182, 188, 228, 180, 170, 170,…
## $ mass       <dbl> 77.0, 136.0, 49.0, 120.0, 75.0, 84.0, 77.0, 84.0, 112.0, 80…
## $ hair_color <chr> "blond", "none", "brown", "brown, grey", "brown", "black", …
## $ skin_color <chr> "fair", "white", "light", "light", "light", "light", "fair"…
## $ eye_color  <chr> "blue", "yellow", "brown", "blue", "blue", "brown", "blue-g…
## $ birth_year <dbl> 19.0, 41.9, 19.0, 52.0, 47.0, 24.0, 57.0, 41.9, 200.0, 29.0…
## $ sex        <chr> "male", "male", "female", "male", "female", "male", "male",…
## $ gender     <chr> "masculine", "masculine", "feminine", "masculine", "feminin…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Alderaan", "Tatooine", "Tatooine",…
## $ species    <chr> "Human", "Human", "Human", "Human", "Human", "Human", "Huma…
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, "Imperial Sp…
## $ starships  <list> <"X-wing", "Imperial shuttle">, "TIE Advanced x1", <>, <>,…
# make data set wider
wideSW <- starwarsClean %>% 
  select(name,sex, height) %>%
  pivot_wider(names_from = sex, values_from = height, values_fill = NA)
wideSW
## # A tibble: 29 × 3
##    name                male female
##    <chr>              <int>  <int>
##  1 Luke Skywalker       172     NA
##  2 Darth Vader          202     NA
##  3 Leia Organa           NA    150
##  4 Owen Lars            178     NA
##  5 Beru Whitesun lars    NA    165
##  6 Biggs Darklighter    183     NA
##  7 Obi-Wan Kenobi       182     NA
##  8 Anakin Skywalker     188     NA
##  9 Chewbacca            228     NA
## 10 Han Solo             180     NA
## # … with 19 more rows
starwars %>%
  select(name, homeworld) %>%
  group_by(homeworld) %>%
  mutate(rn = row_number()) %>%
  ungroup() %>%
  pivot_wider(names_from = homeworld, values_from = name) %>%
  select(-rn)
## # A tibble: 11 × 49
##    Tatooine   Naboo Alder…¹ Stewjon Eriadu Kashy…² Corel…³ Rodia Nal H…⁴ Besti…⁵
##    <chr>      <chr> <chr>   <chr>   <chr>  <chr>   <chr>   <chr> <chr>   <chr>  
##  1 Luke Skyw… R2-D2 Leia O… Obi-Wa… Wilhu… Chewba… Han So… Gree… Jabba … Jek To…
##  2 C-3PO      Palp… Bail P… <NA>    <NA>   Tarfful Wedge … <NA>  <NA>    <NA>   
##  3 Darth Vad… Jar … Raymus… <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
##  4 Owen Lars  Roos… <NA>    <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
##  5 Beru Whit… Rugo… <NA>    <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
##  6 R5-D4      Ric … <NA>    <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
##  7 Biggs Dar… Quar… <NA>    <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
##  8 Anakin Sk… Greg… <NA>    <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
##  9 Shmi Skyw… Cordé <NA>    <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
## 10 Cliegg La… Dormé <NA>    <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
## 11 <NA>       Padm… <NA>    <NA>    <NA>   <NA>    <NA>    <NA>  <NA>    <NA>   
## # … with 39 more variables: `NA` <chr>, Kamino <chr>, Trandosha <chr>,
## #   Socorro <chr>, Bespin <chr>, `Mon Cala` <chr>, Chandrila <chr>,
## #   Endor <chr>, Sullust <chr>, `Cato Neimoidia` <chr>, Coruscant <chr>,
## #   Toydaria <chr>, Malastare <chr>, Dathomir <chr>, Ryloth <chr>,
## #   Vulpter <chr>, Troiken <chr>, Tund <chr>, `Haruun Kal` <chr>, Cerea <chr>,
## #   `Glee Anselm` <chr>, Iridonia <chr>, Iktotch <chr>, Quermia <chr>,
## #   Dorin <chr>, Champala <chr>, Geonosis <chr>, Mirial <chr>, Serenno <chr>, …
## make data set longer
glimpse(wideSW)
## Rows: 29
## Columns: 3
## $ name   <chr> "Luke Skywalker", "Darth Vader", "Leia Organa", "Owen Lars", "B…
## $ male   <int> 172, 202, NA, 178, NA, 183, 182, 188, 228, 180, 170, 170, 183, …
## $ female <int> NA, NA, 150, NA, 165, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
wideSW %>%
  pivot_longer(cols= male:female,
               names_to = "sex",
               values_to= "height",
               values_drop_na = T)
## # A tibble: 29 × 3
##    name               sex    height
##    <chr>              <chr>   <int>
##  1 Luke Skywalker     male      172
##  2 Darth Vader        male      202
##  3 Leia Organa        female    150
##  4 Owen Lars          male      178
##  5 Beru Whitesun lars female    165
##  6 Biggs Darklighter  male      183
##  7 Obi-Wan Kenobi     male      182
##  8 Anakin Skywalker   male      188
##  9 Chewbacca          male      228
## 10 Han Solo           male      180
## # … with 19 more rows