Session 2: Data Management in R

Rutgers University Sociology R Bootcamp

Fred Traylor, Lab TA (he/him)

September 1, 2022

Good Afternoon!

This Afternoon’s Goals

  1. Working Directories & Projects

  2. Environments and Packages

  3. Packages

    1. Tidyverse

    2. usdata

  4. Data Management in R

    1. Filtering Rows

    2. Selecting Columns

Directories and Projects

Working Directories

Tomorrow, we’ll be working with data from outside of R. That is, we’ll be importing our own data files.

Before we can do that, though, we first need to understand where our data and files are currently being saved.

This is called a “Working Directory.” To find it, we can run getwd(), short for “get working directory.”

getwd()
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"

Projects

If you haven’t already, let’s go ahead and set up a project. We’ll save everything we do this semester into the project, so it’ll be a handy place to keep everything together. Projects do two useful functions:

Let’s create a new project

  1. In the RStudio menu, click File > New Project

  2. In the popup menu, click “New Directory”

  3. Click “New Project”

  4. Give it an appropriate name and choose where it should go in your files

    • I named mine “ay23_lab_ta”

    • Other good options include “soc541”, “sociology_stats_1”, etc.

      • We’ll be making another project for 542 (Stats II) in the spring.
    • All of my R files are saved in “Documents > R”, so I made it a “subdirectory” within that folder.

You should now be looking at a new project.

Your working directory might have changed, too, and we can look at it with getwd().

getwd()
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"

Official Question Time 1

Since the last OQT, we’ve done:

  1. Viewing the Working Directory

  2. Creating a New Project

Environments and Packages

Other R Environments

Up to now, we’ve used what is called “Base R.” If you go to the “Environment” pane of your R window, there is a dropdown menu to see what is in each environment. Go ahead and click on it and you’ll see a selection of other “packages.”

No need to do anything inside these packages, but now you know how R knows what to do. If we look at the documentation for a function, it gives us the function’s name followed by the name of the package holding in in curly brackets.

Generally, we don’t want to mess with the code of any functions or values that come in our other packages because fixing them involves reinstalling a lot of things.

Introduction to Packages

The beauty of R is that, because it is open-source, anybody can add new functions and make them available to anybody. Because these functions often rely on each other, they get packaged together to make for easy (and consistent) usage.

These “packages” are what we’re going to be doing today. These will come in handy during this bootcamp and throughout the entire year as you take statistics.

I generally group packages into two main categories:

  1. Quality of life packages make R easier to use or simplify code
  2. Extension packages add capabilities that would otherwise be complicated in Base R

As you can guess, there’s a lot of overlap here. We’ll mostly be working with quality of life packages this fall, but we’ll also add some extension packages as well.

Preparing to Use Packages

When you want to use a package, there are two things you have to do:

  1. You have to install it.

    • There are likely thousands of packages available on CRAN (where you downloaded R from) and even more available elsewhere online. It’d be too much for R to install every possible package to your computer when you downloaded it the first time, so they’re only available as you want them.

    • Luckily, packages (like R) are free!

    • To install a package, you use the install.packages() function built into R.

  2. You have to call it from your library.

    • After it’s been downloaded, it’s saved on your computer. (YAY!)

    • But, R doesn’t know which ones you want to use yet, so you will need to call it into the working library via the library() function.

You can think of install.packages() as buying a tool and library() as actually getting it out of the toolbox.

Intro to the Tidyverse

This afternoon, and all throughout this year, we’re also going to use a popular package (well, set of packages) called the tidyverse.

Because it’ll take a minute or two to install, go ahead and type install.packages("tidyverse") into R and run it.

The tidyverse is “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

Basically, it makes it easier to manage, use, and look at our data. Today, we’ll be working with the manage and use parts of this. In a few weeks, we’ll do a little bit with how the tidyverse makes data visualization better and easier than with Base R.

Hopefully, by now, the tidyverse has been installed, so let’s go ahead and call it into the library with library(tidyverse).

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

You’ll get these warnings every time you load the tidyverse. It’s telling you which packages have been loaded, if the package(s) was built using a newer version of R you have downloaded, and if there are any “conflicting” functions. Generally you don’t need to worry about these. (We’ll deal with them more in the Spring…)

The County Dataset

This afternoon, we’re going to work with a dataset of US counties. (In case you don’t know, counties are smaller than state governments but (generally) larger than cities.)

We’re going to use a package called usdata and a dataset saved in it called “county.” To download it, go ahead and type install.packages("usdata") into R and run it.

Once that’s loaded up, run library(usdata), to bring it into our environment.

library(usdata)
## Warning: package 'usdata' was built under R version 4.1.2

Let’s take a look at the dataframe to see what we’re working with.

View(county)

dim(county)
## [1] 3142   15
str(county)
## tibble [3,142 x 15] (S3: tbl_df/tbl/data.frame)
##  $ name             : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ state            : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pop2000          : num [1:3142] 43671 140415 29038 20826 51024 ...
##  $ pop2010          : num [1:3142] 54571 182265 27457 22915 57322 ...
##  $ pop2017          : int [1:3142] 55504 212628 25270 22668 58013 10309 19825 114728 33713 25857 ...
##  $ pop_change       : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
##  $ poverty          : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
##  $ homeownership    : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
##  $ multi_unit       : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
##  $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
##  $ metro            : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 2 1 1 ...
##  $ median_edu       : Factor w/ 4 levels "below_hs","hs_diploma",..: 3 3 2 2 2 2 2 3 2 2 ...
##  $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
##  $ median_hh_income : int [1:3142] 55317 52562 33368 43404 47412 29655 36326 43686 37342 40041 ...
##  $ smoking_ban      : Factor w/ 3 levels "none","partial",..: 1 1 2 1 1 1 NA NA 1 1 ...

We can see that we have a lot of observations (3,142) and 15 variables.

Fortunately, the names are fairly descriptive, but we should still look at the documentation (?county) to see what everything means.

It also tell us that the data is saved as a “tibble.”

In the Packages pane, scroll down to the one that says “tibble.” If it isn’t checked, go ahead and check it to bring it to the library.

The description for “tibble” is “Simple Data Frames.” Indeed, tibbles are the same as data frames, but with a few nice features that make it easier to work with large datasets:

  1. It gives us the data type for each column

  2. When you print to the console, it shortens the output

    • Only ten observations

    • Only as many columns as will fit

    • And gives a summary of what was cut off

  3. Smaller storage

Viewing Our New Packages

You can see here that we added a whole host of new packages, including dplyr, tidyr, and usdata.

We can also see them in the Packages pane in the bottom right of our RStudio window. It’s in the same place as Help, so you’ll have to click the “Packages” tab.

Scroll through some of them and look at the descriptions of the packages that are checked. For example, the dplyr package is described as “A Grammar of Data Manipulation.”

If you want to install a new package, you can click the “Install” button at the top of that pane or install it via install.packages() as we did before.

Occasionally, we need to update our packages. If that happens, you can click the “Update” button and select the packages you want to update.

Variable Management

The select() Function

The first function we’re going to work with from the tivyverse is select().

Let’s say we want to create a dataset, called “smallcounty” that has only some variables we care about from the county dataset.

With base R, we could create a new data frame built with each column individually. And it would totally work, but it would also be a lot.

smallcounty <- data.frame(county$state,
                          county$smoking_ban,
                          county$median_edu,
                          county$median_hh_income)

With the select function, we can simplify this a bit.

First, you should know that, when working with the tidyverse, every function’s argument list will begin with the dataset.

Because of this, we don’t need to type it again after that.

So, we can make the same dataset as before using this code:

smallcounty <- select(county, state, smoking_ban, median_edu, median_hh_income)

What we did here was take our original dataframe and “select” the variables we wanted.

As we’ll see, with all tidyverse functions, we start with the dataset. After that, we told it which columns we wanted.

Not having to repeat the dataset each time is one part of what makes these functions “tidy.”

Pairing select() with Other Functions

The tidyverse also includes some other functions that are nice to pair with select(). These include starts_with(), ends_with(), contains(), and the “minus sign”, -. Let’s try these now.

select(county, starts_with("s")) # Select variables/columns whose names start with the letter "s"
## # A tibble: 3,142 x 2
##    state   smoking_ban
##    <fct>   <fct>      
##  1 Alabama none       
##  2 Alabama none       
##  3 Alabama partial    
##  4 Alabama none       
##  5 Alabama none       
##  6 Alabama none       
##  7 Alabama <NA>       
##  8 Alabama <NA>       
##  9 Alabama none       
## 10 Alabama none       
## # ... with 3,132 more rows
select(county, ends_with("p"))   # Select variables whose names end with the letter "p"
## # A tibble: 3,142 x 1
##    homeownership
##            <dbl>
##  1          77.5
##  2          76.7
##  3          68  
##  4          82.9
##  5          82  
##  6          76.9
##  7          69  
##  8          70.7
##  9          71.4
## 10          77.5
## # ... with 3,132 more rows
select(county, contains("a"))    # Select variables whose names contain the letter "a" 
## # A tibble: 3,142 x 8
##    name            state pop_change unemployment_ra~ median_edu per_capita_inco~
##    <chr>           <fct>      <dbl>            <dbl> <fct>                 <dbl>
##  1 Autauga County  Alab~       1.48             3.86 some_coll~           27842.
##  2 Baldwin County  Alab~       9.19             3.99 some_coll~           27780.
##  3 Barbour County  Alab~      -6.22             5.9  hs_diploma           17892.
##  4 Bibb County     Alab~       0.73             4.39 hs_diploma           20572.
##  5 Blount County   Alab~       0.68             4.02 hs_diploma           21367.
##  6 Bullock County  Alab~      -2.28             4.93 hs_diploma           15444.
##  7 Butler County   Alab~      -2.69             5.49 hs_diploma           17015.
##  8 Calhoun County  Alab~      -1.51             4.93 some_coll~           23610.
##  9 Chambers County Alab~      -1.2              4.08 hs_diploma           21080.
## 10 Cherokee County Alab~      -0.6              4.05 hs_diploma           23068.
## # ... with 3,132 more rows, and 2 more variables: median_hh_income <int>,
## #   smoking_ban <fct>
select(county, -state, -homeownership)        # Select all variable EXCEPT the ones that are preceded by the minus sign
## # A tibble: 3,142 x 13
##    name   pop2000 pop2010 pop2017 pop_change poverty multi_unit unemployment_ra~
##    <chr>    <dbl>   <dbl>   <int>      <dbl>   <dbl>      <dbl>            <dbl>
##  1 Autau~   43671   54571   55504       1.48    13.7        7.2             3.86
##  2 Baldw~  140415  182265  212628       9.19    11.8       22.6             3.99
##  3 Barbo~   29038   27457   25270      -6.22    27.2       11.1             5.9 
##  4 Bibb ~   20826   22915   22668       0.73    15.2        6.6             4.39
##  5 Bloun~   51024   57322   58013       0.68    15.6        3.7             4.02
##  6 Bullo~   11714   10914   10309      -2.28    28.5        9.9             4.93
##  7 Butle~   21399   20947   19825      -2.69    24.4       13.7             5.49
##  8 Calho~  112249  118572  114728      -1.51    18.6       14.3             4.93
##  9 Chamb~   36583   34215   33713      -1.2     18.8        8.7             4.08
## 10 Chero~   23988   25989   25857      -0.6     16.1        4.3             4.05
## # ... with 3,132 more rows, and 5 more variables: metro <fct>,
## #   median_edu <fct>, per_capita_income <dbl>, median_hh_income <int>,
## #   smoking_ban <fct>

Of course, we can also combine these together:

select(county, -starts_with("p"))  
## # A tibble: 3,142 x 9
##    name         state homeownership multi_unit unemployment_ra~ metro median_edu
##    <chr>        <fct>         <dbl>      <dbl>            <dbl> <fct> <fct>     
##  1 Autauga Cou~ Alab~          77.5        7.2             3.86 yes   some_coll~
##  2 Baldwin Cou~ Alab~          76.7       22.6             3.99 yes   some_coll~
##  3 Barbour Cou~ Alab~          68         11.1             5.9  no    hs_diploma
##  4 Bibb County  Alab~          82.9        6.6             4.39 yes   hs_diploma
##  5 Blount Coun~ Alab~          82          3.7             4.02 yes   hs_diploma
##  6 Bullock Cou~ Alab~          76.9        9.9             4.93 no    hs_diploma
##  7 Butler Coun~ Alab~          69         13.7             5.49 no    hs_diploma
##  8 Calhoun Cou~ Alab~          70.7       14.3             4.93 yes   some_coll~
##  9 Chambers Co~ Alab~          71.4        8.7             4.08 no    hs_diploma
## 10 Cherokee Co~ Alab~          77.5        4.3             4.05 no    hs_diploma
## # ... with 3,132 more rows, and 2 more variables: median_hh_income <int>,
## #   smoking_ban <fct>
select(county, 
       starts_with("s"), 
       contains("a"), 
       -homeownership)
## # A tibble: 3,142 x 8
##    state   smoking_ban name            pop_change unemployment_rate median_edu  
##    <fct>   <fct>       <chr>                <dbl>             <dbl> <fct>       
##  1 Alabama none        Autauga County        1.48              3.86 some_college
##  2 Alabama none        Baldwin County        9.19              3.99 some_college
##  3 Alabama partial     Barbour County       -6.22              5.9  hs_diploma  
##  4 Alabama none        Bibb County           0.73              4.39 hs_diploma  
##  5 Alabama none        Blount County         0.68              4.02 hs_diploma  
##  6 Alabama none        Bullock County       -2.28              4.93 hs_diploma  
##  7 Alabama <NA>        Butler County        -2.69              5.49 hs_diploma  
##  8 Alabama <NA>        Calhoun County       -1.51              4.93 some_college
##  9 Alabama none        Chambers County      -1.2               4.08 hs_diploma  
## 10 Alabama none        Cherokee County      -0.6               4.05 hs_diploma  
## # ... with 3,132 more rows, and 2 more variables: per_capita_income <dbl>,
## #   median_hh_income <int>

In the first one here, I selected out every column starting with “p.”

Here, I selected the variables that started with “s” or contains “a”, but then dropped the variable “homeownership.” I also separated each of these with new lines to keep things clean and easier to look at.

Official Question Time 2

Since the last OQT, we’ve done:

  1. The Global Environment

  2. Packages

  3. The Tidyverse

    • Tibbles

    • select()

    • select()’s helper functions:

      • starts_with()

      • ends_with()

      • contains()

      • -

Pipe

This morning, we learned two shortcuts:

  1. Assignment arrow: ALT and - (Option and - on a Mac)
  2. Comment: Ctrl + Shift + c (Cmd + Shift + c on a Mac)

I also promised a third. This one is called the pipe. You make it with CTRL/Cmd + Shift + m, and it gives you this funky thing called a pipe: %>%

The pipe is almost magic, and it gives you a ton of power.

The pipe allows us to carry forward our data between steps.



Watch carefully

select(county, starts_with("s"))
## # A tibble: 3,142 x 2
##    state   smoking_ban
##    <fct>   <fct>      
##  1 Alabama none       
##  2 Alabama none       
##  3 Alabama partial    
##  4 Alabama none       
##  5 Alabama none       
##  6 Alabama none       
##  7 Alabama <NA>       
##  8 Alabama <NA>       
##  9 Alabama none       
## 10 Alabama none       
## # ... with 3,132 more rows
county %>% select(., starts_with("s"))
## # A tibble: 3,142 x 2
##    state   smoking_ban
##    <fct>   <fct>      
##  1 Alabama none       
##  2 Alabama none       
##  3 Alabama partial    
##  4 Alabama none       
##  5 Alabama none       
##  6 Alabama none       
##  7 Alabama <NA>       
##  8 Alabama <NA>       
##  9 Alabama none       
## 10 Alabama none       
## # ... with 3,132 more rows

With the pipe, we are able to move our dataset forward, represented only by a period, or even nothing at all, further down.

ct1 <- select(county, starts_with("c"))
ct2 <- county %>% select(., starts_with("c"))
ct3 <- county %>% select(starts_with("c"))
identical(ct1, ct2)         # This is just a function to tell if two things are the exact same in every way 
## [1] TRUE
identical(ct1, ct3)
## [1] TRUE

Our subsequent tests show that the three ways of selection (with the name, piped with a period, and piped without a period) produce identical results.

More Piping

For something this short, it wasn’t necessary, but let’s try something a little longer:

  1. Take the county dataset
  2. Select the columns that contain “pop”
  3. Show the first 4 rows
oldway <- select(county, contains("pop"))
head(oldway, 4)
## # A tibble: 4 x 4
##   pop2000 pop2010 pop2017 pop_change
##     <dbl>   <dbl>   <int>      <dbl>
## 1   43671   54571   55504       1.48
## 2  140415  182265  212628       9.19
## 3   29038   27457   25270      -6.22
## 4   20826   22915   22668       0.73

It works. We knew it would. (It’s nearly the same as what we did on the last slide.)

But let’s see how the pipe can simplify this.

county %>% 
  select(., contains("pop")) %>% 
  head(., 4)
## # A tibble: 4 x 4
##   pop2000 pop2010 pop2017 pop_change
##     <dbl>   <dbl>   <int>      <dbl>
## 1   43671   54571   55504       1.48
## 2  140415  182265  212628       9.19
## 3   29038   27457   25270      -6.22
## 4   20826   22915   22668       0.73

It’s beautiful.

Both steps completed in one operation. No need to save a middle dataset (like “oldway” was above) and hit run twice.

It also made it clear to us, at the very beginning of it, which dataset was being used — no need to search through the code to see what we’re using or if we’re changing datasets in the middle.

Rename

Another helpful function from the tidyverse is rename(). It takes the form rename(data, newname=oldname). This is helpful because it let’s us string in multiple name changes. Of course, we can use our new pipe, too!

Let’s see it in use:

names(county)
##  [1] "name"              "state"             "pop2000"          
##  [4] "pop2010"           "pop2017"           "pop_change"       
##  [7] "poverty"           "homeownership"     "multi_unit"       
## [10] "unemployment_rate" "metro"             "median_edu"       
## [13] "per_capita_income" "median_hh_income"  "smoking_ban"
county %>% 
  rename(unemprate = unemployment_rate,
         county_name = name,
         pct_poverty = poverty) %>% 
  names()                 # Yep, we can use this at the end of a pipe series - isn't it great?
##  [1] "county_name"       "state"             "pop2000"          
##  [4] "pop2010"           "pop2017"           "pop_change"       
##  [7] "pct_poverty"       "homeownership"     "multi_unit"       
## [10] "unemprate"         "metro"             "median_edu"       
## [13] "per_capita_income" "median_hh_income"  "smoking_ban"

On Using Script

Importance of Skipping Lines


Now you see why it’s nice to skip lines between pieces of our functions and include spaces between pieces of arguments.

This:

county %>% 
  rename(unemprate = unemployment_rate,
         county_name = name,
         pct_poverty = poverty) %>% 
  names()


is a lot easier to read, and much easier to understand, than:

county %>% rename(unemprate = unemployment_rate, county_name = name, pct_poverty = poverty) %>% names()


Similarly, the pipe also makes it easier to understand in one line. Another correct, but annoying way to write this would have been:

names(rename(county, unemprate = unemployment_rate, county_name = name, pct_poverty = poverty))



If you aren’t using script by now, you’ll want to start.


If you want to go back and change something, it is much easier to alter something from script than to retype it.


From this morning:

While you can type everything directly into the console pane (bottom left), it is good practice to begin typing your “script” into the source pane (top left).

To run a line from the source pane: Press Ctrl + Enter, and R will run everything it thinks you want. You can also click the “Run” button in the top right.

R will also look for something on both sides of the pipe.

Official Question Time 3

Since the last OQT, we’ve done:

  1. The pipe %>%

  2. rename()

Filtering Observations

Reviewing Logic Conditions

We previously == for our logical tests. As a reminder: == means “equal to”

What if we need to test two things?

2==3                    # Is this TRUE?
## [1] FALSE
2==2 & 4==4             # Are both sides TRUE?
## [1] TRUE
2==3 & 4==4 
## [1] FALSE
2==3 | 4==4             # Is at least one side TRUE?
## [1] TRUE
2==3 | 2==4 
## [1] FALSE


We can also do tests of greater than and less than.

2<3                    # Is 2 less than 3?
## [1] TRUE
2>3                    # Is 2 greater than 3?
## [1] FALSE
2<=3                    # Is 2 less than or equal to 3?
## [1] TRUE
3>=3                    # Is 3 greater than or equal to 3?
## [1] TRUE
4>=3                    # Is 4 greater than or equal to 3?
## [1] TRUE
2<=3 & 4>=4             # Are both sides TRUE?
## [1] TRUE
2<=3 | 4<=10             # Is at least one side TRUE?
## [1] TRUE


Lastly, what if we want to reverse a condition, so that something TRUE becomes FALSE?

2==3
## [1] FALSE
!(2==3)                 # Inverting the previous statement
## [1] TRUE
2!=3                    # 2 does not equal 3
## [1] TRUE
2==3 | 4==4 
## [1] TRUE
!(2==3 | 4==4) 
## [1] FALSE
2==3 | 2==4 
## [1] FALSE
!(2==3 | 2==4) 
## [1] TRUE
2==3 & 4==4
## [1] FALSE
!(2==3 & 4==4)
## [1] TRUE

The filter() Function

Earlier, we used the select() function to select the columns/variables we wanted. Now, we can use the filter() function to select the rows/observations we want.

When filtering, we use our logic conditions.

county %>% 
  filter(median_edu == "hs_diploma") %>%  
  select(name, median_edu)
## # A tibble: 1,397 x 2
##    name            median_edu
##    <chr>           <fct>     
##  1 Barbour County  hs_diploma
##  2 Bibb County     hs_diploma
##  3 Blount County   hs_diploma
##  4 Bullock County  hs_diploma
##  5 Butler County   hs_diploma
##  6 Chambers County hs_diploma
##  7 Cherokee County hs_diploma
##  8 Chilton County  hs_diploma
##  9 Choctaw County  hs_diploma
## 10 Clarke County   hs_diploma
## # ... with 1,387 more rows
county %>% 
  filter(pop_change >5 & smoking_ban == "none") %>% 
  select(name, pop_change, smoking_ban)
## # A tibble: 231 x 3
##    name                      pop_change smoking_ban
##    <chr>                          <dbl> <fct>      
##  1 Baldwin County                  9.19 none       
##  2 Lee County                      6.71 none       
##  3 Limestone County                6.19 none       
##  4 Denali Borough                  7.35 none       
##  5 Matanuska-Susitna Borough      11.1  none       
##  6 Benton County                  11.5  none       
##  7 Craighead County                5.52 none       
##  8 Saline County                   5.34 none       
##  9 Washington County               7.64 none       
## 10 Alameda County                  5.07 none       
## # ... with 221 more rows

Let’s filter the original county dataset to find only counties named “Middlesex” to see if we can look at our county.

county %>% 
  filter(name == "Middlesex County")
## # A tibble: 4 x 15
##   name            state pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>           <fct>   <dbl>   <dbl>   <int>      <dbl>   <dbl>         <dbl>
## 1 Middlesex Coun~ Conn~  155071  165676  163410      -1.14     7.2          75.8
## 2 Middlesex Coun~ Mass~ 1465396 1503085 1602947       2.82     8.2          63.9
## 3 Middlesex Coun~ New ~  750162  809858  842798       1.54     8.6          67  
## 4 Middlesex Coun~ Virg~    9932   10959   10679      -0.9     10.2          81.1
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## #   metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## #   median_hh_income <int>, smoking_ban <fct>

Turns out, there are four of them. Let’s try three ways we can get just our county.

county %>% 
  filter(name == "Middlesex County") %>% 
  filter(state == "New Jersey") 
## # A tibble: 1 x 15
##   name            state pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>           <fct>   <dbl>   <dbl>   <int>      <dbl>   <dbl>         <dbl>
## 1 Middlesex Coun~ New ~  750162  809858  842798       1.54     8.6            67
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## #   metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## #   median_hh_income <int>, smoking_ban <fct>
county %>% 
  filter(name == "Middlesex County" & state == "New Jersey") 
## # A tibble: 1 x 15
##   name            state pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>           <fct>   <dbl>   <dbl>   <int>      <dbl>   <dbl>         <dbl>
## 1 Middlesex Coun~ New ~  750162  809858  842798       1.54     8.6            67
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## #   metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## #   median_hh_income <int>, smoking_ban <fct>
county %>% 
  filter(name == "Middlesex County",
         state == "New Jersey") 
## # A tibble: 1 x 15
##   name            state pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>           <fct>   <dbl>   <dbl>   <int>      <dbl>   <dbl>         <dbl>
## 1 Middlesex Coun~ New ~  750162  809858  842798       1.54     8.6            67
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## #   metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## #   median_hh_income <int>, smoking_ban <fct>

When filtering on multiple conditions, you can do either do:

Both will give you the same output.

However, this only works if you’re linking multiple filters with AND.

Filtering Multiple Options: The %in% Operator

What if we want to filter with multiple options?

We can use a funky operator to see if something is inside another. The %in% operator is used to identify if a value is within a set of values. (Sadly, there is no shortcut for this one.)

The official name for this is the “match” operator. You might also just hear it called “percent-in-percent.”


For example:

fullset <- seq(12, 50, 4)        # Creating a sequence from 12 to 50 by 4's
12 %in% fullset                  # Test: Is 12 in this set?
## [1] TRUE
11 %in% fullset                  # Test: Is 11 in this set?
## [1] FALSE


We can also test vectors:

c(11, 12, 48) %in% fullset   # Are each of these numbers in the set?
## [1] FALSE  TRUE  TRUE
11 %in% c(11,12,13)          # Is 11 within a new set?
## [1] TRUE
c(11,12,13) %in% 11          # Are each of these in the set of 11?
## [1]  TRUE FALSE FALSE

If you don’t want to use it, you don’t have it, though it can make more complicated codes a little more simple.

Filtering Multiple Options

So, using the new operator (%in%), let’s filter our counties to anything named “Middlesex” or “Sussex.”

county %>% 
  filter(name %in% c("Middlesex County", "Sussex County")) 
## # A tibble: 7 x 15
##   name            state pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>           <fct>   <dbl>   <dbl>   <int>      <dbl>   <dbl>         <dbl>
## 1 Middlesex Coun~ Conn~  155071  165676  163410      -1.14     7.2          75.8
## 2 Sussex County   Dela~  156638  197145  225322       9.13    12            80  
## 3 Middlesex Coun~ Mass~ 1465396 1503085 1602947       2.82     8.2          63.9
## 4 Middlesex Coun~ New ~  750162  809858  842798       1.54     8.6          67  
## 5 Sussex County   New ~  144166  149265  141682      -2.8      5.3          84.8
## 6 Middlesex Coun~ Virg~    9932   10959   10679      -0.9     10.2          81.1
## 7 Sussex County   Virg~   12504   12087   11373      -3.32    17.8          67  
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## #   metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## #   median_hh_income <int>, smoking_ban <fct>

Let’s also filter to include counties that are in either Rhode Island or Delaware:

county %>% 
  filter(state %in% c("Rhode Island", "Delaware")) 
## # A tibble: 8 x 15
##   name            state pop2000 pop2010 pop2017 pop_change poverty homeownership
##   <chr>           <fct>   <dbl>   <dbl>   <int>      <dbl>   <dbl>         <dbl>
## 1 Kent County     Dela~  126697  162310  176824       4.54    13            72.9
## 2 New Castle Cou~ Dela~  500265  538479  559793       1.88    11.9          71.3
## 3 Sussex County   Dela~  156638  197145  225322       9.13    12            80  
## 4 Bristol County  Rhod~   50648   49875   48912      -0.6      7            72.1
## 5 Kent County     Rhod~  167090  166158  163760      -0.36     7.8          73.8
## 6 Newport County  Rhod~   85433   82888   83460       0.77     9            63.6
## 7 Providence Cou~ Rhod~  621602  626667  637357       1.16    16.7          55.5
## 8 Washington Cou~ Rhod~  123546  126979  126150      -0.17     9.6          76.1
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## #   metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## #   median_hh_income <int>, smoking_ban <fct>

We can also negate this filter by putting the exclamation point at the beginning:

county %>% 
  filter(!state %in% c("Rhode Island", "Delaware")) %>% 
  nrow()
## [1] 3134
nrow(county)
## [1] 3142

We can see there are originally 3142 counties in the dataset. When we filter out counties in Rhode Island or Delaware, which we see from earlier is 8, we are left with 3134.

Official Question Time 5

Since the last OQT, we’ve done:

  1. Review of Logic Conditions

    • Equal to: ==

      • Which should not be confused with the single equal sign =, used for assignment or in a function

      • This was Common Error #1 from earlier

    • AND: &

      • Both conditions must be TRUE
    • OR: |

      • At least one condition must be TRUE
    • NOT: !

      • Whatever the condition is, reverse it

      • Was it TRUE? It’s now FALSE.

      • Was it FALSE? It’s now TRUE.

  2. The filter() function

  3. The %in% operator

Wrapping Up

Practice

Let’s put this all together then. Take some time and do the following:

  1. A dataset of NJ Counties

    1. Use the county dataset

    2. Create a copy of it, called “new_jersey_counties

    3. Filter it to only include counties in New Jersey

    4. Rename pop2017 to population_2017

    5. Rename name to county

    6. Remove the pop2000 and pop2010 variables as well as any columns ending with the letter “e” or starting with “m”

    7. Print the tibble to the console

  2. Print the column names from the tibble from Part 1.

  3. A dataset of presidential western counties

    1. Use the county dataset

    2. Create a copy of it, called “presidential_counties

    3. Select only the county name, state, poverty rate, and percentage multifamily units

    4. Filter it to only include counties in the following western states: Texas, Oklahoma, California, Washington, Oregon, New Mexico, Arizona

    5. Filter to only include counties that name the following presidents: Washington, Jefferson, Madison, Lincoln, Roosevelt, and Grant

    6. Print the tibble to the console.

  4. Print the number of dimensions in the tibble from Part 3.

  5. A dataset of counties that shrunk in population from 2010 to 2017

    1. Use the county dataset

    2. Create a copy of it, called “shrinking_counties

    3. Filter it to only include the county name, state, and pop_change

    4. Select only the counties who shrunk from 2010 to 2017

      • In other words, the variable pop_change should be less than 0
  6. Print the number of counties from Part 5.

The answers are on the next slide, but try to see if you can work through it without them.

Practice Answers — No Peeking

library(tidyverse)

# Part 1: NJ County Dataset 
new_jersey_counties <- county %>% 
  filter(state == "New Jersey") %>% 
  rename(population_2017 = pop2017,
         county = name) %>% 
  select(-pop2000, -pop2010, -ends_with("e"), -starts_with("m"))
print(new_jersey_counties)
## # A tibble: 21 x 5
##    county            population_2017 poverty homeownership smoking_ban
##    <chr>                       <int>   <dbl>         <dbl> <fct>      
##  1 Atlantic County            269918    15.3          70.7 none       
##  2 Bergen County              948406     7.2          67.5 <NA>       
##  3 Burlington County          448596     6.4          79   <NA>       
##  4 Camden County              510719    13.1          69.7 <NA>       
##  5 Cape May County             93553    10.6          74.3 <NA>       
##  6 Cumberland County          152538    18.8          67.4 none       
##  7 Essex County               808285    16.7          47.2 partial    
##  8 Gloucester County          292206     7.9          80.9 <NA>       
##  9 Hudson County              691643    17.1          34.3 <NA>       
## 10 Hunterdon County           125059     4.5          85.6 <NA>       
## # ... with 11 more rows
# Part 2: Column Names 
names(new_jersey_counties)
## [1] "county"          "population_2017" "poverty"         "homeownership"  
## [5] "smoking_ban"
# Part 3: Presidential Counties 
presidential_counties <- county %>% 
  
  # 3c: Selecting columns: county name, state, poverty rate, and percentage multifamily units
  select(name, state, poverty, multi_unit) %>% 
  
  # 3d: Filtering States 
  filter(state %in% c("Texas", "Oklahoma", "California",
                      "Washington", "Oregon", "New Mexico", "Arizona")) %>% 
  
  # 3e: Filtering Counties 
  filter(name %in% c("Washington County", "Jefferson County", "Madison County", 
                     "Lincoln County", "Roosevelt County", "Grant County")) 

  # 3f
presidential_counties
## # A tibble: 17 x 4
##    name              state      poverty multi_unit
##    <chr>             <fct>        <dbl>      <dbl>
##  1 Grant County      New Mexico    22          7.7
##  2 Lincoln County    New Mexico    15.4       10.4
##  3 Roosevelt County  New Mexico    27.5       11.5
##  4 Grant County      Oklahoma       9.6        2.7
##  5 Jefferson County  Oklahoma      20.9        7.3
##  6 Lincoln County    Oklahoma      14.3        3.5
##  7 Washington County Oklahoma      14         11.5
##  8 Grant County      Oregon        13.7        5.3
##  9 Jefferson County  Oregon        20.9       11.2
## 10 Lincoln County    Oregon        18.4       16.2
## 11 Washington County Oregon        10.3       31.2
## 12 Jefferson County  Texas         19.4       19.9
## 13 Madison County    Texas         15.4        3.6
## 14 Washington County Texas         13.2       12.4
## 15 Grant County      Washington    15.9       14.1
## 16 Jefferson County  Washington    12.8        9  
## 17 Lincoln County    Washington    13.7        4.4
# Part 4: Dimensions 
dim(presidential_counties)
## [1] 17  4
# Part 5: Shrinking Counties 
shrinking_counties <- county %>% 
  select(name, state, pop_change) %>% 
  filter(pop_change<0)  # Rate of growth less than 0 

# Part 6: Number from P5 
nrow(shrinking_counties)
## [1] 1594

As a note, it is good practice to comment your code with what you’re doing. Notice here that I commented each code chunk with which part I was working on and what the task was.

You should be sure to do something like this for your homeworks.

Official Question Time 6

This afternoon, we’ve learned:

  1. Directories and Projects

  2. Environments & Packages

  3. The Tidyverse

    • select()

      • Its helpers: starts_with(), ends_with(), contains(), and -
    • rename()

    • filter()

    • The pipe %>%

      • Its shortcut: CTRL/Cmd + Shift + m
  4. The %in% operator

  5. Logical Conditions

    • And: &

    • Or: |

    • Not: !