Fred Traylor, Lab TA (he/him)
September 1, 2022
Working Directories & Projects
Environments and Packages
Packages
Tidyverse
usdata
Data Management in R
Filtering Rows
Selecting Columns
Tomorrow, we’ll be working with data from outside of R. That is, we’ll be importing our own data files.
Before we can do that, though, we first need to understand where our data and files are currently being saved.
This is called a “Working Directory.” To find it, we can run
getwd(), short for “get
working directory.”
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"
If you haven’t already, let’s go ahead and set up a project. We’ll save everything we do this semester into the project, so it’ll be a handy place to keep everything together. Projects do two useful functions:
They keep everything together
They let us tell R to use the project as a working directory
In the RStudio menu, click File > New Project
In the popup menu, click “New Directory”
Click “New Project”
Give it an appropriate name and choose where it should go in your files
I named mine “ay23_lab_ta”
Other good options include “soc541”, “sociology_stats_1”, etc.
All of my R files are saved in “Documents > R”, so I made it a “subdirectory” within that folder.
You should now be looking at a new project.
Your working directory might have changed, too, and we can look at it
with getwd().
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"
Since the last OQT, we’ve done:
Viewing the Working Directory
Creating a New Project
Up to now, we’ve used what is called “Base R.” If you go to the “Environment” pane of your R window, there is a dropdown menu to see what is in each environment. Go ahead and click on it and you’ll see a selection of other “packages.”
No need to do anything inside these packages, but now you know how R knows what to do. If we look at the documentation for a function, it gives us the function’s name followed by the name of the package holding in in curly brackets.
For example, ?table gives us
table {base}, telling us that the function
table comes in the base package built into
R.
If we do it for head() we see that it comes in the
utilities (utils) R package.
Generally, we don’t want to mess with the code of any functions or values that come in our other packages because fixing them involves reinstalling a lot of things.
The beauty of R is that, because it is open-source, anybody can add new functions and make them available to anybody. Because these functions often rely on each other, they get packaged together to make for easy (and consistent) usage.
These “packages” are what we’re going to be doing today. These will come in handy during this bootcamp and throughout the entire year as you take statistics.
I generally group packages into two main categories:
As you can guess, there’s a lot of overlap here. We’ll mostly be working with quality of life packages this fall, but we’ll also add some extension packages as well.
When you want to use a package, there are two things you have to do:
You have to install it.
There are likely thousands of packages available on CRAN (where you downloaded R from) and even more available elsewhere online. It’d be too much for R to install every possible package to your computer when you downloaded it the first time, so they’re only available as you want them.
Luckily, packages (like R) are free!
To install a package, you use the
install.packages() function built into
R.
You have to call it from your library.
After it’s been downloaded, it’s saved on your computer. (YAY!)
But, R doesn’t know which ones you want to use yet, so you will
need to call it into the working library via the
library() function.
You can think of install.packages() as buying a tool and
library() as actually getting it out of the toolbox.
This afternoon, and all throughout this year, we’re also going to use
a popular package (well, set of packages) called the
tidyverse.
Because it’ll take a minute or two to install, go ahead and type
install.packages("tidyverse") into R and run it.
The tidyverse is “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
Basically, it makes it easier to manage, use, and look at our data. Today, we’ll be working with the manage and use parts of this. In a few weeks, we’ll do a little bit with how the tidyverse makes data visualization better and easier than with Base R.
Hopefully, by now, the tidyverse has been installed, so let’s go
ahead and call it into the library with
library(tidyverse).
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
You’ll get these warnings every time you load the tidyverse. It’s telling you which packages have been loaded, if the package(s) was built using a newer version of R you have downloaded, and if there are any “conflicting” functions. Generally you don’t need to worry about these. (We’ll deal with them more in the Spring…)
This afternoon, we’re going to work with a dataset of US counties. (In case you don’t know, counties are smaller than state governments but (generally) larger than cities.)
We’re going to use a package called usdata and a dataset
saved in it called “county.” To download it, go ahead and
type install.packages("usdata") into R and run it.
Once that’s loaded up, run library(usdata), to bring it
into our environment.
## Warning: package 'usdata' was built under R version 4.1.2
Let’s take a look at the dataframe to see what we’re working with.
## [1] 3142 15
## tibble [3,142 x 15] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pop2000 : num [1:3142] 43671 140415 29038 20826 51024 ...
## $ pop2010 : num [1:3142] 54571 182265 27457 22915 57322 ...
## $ pop2017 : int [1:3142] 55504 212628 25270 22668 58013 10309 19825 114728 33713 25857 ...
## $ pop_change : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
## $ poverty : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
## $ homeownership : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ multi_unit : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
## $ metro : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 2 1 1 ...
## $ median_edu : Factor w/ 4 levels "below_hs","hs_diploma",..: 3 3 2 2 2 2 2 3 2 2 ...
## $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
## $ median_hh_income : int [1:3142] 55317 52562 33368 43404 47412 29655 36326 43686 37342 40041 ...
## $ smoking_ban : Factor w/ 3 levels "none","partial",..: 1 1 2 1 1 1 NA NA 1 1 ...
We can see that we have a lot of observations (3,142) and 15 variables.
Fortunately, the names are fairly descriptive, but we should still
look at the documentation (?county) to see what everything
means.
It also tell us that the data is saved as a “tibble.”
In the Packages pane, scroll down to the one that says “tibble.” If it isn’t checked, go ahead and check it to bring it to the library.
library(tibble), and you’ll
actually see this code run in the console when you check it.The description for “tibble” is “Simple Data Frames.” Indeed, tibbles are the same as data frames, but with a few nice features that make it easier to work with large datasets:
It gives us the data type for each column
When you print to the console, it shortens the output
Only ten observations
Only as many columns as will fit
And gives a summary of what was cut off
Smaller storage
You can see here that we added a whole host of new packages, including dplyr, tidyr, and usdata.
We can also see them in the Packages pane in the bottom right of our RStudio window. It’s in the same place as Help, so you’ll have to click the “Packages” tab.
Scroll through some of them and look at the descriptions of the
packages that are checked. For example, the dplyr package
is described as “A Grammar of Data Manipulation.”
If you want to install a new package, you can click the “Install”
button at the top of that pane or install it via
install.packages() as we did before.
Occasionally, we need to update our packages. If that happens, you can click the “Update” button and select the packages you want to update.
Be careful with this though, as updates can change functions you’re using and mess with your results in the middle of a project.
For this reason, it’s generally not necessary to update unless it’s been a while or you need the latest version of a package.
select() FunctionThe first function we’re going to work with from the tivyverse is
select().
Let’s say we want to create a dataset, called
“smallcounty” that has only some variables we care about
from the county dataset.
With base R, we could create a new data frame built with each column individually. And it would totally work, but it would also be a lot.
smallcounty <- data.frame(county$state,
county$smoking_ban,
county$median_edu,
county$median_hh_income)With the select function, we can simplify this a
bit.
First, you should know that, when working with the tidyverse, every function’s argument list will begin with the dataset.
Because of this, we don’t need to type it again after that.
So, we can make the same dataset as before using this code:
What we did here was take our original dataframe and “select” the variables we wanted.
As we’ll see, with all tidyverse functions, we start with the dataset. After that, we told it which columns we wanted.
Not having to repeat the dataset each time is one part of what makes these functions “tidy.”
select() with Other FunctionsThe tidyverse also includes some other functions that are nice to
pair with select(). These include
starts_with(), ends_with(),
contains(), and the “minus sign”, -. Let’s try
these now.
## # A tibble: 3,142 x 2
## state smoking_ban
## <fct> <fct>
## 1 Alabama none
## 2 Alabama none
## 3 Alabama partial
## 4 Alabama none
## 5 Alabama none
## 6 Alabama none
## 7 Alabama <NA>
## 8 Alabama <NA>
## 9 Alabama none
## 10 Alabama none
## # ... with 3,132 more rows
## # A tibble: 3,142 x 1
## homeownership
## <dbl>
## 1 77.5
## 2 76.7
## 3 68
## 4 82.9
## 5 82
## 6 76.9
## 7 69
## 8 70.7
## 9 71.4
## 10 77.5
## # ... with 3,132 more rows
## # A tibble: 3,142 x 8
## name state pop_change unemployment_ra~ median_edu per_capita_inco~
## <chr> <fct> <dbl> <dbl> <fct> <dbl>
## 1 Autauga County Alab~ 1.48 3.86 some_coll~ 27842.
## 2 Baldwin County Alab~ 9.19 3.99 some_coll~ 27780.
## 3 Barbour County Alab~ -6.22 5.9 hs_diploma 17892.
## 4 Bibb County Alab~ 0.73 4.39 hs_diploma 20572.
## 5 Blount County Alab~ 0.68 4.02 hs_diploma 21367.
## 6 Bullock County Alab~ -2.28 4.93 hs_diploma 15444.
## 7 Butler County Alab~ -2.69 5.49 hs_diploma 17015.
## 8 Calhoun County Alab~ -1.51 4.93 some_coll~ 23610.
## 9 Chambers County Alab~ -1.2 4.08 hs_diploma 21080.
## 10 Cherokee County Alab~ -0.6 4.05 hs_diploma 23068.
## # ... with 3,132 more rows, and 2 more variables: median_hh_income <int>,
## # smoking_ban <fct>
select(county, -state, -homeownership) # Select all variable EXCEPT the ones that are preceded by the minus sign## # A tibble: 3,142 x 13
## name pop2000 pop2010 pop2017 pop_change poverty multi_unit unemployment_ra~
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Autau~ 43671 54571 55504 1.48 13.7 7.2 3.86
## 2 Baldw~ 140415 182265 212628 9.19 11.8 22.6 3.99
## 3 Barbo~ 29038 27457 25270 -6.22 27.2 11.1 5.9
## 4 Bibb ~ 20826 22915 22668 0.73 15.2 6.6 4.39
## 5 Bloun~ 51024 57322 58013 0.68 15.6 3.7 4.02
## 6 Bullo~ 11714 10914 10309 -2.28 28.5 9.9 4.93
## 7 Butle~ 21399 20947 19825 -2.69 24.4 13.7 5.49
## 8 Calho~ 112249 118572 114728 -1.51 18.6 14.3 4.93
## 9 Chamb~ 36583 34215 33713 -1.2 18.8 8.7 4.08
## 10 Chero~ 23988 25989 25857 -0.6 16.1 4.3 4.05
## # ... with 3,132 more rows, and 5 more variables: metro <fct>,
## # median_edu <fct>, per_capita_income <dbl>, median_hh_income <int>,
## # smoking_ban <fct>
Of course, we can also combine these together:
## # A tibble: 3,142 x 9
## name state homeownership multi_unit unemployment_ra~ metro median_edu
## <chr> <fct> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Autauga Cou~ Alab~ 77.5 7.2 3.86 yes some_coll~
## 2 Baldwin Cou~ Alab~ 76.7 22.6 3.99 yes some_coll~
## 3 Barbour Cou~ Alab~ 68 11.1 5.9 no hs_diploma
## 4 Bibb County Alab~ 82.9 6.6 4.39 yes hs_diploma
## 5 Blount Coun~ Alab~ 82 3.7 4.02 yes hs_diploma
## 6 Bullock Cou~ Alab~ 76.9 9.9 4.93 no hs_diploma
## 7 Butler Coun~ Alab~ 69 13.7 5.49 no hs_diploma
## 8 Calhoun Cou~ Alab~ 70.7 14.3 4.93 yes some_coll~
## 9 Chambers Co~ Alab~ 71.4 8.7 4.08 no hs_diploma
## 10 Cherokee Co~ Alab~ 77.5 4.3 4.05 no hs_diploma
## # ... with 3,132 more rows, and 2 more variables: median_hh_income <int>,
## # smoking_ban <fct>
## # A tibble: 3,142 x 8
## state smoking_ban name pop_change unemployment_rate median_edu
## <fct> <fct> <chr> <dbl> <dbl> <fct>
## 1 Alabama none Autauga County 1.48 3.86 some_college
## 2 Alabama none Baldwin County 9.19 3.99 some_college
## 3 Alabama partial Barbour County -6.22 5.9 hs_diploma
## 4 Alabama none Bibb County 0.73 4.39 hs_diploma
## 5 Alabama none Blount County 0.68 4.02 hs_diploma
## 6 Alabama none Bullock County -2.28 4.93 hs_diploma
## 7 Alabama <NA> Butler County -2.69 5.49 hs_diploma
## 8 Alabama <NA> Calhoun County -1.51 4.93 some_college
## 9 Alabama none Chambers County -1.2 4.08 hs_diploma
## 10 Alabama none Cherokee County -0.6 4.05 hs_diploma
## # ... with 3,132 more rows, and 2 more variables: per_capita_income <dbl>,
## # median_hh_income <int>
In the first one here, I selected out every column starting with “p.”
Here, I selected the variables that started with “s” or contains “a”, but then dropped the variable “homeownership.” I also separated each of these with new lines to keep things clean and easier to look at.
Since the last OQT, we’ve done:
The Global Environment
Packages
The Tidyverse
Tibbles
select()
select()’s helper functions:
starts_with()
ends_with()
contains()
-
This morning, we learned two shortcuts:
ALT and -
(Option and - on a Mac)Ctrl + Shift + c
(Cmd + Shift + c on a Mac)I also promised a third. This one is called the pipe. You make it
with CTRL/Cmd + Shift +
m, and it gives you this funky thing called a pipe:
%>%
The pipe is almost magic, and it gives you a ton of power.
The pipe allows us to carry forward our data between steps.
Watch carefully
## # A tibble: 3,142 x 2
## state smoking_ban
## <fct> <fct>
## 1 Alabama none
## 2 Alabama none
## 3 Alabama partial
## 4 Alabama none
## 5 Alabama none
## 6 Alabama none
## 7 Alabama <NA>
## 8 Alabama <NA>
## 9 Alabama none
## 10 Alabama none
## # ... with 3,132 more rows
## # A tibble: 3,142 x 2
## state smoking_ban
## <fct> <fct>
## 1 Alabama none
## 2 Alabama none
## 3 Alabama partial
## 4 Alabama none
## 5 Alabama none
## 6 Alabama none
## 7 Alabama <NA>
## 8 Alabama <NA>
## 9 Alabama none
## 10 Alabama none
## # ... with 3,132 more rows
With the pipe, we are able to move our dataset forward, represented only by a period, or even nothing at all, further down.
ct1 <- select(county, starts_with("c"))
ct2 <- county %>% select(., starts_with("c"))
ct3 <- county %>% select(starts_with("c"))
identical(ct1, ct2) # This is just a function to tell if two things are the exact same in every way ## [1] TRUE
## [1] TRUE
Our subsequent tests show that the three ways of selection (with the name, piped with a period, and piped without a period) produce identical results.
For something this short, it wasn’t necessary, but let’s try something a little longer:
county dataset## # A tibble: 4 x 4
## pop2000 pop2010 pop2017 pop_change
## <dbl> <dbl> <int> <dbl>
## 1 43671 54571 55504 1.48
## 2 140415 182265 212628 9.19
## 3 29038 27457 25270 -6.22
## 4 20826 22915 22668 0.73
It works. We knew it would. (It’s nearly the same as what we did on the last slide.)
But let’s see how the pipe can simplify this.
## # A tibble: 4 x 4
## pop2000 pop2010 pop2017 pop_change
## <dbl> <dbl> <int> <dbl>
## 1 43671 54571 55504 1.48
## 2 140415 182265 212628 9.19
## 3 29038 27457 25270 -6.22
## 4 20826 22915 22668 0.73
It’s beautiful.
Both steps completed in one operation. No need to save a middle
dataset (like “oldway” was above) and hit run twice.
It also made it clear to us, at the very beginning of it, which dataset was being used — no need to search through the code to see what we’re using or if we’re changing datasets in the middle.
Another helpful function from the tidyverse is rename().
It takes the form rename(data, newname=oldname). This is
helpful because it let’s us string in multiple name changes. Of course,
we can use our new pipe, too!
Let’s see it in use:
## [1] "name" "state" "pop2000"
## [4] "pop2010" "pop2017" "pop_change"
## [7] "poverty" "homeownership" "multi_unit"
## [10] "unemployment_rate" "metro" "median_edu"
## [13] "per_capita_income" "median_hh_income" "smoking_ban"
county %>%
rename(unemprate = unemployment_rate,
county_name = name,
pct_poverty = poverty) %>%
names() # Yep, we can use this at the end of a pipe series - isn't it great?## [1] "county_name" "state" "pop2000"
## [4] "pop2010" "pop2017" "pop_change"
## [7] "pct_poverty" "homeownership" "multi_unit"
## [10] "unemprate" "metro" "median_edu"
## [13] "per_capita_income" "median_hh_income" "smoking_ban"
Now you see why it’s nice to skip lines between pieces of our functions
and include spaces between pieces of arguments.
This:
county %>%
rename(unemprate = unemployment_rate,
county_name = name,
pct_poverty = poverty) %>%
names()
is a lot easier to read, and much easier to understand, than:
county %>% rename(unemprate = unemployment_rate, county_name = name, pct_poverty = poverty) %>% names()
Similarly, the pipe also makes it easier to understand in one line.
Another correct, but annoying way to write this would have been:
names(rename(county, unemprate = unemployment_rate, county_name = name, pct_poverty = poverty))
If you want to go back and change something, it is much easier to alter
something from script than to retype it.
From this morning:
While you can type everything directly into the console pane (bottom left), it is good practice to begin typing your “script” into the source pane (top left).
Easy to go back, see what you’ve run, change things, and rerun without having to retype everything
Eventually, you can run the entire source code at once
To run a line from the source pane: Press Ctrl +
Enter, and R will run everything it thinks you want.
You can also click the “Run” button in the top right.
If you have something highlighted, R will run ONLY the highlighted code
If you don’t have something highlighted, R will run:
The current line, where your cursor is
Anything after, if your script isn’t finished
Anything before, if it thinks the previous line leads to the current one
R will also look for something on both sides of the pipe.
If you end a line with %>%, it will run whatever
comes after.
If you run the line after an unfinished pipe, it will think you want the line before.
Since the last OQT, we’ve done:
The pipe %>%
rename()
We previously == for our logical tests. As a reminder:
== means “equal to”
If they are, the value returned is TRUE
(occasionally abbreviated as T)
If they are not, the value returned is FALSE
(occasionally abbreviated as F)
A single equal sign = is used for
Assignment, like our assignment arrow <-
Or in a function, like in
seq(from = 1, to = 5)
What if we need to test two things?
& means AND: Both sides must be
TRUE
| means OR: Either side (or both) must be
TRUE
## [1] FALSE
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] FALSE
We can also do tests of greater than and less than.
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
Lastly, what if we want to reverse a condition, so that something
TRUE becomes FALSE?
! to invert the
condition!= to mean “does not equal”## [1] FALSE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] TRUE
## [1] FALSE
## [1] TRUE
filter() FunctionEarlier, we used the select() function to select the
columns/variables we wanted. Now, we can use the filter()
function to select the rows/observations we want.
subset() function; they’re
basically interchangeable.When filtering, we use our logic conditions.
## # A tibble: 1,397 x 2
## name median_edu
## <chr> <fct>
## 1 Barbour County hs_diploma
## 2 Bibb County hs_diploma
## 3 Blount County hs_diploma
## 4 Bullock County hs_diploma
## 5 Butler County hs_diploma
## 6 Chambers County hs_diploma
## 7 Cherokee County hs_diploma
## 8 Chilton County hs_diploma
## 9 Choctaw County hs_diploma
## 10 Clarke County hs_diploma
## # ... with 1,387 more rows
## # A tibble: 231 x 3
## name pop_change smoking_ban
## <chr> <dbl> <fct>
## 1 Baldwin County 9.19 none
## 2 Lee County 6.71 none
## 3 Limestone County 6.19 none
## 4 Denali Borough 7.35 none
## 5 Matanuska-Susitna Borough 11.1 none
## 6 Benton County 11.5 none
## 7 Craighead County 5.52 none
## 8 Saline County 5.34 none
## 9 Washington County 7.64 none
## 10 Alameda County 5.07 none
## # ... with 221 more rows
Let’s filter the original county dataset to find only
counties named “Middlesex” to see if we can look at our county.
## # A tibble: 4 x 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Middlesex Coun~ Conn~ 155071 165676 163410 -1.14 7.2 75.8
## 2 Middlesex Coun~ Mass~ 1465396 1503085 1602947 2.82 8.2 63.9
## 3 Middlesex Coun~ New ~ 750162 809858 842798 1.54 8.6 67
## 4 Middlesex Coun~ Virg~ 9932 10959 10679 -0.9 10.2 81.1
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## # metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## # median_hh_income <int>, smoking_ban <fct>
Turns out, there are four of them. Let’s try three ways we can get just our county.
## # A tibble: 1 x 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Middlesex Coun~ New ~ 750162 809858 842798 1.54 8.6 67
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## # metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## # median_hh_income <int>, smoking_ban <fct>
## # A tibble: 1 x 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Middlesex Coun~ New ~ 750162 809858 842798 1.54 8.6 67
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## # metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## # median_hh_income <int>, smoking_ban <fct>
## # A tibble: 1 x 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Middlesex Coun~ New ~ 750162 809858 842798 1.54 8.6 67
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## # metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## # median_hh_income <int>, smoking_ban <fct>
When filtering on multiple conditions, you can do either do:
Mutliple conditions in the same filter: Middlesex & NJ
Multiple filters, one for each condition: Middlesex, then NJ
Both will give you the same output.
However, this only works if you’re linking multiple filters with AND.
%in%
OperatorWhat if we want to filter with multiple options?
We can use a funky operator to see if something is inside another.
The %in% operator is used to identify if a
value is within a set of values. (Sadly, there is no shortcut for this
one.)
The official name for this is the “match” operator. You might also just hear it called “percent-in-percent.”
For example:
fullset <- seq(12, 50, 4) # Creating a sequence from 12 to 50 by 4's
12 %in% fullset # Test: Is 12 in this set?## [1] TRUE
## [1] FALSE
We can also test vectors:
## [1] FALSE TRUE TRUE
## [1] TRUE
## [1] TRUE FALSE FALSE
If you don’t want to use it, you don’t have it, though it can make more complicated codes a little more simple.
So, using the new operator (%in%),
let’s filter our counties to anything named “Middlesex” or “Sussex.”
## # A tibble: 7 x 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Middlesex Coun~ Conn~ 155071 165676 163410 -1.14 7.2 75.8
## 2 Sussex County Dela~ 156638 197145 225322 9.13 12 80
## 3 Middlesex Coun~ Mass~ 1465396 1503085 1602947 2.82 8.2 63.9
## 4 Middlesex Coun~ New ~ 750162 809858 842798 1.54 8.6 67
## 5 Sussex County New ~ 144166 149265 141682 -2.8 5.3 84.8
## 6 Middlesex Coun~ Virg~ 9932 10959 10679 -0.9 10.2 81.1
## 7 Sussex County Virg~ 12504 12087 11373 -3.32 17.8 67
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## # metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## # median_hh_income <int>, smoking_ban <fct>
Let’s also filter to include counties that are in either Rhode Island or Delaware:
## # A tibble: 8 x 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Kent County Dela~ 126697 162310 176824 4.54 13 72.9
## 2 New Castle Cou~ Dela~ 500265 538479 559793 1.88 11.9 71.3
## 3 Sussex County Dela~ 156638 197145 225322 9.13 12 80
## 4 Bristol County Rhod~ 50648 49875 48912 -0.6 7 72.1
## 5 Kent County Rhod~ 167090 166158 163760 -0.36 7.8 73.8
## 6 Newport County Rhod~ 85433 82888 83460 0.77 9 63.6
## 7 Providence Cou~ Rhod~ 621602 626667 637357 1.16 16.7 55.5
## 8 Washington Cou~ Rhod~ 123546 126979 126150 -0.17 9.6 76.1
## # ... with 7 more variables: multi_unit <dbl>, unemployment_rate <dbl>,
## # metro <fct>, median_edu <fct>, per_capita_income <dbl>,
## # median_hh_income <int>, smoking_ban <fct>
We can also negate this filter by putting the exclamation point at the beginning:
## [1] 3134
## [1] 3142
We can see there are originally 3142 counties in the dataset. When we filter out counties in Rhode Island or Delaware, which we see from earlier is 8, we are left with 3134.
Since the last OQT, we’ve done:
Review of Logic Conditions
Equal to: ==
Which should not be confused with the single equal sign
=, used for assignment or in a function
This was Common Error #1 from earlier
AND: &
TRUEOR: |
TRUENOT: !
Whatever the condition is, reverse it
Was it TRUE? It’s now FALSE.
Was it FALSE? It’s now TRUE.
The filter() function
The %in% operator
Let’s put this all together then. Take some time and do the following:
A dataset of NJ Counties
Use the county dataset
Create a copy of it, called
“new_jersey_counties”
Filter it to only include counties in New Jersey
Rename pop2017 to
population_2017
Rename name to county
Remove the pop2000 and pop2010
variables as well as any columns ending with the letter “e” or starting
with “m”
Print the tibble to the console
Print the column names from the tibble from Part 1.
A dataset of presidential western counties
Use the county dataset
Create a copy of it, called
“presidential_counties”
Select only the county name, state, poverty rate, and percentage multifamily units
Filter it to only include counties in the following western states: Texas, Oklahoma, California, Washington, Oregon, New Mexico, Arizona
Filter to only include counties that name the following presidents: Washington, Jefferson, Madison, Lincoln, Roosevelt, and Grant
Print the tibble to the console.
Print the number of dimensions in the tibble from Part 3.
A dataset of counties that shrunk in population from 2010 to 2017
Use the county dataset
Create a copy of it, called
“shrinking_counties”
Filter it to only include the county name, state, and
pop_change
Select only the counties who shrunk from 2010 to 2017
pop_change should be less
than 0Print the number of counties from Part 5.
The answers are on the next slide, but try to see if you can work through it without them.
library(tidyverse)
# Part 1: NJ County Dataset
new_jersey_counties <- county %>%
filter(state == "New Jersey") %>%
rename(population_2017 = pop2017,
county = name) %>%
select(-pop2000, -pop2010, -ends_with("e"), -starts_with("m"))
print(new_jersey_counties)## # A tibble: 21 x 5
## county population_2017 poverty homeownership smoking_ban
## <chr> <int> <dbl> <dbl> <fct>
## 1 Atlantic County 269918 15.3 70.7 none
## 2 Bergen County 948406 7.2 67.5 <NA>
## 3 Burlington County 448596 6.4 79 <NA>
## 4 Camden County 510719 13.1 69.7 <NA>
## 5 Cape May County 93553 10.6 74.3 <NA>
## 6 Cumberland County 152538 18.8 67.4 none
## 7 Essex County 808285 16.7 47.2 partial
## 8 Gloucester County 292206 7.9 80.9 <NA>
## 9 Hudson County 691643 17.1 34.3 <NA>
## 10 Hunterdon County 125059 4.5 85.6 <NA>
## # ... with 11 more rows
## [1] "county" "population_2017" "poverty" "homeownership"
## [5] "smoking_ban"
# Part 3: Presidential Counties
presidential_counties <- county %>%
# 3c: Selecting columns: county name, state, poverty rate, and percentage multifamily units
select(name, state, poverty, multi_unit) %>%
# 3d: Filtering States
filter(state %in% c("Texas", "Oklahoma", "California",
"Washington", "Oregon", "New Mexico", "Arizona")) %>%
# 3e: Filtering Counties
filter(name %in% c("Washington County", "Jefferson County", "Madison County",
"Lincoln County", "Roosevelt County", "Grant County"))
# 3f
presidential_counties## # A tibble: 17 x 4
## name state poverty multi_unit
## <chr> <fct> <dbl> <dbl>
## 1 Grant County New Mexico 22 7.7
## 2 Lincoln County New Mexico 15.4 10.4
## 3 Roosevelt County New Mexico 27.5 11.5
## 4 Grant County Oklahoma 9.6 2.7
## 5 Jefferson County Oklahoma 20.9 7.3
## 6 Lincoln County Oklahoma 14.3 3.5
## 7 Washington County Oklahoma 14 11.5
## 8 Grant County Oregon 13.7 5.3
## 9 Jefferson County Oregon 20.9 11.2
## 10 Lincoln County Oregon 18.4 16.2
## 11 Washington County Oregon 10.3 31.2
## 12 Jefferson County Texas 19.4 19.9
## 13 Madison County Texas 15.4 3.6
## 14 Washington County Texas 13.2 12.4
## 15 Grant County Washington 15.9 14.1
## 16 Jefferson County Washington 12.8 9
## 17 Lincoln County Washington 13.7 4.4
## [1] 17 4
# Part 5: Shrinking Counties
shrinking_counties <- county %>%
select(name, state, pop_change) %>%
filter(pop_change<0) # Rate of growth less than 0
# Part 6: Number from P5
nrow(shrinking_counties)## [1] 1594
As a note, it is good practice to comment your code with what you’re doing. Notice here that I commented each code chunk with which part I was working on and what the task was.
You should be sure to do something like this for your homeworks.
This afternoon, we’ve learned:
Directories and Projects
Environments & Packages
The Tidyverse
select()
starts_with(), ends_with(),
contains(), and -rename()
filter()
The pipe %>%
CTRL/Cmd +
Shift + mThe %in% operator
Logical Conditions
And: &
Or: |
Not: !