If you know of a good resource that you want to share, feel free to submit a pull request! If you’re not sure how to do that, you can also submit an issue with the link to the resource and a quick intro to why it’s helpful for you.
This doc is based on my google doc of resources and inspired by a coversation I had on twitter.
I’m slowly working on coverting the original doc!
| [Chroma.js](http://gka.github.io/palettes/#diverging | c0=darkred,deeppink,lightyellow | c1=lightyellow,lightgreen,teal | steps=13 | bez0=1 | bez1=1 | coL0=1 | coL1=1) |
There is no use having a visualization if folks can’t read it.
file.path() # takes folder names and returns a path to your file
file.choose() # pulls up finder so you can point to your file
download.file(url, dest_path) # makes downloading files reproducible
jsonlite is a good package to use!
fromJSON() takes in a JSON file (can be a URL) and returns an R listtoJSON() takes in a file and converts it to JSONFormatting JSON files:
minify()
{"a":1,"b":2,"c":{"x":5,"y":6}}prettify(), pretty = TRUE by default in toJSON()
{
"a": 1,
"b": 2,
"c": {
"x": 5,
"y": 6
}
}ROpenSci has an awesome package to use tabula in R– tabulizer
Tabula is also awesome on its own.
# example code from github
library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
out1 <- extract_tables(f)
out2 <- extract_tables(f, pages = 1, guess = FALSE, method = "data.frame")
extract_areas() # turns the pdf into an R graphic so you can select the tables on a page, if the whole page isn't a table!
out3 <- extract_text(f, page = 3) # gets just the text from the table, without turning it into a dataframe
cat(out3, sep = "\n")
## len supp dose
## 4.2 VC 0.5
## 11.5 VC 0.5
## 7.3 VC 0.5
## 5.8 VC 0.5
## 6.4 VC 0.5
## 10.0 VC 0.5
## 11.2 VC 0.5
## 11.2 VC 0.5
## 5.2 VC 0.5
## 7.0 VC 0.5
## 16.5 VC 1.0
## 16.5 VC 1.0
## 15.2 VC 1.0
## 17.3 VC 1.0
## 22.5 VC 1.0
## 3
How to install tabulizer:
(On Windows)
From github vignette:
In command prompt, install Chocolately if you don’t already have it:
@powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))" && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin
Then install java!
choco install jdk7 -y
To set an environment variable in R use
Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk1.8.0_92")
if that doesn’t work, look at the other options in the vignette.
Then install the dev version of the package (not yet on CRAN)
if (!require("ghit")) {
install.packages("ghit")
}
# on 64-bit Windows
ghit::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
ghit::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))
Using the xlsx package to create an excel file
readxl package - tidyverse w/ excel
excel_sheets() # gives a character vector of the sheet names
read_excel() # reads in excel files
# args: sheets (defines which sheet you want)
col_names = TRUE
col_types ("blank" to skip a column)
skip (n rows to skip)
# put all sheets in a list to keep them together
my_workbook <- lapply(excel_sheets("data.xlsx"),
read_excel,
path = "data.xlsx")
XLConnect - dynamic connection between R and excel.
createSheet(book, “name”)
loadWorkbook(“path”)
getSheets(book)
writeWorksheet () # function (object, data, sheet, startRow = 1, startCol = 1, header = TRUE,
rownames = NULL)
writeWorkbook(object, “path”)
renameSheet() # function (object, sheet, newName)
removeSheet() # function (object, sheet)
There’s a vignette with lots of good info.
# uses oauth to access drive
gs_auth() # run to authorize
From the github site, an overview of the functions:
gs_ls() # list sheets
gs_title() # register a sheet by title
gs_key() # register a sheet by key
gs_url() # register a sheet by url
... tbd
strptime to convert character strings to POSIXct.
strptime(x, format, tz = "")
Formatting POSIXct objects:
Example use:
# from r documentation
x <- c("1jan1960", "2jan1960", "31mar1960", "30jul1960")
z <- strptime(x, "%d%b%Y")
Also use lubridate for working with date objects!
There’s helpful info on the tidyverse site as well as a nice vignette
Lubridate makes it easy to manipulate and work with dates.
It has some of the same capacity as strptime, although I actually prefer the former.
ymd(20101215)
#> [1] "2010-12-15"
mdy("4/1/17")
#> [1] "2017-04-01"
Lubridate also makes timezones less confusing!
time <- ymd_hms("2010-12-13 15:30:30")
time
#> [1] "2010-12-13 15:30:30 UTC"
# Changes printing
with_tz(time, "America/Chicago")
#> [1] "2010-12-13 09:30:30 CST"
# Changes time
force_tz(time, "America/Chicago")
#> [1] "2010-12-13 15:30:30 CST"
You can also pull out specific parts of a date object.
bday <- dmy("14/10/1979")
month(bday)
#> [1] 10
wday(bday, label = TRUE)
#> [1] Sun
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
Say you want to turn a time into a numerical representation (i.e. 8:30 AM would be 8.50). Use lubridate!
time <- hm("8:30")
hour(time) + minute(time)/60
#> 8.5
str() is a helpful starter function for working with nested lists, but we can certainly improve upon it. First, using the max.level argument specifies how much information you want about the list. For example, the output from str(..., max.level=0) is something along the lines of List of x, replace x with the length of your list.
A more interactive way to work with nested lists is with {listviewer}, a package that lets you easily explore your list.
library(listviewer)
jsonedit(df, mode = "view")
The options for the mode argument are code, form, text, tree, and view. Even though you specify the mode, you can easily switch within the interactive list view if you think something else would work better for your particular use.
In R, you can write an if/else statement, similar to the way you would write it in Python. It helps me to think about it this way because I started programming in Python first.
It would look something like this:
if (condition) {
do something
} else if (condition) {
do something
} else {
do something
}
But let’s say you want to define a variable in a dataframe based on a conditional. It is way easier with tidyverse packages!
You can use mutate() with ifelse(condition, if-do, else-do)
For example, let’s say you want to assign high & low with a conditional based on some values. You could do:
library(dplyr)
df %>%
mutate(x = ifelse(var > 5, "high", "low"))
# this assigns "high" to anything strictly higher than 5
You can also use a nested ifelse() for more complicated conditionals, although that is discouraged. More on that in a bit…
In this example, I wanted the final variable to say “White subjects”, “Nonwhite subjects”, or “Unknown”. I started with two different variables. One said whether or not race was known, and the other gave race, if known.
library(tidyverse)
biopics <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/biopics/biopics.csv")
biopics <- biopics %>% # start with our original data
mutate(race_plotting = ifelse(race_known == "Unknown", "Unknown", ifelse(subject_race == "White", "White subjects", "Nonwhite subjects"))) # this says, if race_known is unknown, stop there and just mark unknown. But if we know the race, check to see if the race is coded as white. If it is, mark white and stop there. If not, mark nonwhite.
Instead of getting lost in nested statements, try this instead! Found in Github
df <- mutate(df, newvar = derivedFactor(
"group1" = oldvar %in% c("a", "b", "c"),
"group2" = oldvar %in% c("d", "e", "f"),
"group3" = oldvar %in% c("g", "h", "i"),
.default = NA
))
case_when() is another good alternative that draws on some SQL syntax. Here’s an example from Stack Overflow
library(dplyr) # >= 0.7.0
mtcars %>%
mutate(cg = case_when(carb <= 2 ~ "low",
carb > 2 ~ "high"))
How to change the order of the bars in geom_bar() !!!!
# from https://stackoverflow.com/questions/5208679/order-bars-in-ggplot2-bar-graph
require(forcats)
ggplot(data, aes(x = fct_infreq(var))) + geom_bar()
# from https://rstudio-pubs-static.s3.amazonaws.com/7433_4537ea5073dc4162950abb715f513469.html
x$name <- factor(x$name, levels = x$name[order(x$val)])
ggplot(x, aes(x = name, y = val)) + theme_bw() + geom_bar(stat = "identity")
To look into: gghighlight in this blog post! Looks like an easier way to highlight certain lines/bars instead of manually coding that.
Cookbook of good ggplot2 hacks here!!
# forward geocoding
output <- opencage_forward(placename = "Sarzeau")
# reverse geocoding
output2 <- opencage_reverse(latitude = 51.5034070,
longitude = -0.1275920)
# adding parameters
results3 <- opencage_forward(placename = "Berlin", country = "DE")
But for now, I’m using a function I found online. Example:
I wanted to geocode the locations of the Boston community centers recently. I found the names here. Using the chrome extension for CSS selector gadget, I scraped the names into R.
community_centers_url <- read_html("https://www.boston.gov/community-centers") # site with names
centers <- community_centers_url %>%
html_nodes(".cd-t") %>% # css selector
html_text() # I want the text
Then I used a combination of functions to geocode.
url <- function(address, return.call = "json", sensor = "false") {
root <- "http://maps.google.com/maps/api/geocode/"
u <- paste(root, return.call, "?address=", address, "&sensor=", sensor, sep = "")
return(URLencode(u))
} # calling google maps API
geoCode <- function(address,verbose=FALSE) {
if(verbose) cat(address,"\n")
u <- url(address)
doc <- getURL(u)
x <- fromJSON(doc,simplify = FALSE)
if(x$status=="OK") {
lat <- x$results[[1]]$geometry$location$lat
lng <- x$results[[1]]$geometry$location$lng
location_type <- x$results[[1]]$geometry$location_type
formatted_address <- x$results[[1]]$formatted_address
return(c(lat, lng, location_type, formatted_address))
} else {
return(c(NA,NA,NA, NA))
}
} # actually getting the addresses
I’m sure there’s a cleaner way to do this, I’ll work on using opencage and then updating this page!
Another method of geocoding is using ggmap package and Google API. First, register for a key here. After you get the key, this script will geocode for you (up to the API limit which is around 2500/day)
library(ggmap)
register_google(key = "my_key")
for(i in 1:nrow(turnout_data)){
#location is a string like "123 University Ave, Gainesville FL"
#This works best if you format your locations the same way you would for Google Maps/Waze/Tom Tom
result <- geocode(my_data_frame$location[i], output = "latlona", source = "google")
my_data_frame$lon[i] <- as.numeric(result[1])
my_data_frame$lat[i] <- as.numeric(result[2])
turnout_data$geoAddress[i] <- as.character(result[3])
}
Tidycensus for demographic data.
# median income in Boston census tracts
bos_inc <- get_acs(geography = "tract",
variables = c(medincome = "B19013_001"),
state = "MA",
county = "Suffolk")
# see available variables for 2016 acs 5-year
load_variables(year = 2016, dataset = "acs5")
Tigris to load the shapefiles to map census data!
options(tigris_use_cache = TRUE) # doesn't cache by default
bos_tract <- tracts(state = "MA", county = "Suffolk")
You can join the two with tigris::geo_join. Regular joins don’t work with spatial data.
bos_joined <- geo_join(bos_tract, bos_inc, by = "GEOID")
And if I want to exclude a particular tract (it’s mostly water, NA for all demographics) that makes my maps looks ugly :(
bos_joined <- subset(bos_joined, NAME.1 != "Census Tract 9901.01, Suffolk County, Massachusetts")
dplyr::filter doesn’t work with spatial data!!
Principles of tidy data:
The tidyverse has understandable and compatible functions for dealing with data cleaning!
Sometimes you have a table that is “wide” or “narrow,” and you want it to be the other way. Use spread and gather!
From the RStudio blog:
library(tidyr)
library(dplyr)
messy <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
a = c(67, 80, 64),
b = c(56, 90, 50)
)
messy
#> name a b
#> 1 Wilbur 67 56
#> 2 Petunia 80 90
#> 3 Gregory 64 50
# We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate:
# gather(data, key, value, ..., na.rm = FALSE, convert = FALSE,
factor_key = FALSE)
messy %>%
gather(drug, heartrate, a:b)
#> name drug heartrate
#> 1 Wilbur a 67
#> 2 Petunia a 80
#> 3 Gregory a 64
#> 4 Wilbur b 56
#> 5 Petunia b 90
#> 6 Gregory b 50
Conversely, you also might want to spread your data.
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE,
sep = NULL)
Let’s say you have a variable for date-time, but you want one variable for date and the other for time.
In the case of the vignette, they wanted to separate location and time (randomized variables) in the table.
tidier <- messy %>%
gather(key, time, -id, -trt)
tidier %>% head(8)
#> id trt key time
#> 1 1 treatment work.T1 0.08514
#> 2 2 control work.T1 0.22544
#> 3 3 treatment work.T1 0.27453
#> 4 4 control work.T1 0.27231
#> 5 1 treatment home.T1 0.61583
#> 6 2 control home.T1 0.42967
#> 7 3 treatment home.T1 0.65166
#> 8 4 control home.T1 0.56774
tidy <- tidier %>%
separate(key, into = c("location", "time"), sep = "\\.")
tidy %>% head(8)
#> id trt location time time
#> 1 1 treatment work T1 0.08514
#> 2 2 control work T1 0.22544
#> 3 3 treatment work T1 0.27453
#> 4 4 control work T1 0.27231
#> 5 1 treatment home T1 0.61583
#> 6 2 control home T1 0.42967
#> 7 3 treatment home T1 0.65166
#> 8 4 control home T1 0.56774
The opposite of separate() is unite().
Let’s say I have a dataframe where one of the variables is a full address. I want only the rows from Minnesota.
Found in a SO question
df %>%
filter(str_detect(address, "Minnesota"))
# "detects" the string "Minnesota" in the address column
What if we have NA values and blank cells, both meant to be NA?
Answer found in a SO question
df <- df[!(is.na(df$var) | df$var==""), ]
janitor is an awesome data cleaning package. As it says on the github site, it was built with beginners in mind, but can be useful for more experienced R users by making routing processes a little faster. I now routinely add clean_names() every time I load a csv.
read_csv("data.csv") %>%
clean_names()
In fact, when I’m loading lots of csvs I’ll often write my own function that does that automatically. Something like:
clean_read <- function(data) {
x <- read_csv(data) %>%
clean_names()
return(x)
}
If you’re dealing with messy data, especially hard coded data, you might see a lot of duplicates. In that case get_dupes() from janitor will be your new best friend!
General format for linear models in R lm(response ~ explanatory, data = data)
lm(response ~ ., data = data)lm(response ~ . -vars_to_exclude, data = data)Linear model assumptions
log(response)David Robinson’s broom package gives model output in a “tidy” format.
The tidy() function gives the typical model summary as a tidy data frame– great for further analysis!
> library(fivethirtyeight)
> data("hate_crimes")
> library(broom)
> mod <- lm(hate_crimes_per_100k_splc ~ gini_index, data = hate_crimes)
> tidy(mod)
term estimate std.error statistic p.value
1 (Intercept) -1.527463 0.7833043 -1.950025 0.05741966
2 gini_index 4.020510 1.7177215 2.340606 0.02374447
Then instead of having to look for the fitted values in the nested structure of mod, you can get them + more in a tidy data frame with augment()!
> head(augment(mod))
.rownames hate_crimes_per_100k_splc gini_index .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
1 1 0.12583893 0.472 0.3702175 0.04512572 -0.24437860 0.03499099 0.2410643 0.0192795907 -1.0312203
2 2 0.14374012 0.422 0.1691920 0.06752783 -0.02545191 0.07835622 0.2439314 0.0005134118 -0.1098987
3 3 0.22531995 0.455 0.3018689 0.03520103 -0.07654890 0.02129211 0.2436851 0.0011190988 -0.3207498
4 4 0.06906077 0.458 0.3139304 0.03543832 -0.24486962 0.02158013 0.2410927 0.0116131848 -1.0261863
5 5 0.25580536 0.471 0.3661970 0.04407070 -0.11039166 0.03337397 0.2433762 0.0037397389 -0.4654371
6 6 0.39052330 0.457 0.3099099 0.03527586 0.08061342 0.02138273 0.2436546 0.0012466081 0.3377963
Then you can easily add fitted values and residuals to plots that help assess model quality. Fewer steps is good!
Python
HTML/CSS
Regex
SQL
Jekyll + blogging