An Introduction to Geographic Data Science in R with Honeybees

Katie Jolly

November 7, 2018

In this document I’ll demonstrate some code that shows the surface of geographic data science in R. We’ll use the packages sf, dplyr, and ggplot2 pretty heavily, so make sure those are installed and ready to be loaded!

# run this code if you need to install any of the packages


# and better yet, if you're using dplyr and ggplot2, try running 

install.packages("tidyverse") # to install the whole tidyverse. This takes a little while, though!

Geographic Data Science (GDS) and Geographic Information Science (GISc)

While very similar, there are some key differences between GDS and GISc. We generally think about geographic operations in R and Python as geographic data science/geocomputation. On the other hand, we think about geogaphic operations in programs like QGIS and ArcMap as GISc (often just written GIS). One of the key distinctions for me is the degree of reproducibility.

Table taken from Robin Lovelace’s post Can Geographic Data Save the World?.

With geographic operations in R, we can produce a full script of what we ran including any assumptions we made. This isn’t the case with QGIS and ArcMap (I pick on these two only because they are the most popular). An added bonus as an R user is that the sf package for geographic data is specifically designed to work within the tidyverse framework. This means spatial data works with things like dplyr data verbs and now there’s even a geom_sf() function in ggplot2!

R-Spatial Ecosystem

There are a variety of packages available for spatial data analysis specifically.

  • sp: Classes and Methods for Spatial Data
  • sf: Simple Features for R (builds on sp)
  • spdep: Spatial Dependence: Weighting Schemes, Statistics, and Models
  • lwgeom: Binding to the liblwgeom library

The newest, and in my opinion the most helpful to be comfortable with, is sf.

A good set of tutorials is available on the (a bit outdated) r-spatial site.

Along with data analysis, there is also a great set of packages for data visualization! While sf dataframes work with plot(), these packages offer a lot more flexibility for users.

  • ggplot2: An implementation of the Grammar of Graphics in R (as of version 3, I believe)
  • mapview: Interactive viewing of spatial data in R
  • leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library (great with shiny apps!)
  • tmap: R package for thematic maps

If you’re interested in keeping up with the latest r-spatial developments, consider following people like Jakub Nowosad, Robin Lovelace, Edzer Pebesma, Jannes Muenchow, Mark Padgham, Angela Li, Michael Sumner, Leah Wasser, Zev Ross, Kyle Walker, Roger Bivand, Jesse Sadler, and many, many more!

Structure of sf data

Image taken from Geocomputation with R

Simple Features is a hierarchical data model that represents a wide range of geometry types. Essentially, simple features dataframes have each spatial observation in one row (a census tract, a point, a state, …) with a list-column of the coordinates to draw that shape.

The best way to learn about spatial data is to use it! We’ll work through an example of how to map honeybee permits in Minneapolis.


If projections are new to you, the Geocomp in R chapter on projections is a great resource! In short, projections help us translate data from the 3D surface of the earth to a 2D surface on a map. Different projections make the geography look very different!

Projections are fundamental and really interesting, and they’ve even enjoyed a brief spotlight in the West Wing!

Mapping the honeybees

Go to Open Minneapolis to get the shapefile (most common spatial data filetype). You can directly download the shapefile by typing into your browser or you can navigate through the data portal.

Then, unzip it into a subfolder of the folder your RMarkdown is in called shp.

  - practice.rmd (current rmarkdown/script)
    -shapefile (collection of a few files)

You can then read it into R using st_read()

honeybee <- st_read("shp/Honey_Bee_Permits_2017.shp")
## Reading layer `Honey_Bee_Permits_2017' from data source `C:\Users\katie\Documents\r_ladies\spatial_data\shp\Honey_Bee_Permits_2017.shp' using driver `ESRI Shapefile'
## Simple feature collection with 90 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -93.32817 ymin: 44.89744 xmax: -93.20644 ymax: 45.03776
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

This output tells us the dimensions of the data, the bounding box, geometry type, and projection. These are all unique to spatial data. The epsg code refers to the projection used for this data.

There are a number of functions that can help us better understand the data we are working with. I’ll run through a few important ones.

str(honeybee$geometry) # what geometry type is your data
## sfc_POINT of length 90; first list element:  'XY' num [1:2] -93.3 45
st_crs(honeybee) # what coordinate reference system is your data in? important!! This is WGS 84. we'll change that later.
## Coordinate Reference System:
##   EPSG: 4326 
##   proj4string: "+proj=longlat +datum=WGS84 +no_defs"
sum(st_is_valid(honeybee) == TRUE) # how many of our units are valid? no self intersections, etc. in this case they're all valid! 
## [1] 90

We want our data to be projected in UTM 15N (Universal Transverse Mercator Zone 15 North), an appropriate projection for Minnesota. UTM zones are highly accurate within a specific zone, so they’re generally a good projection for something “not too big.”

honeybee_utm <- st_transform(honeybee, 26915) 
# use epsg codes to specify a projection

If we want to see where the honeybee hives in Minneapolis are, the easiest thing to do is put it on a map! We’ll use ggplot.

Static Map

First, we’ll also want to have some sort of background map. We can using the neighborhoods shapefile from Open Minneapolis.

neighborhoods <- st_read("") %>%
## Reading layer `7f88316841ce471faa33c89035fb69e8_0' from data source `' using driver `GeoJSON'
## Simple feature collection with 87 features and 10 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -93.32911 ymin: 44.89059 xmax: -93.19433 ymax: 45.05125
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

The ggplot code looks a little different than the usual plot. We often don’t need to add anything to aes() if we are just plotting to see the general data. Because of the way simple features are stored, there aren’t logical x & y values. I’ll also show an example with aes(), though.

## Registering fonts with R
ggplot() +
  geom_sf(data = neighborhoods, fill = "#e8eff7", color = "#8f98aa") +
  geom_sf(data = honeybee_utm, color = "#efcf2f") +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank(),
        text = element_text(family = "Century Gothic")) +
  ggtitle("Honeybee Permits in Minneapolis")

So now we have a pretty good map!

We can also show where the different types of hives are in the city.

ggplot() +
  geom_sf(data = neighborhoods, fill = "#e8eff7", color = "#8f98aa") +
  geom_sf(data = honeybee_utm, aes(color = HiveType)) +
  scale_color_manual(values = c("#37a347", "#50a2e0")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank(),
        text = element_text(family = "Century Gothic")) +
  ggtitle("Honeybee Permits in Minneapolis")

What if we want to know a little more about this distribution, though?

One thing we might want to see is the number of honeybee permits per neighborhood.

Spatial joins

In order to get data per neighborhood, we’ll start with a spatial left join. It’s similar to asking the question “which observations from y are in each observation from x”

nb_join <- st_join(neighborhoods, honeybee_utm)

## Simple feature collection with 6 features and 13 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 474543.4 ymin: 4975245 xmax: 484182.5 ymax: 4987884
## epsg (SRID):    26915
## proj4string:    +proj=utm +zone=15 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## 1            1 -2144139700  REFNO 23103.00   WARDAREA
## 2            2           0  REFNO 23108.00   WARDAREA
## 3            3 -2144133700  REFNO 23163.00   WARDAREA
## 3.1          3 -2144133700  REFNO 23163.00   WARDAREA
## 3.2          3 -2144133700  REFNO 23163.00   WARDAREA
## 3.3          3 -2144133700  REFNO 23163.00   WARDAREA
##                       BDNAME BDNUM TEXT_NBR  Shape__Area Shape__Length
## 1                    Victory     3       03 2.022355e-04    0.06872054
## 2   Humboldt Industrial Area     8       08 8.106073e-05    0.04906020
## 3                       Howe    63       63 3.139919e-04    0.08252545
## 3.1                     Howe    63       63 3.139919e-04    0.08252545
## 3.2                     Howe    63       63 3.139919e-04    0.08252545
## 3.3                     Howe    63       63 3.139919e-04    0.08252545
##     OBJECTID.y         Address HiveType                       geometry
## 1           68   4525 OSSEO RD   GROUND POLYGON ((474846.3 4987668,...
## 2           NA            <NA>     <NA> POLYGON ((474846 4987686, 4...
## 3           45 3407 34TH AVE S   GROUND POLYGON ((484138.7 4976477,...
## 3.1         47 3433 45TH AVE S   GROUND POLYGON ((484138.7 4976477,...
## 3.2         50 3533 36TH AVE S   GROUND POLYGON ((484138.7 4976477,...
## 3.3         60 3824 47TH AVE S  ROOFTOP POLYGON ((484138.7 4976477,...

This gives us one observation for each overlap, which is why there are more rows than in our original data. We can use dplyr data verbs to calculate the value per neighborhood.

nb <- nb_join %>%
  group_by(BDNAME) %>%
  summarise(n_permits = n()) # total permits per neighborhood

Let’s quickly see this distribution

ggplot(nb, aes(x = n_permits)) +
  geom_histogram(binwidth = 1, color = "white") +
  theme_minimal() +
  ggtitle("Permits per neighborhood") +
  theme(text = element_text(family = "Century Gothic")) +
  labs(x = "Number of permits")

Most neighborhoods have only 1 permit. The maximum number of permits in a neighborhood is 4.

Let’s map this data! We’ll fill in the aes() this time.

ggplot(nb) +
  geom_sf(aes(fill = n_permits), color = "#8f98aa") +
  scale_fill_gradient(low = "#f5f7d9", high = "#eadd27", guide = guide_legend(title = "Permits")) +
  theme_minimal() +
  theme(panel.grid.major = element_line("transparent"),
        axis.text = element_blank(),
        text = element_text(family = "Century Gothic")) +
  ggtitle("Honeybee Permits per \nNeighborhood in Minneapolis")


One thing you’ll often come across in spatial data analysis is the idea of spatial autocorrelation. Positive spatial autocorrelation means that similar values tend to occur near each other and negative spatial autocorrelation means that dissimilar values tend to occur near each other. If you’re modeling spatial data, this needs to be accounted for so that you don’t get clustered errors.

Geographers have a few ways of caluclating this, but Moran’s I is the most popular.

The easiest way to calculate this uses spdep instead of sf.


neighborhoods_sp <- as(nb, "Spatial") # convert our data to sp data

nb_obj <- poly2nb(neighborhoods_sp) # create the neighborhood object

## Neighbour list object:
## Number of regions: 87 
## Number of nonzero links: 492 
## Percentage nonzero weights: 6.500198 
## Average number of links: 5.655172 
## Link number distribution:
##  2  3  4  5  6  7  8  9 
##  2  6 15 18 19 14  9  4 
## 2 least connected regions:
## 70 80 with 2 links
## 4 most connected regions:
## 41 62 76 78 with 9 links

We can see here that the average number of neighbors for a neighborhood is 5.7 and the max number of neighbors is 9 (4 neighborhoods).

weights <- nb2listw(nb_obj, style = "B") # create a matrix of binary spatial weights (connected or not connected)
moran(neighborhoods_sp$n_permits, weights, n=length(weights$neighbours), S0=Szero(weights))
## $I
## [1] 0.08526573
## $K
## [1] 5.602669

Our Moran’s I statistic is 0.085. That means we see slight positive autocorrelation, but it’s not extreme. We might also want some measure of statistical significance for this.

The preferred method for many people is to use Monte Carlo simulation. It randomly assigns values to the polygons and calculates Moran’s I for each iteration. It then compares the observed statistic to the generated distribution.

set.seed(123)$n_permits, weights, nsim=9999) # 10000 simulations
##  Monte-Carlo simulation of Moran I
## data:  neighborhoods_sp$n_permits 
## weights: weights  
## number of simulations + 1: 10000 
## statistic = 0.085266, observed rank = 9363, p-value = 0.0637
## alternative hypothesis: greater

Our observed statistic is 0.085 with a p-value of 0.0637 so we would say our distribution is not different from randomly distributed.

So, we now know that honeybee permits are randomly distributed in Minneapolis at the neighborhood level. It’s important to note this spatial unit, clustering calculations can be very different at different units!

Resources for more information and practice