Section 6 Reading in Data
I am a strong believer in the cake-first approach to teaching/learning R. It emphasizes real-world examples, interesting data, and visual feedback. Because of that, I like to use ready-made data packages like fivethirtyeight
and talk about visualization before data cleaning.
But I also think reading in data is an important skill so we will talk about that briefly at the end of today, but not spend too much time on it. For now, let’s eat the cake instead of going out to get the ingredients. That’s more fun anyway.
6.1 Data packages
There are a number of packages in R specifically to make data sharing easier. A few examples are:
fivethirtyeight
to share to data used in their articlesbikedata
to share data about certain bikeshare systemsecoengine
to share data from the Berkley natural history museum
We will use the data the MN niceride system. I put the data in a package called metcouncilR so it’s easy to use. You can install it by typing install_github("katiejolly/metcouncilR")
in your console.
6.1.1 Try it out
In your R markdown, fill in the following code to load your library:
To pull a particular dataset from this package, we can use the data()
function.
You should now see it in your global working environment.
I’ve written documentation for this data that you can see in the help pane in RStudio.
There are also a few different ways to get quick summaries of the data.
First, you can check the dimensions to get the number of rows and columns.
## [1] 409002 22
What function would you use to get just the number of columns? (Google it.)
We can also print the first 6 rows of the data with the head()
function.
## # A tibble: 6 x 22
## tripduration start_datetime end_datetime start_station_id
## <dbl> <dttm> <dttm> <dbl>
## 1 1373 2018-04-24 16:03:04 2018-04-24 16:25:57 170
## 2 1730 2018-04-24 16:38:40 2018-04-24 17:07:31 2
## 3 547 2018-04-24 17:51:10 2018-04-24 18:00:17 13
## 4 856 2018-04-24 18:50:05 2018-04-24 19:04:22 94
## 5 455 2018-04-25 08:49:05 2018-04-25 08:56:40 13
## 6 1557 2018-04-27 11:57:03 2018-04-27 12:23:01 43
## # ... with 18 more variables: start_station_name <chr>,
## # start_station_latitude <dbl>, start_station_longitude <dbl>,
## # end_station_id <dbl>, end_station_name <chr>,
## # end_station_latitude <dbl>, end_station_longitude <dbl>, bikeid <dbl>,
## # usertype <chr>, birth_year <dbl>, gender <dbl>, bike_type <chr>,
## # start_month <dbl>, start_day <ord>, end_month <dbl>, end_day <ord>,
## # start_hour <int>, end_hour <int>
How can we modify this code to print the first 10 rows instead? (hint: help(head)
to see the documentation)
We can also just get a summary of each variable.
## tripduration start_datetime
## Min. : 61 Min. :2018-04-12 08:49:49
## 1st Qu.: 432 1st Qu.:2018-06-08 15:02:34
## Median : 800 Median :2018-07-18 08:40:36
## Mean : 1275 Mean :2018-07-20 03:50:06
## 3rd Qu.: 1524 3rd Qu.:2018-08-27 20:09:25
## Max. :17992 Max. :2018-11-17 23:13:49
##
## end_datetime start_station_id start_station_name
## Min. :2018-04-12 09:31:20 Min. : 2.0 Length:409002
## 1st Qu.:2018-06-08 15:34:45 1st Qu.: 37.0 Class :character
## Median :2018-07-18 08:54:47 Median : 94.0 Mode :character
## Mean :2018-07-20 04:11:22 Mean :103.4
## 3rd Qu.:2018-08-27 20:37:36 3rd Qu.:171.0
## Max. :2018-11-17 23:19:36 Max. :226.0
## NA's :13251
## start_station_latitude start_station_longitude end_station_id
## Min. :44.89 Min. :-93.33 Min. : 2.0
## 1st Qu.:44.96 1st Qu.:-93.27 1st Qu.: 38.0
## Median :44.97 Median :-93.26 Median : 95.0
## Mean :44.97 Mean :-93.25 Mean :103.6
## 3rd Qu.:44.98 3rd Qu.:-93.23 3rd Qu.:170.0
## Max. :45.04 Max. :-93.08 Max. :226.0
## NA's :13251
## end_station_name end_station_latitude end_station_longitude
## Length:409002 Min. :44.89 Min. :-93.35
## Class :character 1st Qu.:44.96 1st Qu.:-93.27
## Mode :character Median :44.97 Median :-93.26
## Mean :44.97 Mean :-93.25
## 3rd Qu.:44.98 3rd Qu.:-93.23
## Max. :45.04 Max. :-93.08
##
## bikeid usertype birth_year gender
## Min. : 2 Length:409002 Min. :1911 Min. :0.0000
## 1st Qu.: 530 Class :character 1st Qu.:1969 1st Qu.:0.0000
## Median :1056 Mode :character Median :1969 Median :1.0000
## Mean :1092 Mean :1976 Mean :0.7097
## 3rd Qu.:1627 3rd Qu.:1986 3rd Qu.:1.0000
## Max. :3341 Max. :2000 Max. :2.0000
##
## bike_type start_month start_day end_month
## Length:409002 Min. : 4.000 Sun:62998 Min. : 4.000
## Class :character 1st Qu.: 6.000 Mon:52558 1st Qu.: 6.000
## Mode :character Median : 7.000 Tue:51657 Median : 7.000
## Mean : 7.102 Wed:56859 Mean : 7.102
## 3rd Qu.: 8.000 Thu:57657 3rd Qu.: 8.000
## Max. :11.000 Fri:60495 Max. :11.000
## Sat:66778
## end_day start_hour end_hour
## Sun:63368 Min. : 0.00 Min. : 0.00
## Mon:52638 1st Qu.:11.00 1st Qu.:11.00
## Tue:51585 Median :15.00 Median :15.00
## Wed:56843 Mean :14.26 Mean :14.46
## Thu:57670 3rd Qu.:18.00 3rd Qu.:18.00
## Fri:60193 Max. :23.00 Max. :23.00
## Sat:66705
But you’ll notice these aren’t that meaningful for the character variables. Another way we can extract information about a variable is to use the $
operator. To just pull out one variable from a dataset, you would write data$variable
. We can use this syntax to make a table of the user types.
##
## Customer Subscriber
## 287709 121293
6.1.2 Practice
How many of the users were female?
What was the longest trip duration?
Looking at the documentation, why might an end station name be empty?
Looking at the documentation, what is the unit of the
tripduration
variable?What kinds of trips are excluded from this data?