Section 6 Reading in Data

I am a strong believer in the cake-first approach to teaching/learning R. It emphasizes real-world examples, interesting data, and visual feedback. Because of that, I like to use ready-made data packages like fivethirtyeight and talk about visualization before data cleaning.

But I also think reading in data is an important skill so we will talk about that briefly at the end of today, but not spend too much time on it. For now, let’s eat the cake instead of going out to get the ingredients. That’s more fun anyway.

6.1 Data packages

There are a number of packages in R specifically to make data sharing easier. A few examples are:

  • fivethirtyeight to share to data used in their articles

  • bikedata to share data about certain bikeshare systems

  • ecoengine to share data from the Berkley natural history museum

We will use the data the MN niceride system. I put the data in a package called metcouncilR so it’s easy to use. You can install it by typing install_github("katiejolly/metcouncilR") in your console.

6.1.1 Try it out

In your R markdown, fill in the following code to load your library:

To pull a particular dataset from this package, we can use the data() function.

You should now see it in your global working environment.

I’ve written documentation for this data that you can see in the help pane in RStudio.

There are also a few different ways to get quick summaries of the data.

First, you can check the dimensions to get the number of rows and columns.

## [1] 409002     22

What function would you use to get just the number of columns? (Google it.)

We can also print the first 6 rows of the data with the head() function.

## # A tibble: 6 x 22
##   tripduration start_datetime      end_datetime        start_station_id
##          <dbl> <dttm>              <dttm>                         <dbl>
## 1         1373 2018-04-24 16:03:04 2018-04-24 16:25:57              170
## 2         1730 2018-04-24 16:38:40 2018-04-24 17:07:31                2
## 3          547 2018-04-24 17:51:10 2018-04-24 18:00:17               13
## 4          856 2018-04-24 18:50:05 2018-04-24 19:04:22               94
## 5          455 2018-04-25 08:49:05 2018-04-25 08:56:40               13
## 6         1557 2018-04-27 11:57:03 2018-04-27 12:23:01               43
## # ... with 18 more variables: start_station_name <chr>,
## #   start_station_latitude <dbl>, start_station_longitude <dbl>,
## #   end_station_id <dbl>, end_station_name <chr>,
## #   end_station_latitude <dbl>, end_station_longitude <dbl>, bikeid <dbl>,
## #   usertype <chr>, birth_year <dbl>, gender <dbl>, bike_type <chr>,
## #   start_month <dbl>, start_day <ord>, end_month <dbl>, end_day <ord>,
## #   start_hour <int>, end_hour <int>

How can we modify this code to print the first 10 rows instead? (hint: help(head) to see the documentation)

We can also just get a summary of each variable.

##   tripduration   start_datetime               
##  Min.   :   61   Min.   :2018-04-12 08:49:49  
##  1st Qu.:  432   1st Qu.:2018-06-08 15:02:34  
##  Median :  800   Median :2018-07-18 08:40:36  
##  Mean   : 1275   Mean   :2018-07-20 03:50:06  
##  3rd Qu.: 1524   3rd Qu.:2018-08-27 20:09:25  
##  Max.   :17992   Max.   :2018-11-17 23:13:49  
##                                               
##   end_datetime                 start_station_id start_station_name
##  Min.   :2018-04-12 09:31:20   Min.   :  2.0    Length:409002     
##  1st Qu.:2018-06-08 15:34:45   1st Qu.: 37.0    Class :character  
##  Median :2018-07-18 08:54:47   Median : 94.0    Mode  :character  
##  Mean   :2018-07-20 04:11:22   Mean   :103.4                      
##  3rd Qu.:2018-08-27 20:37:36   3rd Qu.:171.0                      
##  Max.   :2018-11-17 23:19:36   Max.   :226.0                      
##                                NA's   :13251                      
##  start_station_latitude start_station_longitude end_station_id 
##  Min.   :44.89          Min.   :-93.33          Min.   :  2.0  
##  1st Qu.:44.96          1st Qu.:-93.27          1st Qu.: 38.0  
##  Median :44.97          Median :-93.26          Median : 95.0  
##  Mean   :44.97          Mean   :-93.25          Mean   :103.6  
##  3rd Qu.:44.98          3rd Qu.:-93.23          3rd Qu.:170.0  
##  Max.   :45.04          Max.   :-93.08          Max.   :226.0  
##                                                 NA's   :13251  
##  end_station_name   end_station_latitude end_station_longitude
##  Length:409002      Min.   :44.89        Min.   :-93.35       
##  Class :character   1st Qu.:44.96        1st Qu.:-93.27       
##  Mode  :character   Median :44.97        Median :-93.26       
##                     Mean   :44.97        Mean   :-93.25       
##                     3rd Qu.:44.98        3rd Qu.:-93.23       
##                     Max.   :45.04        Max.   :-93.08       
##                                                               
##      bikeid       usertype           birth_year       gender      
##  Min.   :   2   Length:409002      Min.   :1911   Min.   :0.0000  
##  1st Qu.: 530   Class :character   1st Qu.:1969   1st Qu.:0.0000  
##  Median :1056   Mode  :character   Median :1969   Median :1.0000  
##  Mean   :1092                      Mean   :1976   Mean   :0.7097  
##  3rd Qu.:1627                      3rd Qu.:1986   3rd Qu.:1.0000  
##  Max.   :3341                      Max.   :2000   Max.   :2.0000  
##                                                                   
##   bike_type          start_month     start_day     end_month     
##  Length:409002      Min.   : 4.000   Sun:62998   Min.   : 4.000  
##  Class :character   1st Qu.: 6.000   Mon:52558   1st Qu.: 6.000  
##  Mode  :character   Median : 7.000   Tue:51657   Median : 7.000  
##                     Mean   : 7.102   Wed:56859   Mean   : 7.102  
##                     3rd Qu.: 8.000   Thu:57657   3rd Qu.: 8.000  
##                     Max.   :11.000   Fri:60495   Max.   :11.000  
##                                      Sat:66778                   
##  end_day       start_hour       end_hour    
##  Sun:63368   Min.   : 0.00   Min.   : 0.00  
##  Mon:52638   1st Qu.:11.00   1st Qu.:11.00  
##  Tue:51585   Median :15.00   Median :15.00  
##  Wed:56843   Mean   :14.26   Mean   :14.46  
##  Thu:57670   3rd Qu.:18.00   3rd Qu.:18.00  
##  Fri:60193   Max.   :23.00   Max.   :23.00  
##  Sat:66705

But you’ll notice these aren’t that meaningful for the character variables. Another way we can extract information about a variable is to use the $ operator. To just pull out one variable from a dataset, you would write data$variable. We can use this syntax to make a table of the user types.

## 
##   Customer Subscriber 
##     287709     121293

6.1.2 Practice

  1. How many of the users were female?

  2. What was the longest trip duration?

  3. Looking at the documentation, why might an end station name be empty?

  4. Looking at the documentation, what is the unit of the tripduration variable?

  5. What kinds of trips are excluded from this data?