Introduction to Shapefiles in R

Shapefiles are a popular method of representing geospatial vector data in GIS software. This file format was developed by ESRI (a notable GIS Software Company). The shapefile format can store data in the form of points, lines, or polygons. Shapefiles are used to represent adminsistrative boundaries (think country/state/county borders), roads, rivers, lakes, etc.

The raster package in R makes it very easy to download various administrative boundaries for any country. You just have to know the three letter ISO Code for the country of interest (you can look this up on Google). Below we demonstrate how to load the raster library, and get the province level administrative boundaries for France.

library(raster)                                  ##Load the Raster Library
france<-getData('GADM', country='FRA', level=1)  ##Get the Province Shapefile for France
plot(france)                                     ##Plot this shapefile

plot of chunk unnamed-chunk-2

Note that level 1 detail give the boundaries one level below the national boundaries. Actually, level 0 gives only national boundaries, level 1 gives the equivalent of province level boundaries, and each susequent level gives finer and finer granularity. In the United States, level 0 give our national boundary, level 1 gives state boundaries, and level 2 adds county boundaries. You can see level 0 through level 3 for France in the following plot.

alt text

Below we will retrieve and plot the county level shapefile for the United States. Note that when we plot this, that the continental United States is extremely small since it also plots Alaska, Hawaii, and some other islands that belong to the US.

us<-getData('GADM', country='USA', level=2)  #Get the County Shapefile for the US

alt text

Now let’s explore this data a little further. If we check the class of the us object (by typing class(us)), we will see that it is a SpatialPolygonsDataFrame. This means that it is multiple polygons that are held together in a data frame. We can access each of these polygons in a similar way that we access elements of a normal data frame. Now let’s look at the names of this data frame:

##  [1] "PID"       "ID_0"      "ISO"       "NAME_0"    "ID_1"     
##  [6] "NAME_1"    "ID_2"      "NAME_2"    "NL_NAME_2" "VARNAME_2"
## [11] "TYPE_2"    "ENGTYPE_2"

Notice there are various types of names in this data frame. Name 0 represents national boundaries, Name 1 represents state names, and Name 2 represents county names. We can see this further if we plot all of the unique NAME_1:

##  [1] "Georgia"              "Alabama"              "Alaska"              
##  [4] "Arizona"              "Arkansas"             "California"          
##  [7] "Colorado"             "Connecticut"          "Delaware"            
## [10] "District of Columbia" "Florida"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "South Dakota"        
## [25] "Tennessee"            "Texas"                "West Virginia"       
## [28] "Minnesota"            "Mississippi"          "Missouri"            
## [31] "Montana"              "Nebraska"             "Nevada"              
## [34] "New Hampshire"        "New Jersey"           "New Mexico"          
## [37] "New York"             "North Carolina"       "North Dakota"        
## [40] "Ohio"                 "Oklahoma"             "Oregon"              
## [43] "Pennsylvania"         "Rhode Island"         "South Carolina"      
## [46] "Utah"                 "Vermont"              "Virginia"            
## [49] "Washington"           "Wisconsin"            "Wyoming"

Below we demonstrate how to subset a SpatialPolygonsDataFrame. Note here that if we wanted to only plot the Northwestern United States, we could use the following code to create a subset that only contains Washington, Oregon, and Idaho.

 northwest<-subset(us,NAME_1=="Washington" | NAME_1=="Oregon" | NAME_1=="Idaho")

plot of chunk unnamed-chunk-8

Now let’s create a subset that only contains Texas.


plot of chunk unnamed-chunk-9

Now we’ll demonstrate how to plot point data on top of a shapefile. If you start by plotting the shapefile, you only have to call the points command in order to plot the point data on top of the shapefile.

library(ggmap)   ##load the ggmap package so we can access the crime data
data(crime)      ##load the crime data
plot(texas)      ##plot texas
points(crime$lon,crime$lat,col="red",pch=16)  ##Add the crime data to the map

plot of chunk unnamed-chunk-10

Here we can see clearly that some of our points are a long way from Houston.

Choropleth Maps (Heat Maps)

Another great way to show spatial data is with a choropleth map (sometimes called a heat map). A choropleth map shades a SpatialPolygonDataFrame by some measurement of spatial data. One very useful way to shade a map is by the density of spatial points found in each polygon. For example, we could shade the map of Texas with the density of crime data in the crime data set. Before we do this, let me first give you a mapping function that I created for this purpose back in 2011. I am not going to explain all of the computations that are performed in this code, but you will need to copy and paste this code into your console in order for this function to be available to you (you can also source this from an R script file if that is easier). You will need to load the sp and RColorBrewer packages before using this code.

heatMap <-function(data,shape=NULL,col="blue",main="Sample HeatMap"){
  # Plots a Heat Map of a Polygons Data Frame.  This will 
    # demonstrate density within a finite set of polygons
    # Args:
    #   data:   Spatial Points dataframe
    #   shape:  Polygons Data Frame 
    #   Notes:  This function requires the sp and RColorBrewer
    #           Packages
    #   Beskow: 03/28/11   
    is.installed <- function(mypkg) is.element(mypkg, 
    if (is.installed(mypkg="sp")==FALSE)  {
        stop("sp package is not installed")}
    if (is.installed(mypkg="RColorBrewer")==FALSE)  {
        stop("RColorBrewer package is not installed")}
    if (!class(data)=="SpatialPointsDataFrame")  {
        stop("data argument is not SpatialPointsDataFrame")}

spdf<-SpatialPolygonsDataFrame(shape1, freq_table, match.ID = TRUE)

spplot(spdf,scales = list(draw = TRUE),
        col.regions=rw.colors(max(freq_table)), main=main)

Now that we have this function read into our Global Environment, we can use it. Before we use it, we need to create an SpatialPointsDataFrame with our crime data. This is done below (after loading the sp and RColorBrewer libraries).

crime.sp<-crime     ##create a new object that we will coerce to a SpatialPointsDataFrame
crime.sp<-crime.sp[!$lat),]   ##Get rid of all NA's in the location field
coordinates(crime.sp)<-c("lon","lat")       ##Assigning coordinates coerces this to a SpatialPointsDataFrame
proj4string(crime.sp)<-proj4string(texas)   ##Assigns the texas spatial datum to our new data frame

We can check our new data set by running the code:

class(crime.sp)  #Prints the class of an object
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"

This should show us that crime.sp is a SpatialPointsDataFrame. Now that we’ve checked this, we just have to run a simple command that will create our choropleth.

heatMap(crime.sp,texas,col="red",main="Houston Crime Data Denstiy")