I am working with the US census gazetteer data file (zcta5) which is publicly available. The version I am using has files named tl_2015_us_zcta510.shp, dbf... Plotting the file works fine.
The issue I am having seems to happen when I subset the SpatialDataPolygonsDataFrame with a larger number of polygons. But when I use a small subset the labels work fine.
The labels I need identify assigned groupings of postal codes an individual 5-digit polygon area belongs to. For example - for Ashtabula, OH postal codes I need all the postal codes to have a label in the middle of it that reads "503". I have labels for all the other Ohio postal code groupings - called "PostalGroupNumber" and in table form the data all checks out to be correct.
So I load libraries and read the full spatial data frame into memory:
library(sp)
library(maps)
library(mapdata)
library(maptools)
library(foreign)
#Load in the entire census gazatteer data file
zcta5=readShapeSpatial("~/R/PostalCodes/USA/US Postal Codes/ZCTA5/tl_2015_us_zcta510.shp")
Next: create vector of Ashtabula, OH postal codes:
ashtab.zips <- c("44003","44004","44005","44010","44030","44032","44041","44047","44048","44068","44076","44082","44084","44085","44088","44093","44099")
Next - subset zcta5 Spatial Data Frame to include only these postal codes:
ashtab <- zcta5[which(zcta5#data$GEOID10 %in% ashtab.zips),]
Next - add labels to new ashtab spatial data frame and plot:
ashtab#data <- cbind(ashtab#data, "PostalGroupNumber"="503")
l1 = list("sp.text", coordinates(ashtab), as.character(ashtab#data$PostalGroupNumber),col="black", cex=0.7,font=2)
spplot(ashtab,zcol="GEOID10", sp.layout=list(l1)
,main=list(label="PostalGroupNumber 503 Postal Areas",cex=2,font=1)
)
Which works and gives the following and correct plot of the postal areas of northeast Ohio with correct labels in them:
Pretty good - BUT - the scale on the right looks like it retained a huge number of GEOID10 levels where I expected only the subset of the 17 in the ashtab.zips vector. Side Question (extra credit ;-)- why are those levels still there?
Now on to the main problem. Ohio postal codes all start with a 43... or a 44... - I have a csv file for just the 5-digit codes that are in Ohio, each with their assigned PostalGroupNumber which I read into a data frame, clean up and use to subset the main data frame like I did above:
oh <- read.csv("~/R/PostalCodes/OhioPostalGroupings/OH-PGAs-PostalCodes Only.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = c("character", "character", "character"))
oh$ZIP_CODE <- trimws(oh$ZIP_CODE)
ohzcta5 <- zcta5[which(zcta5#data$GEOID10 %in% oh$ZIP_CODE),]
l1 = list("sp.text", coordinates(ohzcta5), as.character(ohzcta5#data$GEOID10),col="black", cex=0.7,font=2)
spplot(ohzcta5,zcol="GEOID10", sp.layout=list(l1)
,main=list(label="Ohio Postal Code - PostalGroupNumbers",cex=2,font=1)
)
This time - just plot with labels of the GEOID10 value to see if it plots correctly and it does - hard to read here but zooming in shows correct postal codes in each polygon (this is not a great image but shape of OH is right and labels are correct...):
Now I need to add the PostalGroupNumber labels to the spatial data frame, and make a factor to color all the groups of postal codes together as the same color per group. So Ashtabula should all be one color and all have "503" labels in them - but they do not:
ohzcta5#data <- merge(ohzcta5#data, oh, by.x="GEOID10", by.y="ZIP_CODE", all.x=TRUE)
ohzcta5#data <- cbind(ohzcta5#data, "TAcolor"=as.factor(ohzcta5#data$PostalGroupNumber))
l1 = list("sp.text", coordinates(ohzcta5), as.character(ohzcta5#data$PostalGroupNumber),col="black", cex=0.7,font=2)
spplot(ohzcta5,zcol="GEOID10", sp.layout=list(l1)
,main=list(label="Ohio Postal Code - PostalGroupNumber",cex=2,font=1)
)
Which now looks like this:
A closer look at Ashtabula (northeast corner) now looks like this - What happened to the labels?:
The labels are all wrong - and yet when examining the ohzcta5#data the PostalGroupNumber is on the correct GEOID10 records.
Help!!!! Losing my mind.
Answers to two issues:
1) the issue of too many levels retained inthe spatial frame appearing on the spplot scale is resolved by using the base package "droplevels" for each of the factors in the spatial data frame.
2) Don't use "merge" because it re-orders the data so it no longer aligns to the correct polygon. Instead use "match" as shown in this post https://stackoverflow.com/a/3652472/4017087 (Thanks Ramnath!)
Related
I realise this has been asked about 100 times prior, but none of the answers I've read so far on SO seem to fit my problem.
I have data. I have the lat and lon values. I've read around about something called sp and made a bunch of shape objects in a dataframe. I have matched this dataframe with the variable I am interested in mapping.
I cannot for the life of me figure out how the hell to get ggplot2 to draw polygons. Sometimes it wants explicit x,y values (which are a PART of the shape anyway, so seems redundant), or some other shape files externally which I don't actually have. Short of colouring it in with highlighters, I'm at a loss.
if I take an individual sps object (built with the following function after importing, cleaning, and wrangling a shitload of data)
createShape = function(sub){
#This funciton takes the list of lat/lng values and returns a SHAPE which should be plottable on ggmap/ggplot
tempData = as.data.frame(do.call(rbind, as.list(VICshapes[which(VICshapes$Suburb==sub),] %>% select(coords))[[1]][[1]]))
names(tempData) = c('lat', 'lng')
p = Polygon(tempData)
ps = Polygons(list(p),1)
sps = SpatialPolygons(list(ps))
return(sps)
}
These shapes are then stored in the same dataframe as my data - which only this afternoon for some reason, I can't even look at, as trying to look at it yields the following error.
head(plotdata)
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : first argument must be atomic
I realise I'm really annoyed at this now, but I've about 70% of a grade riding on this, and my university has nobody capable of assisting.
I have pasted the first few rows of data here - https://pastebin.com/vFqy5m5U - apparently you can't print data with an s4 object - the shape file that I"m trying to plot.
Anyway. I'm trying to plot each of those shapes onto a map. Polygons want an x,y value. I don't have ANY OTHER SHAPE FILES. I created them based on a giant list of lat and long values, and the code chunk above. I'm genuinely at a loss here and don't know what question to even ask. I have the variable of interest based on locality, and the shape for each locality. What am I missing?
edit: I've pasted the summary data (BEFORE making them into shapes) here. It's a massive list of lat/lng values for EACH tile/area, so it's pretty big...
Answered on gis.stackexchange.com (link not provided).
I am attempting to count the number of points within each LSOA area within London. I have attempted to use the over function although the output does not produce a count of the number of listings per LSOA
The code I have conducted so far is as follows
ldnLSOA <- readOGR(".", "LSOA_2011_London_gen_MHW")
LondonListings <- read.csv('Londonlistings.csv')
proj4string(LdnLSOA) <- proj4string(LondonListings)
plot(ldnLSOA)
plot(LondonListings, add =T)
LSOAcounts <- over(LondonListings, ldnLSOA)
This produces a table with no additional data than the original ldnLSOA shapefile.
I was wondering if someone knew how I would be able to get a table in the format:
LSOAname | LSOAcode | Count
or that sort of framework.
Example data:
LondonListings:
longitude | latituide
-0.204406 51.52060
-0.034617 51.45037
-0.221920 51.46449
-0.126562 51.47158
-0.188879 51.57068
-0.096917 51.49281
Shapefile:
https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london
I deleted my inespecific answer and wrote another one with your data (except for the points... but it is not hard to replace this data, right?)
Let me know if it worked!
#I'm not sure which of this libs are used, since I always have all of them loaded here
library(rgeos)
library(rgdal)
library(sp)
#Load the shapefile
ldnLSOA <- readOGR(".", "LSOA_2011_London_gen_MHW")
plot(ldnLSOA)
#It's always good to take a look in the data associated to your map
ldn_data<-as.data.frame(ldnLSOA#data)
#Create some random point in this shapefile
ldn_points<-spsample(ldnLSOA,n=1000, type="random")
plot(ldnLSOA)
plot(ldn_points, pch=21, cex=0.5, col="red", add=TRUE)
#create an empty df with as many rows as polygons in the shapefile
df<-as.data.frame(matrix(ncol=3, nrow=length(ldnLSOA#data$LSOA11NM)))
colnames(df)<- c("LSOA_name","LSOA_code", "pt_Count")
df$LSOAname<-ldn_data$LSOA11NM
df$LSOAcode<-ldn_data$LSOA11CD
# Over = at the spatial locations of object x,
# retrieves the indexes or attributes from spatial object y
pt.poly <- over(ldn_points,ldnLSOA)
# Now let's count
pt.count<-as.data.frame(table(pt.poly$LSOA11CD))
#As it came in alphabetical order, let's put in the same order of data in data frame
pt.count_ord<-as.data.frame(pt.count[match(df$LSOA_name,pt.count$Var1),])
#Fill 3rd col with counts
df[,3]<-pt.count_ord$Freq
Inputs:
Percent.Turnout US.State
70 CA
80 NM
76 RI
I have data for each of the 50 states in the US. Also, the state abbreviation for US.State is consistent with the abbreviations in the function state.abb
I would like to create a US map where the Percent.Turnout is printed on each state. Furthermore, using the ColorBrewer package, I would like to color each state based on its Percent.Turnout relative to other states.
I am not very familiar with ggplot syntax, so suggestions in base R would be appreciated (if feasible)
If you'd like to use ggplot2, then the major thing that you need to do is map the state abbreviation column to the full state name in lower case (For this, you can use state.name, but make sure to apply tolower() on it to get it in the right format).
From there, it's simply a matter of joining your dataset to the state's geospatial information and plotting the data. The following segment of code takes you through that step by step:
# First, we need the ggplot2 library:
> library(ggplot2)
# We load the geospatial data for the states
# (there are more options to the map_data function,
# if you are intrested in taking a look).
> states <- map_data("state")
# Here I'm creating a sample dataset like yours.
# The dataset will have 2 columns: The region (or state)
# and a number that will represent the value that you
# want to plot (here the value is just the numerical order of the states).
> sim_data <- data.frame(region=unique(states$region), Percent.Turnout=match(unique(states$region), unique(states$region)))
# Then we merge our dataset with the geospatial data:
> sim_data_geo <- merge(states, sim_data, by="region")
# The following should give us the plot without the numbers:
> qplot(long, lat, data=sim_data_geo, geom="polygon", fill=Percent.Turnout, group=group)
This is the output of the segment of code above:
Now, you said you'd like to also add the value Percent.Turnout to the map. Here, we need to find the center point of the various states. You can calculate that from the geospatial data that we retrieved above (in the states dataframe), but the results won't look very impressive. Thankfully, R has the values for the centers of the states already calculated for us, and we can leverage that, as follows:
# We'll use the state.center list to tell us where exactly
# the center of the state is.
> snames <- data.frame(region=tolower(state.name), long=state.center$x, lat=state.center$y)
# Then again, we need to join our original dataset
# to get the value that should be printed at the center.
> snames <- merge(snames, sim_data, by="region")
# And finally, to put everything together:
> ggplot(sim_data_geo, aes(long, lat)) + geom_polygon(aes(group=group, fill=Percent.Turnout)) + geom_text(data=snames, aes(long, lat, label=Percent.Turnout))
And this is the output of the the last statement above:
I have read so many threads and articles and I keep getting errors. I am trying to make a choropleth? map of the world using data I have from the global terrorism database. I want to color countries on a factor of nkills or just the number of attacks in that country.. I don't care at this point. Because there are so many countries with data, it is unreasonable to make any plots to show this data.
Help is strongly appreciated and if I did not ask this correctly I sincerely apologize, I am learning the rules of this website as I go.
my code (so far..)
library(maps)
library(ggplot2)
map("world")
world<- map_data("world")
gtd<- data.frame(gtd)
names(gtd)<- tolower(names(gtd))
gtd$country_txt<- tolower(rownames(gtd))
demo<- merge(world, gts, sort=FALSE, by="country_txt")
In the gtd data frame, the name for the countries column is "country_txt" so I thought I would use that but I get error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
If that were to work, I would plot as I have seen on a few websites..
I have honestly been working on this for so long and I have read so many codes/other similar questions/websites/r handbooks etc.. I will accept that I am incompetent when it comes to R gladly for some help.
Something like this? This is a solution using rgdal and ggplot. I long ago gave up on using base R for this type of thing.
library(rgdal) # for readOGR(...)
library(RColorBrewer) # for brewer.pal(...)
library(ggplot2)
setwd(" < directory with all files >")
gtd <- read.csv("globalterrorismdb_1213dist.csv")
gtd.recent <- gtd[gtd$iyear>2009,]
gtd.recent <- aggregate(nkill~country_txt,gtd.recent,sum)
world <- readOGR(dsn=".",
layer="world_country_admin_boundary_shapefile_with_fips_codes")
countries <- world#data
countries <- cbind(id=rownames(countries),countries)
countries <- merge(countries,gtd.recent,
by.x="CNTRY_NAME", by.y="country_txt", all.x=T)
map.df <- fortify(world)
map.df <- merge(map.df,countries, by="id")
ggplot(map.df, aes(x=long,y=lat,group=group)) +
geom_polygon(aes(fill=nkill))+
geom_path(colour="grey50")+
scale_fill_gradientn(name="Deaths",
colours=rev(brewer.pal(9,"Spectral")),
na.value="white")+
coord_fixed()+labs(x="",y="")
There are several versions of the Global Terrorism Database. I used the full dataset available here, and then subsetted for year > 2009. So this map shows total deaths due to terrorism, by country, from 2010-01-01 to 2013-01-01 (the last data available from this source). The files are available as MS Excel download, which I converted to csv for import into R.
The world map is available as a shapefile from the GeoCommons website.
The tricky part of making choropleth maps is associating your data with the correct polygons (countries). This is generally a four step process:
Find a field in the shapefile attributes table that maps (no pun intended) to a corresponding field in your data. In this case, it appears that the field "CNTRY_NAME" in the shapefile maps to the field "country_txt" in gtd database.
Create an association between ploygon IDs (stored in the row names of the attribute table), and the CNTRY_NAME field.
Merge the result with your data using CNTRY_NAME and country_txt.
Merge the result of that with the data frame created using the fortify(map) - this associates ploygons with deaths (nkill).
Building on the nice work by #jlhoward. You could instead use rworldmap that already has a world map in R and has functions to aid joining data to the map. The default map is deliberately low resolution to create a 'cleaner' look. The map can be customised (see rworldmap documentation) but here is a start :
library(rworldmap)
#3 lines from #jlhoward
gtd <- read.csv("globalterrorismdb_1213dist.csv")
gtd.recent <- gtd[gtd$iyear>2009,]
gtd.recent <- aggregate(nkill~country_txt,gtd.recent,sum)
#join data to a map
gtdMap <- joinCountryData2Map( gtd.recent,
nameJoinColumn="country_txt",
joinCode="NAME" )
mapDevice('x11') #create a world shaped window
#plot the map
mapCountryData( gtdMap,
nameColumnToPlot='nkill',
catMethod='fixedWidth',
numCats=100 )
Following a comment from #hk47, you can also add the points to the map sized by the number of casualties.
deaths <- subset(x=gtd, nkill >0)
mapBubbles(deaths,
nameX='longitude',
nameY='latitude',
nameZSize='nkill',
nameZColour='black',
fill=FALSE,
addLegend=FALSE,
add=TRUE)
Have a Question on Mapping with R, specifically around the choropleth maps in R.
I have a dataset of ZIP codes assigned to an are and some associated data (dataset is here).
My final data format is: Area ID, ZIP, Probability Value, Customer Count, Area Probability and Area Customer Total. I am attempting to present this data by plotting area probability and Area Customer Total on a Map. I have tried to do this by using the census TIGER Shapefiles but I guess R cannot handle the complete country.
I am comfortable with the Statistical capabilities and now I am moving all my Mapping from third party GIS focused applications to doing all my Mapping in R. Does anyone have any pointers to how to achieve this from within R?
To be a little more detailed, here's the point where R stops working -
shapes <- readShapeSpatial("tl_2013_us_zcta510.shp")
(where the shp file is the census/TIGER) shape file.
Edit - Providing further details. I am trying to first read the TIGER shapefiles, hoping to combine this spatial dataset with my data and eventually plot. I am having an issue at the very beginning when attempting to read the shape file. Below is the code with the output
require(maptools)
shapes<-readShapeSpatial("tl_2013_us_zcta510.shp")
Error: cannot allocate vector of size 317 Kb
There are several examples and tutorials on making maps using R, but most are very general and, unfortunately, most map projects have nuances that create inscrutable problems. Yours is a case in point.
The biggest issue I came across was that the US Census Bureau zip code tabulation area shapefile for the whole US is huge: ~800MB. When loaded using readOGR(...) the R SpatialPolygonDataFrame object is about 913MB. Trying to process a file this size, (e.g., converting to a data frame using fortify(...)), at least on my system, resulted in errors like the one you identified above. So the solution is to subset the file based in the zip codes that are actually in your data.
This map:
was made from your data using the following code.
library(rgdal)
library(ggplot2)
library(stringr)
library(RColorBrewer)
setwd("<directory containing shapfiles and sample data>")
data <- read.csv("Sample.csv",header=T) # your sample data, downloaded as csv
data$ZIP <- str_pad(data$ZIP,5,"left","0") # convert ZIP to char(5) w/leading zeros
zips <- readOGR(dsn=".","tl_2013_us_zcta510") # import zip code polygon shapefile
map <- zips[zips$ZCTA5CE10 %in% data$ZIP,] # extract only zips in your Sample.csv
map.df <- fortify(map) # convert to data frame suitable for plotting
# merge data from Samples.csv into map data frame
map.data <- data.frame(id=rownames(map#data),ZIP=map#data$ZCTA5CE10)
map.data <- merge(map.data,data,by="ZIP")
map.df <- merge(map.df,map.data,by="id")
# load state boundaries
states <- readOGR(dsn=".","gz_2010_us_040_00_5m")
states <- states[states$NAME %in% c("New York","New Jersey"),] # extract NY and NJ
states.df <- fortify(states) # convert to data frame suitable for plotting
ggMap <- ggplot(data = map.df, aes(long, lat, group = group))
ggMap <- ggMap + geom_polygon(aes(fill = Probability_1))
ggMap <- ggMap + geom_path(data=states.df, aes(x=long,y=lat,group=group))
ggMap <- ggMap + scale_fill_gradientn(name="Probability",colours=brewer.pal(9,"Reds"))
ggMap <- ggMap + coord_equal()
ggMap
Explanation:
The rgdal package facilitates the creation of R Spatial objects from ESRI shapefiles. In your case we are importing a polygon shapefile into a SpatialPolygonDataFrame object in R. The latter has two main parts: a polygon section, which contains the latitude and longitude points that will be joined to create the polygons on the map, and a data section which contains information about the polygons (so, one row for each polygon). If, e.g., we call the Spatial object map, then the two sections can be referenced as map#polygons and map#data. The basic challenge in making choropleth maps is to associate data from your Sample.csv file, with the relevant polygons (zip codes).
So the basic workflow is as follows:
1. Load polygon shapefiles into Spatial object ( => zips)
2. Subset if appropriate ( => map).
3. Convert to data frame suitable for plotting ( => map.df).
4. Merge data from Sample.csv into map.df.
5. Draw the map.
Step 4 is the one that causes all the problems. First we have to associate zip codes with each polygon. Then we have to associate Probability_1 with each zip code. This is a three step process.
Each polygon in the Spatial data file has a unique ID, but these ID's are not the zip codes. The polygon ID's are stored as row names in map#data. The zip codes are stored in map#data, in column ZCTA5CE10. So first we must create a data frame that associates the map#data row names (id) with map#data$ZCTA5CE10 (ZIP). Then we merge your Sample.csv with the result using the ZIP field in both data frames. Then we merge the result of that into map.df. This can be done in 3 lines of code.
Drawing the map involves telling ggplot what dataset to use (map.df), which columns to use for x and y (long and lat) and how to group the data by polygon (group=group). The columns long, lat, and group in map.df are all created by the call to fortify(...). The call to geom_polygon(...) tells ggplot to draw polygons and fill using the information in map.df$Probability_1. The call to geom_path(...) tells ggplot to create a layer with state boundaries. The call to scale_fill_gradientn(...) tells ggplot to use a color scheme based on the color brewer "Reds" palette. Finally, the call to coord_equal(...) tells ggplot to use the same scale for x and y so the map is not distorted.
NB: The state boundary layer, uses the US States TIGER file.
I would advise the following.
Use readOGR from the rgdal package rather than readShapeSpatial.
Consider using ggplot2 for good-looking maps - many of the examples use this.
Refer to one of the existing examples of creating a choropleth such as this one to get an overview.
Start with a simple choropleth and gradually add your own data; don't try and get it all right at once.
If you need more help, create a reproducible example with a SMALL fake dataset and with links to the shapefiles in question. The idea is that you make it easy to help us help you rather than discourage us by not supplying code and data in your question.