R convert zipcode or lat/long to county - r

I have a list of locations that contains a city, state, zip, latitude and longitude for each location.
I separately have a list of economic indicators at the county level. I've played with the zipcode package, the ggmap package, and several other free geocoding websites including the US Gazeteer files, but can't seem to find a way to match the two pieces.
Are there currently any packages or other sources that do this?

I ended up using the suggestion from JoshO'Brien mentioned above and found here.
I took his code and changed state to county as shown here:
library(sp)
library(maps)
library(maptools)
# The single argument to this function, pointsDF, is a data.frame in which:
# - column 1 contains the longitude in degrees (negative in the US)
# - column 2 contains the latitude in degrees
latlong2county <- function(pointsDF) {
# Prepare SpatialPolygons object with one SpatialPolygon
# per county
counties <- map('county', fill=TRUE, col="transparent", plot=FALSE)
IDs <- sapply(strsplit(counties$names, ":"), function(x) x[1])
counties_sp <- map2SpatialPolygons(counties, IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Convert pointsDF to a SpatialPoints object
pointsSP <- SpatialPoints(pointsDF,
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Use 'over' to get _indices_ of the Polygons object containing each point
indices <- over(pointsSP, counties_sp)
# Return the county names of the Polygons object containing each point
countyNames <- sapply(counties_sp#polygons, function(x) x#ID)
countyNames[indices]
}
# Test the function using points in Wisconsin and Oregon.
testPoints <- data.frame(x = c(-90, -120), y = c(44, 44))
latlong2county(testPoints)
[1] "wisconsin,juneau" "oregon,crook" # IT WORKS

Matching Zipcodes to Counties is difficult. (Certain zip codes span more than one county and sometimes more than one state. For example 30165)
I am not aware of any specific R package that can match these up for you.
However, you can get a nice table from the Missouri Census Data Center.
You can use this page for data extraction.
A sample output might look like:
state,zcta5,ZIPName,County,County2
01,30165,"Rome, GA",Cherokee AL,
01,31905,"Fort Benning, GA",Russell AL,
01,35004,"Moody, AL",St. Clair AL,
01,35005,"Adamsville, AL",Jefferson AL,
01,35006,"Adger, AL",Jefferson AL,Walker AL
...
Note the County2.
metadata explanation can be found here.
county
The county in which the ZCTA is all or mostly contained. Over 90% of ZCTAs fall entirely within a single county.
county2
The "secondary" county for the ZCTA, i.e. the county which has the 2nd largest intersection with it. Over 90% of the time this value will be blank.
See also ANSI County codes
http://www.census.gov/geo/www/ansi/ansi.html

I think the package "noncensus" is helpful.
corresponding is what I use to match zipcode with county
### code for get county based on zipcode
library(noncensus)
data(zip_codes)
data(counties)
state_fips = as.numeric(as.character(counties$state_fips))
county_fips = as.numeric(as.character(counties$county_fips))
counties$fips = state_fips*1000+county_fips
zip_codes$fips = as.numeric(as.character(zip_codes$fips))
# test
temp = subset(zip_codes, zip == "30329")
subset(counties, fips == temp$fips)

A simple option is to use the geocode() function in ggmap, with the option output="more" or output="all.
This can take flexible input, such as the address or lat/lon, and returns Address, city, county, state, country, postal code, etc, as a list.
require("ggmap")
address <- geocode("Yankee Stadium", output="more")
str(address)
$ lon : num -73.9
$ lat : num 40.8
$ type : Factor w/ 1 level "stadium": 1
$ loctype : Factor w/ 1 level "approximate": 1
$ address : Factor w/ 1 level "yankee stadium, 1 east 161st street, bronx, ny 10451, usa": 1
$ north : num 40.8
$ south : num 40.8
$ east : num -73.9
$ west : num -73.9
$ postal_code : chr "10451"
$ country : chr "united states"
$ administrative_area_level_2: chr "bronx"
$ administrative_area_level_1: chr "ny"
$ locality : chr "new york"
$ street : chr "east 161st street"
$ streetNo : num 1
$ point_of_interest : chr "yankee stadium"
$ query : chr "Yankee Stadium"
Another solution is to use a census shapefile, and the same over() command from the question. I ran into a problem using the maptools base map: because it uses the WGS84 datum, in North America, points that were within a few miles of the coast were mapped incorrectly and about 5% of my data set did not match up.
try this, using the sp package and Census TIGERLine shape files
counties <- readShapeSpatial("maps/tl_2013_us_county.shp", proj4string=CRS("+proj=longlat +datum=NAD83"))
# Convert pointsDF to a SpatialPoints object
pointsSP <- SpatialPoints(pointsDF, proj4string=CRS("+proj=longlat +datum=NAD83"))
countynames <- over(pointsSP, counties)
countynames <- countynames$NAMELSAD

Related

shapefile in R: aggregate geometries by similar attributes

I have a shapefile in SpatialPolygonsDataFrame (myshape) format with several sub-divisions represents by A, B, C, etc. of the same area eg. LIMOEIRO064A, LIMOEIRO064B, etc. for several different areas. I'd like to merge the geometries of LIMOEIRO064 for example. For this I try:
# Packages
library(raster)
library(rgdal)
# Download and unzip the shapefile example
download.file('https://www.dropbox.com/s/2zoproayzqvj1cc/myshape.zip?dl=0',
destfile="myshape.zip",
method="auto")
unzip(paste0(getwd(),"/myshape.zip"))
#Read target shapefile -----------------------------------------------
myshape <- readOGR (".","myshape")
proj4string(myshape) <- CRS("+proj=longlat +ellps=GRS80 +no_defs")
# Create unique ID for each area without sub-units A, B, C, etc. if have in CD_TALHAO attribute
str(myshape#data)
#'data.frame': 419 obs. of 7 variables:
# $ OBJECTID : chr "563774" "563783" "795091" "795092" ...
# $ ID_PROJETO: chr "131" "131" "131" "131" ...
# $ PROJETO : chr "LIMOEIRO" "LIMOEIRO" "LIMOEIRO" "LIMOEIRO" ...
# $ CD_TALHAO : chr "064A" "017B" "V00204" "V00702" ...
# $ SHAPE_AREA: num 1.07e-05 1.67e-05 1.72e-07 2.46e-07 2.07e-06 ...
# $ SHAPE_LEN : num 0.02774 0.01921 0.00401 0.005 0.01916 ...
# $ CODE : chr "LIMOEIRO064A" "LIMOEIRO017B" "LIMOEIROV00204" "LIMOEIROV00702" ...
myshape#data$UNIQUE<-gsub("[a-zA-Z]", "", myshape#data$CD_TALHAO)
# New unique CODE
myshape#data$CODE<-paste0(myshape#data$PROJETO,myshape#data$UNIQUE)
#unique(myshape#data$CODE)
# [1] "LIMOEIRO064" "LIMOEIRO017" "LIMOEIRO00204" "LIMOEIRO00702" "LIMOEIRO06501" "LIMOEIRO02403"
# [7] "LIMOEIRO00201" "LIMOEIRO05002" "LIMOEIRO03516" "LIMOEIRO02203" "LIMOEIRO02904" "LIMOEIRO00405"
# [13] "LIMOEIRO01804" "LIMOEIRO01608" "LIMOEIRO03106" "LIMOEIRO00101" "LIMOEIRO010" "LIMOEIRO035"
# [19] "LIMOEIRO020" "LIMOEIRO001" "LIMOEIRO056" "LIMOEIRO059" "LIMOEIRO06402" "LIMOEIRO01801"
#...
# [295] "LIMOEIRO011" "LIMOEIRO06408"
Now, I'd like to merge shapefiles and geometries with the same CODE identification in a new_myshape but I find options like bind() and union() not work for me. I need something like aggregate by myshape#data$CODE or some option like this.
Please any ideas?
Here is how you can do that with terra
library(terra)
f <- system.file("ex/lux.shp", package="terra")
v <- vect(f)
va <- aggregate(v, "ID_1")
You can use the same approach with raster/sp
p <- shapefile(system.file("external/lux.shp", package="raster"))
pa <- aggregate(p, "ID_1")

Retrieving latitude/longitude coordinates for cities/countries that have since changed names?

Say I have a vector of cities and countries, which may or may not include names of places that have since changed names:
locations <- c("Paris, France", "Sarajevo, Yugoslavia", "Rome, Italy", "Leningrad, Soviet Union", "St Petersburg, Russia")
The problem is that I can't use something like ggmap::geocode since it doesn't appear to work well for locations whose names have changed:
ggmap::geocode(locations, source = "dsk")
lon lat
1 2.34880 48.85341 #Works for Paris
2 NA NA #Didn't work for Sarajevo
3 12.48390 41.89474 #Works for Rome
4 98.00000 60.00000 #Didn't work for the old name of St Petersburg seems to just get the center of Russia
5 30.26417 59.89444 #Worked for St Petersburg
Is there an alternative functions I could use? If I have to "update" the names of the cities & countries, is there an easy method of going through this? I have hundreds of locations that I was looking to collect the longitude and latitude coordinates.
This might not be what you had in mind, but if you use the exact same code with only the city names (and not the countries), at least the two cases that you mentioned (Sarajevo and Leningrad) seem to work fine. You could try to run the function with a modified locations vector including just the city names, and see if you still get errors. Something like this:
(cities <- gsub(',.*', '', locations))
## [1] "Paris" "Sarajevo" "Rome" "Leningrad" "St Petersburg"
cbind(ggmap::geocode(cities, source = 'dsk'), cities)
## lon lat cities
## 1 2.34880 48.85341 Paris
## 2 18.35644 43.84864 Sarajevo
## 3 12.48390 41.89474 Rome
## 4 30.26417 59.89444 Leningrad
## 5 30.26417 59.89444 St Petersburg

R SpatialPointsDataFrame to SpatialLinesDataFrame

I've imported some GPS points from my Sports watch into R:
library(plotKML)
route <- readGPX("Move_Cycling.gpx")
str(route)
The data looks like this:
List of 5
$ metadata : NULL
$ bounds : NULL
$ waypoints: NULL
$ tracks :List of 1
..$ :List of 1
.. ..$ Move:'data.frame': 677 obs. of 5 variables:
.. .. ..$ lon : num [1:677] -3.8 -3.8 -3.8 -3.8 -3.8 ...
.. .. ..$ lat : num [1:677] 52.1 52.1 52.1 52.1 52.1 ...
.. .. ..$ ele : chr [1:677] "152" "151" "153" "153" ...
.. .. ..$ time : chr [1:677] "2014-06-08T09:17:08.050Z" "2014-06-08T09:17:18.680Z" "2014-06-08T09:17:23.680Z" "2014-06-08T09:17:29.680Z" ...
.. .. ..$ extensions: chr [1:677] "7627.7999992370605141521101800" "7427.6000003814697141511.7000000476837210180.8490009442642210" "9127.523.13003521531.7000000476837210181.799999952316280" "10027.534.96003841534.1999998092651410181.88300029210510" ...
$ routes : NULL
I've managed to transform to get the data points into a SpatialPointsDataFrame and to plot it over Google Earth with:
SPDF <- SpatialPointsDataFrame(coords=route$tracks[[1]]$Move[1:2],
data=route$tracks[[1]]$Move[1:2],
proj4string = CRS("+init=epsg:4326"))
plotKML(SPDF)
What I really want is the cycling track, i.e. a SpatialLinesDataFrame, but I can't work out how to set the ID field correctly to match the SpatialLines object with the data.
This is how far I've got:
tmp <- Line(coords=route$tracks[[1]]$Move[1:2])
tmp2 <- Lines(list(tmp), ID=c("coord"))
tmp3 <- SpatialLines(list(tmp2), proj4string = CRS("+init=epsg:4326"))
# result should be something like,
# but the ID of tmp3 and data don't match at the moment
SPDF <- SpatialLinesDataFrame(tmp3, data)
You can read the GPX file straight into a SpatialLinesDataFrame object with readOGR from the rgdal package. A GPX file can contain tracks, waypoints, etc and these are seen by OGR as layers in the file. So simply:
> track = readOGR("myfile.gpx","tracks")
> plot(track)
should work. You should see lines.
In your last line you've not said what your data is, but it needs to be a data frame with one row per track if you are trying to construct a SpatialLinesDataFrame from some SpatialLines and a data frame, and you can tell it not to bother matching the IDs because you don't actually have any real per-track data you are merging. So:
> SPDF = SpatialLinesDataFrame(tmp3, data.frame(who="me"),match=FALSE)
> plot(SPDF)
But if you use readOGR you don't need to go through all that. It will also read in a bit of per-track metadata from the GPX file.
Happy cycling!
As an update, here's my final solution
library(rgdal)
library(plotKML)
track <- readOGR("Move_Cycling.gpx","tracks")
plotKML(track, colour='red', width=2, labels="Cwm Rhaeadr Trail")

The variable from a netcdf file comes out flipped

I have downloaded a nc file from
f=open.ncdf("file.nc")
[1] "file Lfile.nc has 2 dimensions:"
[1] "Longitude Size: 1440"
[1] "Latitude Size: 720"
[1] "------------------------"
[1] "file filr.nc has 8 variables:"
[1] "short ts[Latitude,Longitude] Longname:Skin Temperature (2mm) Missval:NA"
I then wanted to work with the variable soil_moisture_c
A = get.var.ncdf(nc=f,varid="soil_moisture_c",verbose=TRUE)
I then plot A with image(A). I got the map shown below,I even transposed it image(t(a)) but that was changed to other direction,and not how it should be. Anyway,in order to know what is wrong,I used the netcdf viewer Panoply and the map was correctly plotted as you can see below.
The reason is that the NetCDF interface you are using is very low-level, and all you have done is read out the variable without any of its dimension information. The orientation of the grid is really arbitrary, and the coordinate information needs to be understood in a particular context.
library(raster) ## requires ncdf package for this file
d <- raster("LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T185959Z_20040114.nc", varname = "soil_moisture_c")
(I used a different file to yours, but it should work the same).
It turns out that even raster does not get this right without work, but it does make it easy to rectify:
d <- flip(t(d), direction = "x")
That transposed the data and flipped around "x" (longitude), keeping the georeferencing from the original context.
Plot it up with a map from maptools to check:
plot(d)
library(maptools)
data(wrld_simpl)
plot(wrld_simpl, add = TRUE)
There are many other ways to achieve this by reading the dimension information from the file, but this is at least a shortcut to do most of the hard work for you.
Just as a complement to #mdsumner far better solution, here is a way to do that using library ncdf only.
library(ncdf)
f <- open.ncdf("LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040101.nc")
A <- get.var.ncdf(nf,"soil_moisture_c")
All you need is to find your dimensions in order to have a coherent x and y-axis. If you look at your netCDF objects dimensions, here what you see:
str(f$dim)
List of 2
$ Longitude:List of 8
..$ name : chr "Longitude"
..$ len : int 1440
..$ unlim : logi FALSE
..$ id : int 1
..$ dimvarid : num 2
..$ units : chr "degrees_east"
..$ vals : num [1:1440(1d)] -180 -180 -179 -179 -179 ...
..$ create_dimvar: logi TRUE
..- attr(*, "class")= chr "dim.ncdf"
$ Latitude :List of 8
..$ name : chr "Latitude"
..$ len : int 720
..$ unlim : logi FALSE
..$ id : int 2
..$ dimvarid : num 1
..$ units : chr "degrees_north"
..$ vals : num [1:720(1d)] 89.9 89.6 89.4 89.1 88.9 ...
..$ create_dimvar: logi TRUE
..- attr(*, "class")= chr "dim.ncdf"
Hence your dimensions are:
f$dim$Longitude$vals -> Longitude
f$dim$Latitude$vals -> Latitude
Now your Latitude goes from 90 to -90 instead of the opposite, which image prefers, so:
Latitude <- rev(Latitude)
A <- A[nrow(A):1,]
Finally, as you noticed, the x and y of your object A are flipped so you need to transpose it, and your NA values are represented for some reasons by value -32767:
A[A==-32767] <- NA
A <- t(A)
And finally the plot:
image(Longitude, Latitude, A)
library(maptools)
data(wrld_simpl)
plot(wrld_simpl, add = TRUE)
Edit: To do that on your 31 files, let's call your vector of file names ncfiles and yourpath the directory where you stored them (for simplicity i'm going to assume your variable is always called soil_moisture_c and your NAs are always -32767):
ncfiles
[1] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040101.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040102.nc"
[3] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040103.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040104.nc"
[5] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040105.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040106.nc"
[7] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040107.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040108.nc"
[9] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040109.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040110.nc"
[11] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040111.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040112.nc"
[13] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040113.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040114.nc"
[15] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040115.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040116.nc"
[17] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040117.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040118.nc"
[19] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040119.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040120.nc"
[21] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040121.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040122.nc"
[23] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040123.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040124.nc"
[25] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040125.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040126.nc"
[27] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040127.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040128.nc"
[29] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040129.nc" "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040130.nc"
[31] "LPRM-AMSR_E_L3_D_SOILM3_V002-20120520T173800Z_20040131.nc"
yourpath
[1] "C:\\Users"
library(ncdf)
library(maptools)
data(wrld_simpl)
for(i in 1:length(ncfiles)){
f <- open.ncdf(paste(yourpath,ncfiles[i], sep="\\"))
A <- get.var.ncdf(f,"soil_moisture_c")
f$dim$Longitude$vals -> Longitude
f$dim$Latitude$vals -> Latitude
Latitude <- rev(Latitude)
A <- A[nrow(A):1,]
A[A==-32767] <- NA
A <- t(A)
close.ncdf(f) # this is the important part
png(paste(gsub("\\.nc","",ncfiles[i]), "\\.png", sep="")) # or any other device such as pdf, jpg...
image(Longitude, Latitude, A)
plot(wrld_simpl, add = TRUE)
dev.off()
}
You can also simply invert the latitudes from the command line using CDO:
cdo invertlat file.nc file_inverted.nc

State name to abbreviation

I have a large file with a variable state that has full state names. I would like to replace it with the state abbreviations (that is "NY" for "New York"). Is there an easy way to do this (apart from using several if-else commands)? May be using replace() statement?
R has two built-in constants that might help: state.abb with the abbreviations, and state.name with the full names. Here is a simple usage example:
> x <- c("New York", "Virginia")
> state.abb[match(x,state.name)]
[1] "NY" "VA"
1) grep the full name from state.name and use that to index into state.abb:
state.abb[grep("New York", state.name)]
## [1] "NY"
1a) or using which:
state.abb[which(state.name == "New York")]
## [1] "NY"
2) or create a vector of state abbreviations whose names are the full names and index into it using the full name:
setNames(state.abb, state.name)["New York"]
## New York
## "NY"
Unlike (1), this one works even if "New York" is replaced by a vector of full state names, e.g. setNames(state.abb, state.name)[c("New York", "Idaho")]
Old post I know, but wanted to throw mine in there. I learned on tidyverse, so for better or worse I avoid base R when possible. I wanted one with DC too, so first I built the crosswalk:
library(tidyverse)
st_crosswalk <- tibble(state = state.name) %>%
bind_cols(tibble(abb = state.abb)) %>%
bind_rows(tibble(state = "District of Columbia", abb = "DC"))
Then I joined it to my data:
left_join(data, st_crosswalk, by = "state")
I found the built-in state.name and state.abb have only 50 states. I got a bigger table (including DC and so on) from online (e.g., this link: http://www.infoplease.com/ipa/A0110468.html) and pasted it to a .csv file named States.csv. I then load states and abbr. from this file instead of using the built-in. The rest is quite similar to #Aniko 's
library(dplyr)
library(stringr)
library(stringdist)
setwd()
# load data
data = c("NY", "New York", "NewYork")
data = toupper(data)
# load state name and abbr.
State.data = read.csv('States.csv')
State = toupper(State.data$State)
Stateabb = as.vector(State.data$Abb)
# match data with state names, misspell of 1 letter is allowed
match = amatch(data, State, maxDist=1)
data[ !is.na(match) ] = Stateabb[ na.omit( match ) ]
There's a small difference between match and amatch in how they calculate the distance from one word to another. See P25-26 here http://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf
You can also use base::abbreviate if you don't have US state names. This won't give you equally sized abbreviations unless you increase minlength.
state.name %>% base::abbreviate(minlength = 1)
Here is another way of doing it in case you have more than one state in your data and you want to replace the names with the corresponding abbreviations.
#creating a list of names
states_df <- c("Alabama","California","Nevada","New York",
"Oregon","Texas", "Utah","Washington")
states_df <- as.data.frame(states_df)
The output is
> print(states_df)
states_df
1 Alabama
2 California
3 Nevada
4 New York
5 Oregon
6 Texas
7 Utah
8 Washington
Now using the state.abb function you can easily convert the names into abbreviations, and vice-versa.
states_df$state_code <- state.abb[match(states_df$states_df, state.name)]
> print(states_df)
states_df state_code
1 Alabama AL
2 California CA
3 Nevada NV
4 New York NY
5 Oregon OR
6 Texas TX
7 Utah UT
8 Washington WA
If matching state names to abbreviations or the other way around is something you have to frequently, you could put Aniko's solution in a function in a .Rprofile or a package:
state_to_st <- function(x){
c(state.abb, 'DC')[match(x, c(state.name, 'District of Columbia'))]
}
st_to_state <- function(x){
c(state.name, 'District of Columbia')[match(x, c(state.abb, 'DC'))]
}
Using that function as a part of a dplyr chain:
enframe(state.name, value = 'state_name') %>%
mutate(state_abbr = state_to_st(state_name))

Resources