How to join two datasets in R by matching values from one dataset to another? - r

I have two dataframes in R: df1 and df2 as follows-
**df1**
Cust_id Cust_name Cust_dob Cust_address
1 Andrew 10/11/1990 New York
2 Dillain 01/02/1970 San Francisco
3 Alma 07/11/1985 Miami
4 Wesney 21/10/1979 New York
5 Kiko 10/12/1994 Miami
**df2**
Cust_address Latitude Longitude
New York 40.7128 74.0060
San Francisco 37.7749 122.4194
Miami 25.7617 80.1918
Texas 31.9686 99.9018
Dallas 32.7767 96.7970
I want to join these datasets together so that I get the following result: The latitude and longitude columns from df2 must match the address column of df1
**df3**
Cust_id Cust_name Cust_dob Cust_address Latitude Longitude
1 Andrew 10/11/1990 New York 40.7128 74.0060
2 Dillain 01/02/1970 San Francisco 37.7749 122.4194
3 Alma 07/11/1985 Miami 25.7617 80.1918
4 Wesney 21/10/1979 New York 40.7128 74.0060
5 Kiko 10/12/1994 Miami 25.7617 80.1918
I have tried using joins but cannot get the result that I want. I would really appreciate if someone could help me please. I am new to R. Thank you very much. I have tried in the following ways:
df3 = merge(x=df1,y=df2,by="Cust_address",all=TRUE)

We could use inner_join()
inner_join(): includes all rows in x and y.
library(dplyr)
df3 <- inner_join(df1, df2, by="Cust_address")
Cust_id Cust_name Cust_dob Cust_address Latitude Longitude
1 1 Andrew 10/11/1990 New York 40.7128 74.0060
2 2 Dillain 01/02/1970 San Francisco 37.7749 122.4194
3 3 Alma 07/11/1985 Miami 25.7617 80.1918
4 4 Wesney 21/10/1979 New York 40.7128 74.0060
5 5 Kiko 10/12/1994 Miami 25.7617 80.1918

Related

Map zip codes to their respective city and state in R?

I have a data frame of zip codes that I'm looking to map to a city & state for each specific zip code. Currently, I have played around with the zipcode package a bit but I'm not sure that can solve this specific issue.
Here's sample data of what I have now:
str(all_key$zip)
chr [1:406] "43031" "24517" "43224" "43832" "53022" "60185" "84104" "43081"
"85226" "85193" "54656" "43215" "94533" "95826" "64804" "49548" "54467"
The expected output would be adding a city & state column to each row of the data frame referring to the individual zips:
head(all_key)
zip city state
1 43031 city1 state1
2 24517 city2 state2
3 43224 city3 state3
4 43832 city4 state4
5 53022 city5 state5
6 60185 city6 state6
Thanks in advance for your help.
Another Update - February 2023
Another package (zipcodeR) has been added that makes this easier. See below.
Answer updated - January 2020
The zipcode package seems to have disappeared, so this answer has been updated to show how to add lat-lon from an external file. New answer at bottom.
Original answer
You can get the data from the zipcode package and just do a merge to look things up.
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
ZC = data.frame(zip)
library(zipcode)
data(zipcode)
merge(ZC, zipcode)
zip city state latitude longitude
1 24517 Altavista VA 37.12754 -79.27409
2 43031 Johnstown OH 40.15198 -82.66944
3 43081 Westerville OH 40.10951 -82.91606
4 43215 Columbus OH 39.96513 -83.00431
5 43224 Columbus OH 40.03991 -82.96772
6 43832 Newcomerstown OH 40.27738 -81.59662
7 49548 Grand Rapids MI 42.86823 -85.66391
8 53022 Germantown WI 43.21916 -88.12043
9 54467 Plover WI 44.45228 -89.54399
10 54656 Sparta WI 43.96977 -90.80796
11 60185 West Chicago IL 41.89198 -88.20502
12 64804 Joplin MO 37.04716 -94.51124
13 84104 Salt Lake City UT 40.75063 -111.94077
14 85193 Casa Grande AZ 32.86000 -111.83000
15 85226 Chandler AZ 33.31221 -111.93177
16 94533 Fairfield CA 38.26958 -122.03701
17 95826 Sacramento CA 38.55010 -121.37492
If you need to keep the rows in the same order, you can just set the rownames on the zipcode data and use that to select the desired rows and columns.
rownames(zipcode) = zipcode$zip
zipcode[zip, 1:3]
zip city state
43031 43031 Johnstown OH
24517 24517 Altavista VA
43224 43224 Columbus OH
43832 43832 Newcomerstown OH
53022 53022 Germantown WI
60185 60185 West Chicago IL
84104 84104 Salt Lake City UT
43081 43081 Westerville OH
85226 85226 Chandler AZ
85193 85193 Casa Grande AZ
54656 54656 Sparta WI
43215 43215 Columbus OH
94533 94533 Fairfield CA
95826 95826 Sacramento CA
64804 64804 Joplin MO
49548 49548 Grand Rapids MI
54467 54467 Plover WI
Updated Answer - January 2020
Since the zipcode package has disappeared, this shows how to add lat-lon information from a downloaded data set. The file that I am using exists today but the method should work for other files. See the GIS StackExchange for some leads on where to download data.
## Original Data to match
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
ZC = data.frame(zip)
## Download source file, unzip and extract into table
ZipCodeSourceFile = "http://download.geonames.org/export/zip/US.zip"
temp <- tempfile()
download.file(ZipCodeSourceFile , temp)
ZipCodes <- read.table(unz(temp, "US.txt"), sep="\t")
unlink(temp)
names(ZipCodes) = c("CountryCode", "zip", "PlaceName",
"AdminName1", "AdminCode1", "AdminName2", "AdminCode2",
"AdminName3", "AdminCode3", "latitude", "longitude", "accuracy")
## merge extra info onto original data
fZC_Info = merge(ZC, ZipCodes[,c(2:6,10:11)])
head(ZC_Info)
zip PlaceName AdminName1 AdminCode1 AdminName2 latitude longitude
1 24517 Altavista Virginia VA Campbell 37.1222 -79.2911
2 43031 Johnstown Ohio OH Licking 40.1445 -82.6973
3 43081 Westerville Ohio OH Franklin 40.1146 -82.9105
4 43215 Columbus Ohio OH Franklin 39.9671 -83.0044
5 43224 Columbus Ohio OH Franklin 40.0425 -82.9689
6 43832 Newcomerstown Ohio OH Tuscarawas 40.2739 -81.5940
Second Update - February 2023
Another package, zipcodeR, is now available that makes this easier. Here is some simple code to demonstrate it.
library(zipcodeR)
zip = c("43031", "24517", "43224", "43832", "53022",
"60185", "84104", "43081", "85226", "85193", "54656",
"43215", "94533", "95826", "64804", "49548", "54467")
reverse_zipcode(zip)[,c(1,3,7)]
# A tibble: 17 × 3
zipcode major_city state
<chr> <chr> <chr>
1 85193 Casa Grande AZ
2 85226 Chandler AZ
3 94533 Fairfield CA
4 95826 Sacramento CA
5 60185 West Chicago IL
6 49548 Grand Rapids MI
7 64804 Joplin MO
8 43031 Johnstown OH
9 43081 Westerville OH
10 43215 Columbus OH
11 43224 Columbus OH
12 43832 Newcomerstown OH
13 84104 Salt Lake City UT
14 24517 Altavista VA
15 53022 Germantown WI
16 54467 Plover WI
17 54656 Sparta WI
You can still use the "zipcode" package by downloading it from the archives
https://cran.r-project.org/src/contrib/Archive/zipcode/
Once you download the tar.gz file to your computer, you can install it from the RStudio GUI Packages pane. After clicking "Install", you can change the option to "Package Archive File" and point to the downloaded tar.gz file.
Install/use the USA package, also described here, which contains a tibble (zips and lats/longs) from the archived zipcode package.
library(usa)
zcs <- usa::zipcodes
head(zcs)
# A tibble: 6 x 5
zip city state lat long
<chr> <chr> <chr> <dbl> <dbl>
1 00210 Portsmouth NH 43.0 -71.0
2 00211 Portsmouth NH 43.0 -71.0
3 00212 Portsmouth NH 43.0 -71.0
4 00213 Portsmouth NH 43.0 -71.0
5 00214 Portsmouth NH 43.0 -71.0
6 00215 Portsmouth NH 43.0 -71.0
You can use the data frame in the R package zipcodeR.
To add the city and state to your data frame, you can select the variables you want from the data frame provided in zipcodeR (called zip_code_db), then join it with your data frame:
library(dplyr)
library(zipcodeR)
zip_code_db_selected =
zip_code_db %>%
select(zipcode, major_city, state)
all_key_with_city_st =
left_join(all_key, zip_code_db_selected, by = c("zip" = "zipcode"))

R: Mission impossible? How to assign "New York" to a county

I run into problems assigning a county to some city places. When querying via the acs package
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
, you can see that "New York", for instance, has a bunch of counties. So do Los Angeles, Portland, Oklahoma, Columbus etc. How can such data be assigned to a "county"?
Following code is currently used to match "county.name" with the corresponding county FIPS code. Unfortunately, it only works for cases of only one county name output in the query.
Script
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
In order to retain data, the left_join might better be matched as "look for county.name that contains place.name (without the appending xy city in the name), or choose the first item by default. It would be great to see how this could be done.
In general: I assume, there's no better way than this approach?
Thanks for your help!
What about something like the code below to create a "long" data frame for joining. We use the tidyverse pipe operator to chain operations. strsplit returns a list, which we unnest to stack the list values (the county names that go with each combination of state.name and place.name) into a long data frame where each county.name now gets its own row.
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York NA <NA> <NA>
2 36 New York 51000 New York city Bronx County
3 36 New York 51000 New York city Kings County
4 36 New York 51000 New York city New York County
5 36 New York 51000 New York city Queens County
6 36 New York 51000 New York city Richmond County
7 36 New York 51011 New York Mills village Oneida County
UPDATE: Regarding the second question in your comment, assuming you have the vector of metro areas already, how about this:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name
1 36 New York 51000 New York city Bronx County
2 36 New York 51000 New York city Kings County
3 36 New York 51000 New York city New York County
4 36 New York 51000 New York city Queens County
5 36 New York 51000 New York city Richmond County
6 36 New York 51011 New York Mills village Oneida County
7 25 Massachusetts 7000 Boston city Suffolk County
8 25 Massachusetts 7000 Boston city Suffolk County
9 6 California 20802 East Los Angeles CDP Los Angeles County
10 6 California 39612 Lake Los Angeles CDP Los Angeles County
11 6 California 44000 Los Angeles city Los Angeles County
12 48 Texas 19000 Dallas city Collin County
13 48 Texas 19000 Dallas city Dallas County
14 48 Texas 19000 Dallas city Denton County
15 48 Texas 19000 Dallas city Kaufman County
16 48 Texas 19000 Dallas city Rockwall County
17 48 Texas 40516 Lake Dallas city Denton County
18 6 California 20956 East Palo Alto city San Mateo County
19 6 California 55282 Palo Alto city Santa Clara County
UPDATE 2: If I understand your comments, for cities (actually place names in the example) with more than one county, we want only the county that includes the same name as the city (for example, New York County in the case of New York city), or the first county in the list otherwise. The following code selects a county with the same name as the city or, if there isn't one, the first county for that city. You might have to tweak it a bit to make it work for the entire U.S. For example, for it to work for Louisiana, you might need gsub(" County| Parish"... instead of gsub(" County"....
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York 51000 New York city New York County
2 36 New York 51011 New York Mills village Oneida County
3 25 Massachusetts 7000 Boston city Suffolk County
4 6 California 20802 East Los Angeles CDP Los Angeles County
5 6 California 39612 Lake Los Angeles CDP Los Angeles County
6 6 California 44000 Los Angeles city Los Angeles County
7 48 Texas 19000 Dallas city Dallas County
8 48 Texas 40516 Lake Dallas city Denton County
9 6 California 20956 East Palo Alto city San Mateo County
10 6 California 55282 Palo Alto city Santa Clara County
Could you prep the data by using something like the below code?
new_york_data <- geo.lookup(state = "NY", place = "New York")
prep_data <- function(full_data){
output <- data.frame()
for(row in 1:nrow(full_data)){
new_rows <- replicateCounty(full_data[row, ])
output <- plyr::rbind.fill(output, new_rows)
}
return(output)
}
replicateCounty <- function(row){
counties <- str_trim(unlist(str_split(row$county.name, ",")))
output <- data.frame(state = row$state,
state.name = row$state.name,
county.name = counties,
place = row$place,
place.name = row$place.name)
return(output)
}
prep_data(new_york_data)
It's a little messy and you'll need the plyr and stringr packages. Once you prep the data, you should be able to join on it

Using spread() in tidyr to pivot and drop NAs

I am using R and I have data like
California | Los Angeles
California | San Diego
California | San Francisco
New York | Albany
New York | New York City
which I would like to transform to
California | New York
Los Angeles | Albany
San Diego | New York City
San Francisco | NA
I am trying to use spread() in tidyr but can't quite get it to give me the output the way I need it. The closest I can come is
California | New York
Los Angeles | NA
San Diego | NA
San Francisco | NA
NA | Albany
NA | New York City
Can someone please help me get it in the desired format?
Here's how I do it in base:
df<-data.frame(v1=c(rep("California",3), rep("New York",2)), v2=c("Los Angeles", "San Diego", "San Franciso", "Albany", "New York City"))
cali<-as.character(df[df$v1=="California", 2])
ny<-as.character(df[df$v1=="New York", 2])
new <- data.frame(California=cali, NewYork=c(ny, NA))
new
California NewYork
1 Los Angeles Albany
2 San Diego New York City
3 San Franciso <NA>

How do I consolidate ddply across two columns?

I have some data for sites across a bunch of cities that looks about like this:
CITY STATE LAT LON SCORE
Jacksonville FL 30.328539 -81.65101 5
Jacksonville FL 30.392888 -81.67933 6
Jacksonville FL 30.268572 -81.73987 4
Jacksonville FL 30.348585 -81.49965 3
Lake Worth FL 26.579714 -80.07437 6
Lake Worth FL 26.609226 -80.12874 3
Miami FL 25.813808 -80.2058 3
Miami FL 25.753927 -80.27034 2
Miami FL 25.786326 -80.2029 6
Miami FL 25.817325 -80.19046 8
Miami FL 25.812625 -80.2369 9
Miami FL 25.885739 -80.23264 4
Miami FL 25.962069 -80.14465 5
I want to count the records for each city and average the score. I know I could do that with ddply if the cities were unique, but they aren't. There's a "Miami, KS" or something in there. So I need to do ddply on the combined city and state. Something like:
ddply(sometable, .(CITY, STATE), summarise,
mean.score=mean(SCORE),
record.count=length(SCORE)
)
Is there a way to do that? I also need to grab one of the lat/lon pairs for each city. Doesn't matter which one.
library(plyr)
ddply(data,c(.(CITY),.(STATE)),summarise,count=length(SCORE),mean=mean(SCORE))
or you can use:
library(data.table)
data <- data.table(data)
data[, list(count=length(SCORE), mean=mean(SCORE)), by=c("CITY", "STATE")]
or this:
aggregate(SCORE~CITY+STATE,data,function(x) cbind(length(x),mean(x)))
CITY STATE count mean
1 Jacksonville FL 4 4.500000
2 Lake Worth FL 2 4.500000
3 Miami FL 7 5.285714

Lookup values in a vectorized way

I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)
See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco

Resources