I'm trying to extract the city and state from the Address column into 2 separate columns labeled City and State in r. This is what my data looks like:
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ")) %>%
separate(address, c("City", "State"), sep=",")
I tried using the separate function but that only gets the ones with commas. Any ideas on how to do this for both cases?
There is a pattern at the end (space, letter, letter) which I can use to exploit and then remove any commas but not sure how the syntax would work using grep.
Starting from your df
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
> df
address
1 Los Angeles, CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia, PA
6 Trenton, NJ
It's possible to use gsub to subset the string like this:
> city=gsub(',','',gsub("(.*).{3}","\\1",df[,1]))
> city
[1] "Los Angeles" "Pittsburgh" "Miami" "Baltimore" "Philadelphia"
[6] "Trenton"
> state=gsub(".*(\\w{2})","\\1",df[,1])
> state
[1] "CA" "PA" "FL" "MD" "PA" "NJ"
df=data.frame(City=city,State=state)
> df
City State
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ
This is a little unorthodox but it works well. It assumes that all states are 2 characters long and that there is at least 1 space between the city and state. Comma's are ignored
df <- data.frame(address = c("Los Angeles, CA", "Pittsburgh PA", "Miami FL","Baltimore MD", "Philadelphia, PA", "Trenton, NJ"))
df$city <- substring(sub(",","",df$address),1,nchar(sub(",","",df$address))-3)
df$state <- substring(as.character(df$address),nchar(as.character(df$address))-1,nchar(as.character(df$address)))
df <- within(df,rm(address))
output:
city state
1 Los Angeles CA
2 Pittsburgh PA
3 Miami FL
4 Baltimore MD
5 Philadelphia PA
6 Trenton NJ
Related
Need to ID states from mixed location data
Need to search for 50 states abbreviations & 50 states full names, and return state abbreviation
N <- 1:10
Loc <- c("Los Angeles, CA", "Manhattan, NY", "Florida, USA", "Chicago, IL" , "Houston, TX",
+ "Texas, USA", "Corona, CA", "Georgia, USA", "WV NY NJ", "qwerty uy PO DOPL JKF" )
df <- data.frame(N, Loc)
> # Objective create variable state such
> # state contains abbreviated names of states from Loc:
> # for "Los Angeles, CA", state = CA
> # for "Florida, USA", sate = FL
> # for "WV NY NJ", state = NA
> # for "qwerty NJuy PO DOPL JKF", sate = NA (inspite of containing the srting NJ, it is not wrapped in spaces)
>
# End result should be Newdf
State <- c("CA", "NY", "FL", "IL", "TX","TX", "CA", "GA", NA, NA)
Newdf <- data.frame(N, Loc, State)
> Newdf
N Loc State
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
Is there a package? or can a loop be written? Even if the schema could be demonstrated with a few states, that would be sufficient - I will post the full solution when I get to it. Btw, this is for a Twitter dataset downloaded using rtweet package, and the variable is: place_full_name
There are default constants in R, state.abb and state.name which can be used.
vars <- stringr::str_extract(df$Loc, paste0('\\b',c(state.abb, state.name),
'\\b', collapse = '|'))
#[1] "CA" "NY" "Florida" "IL" "TX" "Texas" "CA" "Georgia" "WV" NA
If you want everything as abbreviations, we can go further and do :
inds <- vars %in% state.name
vars[inds] <- state.abb[match(vars[inds], state.name)]
vars
#[1] "CA" "NY" "FL" "IL" "TX" "TX" "CA" "GA" "WV" NA
However, we can see that in 9th row you expect output as NA but here it returns "WV" because it is a state name. In such cases, you need to prepare rules which are strict enough so that it only extracts state names and nothing else.
Utilising the built-in R constants, state.abb and state.name, we can try to extract these from the Loc with regular expressions.
state.abbs <- sub('.+, ([A-Z]{2})', '\\1', df$Loc)
state.names <- sub('^(.+),.+', '\\1', df$Loc)
Now if the state abbreviations are not in any of the built-in ones, then we can use match to find the positions of our state.names that are in any of the items in the built-in state.name vector, and use that to index state.abb, else keep what we already have. Those that don't match either return NA.
df$state.abb <- ifelse(!state.abbs %in% state.abb,
state.abb[match(state.names, state.name)], state.abbs)
df
N Loc state.abb
1 1 Los Angeles, CA CA
2 2 Manhattan, NY NY
3 3 Florida, USA FL
4 4 Chicago, IL IL
5 5 Houston, TX TX
6 6 Texas, USA TX
7 7 Corona, CA CA
8 8 Georgia, USA GA
9 9 WV NY NJ <NA>
10 10 qwerty uy PO DOPL JKF <NA>
I run into problems assigning a county to some city places. When querying via the acs package
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
, you can see that "New York", for instance, has a bunch of counties. So do Los Angeles, Portland, Oklahoma, Columbus etc. How can such data be assigned to a "county"?
Following code is currently used to match "county.name" with the corresponding county FIPS code. Unfortunately, it only works for cases of only one county name output in the query.
Script
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
In order to retain data, the left_join might better be matched as "look for county.name that contains place.name (without the appending xy city in the name), or choose the first item by default. It would be great to see how this could be done.
In general: I assume, there's no better way than this approach?
Thanks for your help!
What about something like the code below to create a "long" data frame for joining. We use the tidyverse pipe operator to chain operations. strsplit returns a list, which we unnest to stack the list values (the county names that go with each combination of state.name and place.name) into a long data frame where each county.name now gets its own row.
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York NA <NA> <NA>
2 36 New York 51000 New York city Bronx County
3 36 New York 51000 New York city Kings County
4 36 New York 51000 New York city New York County
5 36 New York 51000 New York city Queens County
6 36 New York 51000 New York city Richmond County
7 36 New York 51011 New York Mills village Oneida County
UPDATE: Regarding the second question in your comment, assuming you have the vector of metro areas already, how about this:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name
1 36 New York 51000 New York city Bronx County
2 36 New York 51000 New York city Kings County
3 36 New York 51000 New York city New York County
4 36 New York 51000 New York city Queens County
5 36 New York 51000 New York city Richmond County
6 36 New York 51011 New York Mills village Oneida County
7 25 Massachusetts 7000 Boston city Suffolk County
8 25 Massachusetts 7000 Boston city Suffolk County
9 6 California 20802 East Los Angeles CDP Los Angeles County
10 6 California 39612 Lake Los Angeles CDP Los Angeles County
11 6 California 44000 Los Angeles city Los Angeles County
12 48 Texas 19000 Dallas city Collin County
13 48 Texas 19000 Dallas city Dallas County
14 48 Texas 19000 Dallas city Denton County
15 48 Texas 19000 Dallas city Kaufman County
16 48 Texas 19000 Dallas city Rockwall County
17 48 Texas 40516 Lake Dallas city Denton County
18 6 California 20956 East Palo Alto city San Mateo County
19 6 California 55282 Palo Alto city Santa Clara County
UPDATE 2: If I understand your comments, for cities (actually place names in the example) with more than one county, we want only the county that includes the same name as the city (for example, New York County in the case of New York city), or the first county in the list otherwise. The following code selects a county with the same name as the city or, if there isn't one, the first county for that city. You might have to tweak it a bit to make it work for the entire U.S. For example, for it to work for Louisiana, you might need gsub(" County| Parish"... instead of gsub(" County"....
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York 51000 New York city New York County
2 36 New York 51011 New York Mills village Oneida County
3 25 Massachusetts 7000 Boston city Suffolk County
4 6 California 20802 East Los Angeles CDP Los Angeles County
5 6 California 39612 Lake Los Angeles CDP Los Angeles County
6 6 California 44000 Los Angeles city Los Angeles County
7 48 Texas 19000 Dallas city Dallas County
8 48 Texas 40516 Lake Dallas city Denton County
9 6 California 20956 East Palo Alto city San Mateo County
10 6 California 55282 Palo Alto city Santa Clara County
Could you prep the data by using something like the below code?
new_york_data <- geo.lookup(state = "NY", place = "New York")
prep_data <- function(full_data){
output <- data.frame()
for(row in 1:nrow(full_data)){
new_rows <- replicateCounty(full_data[row, ])
output <- plyr::rbind.fill(output, new_rows)
}
return(output)
}
replicateCounty <- function(row){
counties <- str_trim(unlist(str_split(row$county.name, ",")))
output <- data.frame(state = row$state,
state.name = row$state.name,
county.name = counties,
place = row$place,
place.name = row$place.name)
return(output)
}
prep_data(new_york_data)
It's a little messy and you'll need the plyr and stringr packages. Once you prep the data, you should be able to join on it
I am relatively new in R.
I have a dataframe locs that has 1 variable V1 and looks like:
V1
edmonton general hospital
cardiovascular institute, hospital san carlos, madrid spain
hospital of santa maria, lisbon, portugal
and another dataframe cities that has two variables that look like this:
city country
edmonton canada
san carlos spain
los angeles united states
santa maria united states
tokyo japan
madrid spain
santa maria portugal
lisbon portugal
I want to create two new variables in locs that relates any string match of V1 within city so that locs looks like this:
V1 city country
edmonton general hospital edmonton canada
hospital san carlos, madrid spain san carlos, madrid spain
hospital of santa maria, lisbon, portugal santa maria, lisbon portugal, united states
A few things to note: V1 may have multiple country names. Also, if there is a repeat country (for instance, both san carlos and madrid are in spain), then I only want one instance of the country.
Please advise.
Thanks.
A solution using tidyverse and stringr. locs2 is the final output.
library(tidyverse)
library(stringr)
locs2 <- locs %>%
rowwise() %>%
mutate(city = list(str_match(V1, cities$city))) %>%
unnest() %>%
drop_na(city) %>%
left_join(cities, by = "city") %>%
group_by(V1) %>%
summarise_all(funs(toString(sort(unique(.)))))
Result
locs2 %>% as.data.frame()
V1 city country
1 cardiovascular institute, hospital san carlos, madrid spain madrid, san carlos spain
2 edmonton general hospital edmonton canada
3 hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states
DATA
library(tidyverse)
locs <- data_frame(V1 = c("edmonton general hospital",
"cardiovascular institute, hospital san carlos, madrid spain",
"hospital of santa maria, lisbon, portugal"))
cities <- read.table(text = "city country
edmonton canada
'san carlos' spain
'los angeles' 'united states'
'santa maria' 'united states'
tokyo japan
madrid spain
'santa maria' portugal
lisbon portugal",
header = TRUE, stringsAsFactors = FALSE)
I've been trying to figure out how to clean and edit a column in my data set.
The dataset I am using is supposed to only be for the city of San Francisco. A column in the data set called "city" contains multiple different spellings of San Francisco, as well as other cities. Here is what it looks like:
table(sf$city)
Brentwood CA
30401 18 370
DALY CITY FOSTER CITY HAYWARD
0 0 0
Novato Oakland OAKLAND
0 40 0
S F S.F. s.F. Ca
0 31428 12
SAN BRUNO SAN FRANCICSO San Franciisco
0 221 54
san francisco san Francisco San francisco
20 284 0
San Francisco SAN FRANCISCO san Francisco CA
78050 16603 6
San Francisco, San Francisco, Ca San Francisco, CA
12 4 72
San Francisco, CA 94132 San Franciscvo San Francsico
0 0 2
San Franicisco Sand Francisco sf
41 30 17
Sf SF SF , CA
214 81226 1
SF CA 94133 SF, CA SF, CA 94110
0 9 38
SF, CA 94115 SF. SF`
4 1656 31
SO. SAN FRANCISCO SO.S.F.
0 6
What I am trying to do is to change sf$city to only have "San Francisco". So all the data in sf$city will be placed under one city, San Francisco. So when I type table(sf$city), it only shows San Francisco.
Could I subset? Something like:
sf$city = subset(sf, city == "S.F." & "s.F. Ca" & "SAN FRANCICSO" & ...
And subset all the city variables I want? Or will this distort and mess up my data?
I would try regular expressions with agrep and grep.
Example data:
d <- c("Brentwood", "CA", "DALY CITY", "FOSTER CITY", "HAYWARD", "Novato",
"Oakland", "OAKLAND", "S F", "S.F.", "s.F. Ca", "SAN BRUNO",
"SAN FRANCICSO", "San Franciisco", "san francisco", "san Francisco",
"San francisco", "San Francisco", "SAN FRANCISCO", "san Francisco CA",
"San Francisco,", "San Francisco, Ca", "San Francisco, CA", "San Francisco, CA 94132",
"San Franciscvo", "San Francsico", "San Franicisco", "Sand Francisco",
"sf", "Sf", "SF", "SF , CA", "SF CA", "94133", "SF, CA", "SF, CA 94110",
"SF, CA 94115", "SF.", "SF`", "SO. SAN FRANCISCO", "SO.S.F.")
You can target words like "San Francisco" with agrep, and the default of max.dist = 0.1 works well enough here. You can then just target the S.F. variants using grep
d[agrep("San Francisco", d, ignore.case = TRUE, max.dist = 0.1)] <- "San Francisco"
d[grep("\\bS[. ]?F\\.?\\b", d, ignore.case = TRUE, perl = TRUE)] <- "San Francisco"
# [1] "Brentwood" "CA" "DALY CITY" "FOSTER CITY"
# [5] "HAYWARD" "Novato" "Oakland" "OAKLAND"
# [9] "San Francisco" "San Francisco" "San Francisco" "SAN BRUNO"
#[13] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[17] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[21] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[25] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[29] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[33] "San Francisco" "94133" "San Francisco" "San Francisco"
#[37] "San Francisco" "San Francisco" "San Francisco" "San Francisco"
#[41] "San Francisco"
adist is another option for targeting words like "San Francisco". I found the following settings to work well. You can pick up "San Fran":
d[adist("San Francisco", d, ignore.case = TRUE,
cost = c(del = 0.5, ins = 0.5, sub = 3)) < 3] <- "San Francisco"
To riff on #jeta's answer, you could also take the resulting data set and run it through the Google Maps API as shown here: https://gist.github.com/josecarlosgonz/6417633
Specifically, using the functions available at that link, you could take the grep() output and run
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
Which will give you the following output:
# V1 V2 V3 V4
# 1 36.0331164 -86.7827772 APPROXIMATE Brentwood, TN, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
As it looks like you know that all of your locations are in CA, you may also want to append a CA to the end of your vector as shown here:
d[grep("CA", d, invert = TRUE)] <- paste0(d[grep("CA", d, invert = TRUE)], ", CA")
locations <- ldply(d, function(x) geoCode(x))
head(locations, 10)
As shown below, this will make sure that Google places Brentwood in CA.
The advantage of this approach is that you will end up with normalized cities in V4, which could be helpful when it comes to filtering and other things.
# V1 V2 V3 V4
# 1 37.931868 -121.6957863 APPROXIMATE Brentwood, CA 94513, USA
# 2 36.778261 -119.4179324 APPROXIMATE California, USA
# 3 37.6879241 -122.4702079 APPROXIMATE Daly City, CA, USA
# 4 37.5585465 -122.2710788 APPROXIMATE Foster City, CA, USA
# 5 37.6688205 -122.0807964 APPROXIMATE Hayward, CA, USA
# 6 38.1074198 -122.5697032 APPROXIMATE Novato, CA, USA
# 7 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 8 37.8043637 -122.2711137 APPROXIMATE Oakland, CA, USA
# 9 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
# 10 37.7749295 -122.4194155 APPROXIMATE San Francisco, CA, USA
NOTE: Google has a rate limit on it's API. If you want to avoid registering and getting an API key, you will want to chunk the ldply in 10-second bites as suggested in the comment at the Github link above.
To overwrite sf$city to be "San Francisco" for every entry, here is the typical method:
sf$city <- "San Francisco"
However, if some of your observations are not San Francisco, and you would like to drop these, you will want to drop these first. Here is a start:
# drop non-SF observations
sfReal <- sf[!(tolower(sf$city) %in% c("daly city", "brentwood", "hayward", "oakland"))]
My geography is not the best, so I may be missing some. Alternatively, you could use %in% to only include those observations that are San Francisco. Given the set you provided above, I doubt this is the case.
In the future, if this is a repeated task, you should look into regular expressions and grep. This is an amazing tool that will pay gigantic dividends for string manipulation tasks. #jota provides a great method for this in the answer provided.
While web-scraping i came across the following problem for which i think there might be a better solution:
Having this data:
dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany"))
query
1 Washington, USA
2 Frankfurt, Germany
I would like to query e.g. the Google Maps Api and return the formatted address(es). There might be multiple formatted. The result should be the following:
query formatted_address
1 Washington, USA Washington, DC, USA
2 Washington, USA Washington, UT, USA
3 Washington, USA Washington, VA 22747, USA
4 Washington, USA Washington, IA 52353, USA
5 Washington, USA Washington, GA 30673, USA
6 Washington, USA Washington, PA 15301, USA
7 Frankfurt, Germany Frankfurt, Germany
What i do by now is this:
require(RCurl)
require(rvest)
require(magrittr)
build_url <- function(x, base_url = "https://maps.googleapis.com/maps/api/geocode/xml?address="){
paste0(base_url, RCurl::curlEscape(x))
}
l <- lapply(dat$query, function(q){
formatted_address <- q %>% build_url %>% read_xml %>% xml_nodes("formatted_address") %>% xml_text
data.frame(query = q, formatted_address)
})
do.call(rbind, l) # This can be done via data.table::rbindlist as well
Is there a better solution? Maybe more data.table or dplyr style?
I've written the package googleway to access google maps API with a valid API key (so if your data is greater than 2,500 items you can pay for an API key).
To get the address details use google_geocode()
library(googleway)
key <- "your_api_key"
dat <- data.frame(query = c("Washington, USA", "Frankfurt, Germany"))
## To get all the data:
res <- apply(dat, 1, function(x){
google_geocode(address = x["query"],
key = key) ## use simplify = F to return JSON
})
## to access the 'formatted address' part, see
res[[1]]$results$formatted_address
# [1] "Washington, DC, USA" "Washington, UT, USA" "Washington, VA 22747, USA" "Washington, IA 52353, USA"
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA"
## so to get everything as a list
lapply(res, function(x){
x$results$formatted_address
})
# [[1]]
# [1] "Washington, DC, USA" "Washington, UT, USA" "Washington, VA 22747, USA" "Washington, IA 52353, USA"
# [5] "Washington, GA 30673, USA" "Washington, PA 15301, USA"
#
# [[2]]
# [1] "Frankfurt, Germany"
## and to put back onto your original data.frame:
lst <- lapply(1:length(res), function(x){
data.frame(query = dat[x, "query"],
formatted_address = res[[x]]$results$formatted_address)
})
data.table::rbindlist(lst)
# query formatted_address
# 1: Washington, USA Washington, DC, USA
# 2: Washington, USA Washington, UT, USA
# 3: Washington, USA Washington, VA 22747, USA
# 4: Washington, USA Washington, IA 52353, USA
# 5: Washington, USA Washington, GA 30673, USA
# 6: Washington, USA Washington, PA 15301, USA
# 7: Frankfurt, Germany Frankfurt, Germany