R: Mission impossible? How to assign "New York" to a county - r

I run into problems assigning a county to some city places. When querying via the acs package
> geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
, you can see that "New York", for instance, has a bunch of counties. So do Los Angeles, Portland, Oklahoma, Columbus etc. How can such data be assigned to a "county"?
Following code is currently used to match "county.name" with the corresponding county FIPS code. Unfortunately, it only works for cases of only one county name output in the query.
Script
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
dat <- strsplit(dat, ",")
dat
library(tigris)
library(acs)
data(fips_codes) # FIPS codes with state, code, county information
GeoLookup <- lapply(dat,function(x) {
geo.lookup(state = trimws(x[2]), place = trimws(x[1]))[2,]
})
df <- bind_rows(GeoLookup)
#Rename cols to match
colnames(fips_codes) = c("state.abb", "statefips", "state.name", "countyfips", "county.name")
# Here is a problem, because it works with one item in "county.name" but not more than one (see output below).
df <- df %>% left_join(fips_codes, by = c("state.name", "county.name"))
df
Returns:
state state.name county.name place place.name state.abb statefips countyfips
1 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city <NA> <NA> <NA>
2 25 Massachusetts Suffolk County 7000 Boston city MA 25 025
3 6 California Los Angeles County 20802 East Los Angeles CDP CA 06 037
4 48 Texas Collin County, Dallas County, Denton County, Kaufman County, Rockwall County 19000 Dallas city <NA> <NA> <NA>
5 6 California San Mateo County 20956 East Palo Alto city CA 06 081
In order to retain data, the left_join might better be matched as "look for county.name that contains place.name (without the appending xy city in the name), or choose the first item by default. It would be great to see how this could be done.
In general: I assume, there's no better way than this approach?
Thanks for your help!

What about something like the code below to create a "long" data frame for joining. We use the tidyverse pipe operator to chain operations. strsplit returns a list, which we unnest to stack the list values (the county names that go with each combination of state.name and place.name) into a long data frame where each county.name now gets its own row.
library(tigris)
library(acs)
library(tidyverse)
dat = geo.lookup(state = "NY", place = "New York")
state state.name county.name place place.name
1 36 New York <NA> NA <NA>
2 36 New York Bronx County, Kings County, New York County, Queens County, Richmond County 51000 New York city
3 36 New York Oneida County 51011 New York Mills village
dat = dat %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York NA <NA> <NA>
2 36 New York 51000 New York city Bronx County
3 36 New York 51000 New York city Kings County
4 36 New York 51000 New York city New York County
5 36 New York 51000 New York city Queens County
6 36 New York 51000 New York city Richmond County
7 36 New York 51011 New York Mills village Oneida County
UPDATE: Regarding the second question in your comment, assuming you have the vector of metro areas already, how about this:
dat <- c("New York, NY","Boston, MA","Los Angeles, CA","Dallas, TX","Palo Alto, CA")
df <- map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest
})
df
state state.name place place.name county.name
1 36 New York 51000 New York city Bronx County
2 36 New York 51000 New York city Kings County
3 36 New York 51000 New York city New York County
4 36 New York 51000 New York city Queens County
5 36 New York 51000 New York city Richmond County
6 36 New York 51011 New York Mills village Oneida County
7 25 Massachusetts 7000 Boston city Suffolk County
8 25 Massachusetts 7000 Boston city Suffolk County
9 6 California 20802 East Los Angeles CDP Los Angeles County
10 6 California 39612 Lake Los Angeles CDP Los Angeles County
11 6 California 44000 Los Angeles city Los Angeles County
12 48 Texas 19000 Dallas city Collin County
13 48 Texas 19000 Dallas city Dallas County
14 48 Texas 19000 Dallas city Denton County
15 48 Texas 19000 Dallas city Kaufman County
16 48 Texas 19000 Dallas city Rockwall County
17 48 Texas 40516 Lake Dallas city Denton County
18 6 California 20956 East Palo Alto city San Mateo County
19 6 California 55282 Palo Alto city Santa Clara County
UPDATE 2: If I understand your comments, for cities (actually place names in the example) with more than one county, we want only the county that includes the same name as the city (for example, New York County in the case of New York city), or the first county in the list otherwise. The following code selects a county with the same name as the city or, if there isn't one, the first county for that city. You might have to tweak it a bit to make it work for the entire U.S. For example, for it to work for Louisiana, you might need gsub(" County| Parish"... instead of gsub(" County"....
map_df(strsplit(dat, ", "), function(x) {
geo.lookup(state = x[2], place = x[1])[-1, ] %>%
group_by(state.name, place.name) %>%
mutate(county.name = strsplit(county.name, ", ")) %>%
unnest %>%
slice(max(1, which(grepl(sub(" [A-Za-z]*$","", place.name), gsub(" County", "", county.name))), na.rm=TRUE))
})
state state.name place place.name county.name
<chr> <chr> <int> <chr> <chr>
1 36 New York 51000 New York city New York County
2 36 New York 51011 New York Mills village Oneida County
3 25 Massachusetts 7000 Boston city Suffolk County
4 6 California 20802 East Los Angeles CDP Los Angeles County
5 6 California 39612 Lake Los Angeles CDP Los Angeles County
6 6 California 44000 Los Angeles city Los Angeles County
7 48 Texas 19000 Dallas city Dallas County
8 48 Texas 40516 Lake Dallas city Denton County
9 6 California 20956 East Palo Alto city San Mateo County
10 6 California 55282 Palo Alto city Santa Clara County

Could you prep the data by using something like the below code?
new_york_data <- geo.lookup(state = "NY", place = "New York")
prep_data <- function(full_data){
output <- data.frame()
for(row in 1:nrow(full_data)){
new_rows <- replicateCounty(full_data[row, ])
output <- plyr::rbind.fill(output, new_rows)
}
return(output)
}
replicateCounty <- function(row){
counties <- str_trim(unlist(str_split(row$county.name, ",")))
output <- data.frame(state = row$state,
state.name = row$state.name,
county.name = counties,
place = row$place,
place.name = row$place.name)
return(output)
}
prep_data(new_york_data)
It's a little messy and you'll need the plyr and stringr packages. Once you prep the data, you should be able to join on it

Related

R group or aggregate

I would like to do a group_by or aggregate. I have something like:
> head(affiliation_clean)
Affiliation_ID Affiliation_Name City Country
1 000001 New Mexico State University Las Cruces Las Cruces United States
2 000001 New Mexico State University Las Cruces Las Cruces <NA>
3 000001 New Mexico State University Las Cruces <NA> <NA>
4 000002 Palo Alto Research Center Incorporated Palo Alto <NA>
5 000002 Palo Alto Research Center Incorporated <NA> United States
6 000002 Palo Alto Research Center Incorporated <NA> <NA>
Grouping by "Affiliation_ID" and taking the longest string of "Affiliation_Name", "City" and "Country", I would like to get:
> head(affiliation_clean)
Affiliation_ID Affiliation_Name City Country
1 000001 New Mexico State University Las Cruces Las Cruces United States
2 000002 Palo Alto Research Center Incorporated Palo Alto United States
Thanks in advance.
Here is a dplyr solution based on your description to select the longest string of each Affiliation_ID and column.
library(dplyr)
dat2 <- dat %>%
group_by(Affiliation_ID) %>%
summarise_all(funs(.[which.max(nchar(.))][1]))
dat2
# # A tibble: 2 x 4
# Affiliation_ID Affiliation_Name City Country
# <int> <chr> <chr> <chr>
# 1 1 New Mexico State University Las Cruces Las Cruces United States
# 2 2 Palo Alto Research Center Incorporated Palo Alto United States
DATA
dat <-read.table(text = " Affiliation_ID Affiliation_Name City Country
1 '000001' 'New Mexico State University Las Cruces' 'Las Cruces' 'United States'
2 '000001' 'New Mexico State University Las Cruces' 'Las Cruces' NA
3 '000001' 'New Mexico State University Las Cruces' NA NA
4 '000002' 'Palo Alto Research Center Incorporated' 'Palo Alto' NA
5 '000002' 'Palo Alto Research Center Incorporated' NA 'United States'
6 '000002' 'Palo Alto Research Center Incorporated' NA NA",
header = TRUE, stringsAsFactors = FALSE)
Assuming that there is a single unique 'City/Country' for each 'Affiliation_ID', 'Affiliation_Name', after grouping at the first two columns, get the unique non-NA element of all other columns with summarise_all
library(dplyr)
affiliation_clean %>%
group_by(Affiliation_ID, Affiliation_Name) %>%
summarise_all(funs(unique(.[!is.na(.)])) )
# A tibble: 2 x 4
# Groups: Affiliation_ID [?]
# Affiliation_ID Affiliation_Name City Country
# <chr> <chr> <chr> <chr>
#1 000001 New Mexico State University Las Cruces Las Cruces United States
#2 000002 Palo Alto Research Center Incorporated Palo Alto United States

How to have bar labels be names in Plotly for R

So I'm trying to make a bar chart that displays the most popular airports that flew to Chicago. For some reason, I'm finding it to be extremely difficult to have my bars be labeled by the airport names specifically.
I have a data frame called ty
> ty
Name
1 Atlanta, GA: Hartsfield-Jackson Atlanta International
2 New York, NY: LaGuardia
3 Minneapolis, MN: Minneapolis-St Paul International
4 Los Angeles, CA: Los Angeles International
5 Denver, CO: Denver International
6 Washington, DC: Ronald Reagan Washington National
7 Orlando, FL: Orlando International
8 Phoenix, AZ: Phoenix Sky Harbor International
9 Detroit, MI: Detroit Metro Wayne County
10 Las Vegas, NV: McCarran International
11 San Francisco, CA: San Francisco International
12 Dallas/Fort Worth, TX: Dallas/Fort Worth International
13 Boston, MA: Logan International
14 Philadelphia, PA: Philadelphia International
15 Newark, NJ: Newark Liberty International
I also have a data frame called df
id numArrivals
1 10397 964
2 12953 962
3 13487 883
4 12892 823
5 11292 776
6 11278 771
7 13204 725
8 14107 700
9 11433 672
10 12889 647
11 14771 611
12 11298 580
13 10721 569
14 14100 567
15 11618 488
The id corresponds to the airport name 10397 is Atlanta, GA: Hartsfield-Jackson Atlanta International and they continue in that order.
However, when I run:
plotly::plot_ly(df,x=ty["Name"],y=df$numArrivals,type="bar",color=I("rgba(0,92,124,1)"))
I am given this chart.
How can I make the labels of my bars into the names of the airport rather than just numbers?
Feel free to use ggplotly() to create your plot. I used the code below to create a small example.
example <- data.frame(airport = c("Atlanta, GA: Hartsfield-Jackson Atlanta International","New York, NY: LaGuardia","Minneapolis, MN: Minneapolis-St Paul International"),
id = c(10397,12953,13487),
numArrivals = c(964,962,883),stringsAsFactors = F)
library(ggplot2)
library(plotly)
a <- ggplot(example,aes(x=airport,y=numArrivals,fill=id)) + geom_bar(stat = "identity") + coord_flip()
ggplotly(a)
The final result looks like this.

Finding all string matches from another dataframe in R

I am relatively new in R.
I have a dataframe locs that has 1 variable V1 and looks like:
V1
edmonton general hospital
cardiovascular institute, hospital san carlos, madrid spain
hospital of santa maria, lisbon, portugal
and another dataframe cities that has two variables that look like this:
city country
edmonton canada
san carlos spain
los angeles united states
santa maria united states
tokyo japan
madrid spain
santa maria portugal
lisbon portugal
I want to create two new variables in locs that relates any string match of V1 within city so that locs looks like this:
V1 city country
edmonton general hospital edmonton canada
hospital san carlos, madrid spain san carlos, madrid spain
hospital of santa maria, lisbon, portugal santa maria, lisbon portugal, united states
A few things to note: V1 may have multiple country names. Also, if there is a repeat country (for instance, both san carlos and madrid are in spain), then I only want one instance of the country.
Please advise.
Thanks.
A solution using tidyverse and stringr. locs2 is the final output.
library(tidyverse)
library(stringr)
locs2 <- locs %>%
rowwise() %>%
mutate(city = list(str_match(V1, cities$city))) %>%
unnest() %>%
drop_na(city) %>%
left_join(cities, by = "city") %>%
group_by(V1) %>%
summarise_all(funs(toString(sort(unique(.)))))
Result
locs2 %>% as.data.frame()
V1 city country
1 cardiovascular institute, hospital san carlos, madrid spain madrid, san carlos spain
2 edmonton general hospital edmonton canada
3 hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states
DATA
library(tidyverse)
locs <- data_frame(V1 = c("edmonton general hospital",
"cardiovascular institute, hospital san carlos, madrid spain",
"hospital of santa maria, lisbon, portugal"))
cities <- read.table(text = "city country
edmonton canada
'san carlos' spain
'los angeles' 'united states'
'santa maria' 'united states'
tokyo japan
madrid spain
'santa maria' portugal
lisbon portugal",
header = TRUE, stringsAsFactors = FALSE)

R append function

I'm writing an R script that parses out the a state abbreviation from a column in a data.frame. It then uses the which() function to determine the index of the found state abbreviation in a look up data frame that contains state abbreviations and their corresponding full state names. I then use the found index to access the the full state name and append it to a vector called completeList. I then add the vector completeList which should contain the full state names to my original data frame under a newly created column STATE_NAME.
However, for some reason completeList only contains the indexes that were found earlier and not the full state names that I expected. What did I do wrong?
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, addCompleteStateName)
}
file["STATE_NAME"]<-completeList
>completeList
[1] 27 17 17 29 42 50 20 53 45 19 22 52 9 29 26 37 8 58 35
Here is the csv file where the abbreviation of the station is found
STATION STATION_NAME ELEVATION
GHCND:USC00202381 EAST JORDAN MI US 180.1
GHCND:USC00111290 CARLYLE RESERVOIR IL US 153
GHCND:USC00116661 PAW PAW 2 S IL US 274.9
GHCND:USC00228556 SUMRALL MS US 88.1
GHCND:USC00340292 ARDMORE OK US 267.9
GHCND:USC00408522 SPARTA WASTEWATER PLANT TN US 289.9
GHCND:USC00148341 VALLEY FALLS KS US 283.5
GHCND:USW00014742 BURLINGTON INTERNATIONAL AIRPORT VT US 101.2
GHCND:USC00367782 SALINA 3 W PA US 338
GHCND:USC00134142 IOWA FALLS IA US 356.9
GHCND:USC00161565 CARVILLE 2 SW LA US 9.1
GHCND:USC00421446 CITY CRK WATER PLANT UT US 1628.9
GHCND:USW00013781 WILMINGTON NEW CASTLE CO AIRPORT DE US 22.6
GHCND:USC00229400 WATER VALLEY MS US 116.1
GHCND:USC00190562 BELCHERTOWN MA US 171
GHCND:USW00094728 NEW YORK CENTRAL PARK OBS BELVEDERE TOWER NY US 40.2
GHCND:USC00060973 BURLINGTON CT US 155.4
GHCND:USC00475516 MINOCQUA WI US 484.9
GHCND:USC00286055 NEW BRUNSWICK 3 SE NJ US 38.1
Here is the csv file where we look up abbreviations and find the corresponding full state name
State/Possession Abbreviation
Alabama AL
Alaska AK
American Samoa AS
Arizona AZ
Arkansas AR
California CA
Colorado CO
Connecticut CT
Delaware DE
District of Columbia DC
Federated States of Micronesia FM
Florida FL
Georgia GA
Guam GU
Hawaii HI
Idaho ID
Illinois IL
Indiana IN
Iowa IA
Kansas KS
Kentucky KY
Louisiana LA
Maine ME
Marshall Islands MH
Maryland MD
Massachusetts MA
Michigan MI
Minnesota MN
Mississippi MS
Missouri MO
Montana MT
Nebraska NE
Nevada NV
New Hampshire NH
New Jersey NJ
New Mexico NM
New York NY
North Carolina NC
North Dakota ND
Northern Mariana Islands MP
Ohio OH
Oklahoma OK
Oregon OR
Palau PW
Pennsylvania PA
Puerto Rico PR
Rhode Island RI
South Carolina SC
South Dakota SD
Tennessee TN
Texas TX
Utah UT
Vermont VT
Virgin Islands VI
Virginia VA
Washington WA
West Virginia WV
Wisconsin WI
Wyoming WY
Why am I not getting the full state name?
figured it out 😎
#read in csv weather data file
file <- read.csv(header = TRUE, file = "C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\nov_2_1976\\734677_cleaned.csv")
#read in csv state Abbreviation file
abbreviationsFile<-read.csv(header=TRUE, file="C:\\Users\\michael.guarino1\\Desktop\\Work\\weather\\stateAbbreviationMatches.csv")
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
completeList<-append(completeList, toString(addCompleteStateName))
}
file["STATE_NAME"]<-completeList
the type was being forced to an integer
The variable addCompleteStateName is a factor. You can convert it to a character to append the labels.
#iterate through STATION_NAME and store abreviations
completeList<-c()
for(stateAbvr in file$STATION_NAME){
addTo<-(substring(stateAbvr,(nchar(stateAbvr)-4),(nchar(stateAbvr)-3)))
index<-which(abbreviationsFile$Abbreviation==addTo)
addCompleteStateName<-(abbreviationsFile[index,1])
# modified to convert addCompleteStateName to character
completeList<-append(completeList, as.character(addCompleteStateName))
}
file["STATE_NAME"]<-completeList

Using spread() in tidyr to pivot and drop NAs

I am using R and I have data like
California | Los Angeles
California | San Diego
California | San Francisco
New York | Albany
New York | New York City
which I would like to transform to
California | New York
Los Angeles | Albany
San Diego | New York City
San Francisco | NA
I am trying to use spread() in tidyr but can't quite get it to give me the output the way I need it. The closest I can come is
California | New York
Los Angeles | NA
San Diego | NA
San Francisco | NA
NA | Albany
NA | New York City
Can someone please help me get it in the desired format?
Here's how I do it in base:
df<-data.frame(v1=c(rep("California",3), rep("New York",2)), v2=c("Los Angeles", "San Diego", "San Franciso", "Albany", "New York City"))
cali<-as.character(df[df$v1=="California", 2])
ny<-as.character(df[df$v1=="New York", 2])
new <- data.frame(California=cali, NewYork=c(ny, NA))
new
California NewYork
1 Los Angeles Albany
2 San Diego New York City
3 San Franciso <NA>

Resources