Obtain State Name from Google Trends Interest by City - r

Suppose you inquire the following:
gtrends("google", geo="US")$interest_by_city
This returns how many searches for the term "google" occurred across cities in the US. However, it does not provide any information regarding which state each city belongs to.
I have tried merging this data set with several others including city and state names. Given that the same city name can be present in many states, it is unclear to me how to identify which city was the one Google Trends provided data for.
I provide below a more detailed MWE.
library(gtrendsR)
library(USAboundariesData)
data1 <- gtrends("google", geo= "US")$interest_by_city
data1$city <- data1$location
data2 <- us_cities(map_date = NULL)
data3 <- merge(data1, data2, by="city")
And this yields the following problem:
city state
Alexandria Louisiana
Alexandria Indiana
Alexandria Kentucky
Alexandria Virginia
Alexandria Minnesota
making it difficult to know which "Alexandria" Google Trends provided the data for.
Any hints in how to identify the state of each city would be much appreciated.

One way around this is to collect the cities per state and then just rbind the respective data frames. You could first make a vector of state codes like so
states <- paste0("US-",state.abb)
I then just used purrr for its map and reduce functionality to create a single frame
data <- purrr::reduce(purrr::map(states, function(x){
cities = gtrends("google", geo = x)$interest_by_city
}),
rbind)

Related

Separating geographical data strings in R

I'm working with QECW data from BLS and would like to make the geographical data included more useful. I want to split the column "area_title" into different columns - one with the area's name, one with the level of the area, and one with the state.
I got a good start using separate:
qecw <- qecw %>% separate(area_title, c("county", "geography level", "state"))
The problem is that there's a variety of ways the geographical data are arranged into strings that makes them not uniform enough to cleanly separate. The area_title column includes names in formats that separate pretty cleanly, like:
area_title
Alabama -- Statewide
Autauga County, Alabama
which splits pretty well into
county geography level state
Alabama Statewide NA
Autauga County Alabama
but this breaks down for cases like:
area_title
Aleutians West Census Area, Alaska
Chattanooga-Cleveland-Dalton TN-GA-AL CSA
U.S. Combined statistical Areas, combined
as well as any states, counties or other place names that have more than one word.
I can go case-by-case to fix these, but I would appreciate a more efficient solution.
The exact data I'm using is "2019.q1-q3 10 10 Total, all industries," available at the link under "Current year quarterly data grouped by industry".
Thanks!
So far I came up with this:
I can get a place name by selecting a substring of area_title with everything to the left of the first comma:
qecw <- qecw %>% mutate(location = sub(",.*","", qecw$area_title))
Then I have a series of nested if_else statements to create a location type:
mutate(`Location Type` =
if_else(str_detect(area_title, "Statewide"), "State",
if_else(str_detect(area_title, "County"), "County",
if_else(str_detect(area_title, "CSA"), "CSA",
if_else(str_detect(area_title, "MSA"), "MSA",
if_else(str_detect(area_title, "MicroSA"), "MicroSA",
if_else(str_detect(area_title, "Undefined"), "Undefined",
"other")))))))
This isn't a complete answer; I think I'm still missing some location types, and I haven't come up with a good way to extract state names yet.

getCensus Hawaii City Populations

I'm looking to gather populations for Hawaiian cities and am puzzled how to collect it using the censusapi getCensus() function.
census_api_key(key='YOURKEYHERE')
newpopvars <- listCensusMetadata(name = "2017/pep/population", type = "variables")
usapops <- getCensus(name = "pep/population",
vintage = 2017,
vars = c(newpopvars$name),
region = "place:*")
usapops <- usapops[which(usapops$DATE_==10),]
state <- grepl("Hawaii", usapops$GEONAME)
cities <- data.frame()
for (i in seq(1,length(state))) {
if (state[i] == TRUE) {
cities <- rbind(cities,usapops[i,])
}
}
This returns only two cities but certainly there are more than that in Hawaii. What am I doing wrong?
There is only one place (Census summary level 160) in Hawaii which is large enough to be included in the 1-year American Community Survey release: "Urban Honolulu" (GeoID 1571550). The 1-year release only includes places with 65,000+ population. I assume similar controls apply to the Population Estimates program -- I couldn't find it stated directly, but the section header on the page for Population Estimates downloads for cities and towns says "Places of 50,000 or More" -- the second most populated CDP in Hawaii is East Honolulu, which had only 47,868 in the 2013-2017 ACS release.
If you use the ACS 5-year data release, you'll find 151 places at summary level 160.
It looks as though you should change pep/population to acs/acs5 in your getCensus call. I don't know the specific variables for the API, but if you just want total population for places, use the ACS B01003 table, which has a single column with that value.

Geocoding in R using googleway

I have read Batch Geocoding with googleway R
I am attempting to geocode some addresses using googleway. I want the geocodes, address, and county returned back.
Using the answer linked to above I created the following function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
df<-as.data.frame(unlist(res[[x]]$results$address_components))
address<-paste(df[1,],df[2,],sep = " ")
city<-paste0(df[3,])
county<-paste0(df[4,])
state<-paste0(df[5,])
zip<-paste0(df[7,])
coordinates<-cbind(coordinates,address,city,county,state,zip)
coordinates<-as.data.frame(coordinates)
})
Then put it back together like so...
library(data.table)
done<-rbindlist(geocodes))
The issue is getting the address and county back out from the 'res' list. The answer linked to above pulls the address from the dataframe that was sent to google and assumes the list is in the right order and there are no multiple match results back from google (in my list there seems to be a couple). Point is, taking the addresses from one file and the coordinates from another seems rather reckless and since I need the county anyway, I need a way to pull it out of google's resulting list saved in 'res'.
The issue is that some addresses have more "types" than others which means referencing by row as I did above does not work.
I also tried including rbindlist inside the function to convert the sublist into a datatable and then pull out the fields but can't quite get it to work. The issue with this approach is that actual addresses are in a vector but the 'types' field which I would use to filter or select is in a sublist.
The best way I can describe it is like this -
list <- c(long address),c(short address), types(LIST(street number, route, county, etc.))
Obviously, I'm a beginner at this. I know there's a simpler way but I am just really struggling with lists and R seems to make extensive use of them.
Edit:
I definitely recognize that I cannot rbind the whole list. I need to pull specific elements out and bind just those. A big part of the problem, in my mind, is that I do not have a great handle on indexing and manipulating lists.
Here are some addresses to try - "301 Adams St, Friendship, WI 53934, USA" has an 7X3 "address components" and corresponding "types" list of 7. Compare that to "222 S Walnut St, Appleton, WI 45911, USA" which has an address components of 9X3 and "types" list of 9. The types list needs to be connected back to the address components matrix because the types list identifies what each row of the address components matrix contains.
Then there are more complexities introduced by imperfect matches. Try "211 Grand Avenue, Rothschild, WI, 54474" and you get 2 lists, one for east grand ave and one for west grand ave. Google seems to prefer the east since that's what comes out in the "formatted address." I don't really care which is used since the county will be the same for either. The "location" interestingly contains 2 sets of geocodes which, presumably, refer to the two matches. I think this complexity can be ignored since the location consisting of two coordinates is still stored as a 'double' (not a list!) so it should stack with the coordinates for the other addresses.
Edit: This should really work but I'm getting an error in the do.call(rbind,types) line of the function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
R says the "types" object is not a list so it can't rbind it. I tried coercing it to a list but still get the error. I checked using the following paired down function and found #294 is null. This halts the function. I get "over query limit" as an error but I am not over the query limit.
geocodes<-lapply(seq_along(res),function(x) {
types<-res[[x]]$results$address_components[[1]]$types
print(typeof(types))
})
Here's my solution using tidyverse functions. This gets the geocode and also the formatted address in case you want it (other components of the result can be returned as well, they just need to be added to the table in the last row of the map function that gets returned.
suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(googleway))
set_key("your key here")
df <- tibble(full_address = c("2379 ADDISON BLVD HIGH POINT 27262",
"1751 W LEXINGTON AVE HIGH POINT 27262", "dljknbkjs"))
df %>%
mutate(geocode_result = map(full_address, function(full_address) {
res <- google_geocode(full_address)
if(res$status == "OK") {
geo <- geocode_coordinates(res) %>% as_tibble()
formatted_address <- geocode_address(res)
geocode <- bind_cols(geo, formatted_address = formatted_address)
}
else geocode <- tibble(lat = NA, lng = NA, formatted_address = NA)
return(geocode)
})) %>%
unnest()
#> # A tibble: 3 x 4
#> full_address lat lng formatted_address
#> <chr> <dbl> <dbl> <chr>
#> 1 2379 ADDISON BLVD HIGH POI… 36.0 -80.0 2379 Addison Blvd, High Point, N…
#> 2 1751 W LEXINGTON AVE HIGH … 36.0 -80.1 1751 W Lexington Ave, High Point…
#> 3 dljknbkjs NA NA <NA>
Created on 2019-04-14 by the reprex package (v0.2.1)
Ok, I'll answer it myself.
Begin with a dataframe of addresses. I called mine "addresses" and the singular column in the dataframe is also called "Addresses" (note that I capitalized it).
Use googleway to get the geocode data. I did this using apply to loop across the rows in the address dataframe
library(googleway)
res<-apply(addresses,1,function (x){
google_geocode(address=x[['Address']], key='insert your google api key here - its free to get')
})
Here is the function I wrote to get the nested lists into a dataframe.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
library(data.table)
geocodes<-rbindlist(geocodes,fill=TRUE)
lapply loops along the items in the list, within the function I create a coordinates dataframe and put the geocodes there. I also wanted the other address components, particularly the county, so I also created the "types" dataframe which identifies what the items in the address are. I cbind the address items with the types, then use spread from the tidyr package to reshape the dataframe into wideformat so it's just 1 row wide. I then cbind in the lat and lon from the coordinates dataframe.
The rbindlist stacks it all back together. You could use do.call(rbind, geocodes) but rbindlist is faster.

Merging (two and a half) countries from maps-package to one map object in R

I am looking for a map that combines Germany, Austria and parts of Switzerland together to one spatial object. This area should represent the German speaking areas in those three countries. I have some parts in place, but can not find a way to combine them. If there is a completely different solution to solve this problem, I am still interested.
I get the German and the Austrian map by:
require(maps)
germany <- map("world",regions="Germany",fill=TRUE,col="white") #get the map
austria <- map("world",regions="Austria",fill=TRUE,col="white") #get the map
Switzerland is more complicated, as I only need the 60-70% percent which mainly speak German. The cantones that do so (taken from the census report) are
cantonesGerman = c("Uri", "Appenzell Innerrhoden", "Nidwalden", "Obwalden", "Appenzell Ausserrhoden", "Schwyz", "Lucerne", "Thurgau", "Solothurn", "Sankt Gallen", "Schaffhausen", "Basel-Landschaft", "Aargau", "Glarus", "Zug", "Zürich", "Basel-Stadt")
The cantone names can used together with data from gadm.org/country (selecting Switzerland & SpatialPolygonsDataFrame -> Level 1 or via the direct link) to get the German-speaking areas from the gadm-object:
gadmCH = readRDS("~/tmp/CHE_adm1.rds")
dataGermanSwiss <- gadmCH[gadmCH$NAME_1 %in% cantonesGerman,]
I am now missing the merging step to get this information together. The result should look like this:
It represents a combined map consisting of the contours of the merged area (Germany + Austria + ~70% of Switzerland), without borders between the countries. If adding and leaving out the inter-country borders would be parametrizable, that would be great but not a must have.
You can that like this:
Get the polygons you need
library(raster)
deu <- getData('GADM', country='DEU', level=0)
aut <- getData('GADM', country='AUT', level=0)
swi <- getData('GADM', country='CHE', level=1)
Subset the Swiss cantons (here an example list, not the correct one); there is no need for a loop for such things in R.
cantone <- c('Aargau', 'Appenzell Ausserrhoden', 'Appenzell Innerrhoden', 'Basel-Landschaft', 'Basel-Stadt', 'Sankt Gallen', 'Schaffhausen', 'Solothurn', 'Thurgau', 'Zürich')
GermanSwiss <- swi[swi$NAME_1 %in% cantone,]
Aggregate (dissolve) Swiss internal boundaries
GermanSwiss <- aggregate(GermanSwiss)
Combine the three countries and aggregate
german <- bind(deu, aut, GermanSwiss)
german <- aggregate(german)

How do I preserve prexisting identifiers when geocoding a list of addresses in R?

I'm currently working with an R script set up to use RDSTK, a wrapper for the Data Science Toolkit API based on this, to geocode a list of addresses from a CSV.
The script appears to work, but the list of addresses has a preexisting unique identifier which isn't preserved in the process - the input file has two columns: id, and address. The id column, for the purposes of the geocoding process, is meaningless, but I'd like the output to retain it - that is, I'd like the output, which has three columns (address, long, and lat) to have four - id being the first.
The issue is that
The output is not in the same order as the input addresses, or doesn't appear to be, so I cannot simply tack on the column of addresses at the end, and
The output does not include nulls, so the two would not be the same number of rows in any case, even if it was the same order, and
I am not sure how to effectively tie the id column in such that it becomes a part of the geocoding process, which obviously would be the ideal solution.
Here is the script:
require("RDSTK")
library(httr)
library(rjson)
dff = read.csv("C:/Users/name/Documents/batchtestv2.csv")
data <- paste0("[",paste(paste0("\"",dff$address,"\""),collapse=","),"]")
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,lapply(json, function(x) c(long=x$longitude,lat=x$latitude)))
geocode
write.csv(geocode, file = "C:/Users/name/Documents/geocodetest.csv")
And here is a sample of the output:
2633 Camino Ramon Suite 500 San Ramon California 94583 United States -121.96208 37.77027
555 Lordship Boulevard Stratford Connecticut 6615 United States -73.14098 41.16542
500 West 13th Street Fort Worth Texas 76102 United States -97.33288 32.74782
50 North Laura Street Suite 2500 Jacksonville Florida 32202 United States -81.65923 30.32733
7781 South Little Egypt Road Stanley North Carolina 28164 United States -81.00597 35.44482
Maybe the solution is extraordinarily simple and I'm just being dense - it's entirely possible (I don't have extensive experience with any particular language, so I sometimes miss obvious things) but I haven't been able to solve it.
Thanks in advance!

Resources