Separating geographical data strings in R - r

I'm working with QECW data from BLS and would like to make the geographical data included more useful. I want to split the column "area_title" into different columns - one with the area's name, one with the level of the area, and one with the state.
I got a good start using separate:
qecw <- qecw %>% separate(area_title, c("county", "geography level", "state"))
The problem is that there's a variety of ways the geographical data are arranged into strings that makes them not uniform enough to cleanly separate. The area_title column includes names in formats that separate pretty cleanly, like:
area_title
Alabama -- Statewide
Autauga County, Alabama
which splits pretty well into
county geography level state
Alabama Statewide NA
Autauga County Alabama
but this breaks down for cases like:
area_title
Aleutians West Census Area, Alaska
Chattanooga-Cleveland-Dalton TN-GA-AL CSA
U.S. Combined statistical Areas, combined
as well as any states, counties or other place names that have more than one word.
I can go case-by-case to fix these, but I would appreciate a more efficient solution.
The exact data I'm using is "2019.q1-q3 10 10 Total, all industries," available at the link under "Current year quarterly data grouped by industry".
Thanks!

So far I came up with this:
I can get a place name by selecting a substring of area_title with everything to the left of the first comma:
qecw <- qecw %>% mutate(location = sub(",.*","", qecw$area_title))
Then I have a series of nested if_else statements to create a location type:
mutate(`Location Type` =
if_else(str_detect(area_title, "Statewide"), "State",
if_else(str_detect(area_title, "County"), "County",
if_else(str_detect(area_title, "CSA"), "CSA",
if_else(str_detect(area_title, "MSA"), "MSA",
if_else(str_detect(area_title, "MicroSA"), "MicroSA",
if_else(str_detect(area_title, "Undefined"), "Undefined",
"other")))))))
This isn't a complete answer; I think I'm still missing some location types, and I haven't come up with a good way to extract state names yet.

Related

Conditionally Fill a Column based on Another Column

I have a dataframe (df) where in one column I have US states by their two letter acronym; 'AK','AL','AR','AZ','CA', ..., 'WV','WY'.
I want to create a new column that reads the 'df$state' column and apply a region: West, Midwest, Northeast, Southeast, Southwest.
I have the regions broken down into lists (for example:
list_southwest <- c('TX','AZ','NM','OK')
I duplicated the 'df$state' column and renamed it 'df$region'. What I want to do is replace the two-letter state elements with regions and not do it state-by-state.
I have been successful with the code: df$region [df$region == 'TX'] <- "Southwest"
But I'd like to go faster, I tried: df$region [df$region == 'list_west'] <- "Southwest"
in an attempt to check the column for all the two-letter strings in "list_west" but I'm not getting anything replaced and I'm not receiving an error of any kind.
I've also tried the tedious:
df$region [df$region == 'TX', 'AZ', ... but r doesn't seem to like that, I've tried replacing the commas with |, &&, ||, and no luck.
I was thinking there might be a way to add a for loop and case_when(), and a lot of other things, but I'm stuck. Any help would be greatly appreciated!
Here's what I'm hoping for without having to run a line of code per each individual state:
state
region
AK
West
AL
South
AR
South
AZ
West
CA
West
CO
West
CT
NorthEast
SOLVED!!
Here's how the code looks after a comment to use %in% versus ==:
df$region [df$region %in% list_west] <- "West"

How to reformat similar text for merging in R?

I am working with the NYC open data, and I am wanting tho merge two data frames based on community board. The issue is, the two data frames have slightly different ways of representing this. I have provided an example of the two different formats below.
CommunityBoards <- data.frame(FormatOne = c("01 BRONX", "05 QUEENS", "15 BROOKLYN", "03 STATEN ISLAND"),
FormatTwo = c("BRONX COMMUNITY BOARD #1", "QUEENS COMMUNITY BOARD #5",
"BROOKLYN COMMUNITY BOARD #15", "STATEN ISLAND COMMUNITY BD #3"))
Along with the issue of the placement of the numbers and the "#", the second data frame shortens "COMMUNITY BOARD" to "COMMUNITY BD" just on Staten Island. I don't have a strong preference of what string looks like, so long as I can discern borough and community board number. What would be the easiest way to reformat one or both of these strings so I could merge the two sets?
Thank you for any and all help!
You can use regex to get out just the district numbers. For the first format, the only thing that matters is the begining of the string before the space, hence you could do
districtsNrs1 <- as.numeric(gsub("(\\d+) .*","\\1",CommunityBoards$FormatOne))
For the second, I assume that the formats look like "something HASHTAG number", hence you could do
districtsNrs2 <- as.numeric(gsub(".* #(\\d+)","\\1",CommunityBoards$FormatTwo))
to get the pure district numbers.
Now you know how to extract the district numbers. With that information, you can name/reformat the district-names how you want.
To know which district number is which district, you can create a translation data.frame between the districts and numbers like
districtNumberTranslations <- data.frame(
districtNumber = districtsNrs2,
districtName = sapply(strsplit(CommunityBoards$FormatTwo," COMMUNITY "),"[[",1)
)
giving
# districtNumber districtName
#1 1 BRONX
#2 5 QUEENS
#3 15 BROOKLYN
#4 3 STATEN ISLAND

How to create a subset of data for most common [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I need some help creating a subset of data. I'm sure this is a simple problem but I can't figure it out.
For example, in the table, I need to create a subset of the data that includes the presidential winner from each state. So for Alabama for example, I would need the line for Donald J Trump since he got the highest proportion of votes (candidate votes/ total votes). I would need to isolate the winners from every state.
State Candidate candidatevotes totalvotes
Alabama D J Trump 1318255 2123372
Alabama Clinton 729547 2123372
Alabama Gary Johnson 44467 2123372
Alabama Other 21712 2123372
However, I don't know how to isolate the winner from each state. I have tried using using
data_sub <- filename[candidatevotes/totalvotes > .5]
but I know that since there are 3rd party candidates, not every winner from each state will win with majority votes. I have attached a picture for reference. Thank you in advance!
I just manipulated your data a little bit to demostrate how the problem could be solved:
# Just changed the last two states to Texas so that you get a two line result (not just one)
election <- data.frame(State = c("Alabama", "Alabama", "Texas", "Texas"),
Candidate = c("D J Trump", "Clinton", "Gary Johnson", "Other"),
candidatevotes = c(1318255, 729547, 44467, 21712),
totalvotes = c(2123372, 2123372, 2123372, 2123372))
# need library
library(dplyr)
election %>%
# group by the variable you want the max value for (State)
dplyr::group_by(State) %>%
# get the lines with maximum candidatevotes for each State
dplyr::filter(candidatevotes == max(candidatevotes))
We can do a group by 'State' and filter the max proportion row for each 'State'
library(dplyr)
df1 %>%
mutate(prop = candidatevotes/totalvotes) %>%
group_by(State) %>%
filter(prop > .5, prop == max(prop))

getCensus Hawaii City Populations

I'm looking to gather populations for Hawaiian cities and am puzzled how to collect it using the censusapi getCensus() function.
census_api_key(key='YOURKEYHERE')
newpopvars <- listCensusMetadata(name = "2017/pep/population", type = "variables")
usapops <- getCensus(name = "pep/population",
vintage = 2017,
vars = c(newpopvars$name),
region = "place:*")
usapops <- usapops[which(usapops$DATE_==10),]
state <- grepl("Hawaii", usapops$GEONAME)
cities <- data.frame()
for (i in seq(1,length(state))) {
if (state[i] == TRUE) {
cities <- rbind(cities,usapops[i,])
}
}
This returns only two cities but certainly there are more than that in Hawaii. What am I doing wrong?
There is only one place (Census summary level 160) in Hawaii which is large enough to be included in the 1-year American Community Survey release: "Urban Honolulu" (GeoID 1571550). The 1-year release only includes places with 65,000+ population. I assume similar controls apply to the Population Estimates program -- I couldn't find it stated directly, but the section header on the page for Population Estimates downloads for cities and towns says "Places of 50,000 or More" -- the second most populated CDP in Hawaii is East Honolulu, which had only 47,868 in the 2013-2017 ACS release.
If you use the ACS 5-year data release, you'll find 151 places at summary level 160.
It looks as though you should change pep/population to acs/acs5 in your getCensus call. I don't know the specific variables for the API, but if you just want total population for places, use the ACS B01003 table, which has a single column with that value.

Obtain State Name from Google Trends Interest by City

Suppose you inquire the following:
gtrends("google", geo="US")$interest_by_city
This returns how many searches for the term "google" occurred across cities in the US. However, it does not provide any information regarding which state each city belongs to.
I have tried merging this data set with several others including city and state names. Given that the same city name can be present in many states, it is unclear to me how to identify which city was the one Google Trends provided data for.
I provide below a more detailed MWE.
library(gtrendsR)
library(USAboundariesData)
data1 <- gtrends("google", geo= "US")$interest_by_city
data1$city <- data1$location
data2 <- us_cities(map_date = NULL)
data3 <- merge(data1, data2, by="city")
And this yields the following problem:
city state
Alexandria Louisiana
Alexandria Indiana
Alexandria Kentucky
Alexandria Virginia
Alexandria Minnesota
making it difficult to know which "Alexandria" Google Trends provided the data for.
Any hints in how to identify the state of each city would be much appreciated.
One way around this is to collect the cities per state and then just rbind the respective data frames. You could first make a vector of state codes like so
states <- paste0("US-",state.abb)
I then just used purrr for its map and reduce functionality to create a single frame
data <- purrr::reduce(purrr::map(states, function(x){
cities = gtrends("google", geo = x)$interest_by_city
}),
rbind)

Resources