getCensus Hawaii City Populations - r

I'm looking to gather populations for Hawaiian cities and am puzzled how to collect it using the censusapi getCensus() function.
census_api_key(key='YOURKEYHERE')
newpopvars <- listCensusMetadata(name = "2017/pep/population", type = "variables")
usapops <- getCensus(name = "pep/population",
vintage = 2017,
vars = c(newpopvars$name),
region = "place:*")
usapops <- usapops[which(usapops$DATE_==10),]
state <- grepl("Hawaii", usapops$GEONAME)
cities <- data.frame()
for (i in seq(1,length(state))) {
if (state[i] == TRUE) {
cities <- rbind(cities,usapops[i,])
}
}
This returns only two cities but certainly there are more than that in Hawaii. What am I doing wrong?

There is only one place (Census summary level 160) in Hawaii which is large enough to be included in the 1-year American Community Survey release: "Urban Honolulu" (GeoID 1571550). The 1-year release only includes places with 65,000+ population. I assume similar controls apply to the Population Estimates program -- I couldn't find it stated directly, but the section header on the page for Population Estimates downloads for cities and towns says "Places of 50,000 or More" -- the second most populated CDP in Hawaii is East Honolulu, which had only 47,868 in the 2013-2017 ACS release.
If you use the ACS 5-year data release, you'll find 151 places at summary level 160.
It looks as though you should change pep/population to acs/acs5 in your getCensus call. I don't know the specific variables for the API, but if you just want total population for places, use the ACS B01003 table, which has a single column with that value.

Related

Scraping PDF tables based on title

I am trying to extract one table each from 31 pdfs. The titles of the tables all start the same way but the end varies by region.
For one document the title is "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year; Arusha Region, 2012 Census". Another would be "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year; Dodoma Region, 2012 Census."
I used tabulizer to scrape the first table manually based on the specific text lines I need but given the similar naming conventions, I was hoping to automate this process.
```
PATH2<- "Regions/02. Arusha Regional Profile.pdf"
```
txt2 <- pdf_text(PATH2) %>%
readr:: read_lines()
```
specific_lines2<- txt2[4621:4639] %>%
str_squish() %>%
str_replace_all(",","") %>%
strsplit(split = " ")
What: You can find the page with the common part of the title on each file and extract the data from there (if there is only one occurrence of the title per file)
How: Build a function to get the table on a pdf, then ask the function on lapply to run for all pdfs.
Example:
First, load the function to find a page that includes the title and get the text from there.
get_page_text <- function(url,word_find) {
txt <- pdftools::pdf_text(url)
p <- grep(word_find, txt, ignore.case = TRUE)[1] # Sentence to find
L <- tabulizer::extract_text(url, pages = p)
i <- which.max(lengths(L))
data.frame(L[[i]])
}
Second, get file names.
setwd("C:/Users/xyz/Regions")
files <- list.files(pattern = "pdf$|PDF$") # Get file names on the folder Regions.
Then, the "loop" (lapply) to run the function for each pdf.
reports <- lapply(files,
get_page_text,
word_find = "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year")
The result is a variable list that has one data.frame for each pdf extracted. What comes next is cleaning up your data.
The function may vary a lot depending on the patterns on your pdfs. Finding the page was effective for me, you will find what fits best for you.

Separating geographical data strings in R

I'm working with QECW data from BLS and would like to make the geographical data included more useful. I want to split the column "area_title" into different columns - one with the area's name, one with the level of the area, and one with the state.
I got a good start using separate:
qecw <- qecw %>% separate(area_title, c("county", "geography level", "state"))
The problem is that there's a variety of ways the geographical data are arranged into strings that makes them not uniform enough to cleanly separate. The area_title column includes names in formats that separate pretty cleanly, like:
area_title
Alabama -- Statewide
Autauga County, Alabama
which splits pretty well into
county geography level state
Alabama Statewide NA
Autauga County Alabama
but this breaks down for cases like:
area_title
Aleutians West Census Area, Alaska
Chattanooga-Cleveland-Dalton TN-GA-AL CSA
U.S. Combined statistical Areas, combined
as well as any states, counties or other place names that have more than one word.
I can go case-by-case to fix these, but I would appreciate a more efficient solution.
The exact data I'm using is "2019.q1-q3 10 10 Total, all industries," available at the link under "Current year quarterly data grouped by industry".
Thanks!
So far I came up with this:
I can get a place name by selecting a substring of area_title with everything to the left of the first comma:
qecw <- qecw %>% mutate(location = sub(",.*","", qecw$area_title))
Then I have a series of nested if_else statements to create a location type:
mutate(`Location Type` =
if_else(str_detect(area_title, "Statewide"), "State",
if_else(str_detect(area_title, "County"), "County",
if_else(str_detect(area_title, "CSA"), "CSA",
if_else(str_detect(area_title, "MSA"), "MSA",
if_else(str_detect(area_title, "MicroSA"), "MicroSA",
if_else(str_detect(area_title, "Undefined"), "Undefined",
"other")))))))
This isn't a complete answer; I think I'm still missing some location types, and I haven't come up with a good way to extract state names yet.

Obtain State Name from Google Trends Interest by City

Suppose you inquire the following:
gtrends("google", geo="US")$interest_by_city
This returns how many searches for the term "google" occurred across cities in the US. However, it does not provide any information regarding which state each city belongs to.
I have tried merging this data set with several others including city and state names. Given that the same city name can be present in many states, it is unclear to me how to identify which city was the one Google Trends provided data for.
I provide below a more detailed MWE.
library(gtrendsR)
library(USAboundariesData)
data1 <- gtrends("google", geo= "US")$interest_by_city
data1$city <- data1$location
data2 <- us_cities(map_date = NULL)
data3 <- merge(data1, data2, by="city")
And this yields the following problem:
city state
Alexandria Louisiana
Alexandria Indiana
Alexandria Kentucky
Alexandria Virginia
Alexandria Minnesota
making it difficult to know which "Alexandria" Google Trends provided the data for.
Any hints in how to identify the state of each city would be much appreciated.
One way around this is to collect the cities per state and then just rbind the respective data frames. You could first make a vector of state codes like so
states <- paste0("US-",state.abb)
I then just used purrr for its map and reduce functionality to create a single frame
data <- purrr::reduce(purrr::map(states, function(x){
cities = gtrends("google", geo = x)$interest_by_city
}),
rbind)

Merging (two and a half) countries from maps-package to one map object in R

I am looking for a map that combines Germany, Austria and parts of Switzerland together to one spatial object. This area should represent the German speaking areas in those three countries. I have some parts in place, but can not find a way to combine them. If there is a completely different solution to solve this problem, I am still interested.
I get the German and the Austrian map by:
require(maps)
germany <- map("world",regions="Germany",fill=TRUE,col="white") #get the map
austria <- map("world",regions="Austria",fill=TRUE,col="white") #get the map
Switzerland is more complicated, as I only need the 60-70% percent which mainly speak German. The cantones that do so (taken from the census report) are
cantonesGerman = c("Uri", "Appenzell Innerrhoden", "Nidwalden", "Obwalden", "Appenzell Ausserrhoden", "Schwyz", "Lucerne", "Thurgau", "Solothurn", "Sankt Gallen", "Schaffhausen", "Basel-Landschaft", "Aargau", "Glarus", "Zug", "Zürich", "Basel-Stadt")
The cantone names can used together with data from gadm.org/country (selecting Switzerland & SpatialPolygonsDataFrame -> Level 1 or via the direct link) to get the German-speaking areas from the gadm-object:
gadmCH = readRDS("~/tmp/CHE_adm1.rds")
dataGermanSwiss <- gadmCH[gadmCH$NAME_1 %in% cantonesGerman,]
I am now missing the merging step to get this information together. The result should look like this:
It represents a combined map consisting of the contours of the merged area (Germany + Austria + ~70% of Switzerland), without borders between the countries. If adding and leaving out the inter-country borders would be parametrizable, that would be great but not a must have.
You can that like this:
Get the polygons you need
library(raster)
deu <- getData('GADM', country='DEU', level=0)
aut <- getData('GADM', country='AUT', level=0)
swi <- getData('GADM', country='CHE', level=1)
Subset the Swiss cantons (here an example list, not the correct one); there is no need for a loop for such things in R.
cantone <- c('Aargau', 'Appenzell Ausserrhoden', 'Appenzell Innerrhoden', 'Basel-Landschaft', 'Basel-Stadt', 'Sankt Gallen', 'Schaffhausen', 'Solothurn', 'Thurgau', 'Zürich')
GermanSwiss <- swi[swi$NAME_1 %in% cantone,]
Aggregate (dissolve) Swiss internal boundaries
GermanSwiss <- aggregate(GermanSwiss)
Combine the three countries and aggregate
german <- bind(deu, aut, GermanSwiss)
german <- aggregate(german)

Census API did not provide data for selected endyear

Looking to pull in 2014 ACS data released recently through the acs package. Used the following basic query:
# Set the geo marker for all TN counties
geo <- geo.make(state = "TN", county = "*")
# Fetch Total Population for all TN counties
acs.fetch(endyear = 2014, span = 5, geography = geo, table.number = "B01003")
Output (shortened) is what I would expect to see for the 2010-2014 Total Population table:
ACS DATA:
2010 -- 2014 ;
Estimates w/90% confidence intervals;
for different intervals, see confint()
B01003_001
Anderson County, Tennessee 75346 +/- 0
Bedford County, Tennessee 45660 +/- 0
Benton County, Tennessee 16345 +/- 0
But I also get this Warning, which is odd since the values for my acs.fetch match if I do a look-up in the ACS FactFinder website:
Warning messages:
1: In acs.fetch(endyear = 2014, span = 5, geography = geo, table.number = "B01003") :
As of the date of this version of the acs package
Census API did not provides data for selected endyear
2: In acs.fetch(endyear = endyear, span = span, geography = geography[[1]], :
As of the date of this version of the acs package
Census API did not provides data for selected endyear
Am I misunderstanding something here? How can I be seeing the correct values, but the Warning Messages are telling me the Census API is not providing data for my parameters? Thank you.
From the developer, Ezra Glenn (eglenn#mit.edu):
The above is essentially correct: the data is getting fetched just fine. The warning is outdated, from a time before the 2014 data was available. (Technically it's not an error -- just a warning message to possibly explain what went wrong if the data did not get fetched. In this case, it can be safely ignored. I'll remove this in the next version.)

Resources