The code below perfectly returns what I need: the household median income for each puma using 2019 ACS (1-year). However, what is missing is the States name. I tried the option of state="all" but it did not work. How can I obtain my data of interest by states and puma?
Thanks,
NM
PUMA_level <- get_acs(geography = "puma",
variable = "B19013_001",
survey = "acs1",
# state="all",
year = 2019)
Using the usmap::fips_info function you could get a list of state codes, names and abbreviations which you could then merge to your census data like so:
library(tidycensus)
library(usmap)
PUMA_level <- get_acs(geography = "puma",
variable = "B19013_001",
survey = "acs1",
year = 2019,
keep_geo_vars = TRUE)
#> Getting data from the 2019 1-year ACS
#> The 1-year ACS provides data for geographies with populations of 65,000 and greater.
PUMA_level$fips <- substr(PUMA_level$GEOID, 1, 2)
states <- usmap::fips_info(unique(PUMA_level$fips))
#> Warning in get_fips_info(fips_, sortAndRemoveDuplicates): FIPS code(s) 72 not
#> found
PUMA_level <- merge(PUMA_level, states, by = "fips")
head(PUMA_level)
#> fips GEOID
#> 1 01 0100100
#> 2 01 0100200
#> 3 01 0100302
#> 4 01 0100400
#> 5 01 0100500
#> 6 01 0100301
#> NAME
#> 1 Lauderdale, Colbert, Franklin & Marion (Northeast) Counties PUMA; Alabama
#> 2 Limestone & Madison (Outer) Counties--Huntsville City (Far West & Southwest) PUMA, Alabama
#> 3 Huntsville City (Central & South) PUMA, Alabama
#> 4 DeKalb & Jackson Counties PUMA, Alabama
#> 5 Marshall & Madison (Southeast) Counties--Huntsville City (Far Southeast) PUMA, Alabama
#> 6 Huntsville (North) & Madison (East) Cities PUMA, Alabama
#> variable estimate moe abbr full
#> 1 B19013_001 46449 3081 AL Alabama
#> 2 B19013_001 74518 6371 AL Alabama
#> 3 B19013_001 51884 5513 AL Alabama
#> 4 B19013_001 43406 3557 AL Alabama
#> 5 B19013_001 56276 3216 AL Alabama
#> 6 B19013_001 63997 5816 AL Alabama
Related
I have a dataframe with data about the US States.
One of the columns in the df is "Division", which tells the location where each state belongs to ("East North Central", "East South Central", "Middle Atlantic", "Mountain", "New England", "Pacific", "South Atlantic", "West North Central", "West South Central").
I created an array with the average expectancy life for each division, using an existing column called "Life Exp:
avg.life.exp = tapply(df[["Life Exp"]], df$Division, mean, na.rm=TRUE)
Which returns the following:
East North Central East South Central Middle Atlantic
70.99000 69.33750 70.63667
Mountain New England Pacific
70.94750 71.57833 71.69400
South Atlantic West North Central West South Central
69.52625 72.32143 70.43500
Now I would like to add a new column to the df, with the average life expectancy of each Division. So basically I would like to do a Left Join, where if the state belonged to the East Noth Central, it would return 70.99000, and so on.
I need to do this without using packages.
Thank you in advance for any help you can provide!
One option would be to use merge:
merge(df, data.frame(Division = names(avg.life.exp), avg.life.exp), all.x = TRUE)
A second option would be to use match
df$avg.life.exp <- avg.life.exp[match(df$Division, names(avg.life.exp))]
Using the gapminder dataset as example data:
library(gapminder)
# Example data
df <- gapminder[gapminder$year == 2007, c("country", "continent", "lifeExp")]
avg.life.exp <- tapply(df[["lifeExp"]], df$continent, mean, na.rm=TRUE)
avg.life.exp
#> Africa Americas Asia Europe Oceania
#> 54.80604 73.60812 70.72848 77.64860 80.71950
# Using merge
df1 <- merge(df, data.frame(continent = names(avg.life.exp), avg.life.exp), all.x = TRUE)
head(df1)
#> continent country lifeExp avg.life.exp
#> 1 Africa Reunion 76.442 54.80604
#> 2 Africa Eritrea 58.040 54.80604
#> 3 Africa Algeria 72.301 54.80604
#> 4 Africa Congo, Rep. 55.322 54.80604
#> 5 Africa Equatorial Guinea 51.579 54.80604
#> 6 Africa Malawi 48.303 54.80604
# Using match
df$avg.life.exp <- avg.life.exp[match(df$continent, names(avg.life.exp))]
head(df)
#> # A tibble: 6 × 4
#> country continent lifeExp avg.life.exp
#> <fct> <fct> <dbl> <dbl>
#> 1 Afghanistan Asia 43.8 70.7
#> 2 Albania Europe 76.4 77.6
#> 3 Algeria Africa 72.3 54.8
#> 4 Angola Africa 42.7 54.8
#> 5 Argentina Americas 75.3 73.6
#> 6 Australia Oceania 81.2 80.7
I have a big data frame of 80,000 rows. It was created by combining individual data frames from different years. The origin variable indicates the year of the entry's original data frame.
Here is an example of the first few of the big data frame rows that show how data frames from 2003 and 2011 were combined.
df_1:
ID City State origin
1 NY NY 2003
2 NY NY 2003
3 SF CA 2003
1 NY NY 2011
3 SF CA 2011
2 NY NY 2011
4 LA CA 2011
5 SD CA 2011
Now I want to create a new variable called first_appearance that takes the min of the origin variable for each ID:
final_df:
ID City State origin first_appearance
1 NY NY 2003 2003
2 NY NY 2003 2003
3 SF CA 2003 2003
1 NY NY 2011 2003
3 SF CA 2011 2003
2 NY NY 2011 2003
4 LA CA 2011 2011
5 SD CA 2011 2011
So far, I've tried using:
prestep_final <- df_1 %>% group_by(ID) %>% summarise(first_apperance = min(origin))
final_df <- merge(prestep_final, df_1, by = "ID")
Prestep_final works and produces a data frame with the ID and the first_appearance.
Unfortunately, the merge step doesn't work and yields a data frame with NA entries only.
How can I improve my code so that I can produce a table like final_df above. I'd appreciate any suggestions and don't have package preferences.
If you change summarise to mutate you get your desired result without merging:
library(tidyverse)
df <- tibble::tribble(
~ID, ~City, ~State, ~origin,
1, 'NY', 'NY', 2003,
2, 'NY', 'NY', 2003,
3, 'SF', 'CA', 2003,
1, 'NY', 'NY', 2011,
3, 'SF', 'CA', 2011,
2, 'NY', 'NY', 2011,
4, 'LA', 'CA', 2011,
5, 'SD', 'CA', 2011
)
df %>% group_by(ID) %>%
mutate(first_appearance = min(origin))
#> # A tibble: 8 x 5
#> # Groups: ID [5]
#> ID City State origin first_appearance
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 NY NY 2003 2003
#> 2 2 NY NY 2003 2003
#> 3 3 SF CA 2003 2003
#> 4 1 NY NY 2011 2003
#> 5 3 SF CA 2011 2003
#> 6 2 NY NY 2011 2003
#> 7 4 LA CA 2011 2011
#> 8 5 SD CA 2011 2011
Created on 2020-06-10 by the reprex package (v0.3.0)
An option with data.table
library(data.table)
setDT(df)[, first_appearance := min(origin), ID]
Or in base R
df$first_appearance <- with(df, ave(origin, ID, FUN = min))
Here, I am manipulating election data, and the current data is in the following format. Both a visual and coded example are included (while visual is a bit condensed). Moreover, values have been edited from their originals.
# Representative Example
library(tidyverse)
test.df <- tibble(yr=rep(1956),mn=rep(11),
sub=rep("Alabama"),
unit_type=rep("County"),
unit_name=c("Autauga","Baldwin","Barbour"),
TotalVotes=c(1000,2000,3000),
RepVotes=c(500,1000,1500),
RepCandidate=rep("Eisenhower"),
DemVotes=c(500,1000,1500),
DemCandidate=rep("Stevenson"),
ThirdVotes=c(0,0,0),
ThirdCandidate=rep("Uncommitted"),
RepVotesTotalPerc=rep(50.00),
DemVotesTotalPerc=rep(50.00),
ThirdVotesTotalPerc=rep(0.00)
)
----------------------------------------------------------------------------------------------------
yr | mn | sub | unit_type | unit_name | TotalVotes | RepVotes | RepCan | DemVotes | DemCan
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga 1000 500 EisenHower 500 Stevenson
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Baldwin 2000 1000 EisenHower 1000 Stevenson
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Barbour 3000 2000 EisenHower 2000 Stevenson
----------------------------------------------------------------------------------------------------
I am trying to get a table that looks like the following:
----------------------------------------------------------------------------------------------------
yr | mn | sub | unit_type | unit_name | pty_n | can | TotalVotes | CanVotes
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Republican Eisenhower 1000 500
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Democrat Stevenson 1000 500
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Independent Uncommitted 1000 0
----------------------------------------------------------------------------------------------------
# and etc. for other counties in example (Baldwin, Barbour, etc)
As you can see, I pretty much want three observations per county, where candidates are all in one column, as well as their respective votes in another (CanVotes, or the like).
I have tried using things like pivot_longer() or spread(), but I am having a hard time visualizing these in code. Any help here would be greatly appreciated in sort of reorienting my data to get a candidate column, but also moving the rest of the data with it!
Here is a solution that first uses pivot_longer to bring the Votes into a long format. Then I use mutate with case_when to substitute the former column names with the actual candidate names and delete the single candidate columns:
long_table <- pivot_longer(test.df,
cols = c(RepVotes, DemVotes, ThirdVotes),
names_to = "pty_n",
values_to = "CanVotes") %>%
mutate(can = case_when(
pty_n == "RepVotes" ~ RepCandidate,
pty_n == "DemVotes" ~ DemCandidate,
pty_n == "ThirdVotes" ~ ThirdCandidate
),
pty_n = case_when(
pty_n == "RepVotes" ~ "Republican",
pty_n == "DemVotes" ~ "Democrat",
pty_n == "ThirdVotes" ~ "Independent"
)) %>%
select(-c(RepCandidate, DemCandidate, ThirdCandidate))
# A tibble: 9 x 12
yr mn sub unit_type unit_name TotalVotes RepVotesTotalPerc DemVotesTotalPerc ThirdVotesTotalPe~ pty_n CanVotes can
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1956 11 Alabama County Autauga 1000 50 50 0 Republican 500 Eisenhower
2 1956 11 Alabama County Autauga 1000 50 50 0 Democrat 500 Stevenson
3 1956 11 Alabama County Autauga 1000 50 50 0 Independe~ 0 Uncommitt~
4 1956 11 Alabama County Baldwin 2000 50 50 0 Republican 1000 Eisenhower
5 1956 11 Alabama County Baldwin 2000 50 50 0 Democrat 1000 Stevenson
6 1956 11 Alabama County Baldwin 2000 50 50 0 Independe~ 0 Uncommitt~
7 1956 11 Alabama County Barbour 3000 50 50 0 Republican 1500 Eisenhower
8 1956 11 Alabama County Barbour 3000 50 50 0 Democrat 1500 Stevenson
9 1956 11 Alabama County Barbour 3000 50 50 0 Independe~ 0 Uncommitt~
I tried to build a custom spec, but it seems that the names have to be derived from the column names and can't be directly conditional on other columns.
Here is a data.table go at things
library( data.table )
#convert data to the data.table-format
setDT( test.df )
#get the different paries to update the variable balter in
parties <- gsub( "Candidate", "", grep( "^.*Candidate$", names( test.df ), value = TRUE ) )
#melt to each candidate and his/her votes
DT.melt <- melt(test.df,
id.vars = c("yr", "mn", "sub", "unit_type", "unit_name"),
measure.vars = patterns( can = "^.*Candidate$",
canVotes = "^(Rep|Dem|Third)Votes$" ),
variable.name = "pty_n"
)
#get the totals from the original date (by unit_name) through joining
DT.melt[ test.df, TotalVotes := i.TotalVotes, on = .(unit_name)]
#and pass the correct party name to the pty_n column
DT.melt[, pty_n := parties[ pty_n ] ][]
# yr mn sub unit_type unit_name pty_n can canVotes TotalVotes
# 1: 1956 11 Alabama County Autauga Rep Eisenhower 500 1000
# 2: 1956 11 Alabama County Baldwin Rep Eisenhower 1000 2000
# 3: 1956 11 Alabama County Barbour Rep Eisenhower 1500 3000
# 4: 1956 11 Alabama County Autauga Dem Stevenson 500 1000
# 5: 1956 11 Alabama County Baldwin Dem Stevenson 1000 2000
# 6: 1956 11 Alabama County Barbour Dem Stevenson 1500 3000
# 7: 1956 11 Alabama County Autauga Third Uncommitted 0 1000
# 8: 1956 11 Alabama County Baldwin Third Uncommitted 0 2000
# 9: 1956 11 Alabama County Barbour Third Uncommitted 0 3000
I also want to change the state column to be in terms of the FIPS code. Just not sure what parameters to use and how to do this since I am new to R.
Here are the parameters given by R:
plot_usmap(regions = c("states", "state", "counties", "county"),
include = c(), data = data.frame(), values = "values",
theme = theme_map(), lines = "black", labels = FALSE,
label_color = "black")
It is unclear exactly what you are trying to achieve without an example, but here is how I was able to convert a column state in a data.frame from the abbreviation to the FIPS code:
> library(usmap)
> df <- statepop[1:5, -1]
> names(df)[1] <- 'state'
> df
# A tibble: 5 x 3
state full pop_2015
<chr> <chr> <dbl>
1 AL Alabama 4858979
2 AK Alaska 738432
3 AZ Arizona 6828065
4 AR Arkansas 2978204
5 CA California 39144818
> df$fips <- fips(df$state)
> df
# A tibble: 5 x 4
state full pop_2015 fips
<chr> <chr> <dbl> <chr>
1 AL Alabama 4858979 01
2 AK Alaska 738432 02
3 AZ Arizona 6828065 04
4 AR Arkansas 2978204 05
5 CA California 39144818 06
I have a origin-destination table like this.
library(dplyr)
set.seed(1983)
namevec <- c('Portugal', 'Romania', 'Nigeria', 'Peru', 'Texas', 'New Jersey', 'Colorado', 'Minnesota')
## Create OD pairs
df <- data_frame(origins = sample(namevec, size = 100, replace = TRUE),
destinations = sample(namevec, size = 100, replace = TRUE))
Question
I got stucked in counting the relationships for each origin-destination (with no directionality).
How can I get output that Colorado-Minnesota and Minnesota-Colorado are seen as one group?
What I have tried so far:
## Counts for each OD-pairs
df %>%
group_by(origins, destinations) %>%
summarize(counts = n()) %>%
ungroup() %>%
arrange(desc(counts))
Source: local data frame [48 x 3]
origins destinations counts
(chr) (chr) (int)
1 Nigeria Colorado 5
2 Colorado Portugal 4
3 New Jersey Minnesota 4
4 New Jersey New Jersey 4
5 Peru Nigeria 4
6 Peru Peru 4
7 Romania Texas 4
8 Texas Nigeria 4
9 Minnesota Minnesota 3
10 Nigeria Portugal 3
.. ... ... ...
One way is to combine the sorted combination of the two locations into a single field. Summarizing on that will remove your two original columns, so you'll need to join them back in.
paired <- df %>%
mutate(
orderedpair = paste(pmin(origins, destinations), pmax(origins, destinations), sep = "::")
)
paired
# # A tibble: 100 × 3
# origins destinations orderedpair
# <chr> <chr> <chr>
# 1 Peru Colorado Colorado::Peru
# 2 Romania Portugal Portugal::Romania
# 3 Romania Colorado Colorado::Romania
# 4 New Jersey Minnesota Minnesota::New Jersey
# 5 Minnesota Texas Minnesota::Texas
# 6 Romania Texas Romania::Texas
# 7 Peru Peru Peru::Peru
# 8 Romania Nigeria Nigeria::Romania
# 9 Portugal Minnesota Minnesota::Portugal
# 10 Nigeria Colorado Colorado::Nigeria
# # ... with 90 more rows
left_join(
paired,
group_by(paired, orderedpair) %>% count(),
by = "orderedpair"
) %>%
select(-orderedpair) %>%
distinct() %>%
arrange(desc(n))
# # A tibble: 48 × 3
# origins destinations n
# <chr> <chr> <int>
# 1 Romania Portugal 6
# 2 New Jersey Minnesota 6
# 3 Portugal Romania 6
# 4 Minnesota New Jersey 6
# 5 Romania Texas 5
# 6 Nigeria Colorado 5
# 7 Texas Nigeria 5
# 8 Texas Romania 5
# 9 Nigeria Texas 5
# 10 Peru Peru 4
# # ... with 38 more rows
(The only reason I used "::" as the separator is in the unlikely event you need to parse orderedpair; using the default " " (space) won't work with (e.g.) "New Jersey" in the mix.)