R Tidy Census - Variable for Voters 18+ - r

I have been trying to find a variable on the Tidy Census’ latest American Community Survey (ACS) variable list.
The one I’m looking for would be for all voters aged 18 and up. I have yet to find it in the list. Even if I have to combine a couple variables to make it work, that’s fine too.
I search for relevant keywords related to age, but have yet to find anything. Variables that appear with “18 years and over” have a greater specificity than what I am looking for. I may be missing something though; I’m new to Tidy Census.
Help would be greatly appreciated!

Finding Census variables is difficult. Start here: https://data.census.gov/table?q=ACS
In table S0101 under labels there is a selected age categories variable named 18 years and over.
Searched those those keywords and found this long list. https://api.census.gov/data/2019/acs/acs1/subject/variables/
Where we find variable "S0101_C01_026".
["S0101_C01_026E","Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!18 years and over","AGE AND SEX"],
Then we can get that variable:
county_data<-get_acs(geography = "county",
variables = "S0101_C01_026",
cache_table=TRUE,
year=2021)
county_data
# A tibble: 3,221 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama S0101_C01_026 44438 122
2 01003 Baldwin County, Alabama S0101_C01_026 178105 NA
3 01005 Barbour County, Alabama S0101_C01_026 19995 28
4 01007 Bibb County, Alabama S0101_C01_026 17800 44
5 01009 Blount County, Alabama S0101_C01_026 45201 75

Related

how do I constrain a variable to 5 digits but ensure it only deletes from the right [duplicate]

This question already has answers here:
Extract the first 2 Characters in a string
(4 answers)
Closed 2 years ago.
I'm sure someone has asked this before or that I could research a way to do this efficiently but I'm tight on time, and I'm not sure how to word my issue.
I have a data frame of large dimensions but I noticed that for some reason one of my columns has odd numbers.
head(testCA_extract[5])
ZIP_CODE
1 94801
2 94801
3 928034250
4 92714
5 95054
6 94565
from
> head(testCA_extract[2:6])
REPORTING_YEAR STATE_COUNTY_FIPS_CODE COUNTY_NAME ZIP_CODE CITY_NAME
1 1990 06013 CONTRA COSTA 94801 RICHMOND
2 1990 06013 CONTRA COSTA 94801 RICHMOND
3 1990 06059 ORANGE 928034250 ANAHEIM
4 1990 06059 ORANGE 92714 IRVINE
5 1990 06085 SANTA CLARA 95054 SANTA CLARA
6 1990 06013 CONTRA COSTA 94565 PITTSBURG
For anyone unfamiliar the zip codes are suppose to be 5 digits exactly I'm not sure why there are extra digits but it appears that the first 5 numbers regardless of length is the correct zip code.
So I need to either select only the first 5 digits or constrain the variable to the first 5 digits and delete the rest. and then I need that information to go back to it's proper row and column in the DF.
For your future posts, it will be good practice to include a minimum, reproducible example. In this simple case,
x <- as.numeric(substr(as.character(x), 1, 5))
where x is the variable containing your ZIP codes should do the trick.

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?
We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement
I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

Reshape long to wide where most columns have multiple values

I have data as below:
IDnum zipcode City County State
10011 36006 Billingsley Autauga AL
10011 36022 Deatsville Autauga AL
10011 36051 Marbury Autauga AL
10011 36051 Prattville Autauga AL
10011 36066 Prattville Autauga AL
10011 36067 Verbena Autauga AL
10011 36091 Selma Autauga AL
10011 36703 Jones Autauga AL
10011 36749 Plantersville Autauga AL
10011 36758 Uriah Autauga AL
10011 36480 Atmore Autauga AL
10011 36502 Bon Secour Autauga AL
I have a list of zipcodes, the cities they encompass, and counties/states they are located in. IDnum = numeric value for county and state, combined. List is in format you see now, I need to reshape it from long to wide / vertical to horizontal, where the IDnum variable becomes the unique identifier, and all other possible value combinations become wide variables.
IDnum zip1 city1 county1 state1 zip2 city2 county2
10011 36006 Billingsley Autauga AL 36022 Deatsville Autauga
This is just sample of the dataset, it encompasses every zip in the USA and includes more variables. I have seen other questions and answers similar to this one, but not where there are multiple values in almost every column.
There are commands in SPSS and STATA that will reshape data this way, in SPSS I can run a Restructure/Cases to Vars command that turns 11 variables in my initial dataset into about 1750, b/c one county has over 290 zips and it replicates most of the other variables 290+ times. This will create many blanks, but I need it to be reshaped into one very long horizontal file.
I have looked at reshape and reshape2, and am hung up on the 'default to length' error message. I did get melt/dcast to sorta work, but this creates one variable that is a list of all values, rather than creating variables for each value.
melted_dupes <- melt(zip_code_list_dupes, id.vars= c("IDnum"))
HRZ_dupes <- dcast(melted_dupes, IDnum ~ variable, fun.aggregate = list)
I have tried tidyr and dplyr but got lost in syntax. Am a little surprised there isn't a command the data similar to built in commands in other packages, making me assume there is, and I just haven't figured it out.
Any help is appreciated.
You can do this with the base function reshape after adding in a consecutive count by IDnum. Assuming your data is stored in a data.frame named df:
df2 <- within(df, count <- ave(rep(1,nrow(df)),df$IDnum,FUN=cumsum))
Provides a new column of the consecutive count named "time". And now we can reshape to wide format
reshape(df2,direction="wide",idvar="IDnum",timevar="count")
IDnum zipcode.1 City.1 County.1 State.1 zipcode.2 City.2 County.2 State.2 zipcode.3 City.3 County.3 State.3 zipcode.4 City.4 County.4 State.4
1 10011 36006 Billingsley Autauga AL 36022 Deatsville Autauga AL 36051 Marbury Autauga AL 36051 Prattville Autauga AL
(output truncated, goes all the way to zipcode.12, etc.)
There might be a more efficient way, but try the following.
I used my own (example) dataset, very similar to yours.
Run the process step by step to see how it works, as you'll have to modify some things in the code.
library(dplyr)
library(tidyr)
# get example data
dt = data.frame(id = c(1,1,1,2,2),
zipcode = c(4,5,6,7,8),
city = c("A","B","C","A","C"),
county = c("A","B","C","A","C"),
state = c("A","B","C","A","C"))
dt
# id zipcode city county state
# 1 1 4 A A A
# 2 1 5 B B B
# 3 1 6 C C C
# 4 2 7 A A A
# 5 2 8 C C C
# get maximum number of rows for a single id
# this will help you get the wide format
max_num_rows = max((dt %>% count(id))$n)
# get names of columns to reshape
col_names = names(dt)[-1]
dt %>%
group_by(id) %>%
mutate(nrow = paste0("row",row_number())) %>%
unite_("V",col_names) %>%
spread(nrow, V) %>%
unite("z",matches("row")) %>%
separate(z, paste0(col_names, sort(rep(1:max_num_rows, ncol(dt)-1))), convert=T) %>%
ungroup()
# # A tibble: 2 × 13
# id zipcode1 city1 county1 state1 zipcode2 city2 county2 state2 zipcode3 city3 county3 state3
# * <dbl> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr>
# 1 1 4 A A A 5 B B B 6 C C C
# 2 2 7 A A A 8 C C C NA <NA> <NA> <NA>

combining observations based on a criteria in R [duplicate]

This question already has answers here:
Collapsing a data frame over one variable
(3 answers)
Closed 7 years ago.
I have a data set that looks like this:
geoid zip dealers Year County
1001 36703 1 2001 Autauga County, AL
1001 36704 3 2001 Autauga County, AL
1003 36535 7 2000 Baldwin County, AL
1003 36536 3 2000 Baldwin County, AL
And I want to take all the rows that are the same except for 'dealers' and 'zip' and combine them into one row with the dealer variable summed from all the similar rows. (I'm not sure what the easiest thing is to do with zip, either list them all or leave it out? Doesn't really matter.) So this would become:
geoid dealers Year County
1001 4 2001 Autauga County, AL
1003 10 2000 Baldwin County, AL
Is there any way to create a new dataset like this? (Incidentally, I got here by merging three datasets, so if there's a better way to merge without creating these duplicates, that would be helpful as well.)
This gives you the desired result:
df <- read.table(header=TRUE, text=
'geoid zip dealers Year County
1001 36703 1 2001 "Autauga County, AL"
1001 36704 3 2001 "Autauga County, AL"
1003 36535 7 2000 "Baldwin County, AL"
1003 36536 3 2000 "Baldwin County, AL"')
aggregate(dealers ~ geoid+Year+County, data=df[-2], FUN=sum) # or
aggregate(dealers ~ geoid+Year+County, data=df, FUN=sum)

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

Resources