combining observations based on a criteria in R [duplicate] - r

This question already has answers here:
Collapsing a data frame over one variable
(3 answers)
Closed 7 years ago.
I have a data set that looks like this:
geoid zip dealers Year County
1001 36703 1 2001 Autauga County, AL
1001 36704 3 2001 Autauga County, AL
1003 36535 7 2000 Baldwin County, AL
1003 36536 3 2000 Baldwin County, AL
And I want to take all the rows that are the same except for 'dealers' and 'zip' and combine them into one row with the dealer variable summed from all the similar rows. (I'm not sure what the easiest thing is to do with zip, either list them all or leave it out? Doesn't really matter.) So this would become:
geoid dealers Year County
1001 4 2001 Autauga County, AL
1003 10 2000 Baldwin County, AL
Is there any way to create a new dataset like this? (Incidentally, I got here by merging three datasets, so if there's a better way to merge without creating these duplicates, that would be helpful as well.)

This gives you the desired result:
df <- read.table(header=TRUE, text=
'geoid zip dealers Year County
1001 36703 1 2001 "Autauga County, AL"
1001 36704 3 2001 "Autauga County, AL"
1003 36535 7 2000 "Baldwin County, AL"
1003 36536 3 2000 "Baldwin County, AL"')
aggregate(dealers ~ geoid+Year+County, data=df[-2], FUN=sum) # or
aggregate(dealers ~ geoid+Year+County, data=df, FUN=sum)

Related

R Tidy Census - Variable for Voters 18+

I have been trying to find a variable on the Tidy Census’ latest American Community Survey (ACS) variable list.
The one I’m looking for would be for all voters aged 18 and up. I have yet to find it in the list. Even if I have to combine a couple variables to make it work, that’s fine too.
I search for relevant keywords related to age, but have yet to find anything. Variables that appear with “18 years and over” have a greater specificity than what I am looking for. I may be missing something though; I’m new to Tidy Census.
Help would be greatly appreciated!
Finding Census variables is difficult. Start here: https://data.census.gov/table?q=ACS
In table S0101 under labels there is a selected age categories variable named 18 years and over.
Searched those those keywords and found this long list. https://api.census.gov/data/2019/acs/acs1/subject/variables/
Where we find variable "S0101_C01_026".
["S0101_C01_026E","Estimate!!Total!!Total population!!SELECTED AGE CATEGORIES!!18 years and over","AGE AND SEX"],
Then we can get that variable:
county_data<-get_acs(geography = "county",
variables = "S0101_C01_026",
cache_table=TRUE,
year=2021)
county_data
# A tibble: 3,221 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama S0101_C01_026 44438 122
2 01003 Baldwin County, Alabama S0101_C01_026 178105 NA
3 01005 Barbour County, Alabama S0101_C01_026 19995 28
4 01007 Bibb County, Alabama S0101_C01_026 17800 44
5 01009 Blount County, Alabama S0101_C01_026 45201 75

How to merge two datasets by two common columns? [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
merging on multiple columns R
(1 answer)
Closed 3 years ago.
I have two datasets that look like this that I am having difficulty with merging.
I've already tried:
ndf <- merge(df1, df2, by=c("state", "year"))
but it ended up with a data frame with 200,000 observations. Here are my two example data sets, df1 is empty in the "income" and "local_income" column:
df1 df2
state year income local_income state year income local_income
CA 1992 CA 1992 1 1
CA 1993 NV 1992 4 3
CA 1994 CO 1992 3 2
CA 1995
CA 1996
NV 1992
NV 1993
NV 1994
NV 1995
NV 1996
CO 1992
CO 1993
CO 1994
CO 1995
CO 1996
Essentially what I want to do is merge these two datasets to look like this:
df3
state year income local_income
CA 1992 1 1
CA 1993
CA 1994
CA 1995
CA 1996
NV 1992 4 3
NV 1993
NV 1994
NV 1995
NV 1996
CO 1992 3 2
CO 1993
CO 1994
CO 1995
CO 1996
And then I'll eventually go on to merging for each year. But this is a good start to get me going. Any help will be greatly appreciated! This would other wise take me 8+ hours to do with all the data I have, so I'm excited to see the power of r and its community!
You can also try the dplyr version.
library(dplyr)
df3 <- full_join(df1, df2, by=c("state", "year"))

How to Merge Uneven Data Frames With Real Data

Problem:
I have two different size data sets that I would like to merge together. Without abandoning rows or inserting NA's. To compare this to a excel document situation you would have five columns and you would drago down 3 of them to populate the blank space left by the rows inserted by adding your data to the 4th and 5th column.
Example Data Set
zipcode = a, step3 = b in my later brainstorming code to solve my problem
> head(zipcode_joincsv)
zip city abv latitude longitude median mean pop
226 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
227 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
228 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
229 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
230 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
231 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
> head(step3_df)
tolower.state.name. state.abb
1 alabama AL
2 alaska AK
3 arizona AZ
4 arkansas AR
5 california CA
6 colorado CO
Desired Result:
One DF where each zipcode city combo is combined with their states pop and
income. A column in common they have is the abbreviation column.
tolower.state.name. zip city abv latitude longitude median mean pop
1 alabama 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
2 alabama 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
3 alabama 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
4 alabama 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
5 alabama 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
6 alabama 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
7 alaska data from these rows
8 arizona data from these rows
9 arkansas data from these rows
10 california data from these rows
11 colorado data from these rows
I've contemplated using something like
sqldf ("SELECT a.Zip, a.City, a.State Abv, a.Lat, a.Long, a.median, a.mean, a.pop, b.state.name, b.states.abb, b.pop, b.income
FROM a a
LEFT JOIN b b using (abv)")
I know that is probably not going to work if only that if it worked all the rows that there was not a matching set from A would input a NA where what I would like is that for every abv of NY the states average income and total population gets copied down the line. Than for every AR and every AL etc until the two data sets are one that a ggplot using all of the data can be created.
dplyr::left_join(a, b, by="abv") should work.

Interactive Map Drill-down ability in R

I have a dataframe like the one below:
State<-c("Alabama","Alabama","Alaska","Alaska")
StateCode<-c("AL","AL","AK","AK")
County<-c("AUTAUGA","BALDWIN","ANCHORAGE","BETHEL")
CountyCode<-c("AL001","AL003","AK020","AK050")
Murder<-c(5,6,7,8)
d<-data.frame(State,StateCode,County,CountyCode, Num)
State StateCode County CountyCode Num
1 Alabama AL AUTAUGA AL001 5
2 Alabama AL BALDWIN AL003 6
3 Alaska AK ANCHORAGE AK020 7
4 Alaska AK BETHEL AK050 8
I have been searching for an option between R packages to create a drill-down map from State to County level out of this but I can't find a working example with code anywhere. Here is an example Any feedback on this?

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?
We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement
I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

Resources