How to merge two datasets by two common columns? [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
merging on multiple columns R
(1 answer)
Closed 3 years ago.
I have two datasets that look like this that I am having difficulty with merging.
I've already tried:
ndf <- merge(df1, df2, by=c("state", "year"))
but it ended up with a data frame with 200,000 observations. Here are my two example data sets, df1 is empty in the "income" and "local_income" column:
df1 df2
state year income local_income state year income local_income
CA 1992 CA 1992 1 1
CA 1993 NV 1992 4 3
CA 1994 CO 1992 3 2
CA 1995
CA 1996
NV 1992
NV 1993
NV 1994
NV 1995
NV 1996
CO 1992
CO 1993
CO 1994
CO 1995
CO 1996
Essentially what I want to do is merge these two datasets to look like this:
df3
state year income local_income
CA 1992 1 1
CA 1993
CA 1994
CA 1995
CA 1996
NV 1992 4 3
NV 1993
NV 1994
NV 1995
NV 1996
CO 1992 3 2
CO 1993
CO 1994
CO 1995
CO 1996
And then I'll eventually go on to merging for each year. But this is a good start to get me going. Any help will be greatly appreciated! This would other wise take me 8+ hours to do with all the data I have, so I'm excited to see the power of r and its community!

You can also try the dplyr version.
library(dplyr)
df3 <- full_join(df1, df2, by=c("state", "year"))

Related

How to Merge Uneven Data Frames With Real Data

Problem:
I have two different size data sets that I would like to merge together. Without abandoning rows or inserting NA's. To compare this to a excel document situation you would have five columns and you would drago down 3 of them to populate the blank space left by the rows inserted by adding your data to the 4th and 5th column.
Example Data Set
zipcode = a, step3 = b in my later brainstorming code to solve my problem
> head(zipcode_joincsv)
zip city abv latitude longitude median mean pop
226 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
227 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
228 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
229 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
230 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
231 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
> head(step3_df)
tolower.state.name. state.abb
1 alabama AL
2 alaska AK
3 arizona AZ
4 arkansas AR
5 california CA
6 colorado CO
Desired Result:
One DF where each zipcode city combo is combined with their states pop and
income. A column in common they have is the abbreviation column.
tolower.state.name. zip city abv latitude longitude median mean pop
1 alabama 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
2 alabama 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
3 alabama 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
4 alabama 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
5 alabama 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
6 alabama 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
7 alaska data from these rows
8 arizona data from these rows
9 arkansas data from these rows
10 california data from these rows
11 colorado data from these rows
I've contemplated using something like
sqldf ("SELECT a.Zip, a.City, a.State Abv, a.Lat, a.Long, a.median, a.mean, a.pop, b.state.name, b.states.abb, b.pop, b.income
FROM a a
LEFT JOIN b b using (abv)")
I know that is probably not going to work if only that if it worked all the rows that there was not a matching set from A would input a NA where what I would like is that for every abv of NY the states average income and total population gets copied down the line. Than for every AR and every AL etc until the two data sets are one that a ggplot using all of the data can be created.
dplyr::left_join(a, b, by="abv") should work.

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

Combine rows with two matching columns in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a df that resembles this:
Year Country Sales($M)
2013 Australia 120
2013 Australia 450
2013 Armenia 80
2013 Armenia 175
2013 Armenia 0
2014 Australia 500
2014 Australia 170
2014 Armenia 0
2014 Armenia 100
I'd like to combine the rows that match Year and Country, adding the Sales column. The result should be:
Year Country Sales($M)
2013 Australia 570
2013 Armenia 255
2014 Australia 670
2014 Armenia 100
I'm sure I could write a long loop to check whether Year and Country are the same and then add the Sales from those rows, but this is R so there must be a simple function that I'm totally missing.
Many thanks in advance.
library(tidyverse)
df %>%
group_by(Year,Country) %>%
summarise(Sales = sum(Sales))

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?
We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement
I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

combining observations based on a criteria in R [duplicate]

This question already has answers here:
Collapsing a data frame over one variable
(3 answers)
Closed 7 years ago.
I have a data set that looks like this:
geoid zip dealers Year County
1001 36703 1 2001 Autauga County, AL
1001 36704 3 2001 Autauga County, AL
1003 36535 7 2000 Baldwin County, AL
1003 36536 3 2000 Baldwin County, AL
And I want to take all the rows that are the same except for 'dealers' and 'zip' and combine them into one row with the dealer variable summed from all the similar rows. (I'm not sure what the easiest thing is to do with zip, either list them all or leave it out? Doesn't really matter.) So this would become:
geoid dealers Year County
1001 4 2001 Autauga County, AL
1003 10 2000 Baldwin County, AL
Is there any way to create a new dataset like this? (Incidentally, I got here by merging three datasets, so if there's a better way to merge without creating these duplicates, that would be helpful as well.)
This gives you the desired result:
df <- read.table(header=TRUE, text=
'geoid zip dealers Year County
1001 36703 1 2001 "Autauga County, AL"
1001 36704 3 2001 "Autauga County, AL"
1003 36535 7 2000 "Baldwin County, AL"
1003 36536 3 2000 "Baldwin County, AL"')
aggregate(dealers ~ geoid+Year+County, data=df[-2], FUN=sum) # or
aggregate(dealers ~ geoid+Year+County, data=df, FUN=sum)

Resources