Summing data frames with different length - r

I have two data sets (one for each country) that look like this:
dfGermany
Country Sales Year Code
Germany 2000 2000 221
Germany 1500 2001 150
Germany 2150 2002 270
dfJapan
Country Sales Year Code
Japan 500 2000 221
Japan 750 2001 221
Japan 800 2001 270
Japan 1000 2002 270
Code here is the "name" of the product. What I want to do is to take half the Japanese sell and add it to the df for Germany if the code and the year matches.
So for instance, half of the sales value for product 221 and 270 in dfJapan (250 € and 500 €) should be added to dfGermany for year 2000 and 2002. But nothing should happen to the values for 2001 since the code does not match with the year.
I tried with merge, but that function did not work since the data is of different size and I also want to match both year and value.

We can do a join on 'Year', 'Code' and then update the 'dfGermany' 'Sales' column
library(data.table)
setDT(dfGermany)[dfJapan, Sales := Sales + i.Sales/2, on = .(Year, Code)]
dfGermany
# Country Sales Year Code
#1: Germany 2250 2000 221
#2: Germany 1500 2001 150
#3: Germany 2650 2002 270
data
dfGermany <- structure(list(Country = c("Germany", "Germany", "Germany"),
Sales = c(2000, 1500, 2150), Year = 2000:2002, Code = c(221L,
150L, 270L)), row.names = c(NA, -3L), class = "data.frame")
dfJapan <- structure(list(Country = c("Japan", "Japan", "Japan", "Japan"
), Sales = c(500L, 750L, 800L, 1000L), Year = c(2000L, 2001L,
2001L, 2002L), Code = c(221L, 221L, 270L, 270L)),
class = "data.frame", row.names = c(NA, -4L))

Using dplyr and #akrun's provided data:
library(dplyr)
dfGermany %>%
left_join(dfJapan %>%
select(Year, Code, sales_japan = Sales),
by = c('Year', 'Code')) %>%
mutate(Sales = Sales + coalesce(sales_japan / 2, 0)) %>%
select(-sales_japan)
> dfGermany
Country Sales Year Code
1 Germany 2250 2000 221
2 Germany 1500 2001 150
3 Germany 2650 2002 270

Related

for-while ifelse loop? (R-Programming)

To be honest, I am completely stuck, I'm not quite sure how to phrase the title either.
I have two datasets, lets say it looks something like this:
Dataset1 (ie GDP related):
Year
Country
2000
Austria
2001
Austria
2000
Belgium
2001
Belgium
Dataset2 (TAX-related):
Year
Austria
Belgium
2000
55
48
2001
51
45
So what I would like, is to generate some sort of function/loop that essentially says:
if our country variable in dataset1 has a name that is a column name in dataset2, use these observations
Then, conditional on the year and country, I want to create a new variable in dataset1 called tax, apply the country's tax rate from dataset two into dataset1.
So for instance, we know Austria (observation) is also a name of a variable, then I want to get this tax rate from dataset2, and apply 55 for year 2000 and 56 for 2001, for dataset1. And this will go on for all countries and years.
And should thus look like
Dataset1 (ie GDP related):
Year
Country
Tax
2000
Austria
55
2001
Austria
51
2000
Belgium
48
2001
Belgium
45
My dataset is quite big, so it is much preferred if I have some sort of algorithm for this
Thanks!
Assuming the first data have more columns, then after reshaping the second data to long with pivot_longer, do a join with the first data (left_join) which matches the 'Year', 'Country'
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = -Year, names_to = 'Country', values_to = 'Tax') %>%
left_join(df1, .)
-output
Year Country Tax
1 2000 Austria 55
2 2001 Austria 51
3 2000 Belgium 48
4 2001 Belgium 45
data
df1 <- structure(list(Year = c(2000L, 2001L, 2000L, 2001L), Country = c("Austria",
"Austria", "Belgium", "Belgium")), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(Year = 2000:2001, Austria = c(55L, 51L), Belgium = c(48L,
45L)), class = "data.frame", row.names = c(NA, -2L))
This should also work:
library(dplyr)
library(tidyr)
df2 %>%
# pivot_longer(-Year) %>% first solution
pivot_longer(cols = -Year, names_to = 'Country', values_to = 'Tax') %>% # taken from #akrun
arrange(Country)
Year Country Tax
<int> <chr> <int>
1 2000 Austria 55
2 2001 Austria 51
3 2000 Belgium 48
4 2001 Belgium 45

How to summarize two different rows with different values to a single row with that sum using dplyr?

I have the following data frame but in a bigger scale of course:
country
year
strain
num_cases
mex
1996
sp_m014
412
mex
1996
sp_f014
214
mex
1998
sp_m014
150
mex
1998
sp_f014
200
usa
1996
sp_m014
200
usa
1996
sp_f014
180
usa
1997
sp_m014
190
usa
1997
sp_f014
150
I want to get the following result, that is the sum of sp_m014 (male) and sp_f014 (female) for mex and usa individually:
country
year
strain
num_cases
mex
1996
sp
626
mex
1998
sp
350
usa
1996
sp
380
usa
1997
sp
340
In my real data frame I have a lot more age ranges, here I only show the 014 for males and females. But I want to summarize them that way for every age range and gender.
Thanks!
Grouped by 'country', 'year' summarise to update the 'strain' as 'sp' and get the sum of 'num_cases'
library(dplyr)
df1 %>%
group_by(country, year) %>%
summarise(strain = 'sp', num_cases = sum(num_cases), .groups = 'drop')
-output
# A tibble: 4 x 4
# country year strain num_cases
#* <chr> <int> <chr> <int>
#1 mex 1996 sp 626
#2 mex 1998 sp 350
#3 usa 1996 sp 380
#4 usa 1997 sp 340
data
df1 <- structure(list(country = c("mex", "mex", "mex", "mex", "usa",
"usa", "usa", "usa"), year = c(1996L, 1996L, 1998L, 1998L, 1996L,
1996L, 1997L, 1997L), strain = c("sp_m014", "sp_f014", "sp_m014",
"sp_f014", "sp_m014", "sp_f014", "sp_m014", "sp_f014"), num_cases = c(412L,
214L, 150L, 200L, 200L, 180L, 190L, 150L)),
class = "data.frame", row.names = c(NA,
-8L))
Here's an approach with tidyr::extract:
library(tidyr);library(dplyr)
df1 %>%
extract(strain, into = c("strain","sex","age"), "(\\w+)_([mf])(.*)") %>%
group_by(country,year,strain) %>%
summarise(across(num_cases,sum))
# A tibble: 4 x 4
# Groups: country, year [4]
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340
Now that you have the strains fully parsed you can easily group by sex or age. Thanks to #akrun for the data.
Update:
To use the age range you can do parse_number
df1 %>%
mutate(age_range=parse_number(strain)) %>%
group_by(country, year, age_range) %>%
summarise(num_cases=sum(num_cases))
Output:
country year age_range num_cases
<chr> <int> <dbl> <int>
1 mex 1996 14 626
2 mex 1998 14 350
3 usa 1996 14 380
4 usa 1997 14 340
First answer:
Thanks to akrun for the data:
library(tidyverse)
df1 %>%
group_by(country, year, strain) %>%
mutate(strain=str_extract(strain, "^.{2}")) %>%
summarise(num_cases=sum(num_cases))
Output:
country year strain num_cases
<chr> <int> <chr> <int>
1 mex 1996 sp 626
2 mex 1998 sp 350
3 usa 1996 sp 380
4 usa 1997 sp 340

Calculating sums of observation in time intervals in a df [duplicate]

This question already has answers here:
Aggregate one data frame by time intervals from another data frame
(3 answers)
Closed 1 year ago.
I've posted this as another question, but realised I've got my sample data wrong.
I've got two separate datasets. df1 looks like this:
loc_ID year observations
nin212 2002 90
nin212 2003 98
nin212 2004 102
cha670 2001 18
cha670 2002 19
cha670 2003 21
df2 looks like this:
loc_ID start_year end_year
nin212 2002 2003
nin212 2003 2004
cha670 2001 2002
cha670 2002 2003
I want to calculate the number of observations in the time intervals (start_year to end_year) per loc_ID. In the example above, I would like to achieve this final dataset:
loc_ID start_year end_year observations
nin212 2002 2003 188
nin212 2003 2004 200
cha670 2001 2002 37
cha670 2002 2003 40
How could I do this?
We can do a non-equi join
library(data.table)
setDT(df2)[, observations := setDT(df1)[df2, sum(observations),
on = .(loc_ID, year >= start_year, year <= end_year),
by = .EACHI]$V1]
-output
df2
# loc_ID start_year end_year observations
#1: nin212 2002 2003 188
#2: nin212 2003 2004 200
#3: cha670 2001 2002 37
#4: cha670 2002 2003 40
data
structure(list(loc_ID = c("nin212", "nin212", "nin212", "cha670",
"cha670", "cha670"), year = c(2002L, 2003L, 2004L, 2001L, 2002L,
2003L), observations = c(90L, 98L, 102L, 18L, 19L, 21L)),
class = "data.frame", row.names = c(NA,
-6L))
> dput(df2)
structure(list(loc_ID = c("nin212", "nin212", "cha670", "cha670"
), start_year = c(2002L, 2003L, 2001L, 2002L), end_year = c(2003L,
2004L, 2002L, 2003L)), class = "data.frame", row.names = c(NA,
-4L))

Transform Year-to-date to Quarterly data with data.table

Quarterly data from a data provider has the issue that for some variables the quarterly data values are actually Year-to-date figures. That means the values are the sum of all previous quarters (Q2 = Q1 + Q2 , Q3 = Q1 + Q2 + Q3, ...).
The structure of the original data looks the following:
library(data.table)
library(plyr)
dt.quarter.test <- structure(list(Year = c(2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2001L)
, Quarter = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L)
, Data.Year.to.Date = c(162, 405, 610, 938, 331, 1467, 1981, 2501))
, .Names = c("Year", "Quarter", "Data.Year.to.Date"), class = c("data.table", "data.frame"), row.names = c(NA, -8L))
In order to calculate the quarterly values I therefore need to subtract the previous Quarter from Q2, Q3 and Q4.
I've managed to get the desired results by using the ddply function from the plyr package.
dt.quarter.result <- ddply(dt.quarter.test, "Year"
, transform
, Data.Quarterly = Data.Year.to.Date - shift(Data.Year.to.Date, n = 1L, type = "lag", fill = 0))
dt.quarter.result
Year Quarter Data.Year.to.Date Data.Quarterly
1 2000 1 162 162
2 2000 2 405 243
3 2000 3 610 205
4 2000 4 938 328
5 2001 1 331 331
6 2001 2 1467 1136
7 2001 3 1981 514
8 2001 4 2501 520
But I am not really happy with the command, since it seems quite clumsy and I would like to get some input on how to improve it and especially do it directly within the data.table.
Here is the data.table syntax, and you might find data.table cheat sheet helpful:
library(data.table)
dt.quarter.test[, Data.Quarterly := Data.Year.to.Date - shift(Data.Year.to.Date, fill = 0), Year][]
# Year Quarter Data.Year.to.Date Data.Quarterly
# 1: 2000 1 162 162
# 2: 2000 2 405 243
# 3: 2000 3 610 205
# 4: 2000 4 938 328
# 5: 2001 1 331 331
# 6: 2001 2 1467 1136
# 7: 2001 3 1981 514
# 8: 2001 4 2501 520

get the mean of a variable subset of data in R [duplicate]

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
Imagine I have the following data:
Year Month State ppo
2011 Jan CA 220
2011 Feb CA 250
2012 Jan CA 230
2011 Jan WA 200
2011 Feb WA 210
I need to calculate the mean for each state for the year, so the output would look something like this:
Year Month State ppo annualAvg
2011 Jan CA 220 230
2011 Feb CA 240 230
2012 Jan CA 260 260
2011 Jan WA 200 205
2011 Feb WA 210 205
where the annual average is the mean of any entries for that state in the same year. If the year and state were constant I would know how to do this, but somehow the fact that they are variable is throwing me off.
Looking around, it seems like maybe ddply is what I want to be using for this (https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r), but when I tried to use it I was doing something wrong and kept getting errors (I have tried so many variations of it that I won't bother to post them all here). Any idea how I am actually supposed to be doing this?
Thanks for the help!
Try this:
library(data.table)
setDT(df)
df[ , annualAvg := mean(ppo) , by =.(Year, State) ]
Base R: df$ppoAvg <- ave(df$ppo, df$State, df$Year, FUN = mean)
Using dplyr with group_by %>% mutate to add a column:
library(dplyr)
df %>% group_by(Year, State) %>% mutate(annualAvg = mean(ppo))
#Source: local data frame [5 x 5]
#Groups: Year, State [3]
# Year Month State ppo annualAvg
# (int) (fctr) (fctr) (int) (dbl)
#1 2011 Jan CA 220 235
#2 2011 Feb CA 250 235
#3 2012 Jan CA 230 230
#4 2011 Jan WA 200 205
#5 2011 Feb WA 210 205
Using data.table:
library(data.table)
setDT(df)[, annualAvg := mean(ppo), .(Year, State)]
df
# Year Month State ppo annualAvg
#1: 2011 Jan CA 220 235
#2: 2011 Feb CA 250 235
#3: 2012 Jan CA 230 230
#4: 2011 Jan WA 200 205
#5: 2011 Feb WA 210 205
Data:
structure(list(Year = c(2011L, 2011L, 2012L, 2011L, 2011L), Month = structure(c(2L,
1L, 2L, 2L, 1L), .Label = c("Feb", "Jan"), class = "factor"),
State = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("CA",
"WA"), class = "factor"), ppo = c(220L, 250L, 230L, 200L,
210L), annualAvg = c(235, 235, 230, 205, 205)), .Names = c("Year",
"Month", "State", "ppo", "annualAvg"), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x105000778>)

Resources