How to de-cumulate variable in dplyr? - r

I have an issue. I have panel of quarterly individual data, which are "annually cumulative", ie. values for 1st quarter are for 1st quarter, values for 2nd quarter are sum for 1st and 2nd, 3rd quarter values are sums for first 3 quarters of the year and 4th quarter are annual sums. How to easily de-cumulate those in dplyr, grouping by id and year?

Assuming we have two years, and in year one sales are 2 per quarter, and in year 2 sales are 3 per quarter, the original is:
df = data.frame(quarter = c("Q1","Q2","Q3","Q4","Q1","Q2","Q3","Q4"), year=c(rep(2017,4),rep(2018,4)), cum_tot= c(2,4,6,8,3,6,9,12))
quarter year cum_tot
1 Q1 2017 2
2 Q2 2017 4
3 Q3 2017 6
4 Q4 2017 8
5 Q1 2018 3
6 Q2 2018 6
7 Q3 2018 9
8 Q4 2018 12
Then we can get the sales per quarter as:
library(dplyr)
df %>% group_by(year) %>% mutate(original = c(cum_tot[1], diff(cum_tot)))
Or, as per GGamba's comment below:
df %>% group_by(year) %>% mutate(original = cum_tot - lag(cum_tot, default = 0))
They both result in:
quarter year cum_tot original
1 Q1 2017 2 2
2 Q2 2017 4 2
3 Q3 2017 6 2
4 Q4 2017 8 2
5 Q1 2018 3 3
6 Q2 2018 6 3
7 Q3 2018 9 3
8 Q4 2018 12 3
Hope this helps!

Related

Ranking of values in one quarter [duplicate]

This question already has answers here:
Calculate rank by group
(4 answers)
How to emulate SQLs rank functions in R?
(5 answers)
Closed 8 days ago.
I am trying to implement a calculation that will rank the Price values in a separate partition. Below you can see my data
df<-data.frame( year=c(2010,2010,2010,2010,2010,2010),
quarter=c("q1","q1","q1","q2","q2","q2"),
Price=c(10,20,30,10,20,30)
)
df
Now I want to count over each quarter and I expect to have 1 for the smallest Price and 3 for the highest Price
df %>% group_by(quarter) %>% mutate(id = row_number(Price))
Instead of the expected results, I received different results. Below you can see the result from the code. Instead of ranking in separate quarter, ranging is in both quarters.
So can anybody help me how to solve this problem and to receive results as in table below
You probably want rank.
transform(df, id=ave(Price, year, quarter, FUN=rank))
# year quarter Price id
# 1 2010 q1 10 1
# 2 2010 q1 20 2
# 3 2010 q1 30 3
# 4 2010 q2 10 1
# 5 2010 q2 20 2
# 6 2010 q2 30 3
With dplyr, use dense_rank
library(dplyr)
df %>%
group_by(quarter) %>%
mutate(id = dense_rank(Price)) %>%
ungroup
# A tibble: 6 × 4
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
In the newer version of dplyr, can also use .by in mutate
df %>%
mutate(id = dense_rank(Price), .by = 'quarter')
year quarter Price id
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3
Alternatively with row_number()
library(tidyverse)
df %>% group_by(year, quarter) %>% mutate(id=row_number())
Created on 2023-02-12 with reprex v2.0.2
# A tibble: 6 × 4
# Groups: year, quarter [2]
year quarter Price id
<dbl> <chr> <dbl> <int>
1 2010 q1 10 1
2 2010 q1 20 2
3 2010 q1 30 3
4 2010 q2 10 1
5 2010 q2 20 2
6 2010 q2 30 3

Create incremental column year based on id and year column in R

I have the below dataframe and i want to create the 'create_col' using some kind of seq() function i guess using the 'year' column as the start of the sequence. How I could do that?
id <- c(1,1,2,3,3,3,4)
year <- c(2013, 2013, 2015,2017,2017,2017,2011)
create_col <- c(2013,2014,2015,2017,2018,2019,2011)
Ideal result:
id year create_col
1 1 2013 2013
2 1 2013 2014
3 2 2015 2015
4 3 2017 2017
5 3 2017 2018
6 3 2017 2019
7 4 2011 2011
You can add row_number() to minimum year in each id :
library(dplyr)
df %>%
group_by(id) %>%
mutate(create_col = min(year) + row_number() - 1)
# id year create_col
# <dbl> <dbl> <dbl>
#1 1 2013 2013
#2 1 2013 2014
#3 2 2015 2015
#4 3 2017 2017
#5 3 2017 2018
#6 3 2017 2019
#7 4 2011 2011
data
df <- data.frame(id, year)

Calculate average of values in R and add result as new rows instead of as a new column

I have a dataframe like the following one:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
and I want to calculate the average value by day for the three year period (2014, 2015, 2016). The following code works for this purpose:
data %>%
group_by(day) %>%
mutate(MEAN = mean(value))
and produces this output:
day year value MEAN
1 2014 5 7
1 2015 16 7
1 2016 0 7
2 2014 3 3
2 2015 1 3
2 2016 4 3
but I want to add the average values as new rows in the same dataframe as follows:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
1 avg 7 <--
2 avg 3 <--
Any suggestions about how can I possibly do this? Thanks!
We can use summarise (instead of mutate - which adds a new column in the original dataset) to calculate the mean and then with bind_rows can bind with original data. The tidyverse functions are very particular about type, so make sure the class are the same before we do the binding
library(dplyr)
data %>%
group_by(day) %>%
summarise(year = 'avg', value = mean(value)) %>%
bind_rows(data %>%
mutate(year = as.character(year)), .)
# day year value
#1 1 2014 5.00
#2 1 2015 16.00
#3 1 2016 0.00
#4 2 2014 3.00
#5 2 2015 1.00
#6 2 2016 4.00
#7 1 avg 7.00
#8 2 avg 2.67
Another option is to split by the 'day' and then with add_row (from tibble) create a new row on each of the list elements
library(tibble)
library(purrr)
data %>%
mutate(year = as.character(year)) %>%
group_split(day) %>%
map_dfr(~ .x %>% add_row(day = first(.$day),
year = 'avg', value = mean(.$value)))
Here is a base R option using aggregate
rbind(df,cbind(aggregate(value~day,df,mean),year = "avg")[c(1,3,2)])
or a variation (by #thelatemail from comments)
rbind(df, aggregate(df["value"], cbind(df["day"], year="avg"), FUN=mean))
which gives
day year value
1 1 2014 5.000000
2 1 2015 16.000000
3 1 2016 0.000000
4 2 2014 3.000000
5 2 2015 1.000000
6 2 2016 4.000000
7 1 avg 7.000000
8 2 avg 2.666667

Group By and summaries with condition

I have data frame df. After group_by(id, Year, Month, new_used_ind) and summarise(n = n()) it looks like:
id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2
I want to add and get total for id, year and month but also want a total of ' N' from new_used_ind in a new column.
Something like this
id Year Month Total_New total
1 2001 apr 3 5
2 2002 mar 5 8
4 2004 july 4 6
library(dplyr)
read.table(text= "id Year Month new_used_ind n
1 2001 apr N 3
1 2001 apr U 2
2 2002 mar N 5
3 2003 mar U 3
4 2004 july N 4
4 2004 july U 2", header = T) -> df
df %>%
group_by(id, Year, Month) %>%
mutate(total_New=sum(n*(new_used_ind=="N"))) %>%
mutate(total_n=sum(n)) %>%
summarise_at(c("total_New", "total_n"), mean)
#> # A tibble: 4 x 5
#> # Groups: id, Year [4]
#> id Year Month total_New total_n
#> <int> <int> <fct> <dbl> <dbl>
#> 1 1 2001 apr 3 5
#> 2 2 2002 mar 5 5
#> 3 3 2003 mar 0 3
#> 4 4 2004 july 4 6
Created on 2019-06-11 by the reprex package (v0.3.0)

How to get every quarter of a date interval in R? [duplicate]

This question already has answers here:
Extract only quarter from a date in r
(4 answers)
Closed 6 years ago.
what I have is a data frame with many products, prices and startdate/enddate the object has been online.
product startdate enddate price
1 2012-03-17 2016-09-08 10
2 2014-05-16 2015-06-29 8
3 2015-07-01 2016-04-02 9
What I want to have is to get every quarter and year of the time the product has been online. For example for product 3: Q3 15, Q4 15, Q1 16, Q2 16.
I already transformed it into interval class via:
library(lubridate)
interval <- interval(startdate,enddate)
interval
I searched for a way to get the quarters out of that interval but couldn't find a solution.
My overall goal is to calculate the mean of the prices of every product online for every quarter.
Any help would be appreciated. Thank you!
If df is your data frame, what the following does is generate the sequence of all months from startdate to enddate, retain unique combinations of product and quarters and calculate the average.
library(lubridate)
library(dplyr)
df <- df %>%
mutate(startdate = ymd(startdate),
enddate = ymd(enddate))
df$output <- mapply(function(x,y) seq(x, y, by = "month"),
df$startdate,
df$enddate)
df %>%
tidyr::unnest(output) %>%
mutate(quarter = paste0("Q",quarter(output), " ", year(output))) %>%
select(-output) %>%
group_by(product, startdate, enddate, quarter) %>%
filter(row_number(quarter) == 1) %>%
summarise(mean(price))
Result for the first row of your data frame would be:
product startdate enddate quarter `mean(price)`
<int> <date> <date> <chr> <dbl>
1 1 2012-03-17 2016-09-08 Q1 2012 10
2 1 2012-03-17 2016-09-08 Q1 2013 10
3 1 2012-03-17 2016-09-08 Q1 2014 10
4 1 2012-03-17 2016-09-08 Q1 2015 10
5 1 2012-03-17 2016-09-08 Q1 2016 10
6 1 2012-03-17 2016-09-08 Q2 2012 10
7 1 2012-03-17 2016-09-08 Q2 2013 10
8 1 2012-03-17 2016-09-08 Q2 2014 10
9 1 2012-03-17 2016-09-08 Q2 2015 10
10 1 2012-03-17 2016-09-08 Q2 2016 10
11 1 2012-03-17 2016-09-08 Q3 2012 10
12 1 2012-03-17 2016-09-08 Q3 2013 10
13 1 2012-03-17 2016-09-08 Q3 2014 10
14 1 2012-03-17 2016-09-08 Q3 2015 10
15 1 2012-03-17 2016-09-08 Q3 2016 10
16 1 2012-03-17 2016-09-08 Q4 2012 10
17 1 2012-03-17 2016-09-08 Q4 2013 10
18 1 2012-03-17 2016-09-08 Q4 2014 10
19 1 2012-03-17 2016-09-08 Q4 2015 10

Resources