I need to convert my data, which is on quarterly basis, to monthly, by dividing some variable by 3.
Example dataset:
df <- data.frame(Year = c(2018,2019,2020), qtr = c(1,3,2),
amount = c(3,6,12), variable = c(5,6,7))
df
What I would need is to get months for every quarter, i.e. the final dataset would look like this:
data.frame(Year = c(2018,2018,2018,2019,2019,2019,2020,2020,2020),
qtr = c(1,2,3,7,8,9,4,5,6),
amount = c(1,1,1,2,2,2,4,4,4),
variable = c(5,5,5,6,6,6,7,7,7))
Also, bonus question, how do I print the data frames in this environment
Does this work:
df %>%
mutate(qtr_start_mth = case_when(qtr == 1 ~ 1,
qtr == 2 ~ 4,
qtr == 3 ~ 7,
qtr == 4 ~ 10),
qtr_end_mth = case_when(qtr == 1 ~ 3,
qtr == 2 ~ 6,
qtr == 3 ~ 9,
qtr == 4 ~ 12)) %>%
mutate(month = map2(qtr_start_mth, qtr_end_mth, `:`)) %>%
separate_rows() %>%
unnest(month) %>%
mutate(amount = amount /3) %>%
select(1,2,3,4,7)
# A tibble: 9 x 5
Year qtr amount variable month
<dbl> <dbl> <dbl> <dbl> <int>
1 2018 1 1 5 1
2 2018 1 1 5 2
3 2018 1 1 5 3
4 2019 3 2 6 7
5 2019 3 2 6 8
6 2019 3 2 6 9
7 2020 2 4 7 4
8 2020 2 4 7 5
9 2020 2 4 7 6
Data used:
> dput(df)
structure(list(Year = c(2018, 2019, 2020), qtr = c(1, 3, 2),
amount = c(3, 6, 12), variable = c(5, 6, 7)), class = "data.frame", row.names = c(NA,
-3L))
>
Using base:
do.call(rbind,
c(make.row.names = FALSE,
lapply(split(df, df$Year), function(i){
cbind(i, month = 1:3 + (i$qtr - 1) * 3, row.names = NULL)
})))
# Year qtr amount variable month
# 1 2018 1 3 5 1
# 2 2018 1 3 5 2
# 3 2018 1 3 5 3
# 4 2019 3 6 6 7
# 5 2019 3 6 6 8
# 6 2019 3 6 6 9
# 7 2020 2 12 7 4
# 8 2020 2 12 7 5
# 9 2020 2 12 7 6
Related
Below is the sample data. I receive the data in a form such as this. Each row is a quarter and then the months are columns inside of it. Trying to do some month over month calculation but am thinking that I transform the data frame in order to do so. I am thinking that I would do a pivot_longer but not seeing anything online that is of a similar vein. Below is the desired result
year<-c(2018,2018,2018,2018,2019,2019,2019,2019,2020,2020,2020,2020)
qtr<-c(1,2,3,4,1,2,3,4,1,2,3,4)
avgemp <-c(3,5,7,9,11,13,15,17,19,21,23,25)
month1emp<-c(2,4,6,8,10,12,14,16,18,20,22,24)
month2emp<-c(3,5,7,9,11,13,15,17,19,21,23,25)
month3emp<-c(4,6,8,10,12,14,16,18,20,22,24,26)
sample<-data.frame(year,qtr,month1emp,month2emp,month3emp)
Desired Result
year qtr month employment
2018 1 1 2
2018 1 2 3
2018 1 3 4
2018 2 4 4
2018 2 4 5
2018 2 4 6
and so on. At 2019, the month value would restart and go from 1 to 12.
We could use pivot_longer on the 'month' columns, specify the names_pattern to capture the digits ((\\d+)) followed by the emp for the 'month' and the .value columns
library(dplyr)
library(tidyr)
sample %>%
pivot_longer(cols = starts_with('month'),
names_to = c("month", ".value"), names_pattern = ".*(\\d+)(emp)")%>%
rename(employment = emp)
-output
# A tibble: 36 x 4
year qtr month employment
<dbl> <dbl> <chr> <dbl>
1 2018 1 1 2
2 2018 1 2 3
3 2018 1 3 4
4 2018 2 1 4
5 2018 2 2 5
6 2018 2 3 6
7 2018 3 1 6
8 2018 3 2 7
9 2018 3 3 8
10 2018 4 1 8
# … with 26 more rows
If we need to increment the 'month' based on 'qtr' value
sample %>%
pivot_longer(cols = starts_with('month'),
names_to = c("month", ".value"), names_pattern = ".*(\\d+)(emp)")%>%
rename(employment = emp) %>%
mutate(month = as.integer(month) + c(0, 3, 6, 9)[qtr])
# A tibble: 36 x 4
year qtr month employment
<dbl> <dbl> <dbl> <dbl>
1 2018 1 1 2
2 2018 1 2 3
3 2018 1 3 4
4 2018 2 4 4
5 2018 2 5 5
6 2018 2 6 6
7 2018 3 7 6
8 2018 3 8 7
9 2018 3 9 8
10 2018 4 10 8
# … with 26 more rows
Base R solution:
# Create a vector of boolean values,
# denoting whether or not the columns should
# be unpivoted: unpivot_cols => boolean vector
unpivot_cols <- startsWith(
names(df),
"month"
)
# Reshape the data.frame, calculate
# the month value: rshpd_df => data.frame
rshpd_df <- transform(
reshape(
df,
direction = "long",
varying = names(df)[unpivot_cols],
ids = NULL,
timevar = "month",
times = seq_len(sum(unpivot_cols)),
v.names = "employment",
new.row.names = seq_len(
nrow(df) * ncol(df)
)
),
month = ((12 / 4) * (qtr - 1)) + month
)
# Order the data.frame by year and month:
# ordered_df => data.frame
ordered_df <- with(
rshpd_df,
rshpd_df[order(year, month),]
)
I have a data frame like this:
my_df <- data.frame(
year = c("2018","2018","2017","2017", "2016","2016"),
my_month = c(6,7,8,9,4,5),
val=c(5,9,3,2,1,1))
> my_df
year my_month val
1 2018 6 5
2 2018 7 9
3 2017 8 3
4 2017 9 2
5 2016 4 1
6 2016 4 1
I need a data frame like this:
my_df_2 <- data.frame(
year = c("2018","2018","2017","2017", "2016","2016"),
my_month = c(6,7,8,9,4,5),
val=c(5,9,3,2,1,1),
pre_month = c(NA,4,NA,-1,NA,0))
> my_df_2
year my_month val pre_month
1 2018 6 5 NA
2 2018 7 9 4
3 2017 8 3 NA
4 2017 9 2 -1
5 2016 4 1 NA
6 2016 5 1 0
Basically "pre_month" col is created by taking "my_month" row for that particular year and subtracting the value of previous month in "val" column. So far 7-2018 -> 9-5=4 and so on.
Thank you for your help.
Here's a solution using tidyverse.
my_df <- data.frame(
year = c("2018","2018","2017","2017", "2016","2016"),
my_month = c(6,7,8,9,4,5),
val=c(5,9,3,2,1,1))
library(tidyverse)
my_df %>%
mutate(year = as.numeric(year)) %>%
group_by(year) %>%
arrange(my_month) %>%
mutate(pre_month = c(NA, diff(val))) %>%
arrange(desc(year))
I changed year to a numeric so it could be sorted sensibly.
I got data like this
structure(list(id = c(1, 1, 1, 2, 2, 2), time = c(1, 2, 2, 5,
6, 6)), class = "data.frame", row.names = c(NA, -6L))
and If for the same ID the value in the next row is equal to the value in the previous row, then increase the value of the duplicate by 1. I want to get this
structure(list(id2 = c(1, 1, 1, 2, 2, 2), time2 = c(1, 2, 3,
5, 6, 7)), class = "data.frame", row.names = c(NA, -6L))
Using base R:
ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
# [1] 1 2 3 5 6 7
(This can be reassigned back into time.)
This deals with 2 or more duplicates, meaning if we instead have another 6th row,
df <- rbind(df, df[6,])
df$time2 <- ave(df$time, df$time, FUN = function(z) z+cumsum(duplicated(z)))
df
# id time time2
# 1 1 1 1
# 2 1 2 2
# 3 1 2 3
# 4 2 5 5
# 5 2 6 6
# 6 2 6 7
# 61 2 6 8
You could use accumulate
library(tidyverse)
df %>%
group_by(id) %>%
mutate(time2 = accumulate(time, ~if(.x>=.y) .x + 1 else .y))
# A tibble: 6 x 3
# Groups: id [2]
id time time2
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 2
3 1 2 3
4 2 5 5
5 2 6 6
6 2 6 7
This works even if the group is repeated more than twice.
If the first data.frame is named df, this gives you what you need:
df$time[duplicated(df$id) & duplicated(df$time)] <- df$time[duplicated(df$id) & duplicated(df$time)] + 1
df
id time
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
It finds the rows where both id and time have been duplicated from the previous row, and adds 1 to time in those rows
You can use dplyr's mutate with lag
data%>%group_by(id)%>%
mutate(time=time+cumsum(duplicated(time)))%>%
ungroup()
# A tibble: 6 x 2
id time
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 2 5
5 2 6
6 2 7
I'm looking for a tidy solution preferably using tidyverse
This question is in line with this answer, it does however have an added twist. My data has an overall grouping variable 'grp'. Within each such group, I want to perform calculations based on cumulative sum (cumsum) within sub-groups defined by 'trial', here X and Y.
However, for the calculations within both sub-groups, trial "X" and trial "Y", I need to use a single, common group-specific baseline, i.e. where trial is B.
My desired outcome is Value3 in the data set desired_outcome below:
# library(tidyverse)
# library(dplyr)
desired_outcome # see below I got this `desired_outcome`
# A tibble: 10 x 6
# Groups: grp [2]
grp trial yr value1 value2 Value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 7
5 A Y 2023 6 4 16
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 5
10 B Y 2023 4 6 14
My minimal working example. Data first,
tabl <- tribble(~grp, ~trial, ~yr, ~value1, ~value2,
'A', "B", 2021, 2, 0,
'A', "X", 2022, 3, 1,
'A', "X", 2023, 4, 2,
'A', "Y", 2022, 5, 3,
'A', "Y", 2023, 6, 4,
'B', "B", 2021, 0, 2,
'B', "X", 2022, 1, 3,
'B', "X", 2023, 2, 4,
'B', "Y", 2022, 3, 5,
'B', "Y", 2023, 4, 6) %>%
mutate(trial = factor(trial, levels = c("B", "X", "Y"))) %>%
arrange(grp, trial, yr)
Now, I need to use group_by(), but I can't group on trial as I need to use the baseline, B in calculations for both "X" and "Y".
undesired_outcome_tidier_code <- tabl %>%
group_by(grp) %>% # this do not work!
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1) %>%
select(-Value1.1, -Value2.1)
In undesired_outcome_tidier_code row 4-5 and 9-10 is, for obvious reasons, not using line 1 and 6, respectively, as base line. As shown here,
undesired_outcome_tidier_code
# A tibble: 10 x 6
# Groups: grp [2]
grp trial yr value1 value2 Value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 17
5 A Y 2023 6 4 26
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 15
10 B Y 2023 4 6 24
I am looking for a solution that gets me desired_outcome (see below) in a tidy way.
I can, in this smaller example, work my way around it, to get to my desired_outcome, but it's a cumbersome two step solution. There must be a better/tidier way.
step1 <- tabl %>% arrange(grp, trial, yr) %>% filter(trial != 'Y') %>%
group_by(grp) %>%
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1)
step2 <- tabl %>% arrange(grp, trial, yr) %>% filter(trial != 'X') %>%
group_by(grp) %>%
mutate(Value1.1 = cumsum(value1),
Value2.1 = lag(cumsum(value2), default = 0),
Value3 = Value1.1 + Value2.1)
desired_outcome <- rbind(step1,
step2 %>% filter(trial != 'B')
) %>% select(-Value1.1, -Value2.1) %>% arrange(grp, trial, yr)
With the addition of purrr, you could do:
map(.x = c("X", "Y"),
~ tabl %>%
arrange(grp, trial, yr) %>%
filter(trial != .x) %>%
group_by(grp) %>%
mutate(value3 = cumsum(value1) + lag(cumsum(value2), default = 0))) %>%
reduce(full_join) %>%
arrange(grp, trial, yr)
grp trial yr value1 value2 value3
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 A B 2021 2 0 2
2 A X 2022 3 1 5
3 A X 2023 4 2 10
4 A Y 2022 5 3 7
5 A Y 2023 6 4 16
6 B B 2021 0 2 0
7 B X 2022 1 3 3
8 B X 2023 2 4 8
9 B Y 2022 3 5 5
10 B Y 2023 4 6 14
You can try with this.
calculate_value3 is a function that calculates value3 as you described. It does it for every letter of trial. It always includes the observation of the baseline. It doesn't matter if the letters will be different than X and Y. Note that baseline can be any letter you want, I set it up as "B" for now.
Inside the pipes, you go for a map-reduce solution. map will run the function calculate_value3 for each unique trial and reduce will set them all together with coalesce (which will replace all NAs --> this is why I initialize v3 as a vector of all NAs in calculate_value3)
calculate_value3 <- function(ut, # trial under examination
tr, # trial vector
v1, # value1 vector
v2, # value2 vector
baseline = "B"){ # baseline id
v3 <- rep_len(NA, length(tr))
ind <- ut == tr | baseline == tr
cumv1 <- cumsum(v1[ind])
cumlv2 <- cumsum(lag(v2[ind], default = 0))
v3[ind] <- cumv1 + cumlv2
v3
}
library(purrr)
tabl %>%
group_by(grp) %>%
mutate(value3 = reduce(
map(unique(trial), calculate_value3,
tr = trial, v1 = value1, v2 = value2),
coalesce)) %>%
ungroup()
#> # A tibble: 10 x 6
#> grp trial yr value1 value2 value3
#> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A B 2021 2 0 2
#> 2 A X 2022 3 1 5
#> 3 A X 2023 4 2 10
#> 4 A Y 2022 5 3 7
#> 5 A Y 2023 6 4 16
#> 6 B B 2021 0 2 0
#> 7 B X 2022 1 3 3
#> 8 B X 2023 2 4 8
#> 9 B Y 2022 3 5 5
#> 10 B Y 2023 4 6 14
The solution is flexible to the identifiers of the trials and seems reasonably easy to debug and to edit if need be [at least to me].
Because tidyverse didn't seem like a strict requirement, I take the opportunity to suggest a data.table alternative:
Starting with the 'desired_outcome' data, just to make it easier to compare results:
library(data.table)
setDT(desired_outcome)
desired_outcome[ , v3 := {
c(value1[1], sapply(c("X", "Y"), function(g){
.SD[trial %in% c("B", g), (cumsum(value1) + cumsum(shift(value2, fill = 0)))[-1]]
}))}, by = grp]
# grp trial yr value1 value2 Value3 v3
# 1: A B 2021 2 0 2 2
# 2: A X 2022 3 1 5 5
# 3: A X 2023 4 2 10 10
# 4: A Y 2022 5 3 7 7
# 5: A Y 2023 6 4 16 16
# 6: B B 2021 0 2 0 0
# 7: B X 2022 1 3 3 3
# 8: B X 2023 2 4 8 8
# 9: B Y 2022 3 5 5 5
# 10: B Y 2023 4 6 14 14
For each 'grp' (by = grp), loop over 'trial' "X" and "Y" (sapply(c("X", "Y")). Within each sub-dataset defined by by (.SD), select rows where 'trial' is equal to "B" or the current value of the loop (trial %in% c("B", g)).
Do the calculation (cumsum(value1) + cumsum(shift(value2, fill = 0)) and remove the first value ([-1]). Append the first row within each 'grp', i.e. the row that corresponds to trial "B" (c(value1[1], ...). Assign the result to a new variable by reference (v3 := )
How do I convert the dataframe?
Before:
set.seed(1)
df <- data.frame( n = rpois(16, 2),
year = rep(2011, 16),
month = rep(seq(1,4,1), times = rep(4,4)))
After:
df1 <- data.frame( n = c(8,11,4,9),
year = rep(2011, 4),
month = rep(seq(1,4,1)))
I think that what you want is this, using dplyr:
library(dplyr)
df %>%
group_by(year, month) %>%
summarise(n = sum(n))
# A tibble: 4 x 3
# Groups: year [1]
year month n
<dbl> <dbl> <int>
1 2011 1 8
2 2011 2 11
3 2011 3 4
4 2011 4 9
Using base R with aggregate
aggregate(n ~ ., df, sum)
# year month n
#1 2011 1 8
#2 2011 2 11
#3 2011 3 4
#4 2011 4 9