This question already has an answer here:
How to calculate cumulative sum? [duplicate]
(1 answer)
Closed 2 years ago.
I'd like to compute aggregated values in a column through time (with "year" being another column in my data.
I know how to do this easily in base R using a loop, but I feel there must be a way to do the same with dplyr using summarise in combination with something else. I would like to learn that, so I can integrate if in my code better.
I've made a toy example for the same case. Consider this data where we have the number of questions asked in Stackoverflow by someone each year.
> library(tidyverse)
> data <- tribble(~year, ~questions,
2015, 1,
2016, 3,
2016, 2,
2017, 2,
2018, 3,
2018, 7,
2019, 10,
2020, 21)
> data
# A tibble: 6 x 2
year questions
<dbl> <dbl>
1 2015 1
2 2016 3
3 2016 2
4 2017 2
5 2018 3
6 2018 7
7 2019 10
8 2020 21
The following loop will do what I want
> for (i in 1:length(data$year)){
+ data$agg_questions[i] <- sum(data$questions[data$year <= data$year[I]])
+ }
> data
# A tibble: 6 x 3
year questions agg_questions
<dbl> <dbl> <dbl>
1 2015 1 1
2 2016 5 6
3 2017 2 8
4 2018 10 18
5 2019 10 28
6 2020 21 49
And, of course, I'm looking for a way that allows me not to use the loop. Not something like this:
> for (i in 1:length(data$year)){
+ data$agg_questions2[i] <- data %>%
+ filter(year <= data$year[i]) %>%
+ pull(questions) %>%
+ sum()
+ }
> data
# A tibble: 6 x 4
year questions agg_questions agg_questions2
<dbl> <dbl> <dbl> <dbl>
1 2015 1 1 1
2 2016 5 6 6
3 2017 2 8 8
4 2018 10 18 18
5 2019 10 28 28
6 2020 21 49 49
I know if possible to use [] to subset inside the summarise() and mutate() functions, but I've always struggled with that. Thanks!
Is that possible?
EDIT
After reading the first answers, I realised that I had simplified the example too much. I've edited the example data by adding several rows for the same year to make it look more like what I want (and, I think, complicates just using cumsum())
You can do this by using summarise and sum to create a year totals column and mutate with cumsum to create a column that provides cumulative sums over the years.
library(dplyr)
data <- tribble(~year, ~questions,
2015, 1,
2016, 3,
2016, 2,
2017, 2,
2018, 3,
2018, 7,
2019, 10,
2020, 21)
data %>%
group_by(year) %>%
summarise(year_total = sum(questions)) %>%
mutate(cum_over_years = cumsum(year_total))
#> # A tibble: 6 x 3
#> year year_total cum_over_years
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1
#> 2 2016 5 6
#> 3 2017 2 8
#> 4 2018 10 18
#> 5 2019 10 28
#> 6 2020 21 49
library(tibble)
data <- tribble(~year, ~questions,
2015, 1,
2016, 3,
2017, 2,
2018, 3,
2019, 10,
2020, 21)
In base R:
data <- as.data.frame(data)
data$agg_questions <- cumsum(data$questions)
> data
year questions agg_questions
1 2015 1 1
2 2016 3 4
3 2017 2 6
4 2018 3 9
5 2019 10 19
6 2020 21 40
In data.table:
library(data.table)
data <- as.data.table(data)
data[, agg_questions := cumsum(questions)]
> data
year questions agg_questions
1: 2015 1 1
2: 2016 3 4
3: 2017 2 6
4: 2018 3 9
5: 2019 10 19
6: 2020 21 40
Related
I'm stumped on something that seems so silly. Is there an elegant dplyr way to filter out only the 2 rows where year == 2020 and quarter %in% 1:2.
I don't want to filter quarter for any other year besides 2020.
library(tibble)
library(dplyr)
df <- tibble(measure = rep(letters[1:4], 4),
year = rep(2017:2020,4)) %>%
arrange(year) %>%
mutate(quarter = rep(1:4, 4))
df2 <- filter(df, measure != 2020 & quarter %in% 1:2)
Created on 2021-02-01 by the reprex package (v0.3.0)
What i want is all but the last two rows:
Try negating the entire expression of what you don't want:
dplyr::filter(df, !(year == 2020 & quarter %in% 1:2))
measure year quarter
1 a 2017 1
2 a 2017 2
3 a 2017 3
4 a 2017 4
5 b 2018 1
6 b 2018 2
7 b 2018 3
8 b 2018 4
9 c 2019 1
10 c 2019 2
11 c 2019 3
12 c 2019 4
13 d 2020 3
14 d 2020 4
year == 2020 & quarter %in% 1:2 says to keep rows where the year is 2020 AND quarter is 1 or 2. The ! negates the entire expression so you exclude those rows.
FYI, you can also use dplyr::between(quarter, 1, 2)
I'm trying to subset the individuals that have been present for the duration of the whole study starting in 2014 and ending in 2019. So, the output would be a list of names that are present in every year of the dataframe.
I've tried the following code:
big_data <- dplyr::bind_rows(df1, df2, df3, df4, df5, df6) # I've bound 6 different dataframes (each with data from one of the years) by row. These dfs have a different number of rows and columns. Some columns repeat in different years, while others don't.
Date <- as.POSIXlt.Date(big_data$Date)
Year <- separate(big_data, Date, into = c('Month', 'Day', 'Year') %>% select(Year)) # I've extracted the Year from the Date variable (DD/MM/YYYY)
Year <- big_data$Year # I've added it to the big_data
Interval <- Year %between% c("2014", "2019") # I've created a timeperiod with the start and end years of the study
big_data [, all.names(FocalID %in% Interval)] # I've tried to get the names of the individuals (in variable FocalID) that are present in the interval (but probably doesn't mean in every year)
Obviously this code didn't work. Could you help me out? Thank you!
If your data frame has rows with id and year, for example:
big_data <- data.frame(
id = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3),
year = c(2014:2019, 2014:2019, 2014:2018)
)
id year
1 1 2014
2 1 2015
3 1 2016
4 1 2017
5 1 2018
6 1 2019
7 1 2014
8 2 2015
9 2 2016
10 2 2017
11 2 2018
12 3 2019
13 3 2014
14 3 2015
15 3 2016
16 3 2017
17 3 2018
You can use dplyr package from tidyverse to group_by individual subject id, and then check to make sure rows of data contain all years 2014-2019 in year. This will filter in all rows for given id - if all years are represented.
library(dplyr)
big_data %>%
group_by(id) %>%
filter(all(2014:2019 %in% year))
A base R option would be the following:
big_data[big_data$id %in% Reduce(intersect, split(big_data$id, big_data$year)), ]
In this example, id of 1 and 3 include all years 2014-2019.
Output
id year
1 1 2014
2 1 2015
3 1 2016
4 1 2017
5 1 2018
6 1 2019
7 1 2014
12 3 2019
13 3 2014
14 3 2015
15 3 2016
16 3 2017
17 3 2018
Another option with data.table
library(data.table)
setDT(big_data)[big_data[, .I[all(2014:2019 %in% year)], id]$V1]
-output
# id year
# 1: 1 2014
# 2: 1 2015
# 3: 1 2016
# 4: 1 2017
# 5: 1 2018
# 6: 1 2019
# 7: 1 2014
# 8: 3 2019
# 9: 3 2014
#10: 3 2015
#11: 3 2016
#12: 3 2017
#13: 3 2018
data
big_data <- data.frame(
id = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3),
year = c(2014:2019, 2014:2019, 2014:2018)
)
I am looking to get the standard deviation grouped by year. All the examples I have seen does not involve an aggregated count column.
I want to use the sum of the count column as part of the standard deviation calculation.
year count age
2018 2 0
2018 3 1
2018 4 2
2017 1 0
2017 4 1
2017 2 2
The expected answer for the above would be:-
Year 2018 = 0.78567420131839
Year 2017 = 0.63887656499994
The following should do the trick.
library(dplyr)
library(purrr)
data <- tibble(year = c(2018, 2018, 2018, 2017, 2017, 2017),
count = c(2, 3, 4, 1, 4, 2),
age = c(0, 1, 2, 0, 1, 2))
data %>%
mutate(vec = map2(age, count, ~ rep(.x, .y))) %>%
group_by(year) %>%
mutate(concs = list(unlist(vec))) %>%
ungroup() %>%
mutate(age_sd = map_dbl(concs, sd)) %>%
select(-vec, -concs)
# year count age age_sd
# <dbl> <dbl> <dbl> <dbl>
# 1 2018 2 0 0.833
# 2 2018 3 1 0.833
# 3 2018 4 2 0.833
# 4 2017 1 0 0.690
# 5 2017 4 1 0.690
# 6 2017 2 2 0.690
I'm working with a prescription drug claims dataset. When there is a canceled claim, the data system does not just delete the observation, but creates a new observation with the same prescription number but with the days supplied shown as a negative number.
E.g.
DaysSupply RxNumber DateSupplied
1 -10 1 2018
2 10 1 2018
I want to delete paired rows of the dataset if they 1) share the same prescription number (RxNumber), 2) if they have the same prescription date (DateSupplied), and 3) if the DaysSupply are corresponding positive and negative values (e.g. +10 and -10). The prescription number is the patient-specific key in this case.
One complication is that multiple drug fills can be redeemed from one prescription number, so I want to deduplicate JUST PAIRS that match the above conditions instead of deduplicating on all rows that share the same prescription number.
I'm not sure what approach I should be taking. I've thought about using a long if statement/dedpulicate command but I'm not sure how to instruct R to deduplicate ONLY pairs that match the above conditions.
v1 <- c(-10,10,10,-8,8,-6,6,5,4)
v2 <- c(1,1,1,2,2,3,4,9,9)
v3 <- c(2018, 2018, 2018, 2018, 2017, 2016, 2016, 2015, 2014)
df <- data.frame("DaysSupply" = v1, "RxNumber" = v2, "DateSupplied" = v3)
DaysSupply RxNumber DateSupplied
1 -10 1 2018
2 10 1 2018
3 10 1 2018
4 -8 2 2018
5 8 2 2017
6 -6 3 2016
7 6 4 2016
8 5 9 2015
9 4 9 2014
What I would like as an output is:
DaysSupply RxNumber DateSupplied
3 10 1 2018
4 -8 2 2018
5 8 2 2017
6 -6 3 2016
7 6 4 2016
8 5 9 2015
9 4 9 2014
Any ideas?
A dplyr solution using your sample data.
I included some lines toward the end to make it look nicer and get the output to look the same as yours. I'm sure someone could cut a line or two out and make the duplicate removal process a little cleaner, but I got it to do what you need.
df %>%
dplyr::mutate(AbsDaysSupply = abs(DaysSupply)) %>%
dplyr::group_by(RxNumber, DateSupplied, AbsDaysSupply) %>%
dplyr::arrange(RxNumber, DateSupplied, AbsDaysSupply, DaysSupply) %>%
dplyr::mutate(sum = cumsum(DaysSupply)) %>%
dplyr::filter(!(sum <= 0 & dplyr::n() > 1)) %>%
dplyr::ungroup() %>%
dplyr::select(-AbsDaysSupply, -sum) %>%
dplyr::arrange(desc(DateSupplied), RxNumber)
# A tibble: 7 x 3
DaysSupply RxNumber DateSupplied
<dbl> <dbl> <dbl>
1 10 1 2018
2 -8 2 2018
3 8 2 2017
4 -6 3 2016
5 6 4 2016
6 5 9 2015
7 4 9 2014
library(tidyverse)
v1 <- c(-10,10,10,-8,8,-6,6,5,4)
v2 <- c(1,1,1,2,2,3,4,9,9)
v3 <- c(2018, 2018, 2018, 2018, 2017, 2016, 2016, 2015, 2014)
df <- data.frame("DaysSupply" = v1, "RxNumber" = v2, "DateSupplied" = v3)
df %>%
# Create an absolute column for matching
mutate(DaysSupplyAbs = abs(DaysSupply)) %>%
# Orderto make matches adjacent, but with the positive first
arrange(RxNumber, DaysSupplyAbs, -DaysSupply) %>%
# Limit matches to Year and RxNumber
group_by(RxNumber, DateSupplied) %>%
# Get the nex (lead) and prior(Days Supply values)
mutate(DaysSupplyLead = lead(DaysSupply),
DaysSupplyLag = lag(DaysSupply)) %>%
# Identify the reversed and reversal
mutate(reversed = if_else(is.na(DaysSupplyLead), FALSE, DaysSupply == -DaysSupplyLead)) %>%
mutate(reversal = if_else(is.na(lag(reversed)), FALSE, lag(reversed) )) %>%
ungroup() %>%
# Filter out the reversals and the reveresed
filter(!(reversed | reversal)) %>%
select(DaysSupply, RxNumber, DateSupplied, reversed, reversal )
Result:
# DaysSupply RxNumber DateSupplied reversed reversal
# <dbl> <dbl> <dbl> <lgl> <lgl>
# 1 10 1 2018 FALSE FALSE
# 2 8 2 2017 FALSE FALSE
# 3 -8 2 2018 FALSE FALSE
# 4 -6 3 2016 FALSE FALSE
# 5 6 4 2016 FALSE FALSE
# 6 4 9 2014 FALSE FALSE
# 7 5 9 2015 FALSE FALSE
I'm having some trouble with advanced operations in dplyr with grouped data. I'm not sure how to specify if I want to refer to an observation-level value, and when I can specifically refer to the entire vector.
Sample data frame:
df <- as.data.frame(
rbind(
c(11990, 2011, 1, 1, 2010),
c(11990, 2015, 1, 0, NA),
c(11990, 2017, 2, 1, NA),
c(11990, 2018, 2, 1, 2016),
c(11990, 2019, 2, 1, 2019),
c(11990, 2020, 1, 0, NA),
c(22880, 2013, 1, 1, NA),
c(22880, 2014, 1, 0, 2011),
c(22880, 2015, 1, 1, NA),
c(22880, 2018, 2, 0, 2014),
c(22880, 2020, 2, 0, 1979)))
names(df) <- c("id", "year", "house_apt", "moved", "year_moved")
# > df
# id year house_apt moved year_moved
# 1 11990 2011 1 1 2010
# 2 11990 2015 1 0 NA
# 3 11990 2017 2 1 NA
# 4 11990 2018 2 1 2016
# 5 11990 2019 2 1 2019
# 6 11990 2020 1 0 NA
# 7 22880 2013 1 1 NA
# 8 22880 2014 1 0 2011
# 9 22880 2015 1 1 NA
# 10 22880 2018 2 0 2014
# 11 22880 2020 2 0 1979
If I do simple mutate operations:
library(dplyr)
df %>% mutate(year+2)
df %>% group_by(id) %>% mutate(year+2)
It's pretty obvious that "year" here refers to each individual row value. This is the case even if I were to (for some reason) do it with a grouping. However, if I were to do the following two operations which involve a vector operation:
df %>% mutate(sum(year))
df %>% group_by(id) %>% mutate(sum(year))
dplyr understands "year" as the entire vector of year values for that whole group.
However, now I am having a lot of trouble with an operation where it is ambiguous whether I want mutate to use the row-value or the entire vector. With my data frame, I want to create a variable a guessed moving year for individuals who moved but didn't record the moving date until a later survey instance. Note the data is extremely messy, with some nonsensical moving dates that we want to ignore.
Therefore, I want to create a "guess" value for each row where a person moved but no move_year is recorded. I want the operation to look through the entire vector of moving dates for each individual, subset to include only the ones earlier than the current year, and pick out the one that is the closest to the year for the current row. Granular example: If we look at row #3, the individual moved in that year, but there is no move date. Therefore we want to look at the entire year_moved vector for that person (2010, NA, NA, 2016, 2019, NA) and choose the one that is the closest to and preferably earlier than the row #3 value of year (2017). The guess value, therefore, would be 2016.
Getting the value we want with a given year and vector of values is simple:
year <- 2017
year_moved <- c(2010, 2016, 2017)
year_moved[which.min(year-(year_moved[year_moved<year & !is.na(year_moved)]))]
# [1] 2016
rm(year, year_moved)
However, when I try this within a mutate function, it doesn't give me the same result.
df %>%
group_by(id) %>%
mutate(
year_guess = ifelse(moved==1 & is.na(year_moved),
year_moved[which.min(year-(year_moved[year_moved<year]))],
NA))
# # A tibble: 11 x 6
# # Groups: id [2]
# id year house_apt moved year_moved guess
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 11990 2011 1 1 2010 NA
# 2 11990 2015 1 0 NA NA
# 3 11990 2017 2 1 NA NA
# 4 11990 2018 2 1 2016 NA
# 5 11990 2019 2 1 2019 NA
# 6 11990 2020 1 0 NA NA
# 7 22880 2013 1 1 NA 2011
# 8 22880 2014 1 0 2011 NA
# 9 22880 2015 1 1 NA 2011
# 10 22880 2018 2 0 2014 NA
# 11 22880 2020 2 0 1979 NA
# Warning message:
# In year - (year_moved[year_moved < year & !is.na(year_moved)]) :
# longer object length is not a multiple of shorter object length
(Row 3 should be 2016 and Row 9 should be 2014.) I think part of it is my inability to specify whether I am interested in a row-value or a vector. Note that the first time I refer to "year_moved" (is.na(year_moved)), I am referring to the value in that row. When I refer to it within the which.min, I am trying to refer to the groupwise vector. When I refer to "year", I'm trying to refer to the value of the individual row I'm working in. Clearly things are a little muddled, and it's a broader problem I've been running into with many different applications. Can anyone provide guidance?
I've been writing my whole project using tidyverse so would like to continue if possible.
I think the most straightforward way to modify your current attempt to get the right results is to wrap the guessing operation in sapply so that a guess is separately calculated for each year:
df %>%
group_by(id) %>%
mutate(
year_guess = ifelse(
moved==1 & is.na(year_moved),
sapply(year, function(x) year_moved[which.min(x-(year_moved[year_moved < x]))]),
NA)
)
I haven't been able to fully unpack the logic of how this works but I think as written your guessing procedure is a little bit complex to be easily vectorized (although it probably can be if you approach it in a slightly different way).
Output:
# A tibble: 11 x 6
# Groups: id [2]
id year house_apt moved year_moved year_guess
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11990 2011 1 1 2010 NA
2 11990 2015 1 0 NA NA
3 11990 2017 2 1 NA 2016
4 11990 2018 2 1 2016 NA
5 11990 2019 2 1 2019 NA
6 11990 2020 1 0 NA NA
7 22880 2013 1 1 NA 2011
8 22880 2014 1 0 2011 NA
9 22880 2015 1 1 NA 2014
10 22880 2018 2 0 2014 NA
11 22880 2020 2 0 1979 NA