Subsets in subsets without looping in R - r

this is my starting data:
days <- c("01.01.2018","01.01.2018","01.01.2018","01.01.2018",
"02.01.2018","02.01.2018","02.01.2018","02.01.2018",
"03.01.2018","03.01.2018","03.01.2018","03.01.2018")
time <- c("00:00:00","08:00:00","12:00:00","16:00:00",
"00:00:00","08:00:00","12:00:00","16:00:00",
"00:00:00","08:00:00","12:00:00","16:00:00")
a <- c(10,12,11,14,
12,22,24,20,
11,8,13,16)
b <- c(18,22,26,21,
2,6,7,5,
27,31,29,26)
c <- a-b
d <- c(10,10,10,10,
20,20,20,20,
30,30,30,30)
df <- data.frame(days,time,a,b,c,d)
so df will come out as:
days time a b c d
1 01.01.2018 00:00:00 10 18 -8 10
2 01.01.2018 08:00:00 12 22 -10 10
3 01.01.2018 12:00:00 11 26 -15 10
4 01.01.2018 16:00:00 14 21 -7 10
5 02.01.2018 00:00:00 12 2 10 20
6 02.01.2018 08:00:00 22 6 16 20
7 02.01.2018 12:00:00 24 7 17 20
8 02.01.2018 16:00:00 20 5 15 20
9 03.01.2018 00:00:00 11 27 -16 30
10 03.01.2018 08:00:00 8 31 -23 30
11 03.01.2018 12:00:00 13 29 -16 30
12 03.01.2018 16:00:00 16 26 -10 30
in this dataframe i'd like to
for each day
find the first c value <-10
add the corresponding d values to ranges from the c value found before and the last c value of the day
this is what i've come up:
ndays <- unique(df$days)
for(i in 1:length(ndays)) {
if(!is.na(df[(df$days == ndays[i] & df$c <= -10),]$c[1]))
{
df[(df$days == ndays[i] & df$c <= -10),]$c <- df[(df$days == ndays[i] & df$c <= -10),]$c + df[(df$days == ndays[i] & df$c <= -10),]$d
}
}
Output will be:
days time a b c d
1 01.01.2018 00:00:00 10 18 -8 10
2 01.01.2018 08:00:00 12 22 0 10
3 01.01.2018 12:00:00 11 26 -5 10
4 01.01.2018 16:00:00 14 21 -7 10
5 02.01.2018 00:00:00 12 2 10 20
6 02.01.2018 08:00:00 22 6 16 20
7 02.01.2018 12:00:00 24 7 17 20
8 02.01.2018 16:00:00 20 5 15 20
9 03.01.2018 00:00:00 11 27 14 30
10 03.01.2018 08:00:00 8 31 7 30
11 03.01.2018 12:00:00 13 29 14 30
12 03.01.2018 16:00:00 16 26 20 30
Problem is, i'd like not to use a for loop since is slow, and is not adding d to the entire day. df$c[4] should be 3.

There is a solution using dplyr and lubridate. I'm not 100% sure to understand what you want to do, but I think it should help you to solve your problem.
days <- c("01.01.2018","01.01.2018","01.01.2018","01.01.2018",
"02.01.2018","02.01.2018","02.01.2018","02.01.2018",
"03.01.2018","03.01.2018","03.01.2018","03.01.2018")
time <- c("00:00:00","08:00:00","12:00:00","16:00:00",
"00:00:00","08:00:00","12:00:00","16:00:00",
"00:00:00","08:00:00","12:00:00","16:00:00")
a <- c(10,12,11,14,
12,22,24,20,
11,8,13,16)
b <- c(18,22,26,21,
2,6,7,5,
27,31,29,26)
c <- a-b
d <- c(10,10,10,10,
20,20,20,20,
30,30,30,30)
By creating a variable yday with lubridate functions, you can after group by this variable.You can use cumulative maximum cummax.
library(lubridate)
library(dplyr)
df %>%
mutate(yday = day(dmy(days))) %>%
mutate(is_below = c < -10) %>%
group_by(yday) %>%
mutate(to_add = cummax(is_below)) %>%
mutate(c = if_else(to_add == 1, true = c + d, false = c))
#> # A tibble: 12 x 9
#> # Groups: yday [3]
#> days time a b c d yday is_below to_add
#> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <int> <lgl> <int>
#> 1 01.01.2018 00:00:00 10 18 -8 10 1 FALSE 0
#> 2 01.01.2018 08:00:00 12 22 -10 10 1 FALSE 0
#> 3 01.01.2018 12:00:00 11 26 -5 10 1 TRUE 1
#> 4 01.01.2018 16:00:00 14 21 3 10 1 FALSE 1
#> 5 02.01.2018 00:00:00 12 2 10 20 2 FALSE 0
#> 6 02.01.2018 08:00:00 22 6 16 20 2 FALSE 0
#> 7 02.01.2018 12:00:00 24 7 17 20 2 FALSE 0
#> 8 02.01.2018 16:00:00 20 5 15 20 2 FALSE 0
#> 9 03.01.2018 00:00:00 11 27 14 30 3 TRUE 1
#> 10 03.01.2018 08:00:00 8 31 7 30 3 TRUE 1
#> 11 03.01.2018 12:00:00 13 29 14 30 3 TRUE 1
#> 12 03.01.2018 16:00:00 16 26 20 30 3 FALSE 1

Related

Subsetting 10 day intervals with a overlapping date

I have a annual data set that I would like to break into 10 day intervals. For example I would like to subset 2010-12-26 to 2011-01-04 create a home range using the x and y values for those dates, then get the next 9 days plus an overlapping date between the subsetted data this case it would be 2011-01-04 (2011-01-04 to 2011-01-13). Is there a good way to do this?
#Example dataset
library(lubridate)
date <- seq(dmy("26-12-2010"), dmy("15-01-2013"), by = "days")
df <- data.frame(date = date,
x = runif(752, min = 60000, max = 80000),
y = runif(752, min = 800000, max = 900000))
> df
date x y
1 2010-12-26 73649.16 894525.6
2 2010-12-27 69005.21 898233.7
3 2010-12-28 64982.90 873692.6
4 2010-12-29 64592.93 841055.2
5 2010-12-30 60475.99 854524.3
6 2010-12-31 79206.43 879468.2
7 2011-01-01 76692.40 830569.6
8 2011-01-02 70378.51 834338.2
9 2011-01-03 74977.73 820568.0
10 2011-01-04 63023.47 899482.3
11 2011-01-05 77046.80 886369.0
12 2011-01-06 68751.91 841074.7
13 2011-01-07 65471.34 888525.3
14 2011-01-08 61138.68 855039.5
15 2011-01-09 65660.66 880227.2
16 2011-01-10 75526.36 838478.6
17 2011-01-11 64485.74 808947.7
18 2011-01-12 61405.69 887784.1
19 2011-01-13 70561.86 847634.7
20 2011-01-14 69234.98 840012.1
21 2011-01-15 75539.43 817132.5
22 2011-01-16 74227.28 839230.4
23 2011-01-17 74548.59 855006.3
24 2011-01-18 72020.71 815036.7
25 2011-01-19 70814.50 883029.6
26 2011-01-20 76924.65 817289.5
27 2011-01-21 60556.21 807427.2
Thank you for your time.
What about this?
res <- lapply(
seq(0, nrow(df), by = 10),
function(k) df[max(k, 1):min(k + 10, nrow(df)), ]
)
which gives
> head(res)
[[1]]
date x y
1 2010-12-26 63748.27 856758.7
2 2010-12-27 73774.90 860222.6
3 2010-12-28 68893.24 804194.7
4 2010-12-29 79791.86 810624.5
5 2010-12-30 60073.50 809016.0
6 2010-12-31 74020.15 883304.9
7 2011-01-01 67144.95 889235.3
8 2011-01-02 67205.20 810514.2
9 2011-01-03 68518.68 882730.7
10 2011-01-04 70442.87 892934.1
[[2]]
date x y
10 2011-01-04 70442.87 892934.1
11 2011-01-05 65466.26 855725.2
12 2011-01-06 70034.79 879770.8
13 2011-01-07 60195.42 888653.4
14 2011-01-08 65208.12 883176.8
15 2011-01-09 63040.52 821902.3
16 2011-01-10 62302.66 815025.1
17 2011-01-11 77662.53 829474.5
18 2011-01-12 64802.65 809961.7
19 2011-01-13 71812.61 810755.1
20 2011-01-14 63086.30 820029.9
[[3]]
date x y
20 2011-01-14 63086.30 820029.9
21 2011-01-15 75548.71 806966.7
22 2011-01-16 68572.89 847679.0
23 2011-01-17 71408.65 889490.2
24 2011-01-18 73507.84 815559.7
25 2011-01-19 76854.50 899108.6
26 2011-01-20 79138.08 858537.1
27 2011-01-21 73960.14 898957.3
28 2011-01-22 75048.41 864425.6
29 2011-01-23 61059.20 857558.3
30 2011-01-24 67455.03 853017.1
[[4]]
date x y
30 2011-01-24 67455.03 853017.1
31 2011-01-25 72727.70 891708.8
32 2011-01-26 73230.11 836404.6
33 2011-01-27 67719.05 815528.3
34 2011-01-28 65139.66 826289.8
35 2011-01-29 65145.94 818736.4
36 2011-01-30 74206.03 839014.2
37 2011-01-31 77259.35 855653.0
38 2011-02-01 77809.65 836912.6
39 2011-02-02 62744.02 831549.0
40 2011-02-03 79594.93 873313.6
[[5]]
date x y
40 2011-02-03 79594.93 873313.6
41 2011-02-04 78942.86 825001.1
42 2011-02-05 61346.88 871578.5
43 2011-02-06 68526.18 863300.7
44 2011-02-07 76920.15 844180.0
45 2011-02-08 73023.08 823092.4
46 2011-02-09 64287.09 804682.7
47 2011-02-10 71377.16 829219.8
48 2011-02-11 68930.80 814626.6
49 2011-02-12 70780.95 831549.8
50 2011-02-13 73740.99 895868.0
[[6]]
date x y
50 2011-02-13 73740.99 895868.0
51 2011-02-14 79846.05 844586.6
52 2011-02-15 66559.60 835943.0
53 2011-02-16 68522.99 837633.2
54 2011-02-17 65898.75 891364.4
55 2011-02-18 73809.44 842797.9
56 2011-02-19 73336.53 821166.5
57 2011-02-20 72780.91 883200.6
58 2011-02-21 73240.81 864142.2
59 2011-02-22 78855.11 868599.6
60 2011-02-23 69236.04 845566.6
Alternative solution using dplyr package and applicable when instead of groups of 10 you want groups of n dates. We assume one row per date as in your example.
library(lubridate)
dt <- seq(dmy("26-12-2010"), dmy("15-01-2013"), by = "days")
df <- data.frame(date = dt,
x = runif(752, min = 60000, max = 80000),
y = runif(752, min = 800000, max = 900000))
library(dplyr)
n <- 10
df |>
arrange(date) |>
mutate(id = 0:(nrow(df) - 1),
group = id %/% n + 1) |>
group_by(group) |>
group_split() |>
head(n=2)
#> [[1]]
#> # A tibble: 10 x 5
#> date x y id group
#> <date> <dbl> <dbl> <int> <dbl>
#> 1 2010-12-26 70488. 884674. 0 1
#> 2 2010-12-27 74133. 888636. 1 1
#> 3 2010-12-28 66635. 838681. 2 1
#> 4 2010-12-29 67931. 808998. 3 1
#> 5 2010-12-30 68032. 868329. 4 1
#> 6 2010-12-31 76891. 826684. 5 1
#> 7 2011-01-01 70793. 890401. 6 1
#> 8 2011-01-02 60427. 846447. 7 1
#> 9 2011-01-03 69902. 886152. 8 1
#> 10 2011-01-04 64253. 859245. 9 1
#>
#> [[2]]
#> # A tibble: 10 x 5
#> date x y id group
#> <date> <dbl> <dbl> <int> <dbl>
#> 1 2011-01-05 74260. 844636. 10 2
#> 2 2011-01-06 75631. 807722. 11 2
#> 3 2011-01-07 74443. 840540. 12 2
#> 4 2011-01-08 78903. 811777. 13 2
#> 5 2011-01-09 78531. 894333. 14 2
#> 6 2011-01-10 79310. 812625. 15 2
#> 7 2011-01-11 71701. 801691. 16 2
#> 8 2011-01-12 63254. 854752. 17 2
#> 9 2011-01-13 72813. 837910. 18 2
#> 10 2011-01-14 62718. 877568. 19 2
Created on 2021-07-05 by the reprex package (v2.0.0)

A running sum for daily data that resets when month turns

I have a 2 column table (tibble), made up of a date object and a numeric variable. There is maximum one entry per day but not every day has an entry (ie date is a natural primary key). I am attempting to do a running sum of the numeric column along with dates but with the running sum resetting when the month turns (the data is sorted by ascending date). I have replicated what I want to get as a result below.
Date score monthly.running.sum
10/2/2019 7 7
10/9/2019 6 13
10/16/2019 12 25
10/23/2019 2 27
10/30/2019 13 40
11/6/2019 2 2
11/13/2019 4 6
11/20/2019 15 21
11/27/2019 16 37
12/4/2019 4 4
12/11/2019 24 28
12/18/2019 28 56
12/25/2019 8 64
1/1/2020 1 1
1/8/2020 15 16
1/15/2020 9 25
1/22/2020 8 33
It looks like the package "runner" is possibly suited to this but I don't really understand how to instruct it. I know I could use a join operation plus a group_by using dplyr to do this, but the data set is very very large and doing so would be wildly inefficient. i could also manually iterate through the list with a loop, but that also seems inelegant. last option i can think of is selecting out a unique vector of yearmon objects and then cutting the original list into many shorter lists and running a plain cumsum on it, but that also feels unoptimal. I am sure this is not the first time someone has to do this, and given how many tools there is in the tidyverse to do things, I think I just need help finding the right one. The reason I am looking for a tool instead of using one of the methods I described above (which would take less time than writing this post) is because this code needs to be very very readable by an audience that is less comfortable with code.
We can also use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%m/%d/%Y")
][, monthly.running.sum := cumsum(score),format(Date, "%Y-%m")][]
# Date score monthly.running.sum
# 1: 2019-10-02 7 7
# 2: 2019-10-09 6 13
# 3: 2019-10-16 12 25
# 4: 2019-10-23 2 27
# 5: 2019-10-30 13 40
# 6: 2019-11-06 2 2
# 7: 2019-11-13 4 6
# 8: 2019-11-20 15 21
# 9: 2019-11-27 16 37
#10: 2019-12-04 4 4
#11: 2019-12-11 24 28
#12: 2019-12-18 28 56
#13: 2019-12-25 8 64
#14: 2020-01-01 1 1
#15: 2020-01-08 15 16
#16: 2020-01-15 9 25
#17: 2020-01-22 8 33
data
df <- structure(list(Date = c("10/2/2019", "10/9/2019", "10/16/2019",
"10/23/2019", "10/30/2019", "11/6/2019", "11/13/2019", "11/20/2019",
"11/27/2019", "12/4/2019", "12/11/2019", "12/18/2019", "12/25/2019",
"1/1/2020", "1/8/2020", "1/15/2020", "1/22/2020"), score = c(7L,
6L, 12L, 2L, 13L, 2L, 4L, 15L, 16L, 4L, 24L, 28L, 8L, 1L, 15L,
9L, 8L)), row.names = c(NA, -17L), class = "data.frame")
Using lubridate, you can extract month and year values from the date, group_by those values and them perform the cumulative sum as follow:
library(lubridate)
library(dplyr)
df %>% mutate(Month = month(mdy(Date)),
Year = year(mdy(Date))) %>%
group_by(Month, Year) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 6
# Groups: Month, Year [4]
Date score monthly.running.sum Month Year SUM
<chr> <int> <int> <int> <int> <int>
1 10/2/2019 7 7 10 2019 7
2 10/9/2019 6 13 10 2019 13
3 10/16/2019 12 25 10 2019 25
4 10/23/2019 2 27 10 2019 27
5 10/30/2019 13 40 10 2019 40
6 11/6/2019 2 2 11 2019 2
7 11/13/2019 4 6 11 2019 6
8 11/20/2019 15 21 11 2019 21
9 11/27/2019 16 37 11 2019 37
10 12/4/2019 4 4 12 2019 4
11 12/11/2019 24 28 12 2019 28
12 12/18/2019 28 56 12 2019 56
13 12/25/2019 8 64 12 2019 64
14 1/1/2020 1 1 1 2020 1
15 1/8/2020 15 16 1 2020 16
16 1/15/2020 9 25 1 2020 25
17 1/22/2020 8 33 1 2020 33
An alternative will be to use floor_date function in order ot convert each date as the first day of each month and the calculate the cumulative sum:
library(lubridate)
library(dplyr)
df %>% mutate(Floor = floor_date(mdy(Date), unit = "month")) %>%
group_by(Floor) %>%
mutate(SUM = cumsum(score))
# A tibble: 17 x 5
# Groups: Floor [4]
Date score monthly.running.sum Floor SUM
<chr> <int> <int> <date> <int>
1 10/2/2019 7 7 2019-10-01 7
2 10/9/2019 6 13 2019-10-01 13
3 10/16/2019 12 25 2019-10-01 25
4 10/23/2019 2 27 2019-10-01 27
5 10/30/2019 13 40 2019-10-01 40
6 11/6/2019 2 2 2019-11-01 2
7 11/13/2019 4 6 2019-11-01 6
8 11/20/2019 15 21 2019-11-01 21
9 11/27/2019 16 37 2019-11-01 37
10 12/4/2019 4 4 2019-12-01 4
11 12/11/2019 24 28 2019-12-01 28
12 12/18/2019 28 56 2019-12-01 56
13 12/25/2019 8 64 2019-12-01 64
14 1/1/2020 1 1 2020-01-01 1
15 1/8/2020 15 16 2020-01-01 16
16 1/15/2020 9 25 2020-01-01 25
17 1/22/2020 8 33 2020-01-01 33
A base R alternative :
df$Date <- as.Date(df$Date, "%m/%d/%Y")
df$monthly.running.sum <- with(df, ave(score, format(Date, "%Y-%m"),FUN = cumsum))
df
# Date score monthly.running.sum
#1 2019-10-02 7 7
#2 2019-10-09 6 13
#3 2019-10-16 12 25
#4 2019-10-23 2 27
#5 2019-10-30 13 40
#6 2019-11-06 2 2
#7 2019-11-13 4 6
#8 2019-11-20 15 21
#9 2019-11-27 16 37
#10 2019-12-04 4 4
#11 2019-12-11 24 28
#12 2019-12-18 28 56
#13 2019-12-25 8 64
#14 2020-01-01 1 1
#15 2020-01-08 15 16
#16 2020-01-15 9 25
#17 2020-01-22 8 33
The yearmon class represents year/month objects so just convert the dates to yearmon and accumulate by them using this one-liner:
library(zoo)
transform(DF, run.sum = ave(score, as.yearmon(Date, "%m/%d/%Y"), FUN = cumsum))
giving:
Date score run.sum
1 10/2/2019 7 7
2 10/9/2019 6 13
3 10/16/2019 12 25
4 10/23/2019 2 27
5 10/30/2019 13 40
6 11/6/2019 2 2
7 11/13/2019 4 6
8 11/20/2019 15 21
9 11/27/2019 16 37
10 12/4/2019 4 4
11 12/11/2019 24 28
12 12/18/2019 28 56
13 12/25/2019 8 64
14 1/1/2020 1 1
15 1/8/2020 15 16
16 1/15/2020 9 25
17 1/22/2020 8 33

Growth Rate for daily data

I have a data for selling some product and I would like to calculate the growth rate of this data such that N_win and N_lose are the win and lose over a period of time 1-19 March. Also, I would like to predict the growth rate and win and lose?
Date N_win N_lose tot1 tot2
1 2018-03-01 0 0 0 0
2 2018-03-02 1 0 1 1
3 2018-03-03 0 0 1 1
4 2018-03-04 1 0 2 2
5 2018-03-05 3 0 5 5
6 2018-03-06 0 0 5 5
7 2018-03-07 2 0 7 7
8 2018-03-08 4 0 11 11
9 2018-03-09 4 0 15 15
10 2018-03-10 5 0 20 20
11 2018-03-11 1 1 21 20
12 2018-03-12 24 1 45 44
13 2018-03-13 41 1 86 85
14 2018-03-14 17 2 103 101
15 2018-03-15 15 3 118 115
16 2018-03-16 15 6 133 127
17 2018-03-17 38 6 171 165
18 2018-03-18 67 6 238 232
I tried to apply this function but it seems not working
Growthrate = function(x1,x2, n){
gr = (x2/x1)^(1/n)-1
return(gr)
}
GR = NULL
for(i in 1:length(DF[,1])){
GR[i] = Growthrate(DF[i,2],DF[i+1,2], sum(i))
}

R sum two columns with condition on third column

I have a data frame like:
user_name started_at session_time_min task_completed timediff
ABC 2018-03-02 18:00:00 1 3 NA
ABC 2018-03-02 19:00:00 1036 18 1
ABC 2018-03-03 12:00:00 6 10 17
ABC 2018-03-04 21:00:00 0 1 33
ABC 2018-03-05 16:00:00 143 61 19
ABC 2018-03-05 18:00:00 12 18 2
ABC 2018-03-05 19:00:00 60 94 1
ABC 2018-03-05 20:00:00 20 46 1
ABC 2018-03-09 15:00:00 0 1 91
I want to sum session_time_min and task_completed with previous row if timediff = 1
Want output like:
user_name started_at session_time_min task_completed
ABC 2018-03-02 18:00:00 1037 21
ABC 2018-03-03 12:00:00 6 10
ABC 2018-03-04 21:00:00 0 1
ABC 2018-03-05 16:00:00 143 61
ABC 2018-03-05 18:00:00 92 158
ABC 2018-03-09 15:00:00 0 1
Any help will highly be appricated.
You could use a for loop to help you out especially if you want to use base R.
for (i in 1:nrow(data)) {
if (is.na(data[i,5])){
data[i+1,3] <- data[i+1,3] + data[i,3]
data[i+1,4] <- data[i+1,4] + data[i,4]
} else {}
}
data <- na.omit(data)
This code runs through each row in your dataframe and checks if the value in column 5 (timediff) is a NA. If it is an NA it adds (for the 2 columns you want positioned at 3 and 4) it to the row below (which will be i+1)
Make a group counter using cumsum and then use that to subset the identifier columns and rowsum the value columns:
grp <- cumsum(!dat$timediff %in% 1)
#[1] 1 1 2 3 4 5 5 5 6
cbind(
dat[match(unique(grp), grp), c("user_name","started_at")],
rowsum(dat[c("session_time_min","task_completed")], grp)
)
# user_name started_at session_time_min task_completed
#1 ABC 2018-03-0218:00:00 1037 21
#3 ABC 2018-03-0312:00:00 6 10
#4 ABC 2018-03-0421:00:00 0 1
#5 ABC 2018-03-0516:00:00 143 61
#6 ABC 2018-03-0518:00:00 92 158
#9 ABC 2018-03-0915:00:00 0 1

Cumsum reset at certain values [duplicate]

This question already has answers here:
Cumsum with reset when 0 is encountered and by groups
(2 answers)
Cumulative sum that resets when 0 is encountered
(4 answers)
Closed 5 years ago.
I have the following dataframe
x y count
1 1 2018-02-24 4.031540
2 2 2018-02-25 5.244303
3 3 2018-02-26 5.441465
4 NA 2018-02-27 4.164104
5 5 2018-02-28 5.172919
6 6 2018-03-01 5.591410
7 7 2018-03-02 4.691716
8 8 2018-03-03 5.465360
9 9 2018-03-04 3.269378
10 NA 2018-03-05 5.300679
11 11 2018-03-06 5.489664
12 12 2018-03-07 4.423334
13 13 2018-03-08 3.808764
14 14 2018-03-09 6.450136
15 15 2018-03-10 5.541785
16 16 2018-03-11 4.762889
17 17 2018-03-12 5.511649
18 18 2018-03-13 6.795386
19 19 2018-03-14 6.615762
20 20 2018-03-15 4.749151
I want to take the cumsum of the count column, but I want the the cumsum to restart when the x value is NA. I've tried the following:
df$cum_sum <- ifelse(is.na(df$x) == FALSE, cumsum(df$count), 0)
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 0.000000
5 5 2018-02-28 5.172919 24.054331
6 6 2018-03-01 5.591410 29.645741
7 7 2018-03-02 4.691716 34.337458
8 8 2018-03-03 5.465360 39.802817
9 9 2018-03-04 3.269378 43.072195
10 NA 2018-03-05 5.300679 0.000000
11 11 2018-03-06 5.489664 53.862538
12 12 2018-03-07 4.423334 58.285871
13 13 2018-03-08 3.808764 62.094635
14 14 2018-03-09 6.450136 68.544771
15 15 2018-03-10 5.541785 74.086556
16 16 2018-03-11 4.762889 78.849445
17 17 2018-03-12 5.511649 84.361094
18 18 2018-03-13 6.795386 91.156480
19 19 2018-03-14 6.615762 97.772242
20 20 2018-03-15 4.749151 102.521394
The result is the cum_sum column is 0 at the NA values, but the cumsum doesn't reset. How can I fix this?
A possible solution:
dat$cum_sum <- ave(dat$count, cumsum(is.na(dat$x)), FUN = cumsum)
which gives:
> dat
x y count cum_sum
1 1 2018-02-24 4.031540 4.031540
2 2 2018-02-25 5.244303 9.275843
3 3 2018-02-26 5.441465 14.717308
4 NA 2018-02-27 4.164104 4.164104
5 5 2018-02-28 5.172919 9.337023
6 6 2018-03-01 5.591410 14.928433
7 7 2018-03-02 4.691716 19.620149
8 8 2018-03-03 5.465360 25.085509
9 9 2018-03-04 3.269378 28.354887
10 NA 2018-03-05 5.300679 5.300679
11 11 2018-03-06 5.489664 10.790343
12 12 2018-03-07 4.423334 15.213677
13 13 2018-03-08 3.808764 19.022441
14 14 2018-03-09 6.450136 25.472577
15 15 2018-03-10 5.541785 31.014362
16 16 2018-03-11 4.762889 35.777251
17 17 2018-03-12 5.511649 41.288900
18 18 2018-03-13 6.795386 48.084286
19 19 2018-03-14 6.615762 54.700048
20 20 2018-03-15 4.749151 59.449199
Or with dplyr:
library(dplyr)
dat %>%
group_by(grp = cumsum(is.na(x))) %>%
mutate(cum_sum = cumsum(count)) %>%
ungroup() %>%
select(-grp)
I have the data.table version
plouf <- setDT(df)
plouf[,group := cumsum(is.na(x))]
plouf[!is.na(x),cum_sum := cumsum(count),by = group]
x y count group cum_sum
1: 1 2018-02-24 4.031540 0 4.031540
2: 2 2018-02-25 5.244303 0 9.275843
3: 3 2018-02-26 5.441465 0 14.717308
4: NA 2018-02-27 4.164104 1 NA
5: 5 2018-02-28 5.172919 1 5.172919
6: 6 2018-03-01 5.591410 1 10.764329
7: 7 2018-03-02 4.691716 1 15.456045
8: 8 2018-03-03 5.465360 1 20.921405
9: 9 2018-03-04 3.269378 1 24.190783
10: NA 2018-03-05 5.300679 2 NA
11: 11 2018-03-06 5.489664 2 5.489664
12: 12 2018-03-07 4.423334 2 9.912998
13: 13 2018-03-08 3.808764 2 13.721762
14: 14 2018-03-09 6.450136 2 20.171898
15: 15 2018-03-10 5.541785 2 25.713683
16: 16 2018-03-11 4.762889 2 30.476572
17: 17 2018-03-12 5.511649 2 35.988221
18: 18 2018-03-13 6.795386 2 42.783607
19: 19 2018-03-14 6.615762 2 49.399369
20: 20 2018-03-15 4.749151 2 54.148520

Resources