There must be an easy, possible recursive solution, for the following problem. I would very much appreciate, if anyone can help:
I use data.table and RcppRoll to calculate the weekly sales in qualified weeks within the past 26 weeks for each product. With a window of 26, this works fine, as long as # of current week > 26. However, when # of current week is <= 26, I want to use a window of size 26, 25, ..., and so on.
The formular would be: baseline sales = sum over 26 (or less) weeks of sales (before current week, in qualified weeks only), divided by # of qualified weeks
Here is some code to create test data:
library("data.table")
library("RcppRoll")
products <- seq(1:10) #grouping variable
weeks <- seq(1:100) #weeks
sales <- round(rchisq(1000, 2),0) #sales
countweek <- round(runif(1000, 0,1),0) #1, if qualified weeks
data <- as.data.table(cbind(merge(weeks,products,all=T),sales,countweek))
names(data) <- c("week","product","sales","countweek")
data <- data[order(product,week)]
data[,pastsales:=shift(RcppRoll::roll_sumr(sales*countweek,26L,fill=0),1L,0,"lag"),by=.(product)]
data[,rollweekcount:=shift(RcppRoll::roll_sumr(countweek,26L,fill=0),1L,0,"lag"),by=.(product)]
data[,baseline:=pastsales/rollweekcount]
You can see the break at week line 26 for product 1. After line 26, I get the desired results:
> data[product == 1]
week product sales countweek pastsales rollweekcount baseline
...
20: 20 1 1 0 0 0 NaN
21: 21 1 2 0 0 0 NaN
22: 22 1 1 1 0 0 NaN
23: 23 1 0 0 0 0 NaN
24: 24 1 3 1 0 0 NaN
25: 25 1 5 1 0 0 NaN
26: 26 1 5 1 0 0 NaN
27: 27 1 1 1 44 13 3.384615
28: 28 1 0 1 45 14 3.214286
29: 29 1 5 0 44 14 3.142857
30: 30 1 0 1 44 14 3.142857
31: 31 1 3 1 44 14 3.142857
32: 32 1 4 0 42 14 3.000000
...
You need an "adaptive" window width. Not sure about RcppRoll, but the more recent versions of data.table has frollsum which can do this
data[, pastsales := shift(frollsum(sales*countweek, pmin(1:.N, 26L), adaptive = TRUE),
1L, 0, "lag"),
by = .(product)]
data[, rollweekcount := shift(frollsum(countweek, pmin(1:.N, 26L), adaptive = TRUE),
1L, 0, "lag"),
by = .(product)]
Related
I have the following dataset:
df1 <- data.frame(number = c(1,1,0,0,0,0,0,1,1))
In this dataset i want to create a second column, which shows if in the certain row of the first column there is a case, when first and second lags are equal to 0 and the first lead equals to 1. If this is a case, so the number 1 is put in the second column where change from 0 to 1 occurred (if not the case so equals to 44. As a result, in this output all rows in the second column should equal to 44 except the 8th.
here is my code. and in the comments below I will put a photo of the required result.
df1$t<-ifelse(df1[,1]==1 & lag(df1[,1]==0,1,default = 44) & lag(df1[,1]==0,2,default = 44)
& lead(df1[,1]==1,1,default = 44)
,1,44)
Athough the OP has asked for an explanation why his code does not return the expected result (which is addressed by Gregor's comment) I would like to propose an alternative approach.
If I understand correctly, the OP wants to find all sub-sequences in df1$number which consist of two zeros followed by two ones, i.e., c(0, 0, 1, 1). Then, the row which contains the first one in the sub-sequence should be marked by a 1 while all other rows should get 44 as default value.
As of version v1.12.0 (on CRAN 13 Jan 2019) of data.table, the shift() function recognizes negative lag/lead parameters. By this, a column can be shifted by multiple values in one batch. The row numbers which fulfill above condition are identified by a subsequent join operation. Finally df1 is updated selectively using these row numbers:
# use enhanced sample dataset, rows 10 to 21 added
df1 <- data.frame(number = c(1,1,0,0,0,0,0,1,1,0,1,0,1,1,0,0,1,0,0,1,1))
library(data.table)
setDT(df1)[, t := 44] # coerce to data.table, pre-populate result column
# shift and join
idx <- df1[, shift(number, 2:-1)][.(0, 0, 1, 1), on = paste0("V", 1:4), which = TRUE]
df1[idx, t := 1] # selective update
df1
number t
1: 1 44
2: 1 44
3: 0 44
4: 0 44
5: 0 44
6: 0 44
7: 0 44
8: 1 1
9: 1 44
10: 0 44
11: 1 44
12: 0 44
13: 1 44
14: 1 44
15: 0 44
16: 0 44
17: 1 44
18: 0 44
19: 0 44
20: 1 1
21: 1 44
number t
This works essentially as OP's approach by shifting and comparing with expected values. However, OP's approach requires to code four comparisions and three shift operations while here the shifting is done in one step and the comparison of all columns simultaneously is done by the join operation in the second step.
Additional explanations
The shift operation
df1[, shift(number, 2:-1)]
returns
V1 V2 V3 V4
1: NA NA 1 1
2: NA 1 1 0
3: 1 1 0 0
4: 1 0 0 0
5: 0 0 0 0
6: 0 0 0 0
7: 0 0 0 1
8: 0 0 1 1
9: 0 1 1 0
10: 1 1 0 1
11: 1 0 1 0
12: 0 1 0 1
13: 1 0 1 1
14: 0 1 1 0
15: 1 1 0 0
16: 1 0 0 1
17: 0 0 1 0
18: 0 1 0 0
19: 1 0 0 1
20: 0 0 1 1
21: 0 1 1 NA
V1 V2 V3 V4
In the subsequent join operation,
df1[, shift(number, 2:-1)][.(0, 0, 1, 1), on = paste0("V", 1:4), which = TRUE]
which = TRUE asks for returning only the indices of matching rows which are
[1] 8 20
I'm really new to R but I haven't been able to find a simple solution to this. As an example, I have the following dataframe:
case <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
areas <- c(1,2,1,1,1,2,2,2,2,1,1,2,2,2,1,1,1,2,2,2)
A <- c(1,2,11,12,20,21,26,43,43,47,48,59,63,64,65,66,67,83,90,91)
var <- c(1,1,0,0,0,1,1,0,0,1,0,1,0,1,1,0,0,0,0,0)
outcome <- c(1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0)
df <- data.frame(case,areas,A,var,outcome)
case areas A var outcome
1 1 1 1 1 1
2 2 2 2 1 0
3 3 1 11 0 0
4 4 1 12 0 0
5 5 1 20 0 0
6 6 2 21 1 0
7 7 2 26 1 0
8 8 2 43 0 0
9 9 2 43 0 0
10 10 1 47 1 1
11 11 1 48 0 0
12 12 2 59 1 1
13 13 2 63 0 0
14 14 2 64 1 0
15 15 1 65 1 0
16 16 1 66 0 0
17 17 1 67 0 0
18 18 2 83 0 1
19 19 2 90 0 0
20 20 2 91 0 0
in the 'A' column we have a wide range of integers, and I'd like to create an extra column that groups each case by its membership to the following categories:
<5; 5 - 19; 20 - 49; 50 - 79; 80+
So the first 3 rows of the column should be a string value that says "<5", "<5", "5 - 19"... and so on, and the last value in the column will be "80+".
I could write out something like this, but it seems very sloppy:
A_groups = ifelse(df$A<5, "<5", df$A)
A_groups = ifelse(df$A>4 & df$A<20, "5-19", A_groups)
A_groups = ifelse(df$A>19 & df$A<50, "20-49", A_groups)
What is the best alternative to this?
You're looking for the cut() function. You want to create a factor based on interval, which is what this function provides.
df$new_factor <- cut(df$A, breaks = c(-Inf, 5, 20, 50, 80, Inf),
labels = c('<5', '5-19', '20-49', '50-79', '80+'),
right = FALSE)
View the helppage: ?cut to see why I included right = FALSE. To double check whether it works what you do, it's always nice to create some cases where you wouldn't be sure of. For example: check case == 5 with right = FALSE on and without it and see what happens to new_factor.
You can use cut() or findInterval().
breaks = c(0,5,20,50,80,Inf)
labels = c("<5", "5-19", "20-49", "50-79", "80+")
# Using cut()
df$A_groups = cut(df$A, breaks = breaks, right = FALSE, labels = labels)
# Using findInterval()
df$B_groups = factor(findInterval(df$A, breaks), labels = labels)
I have two datasets like these ones:
df <- data.frame(id = 1:20,
Sex = rep(x = c(0,1), each=10),
age = c(25,56,29,42,33,33,33,25,25,25,26,57,30,43,34,34,34,26,26,26),
ov = letters[1:20])
df1 <- data.frame(Sex = c(0,0,0,1,1),
age = c(25,33,39,41,43))
I want to take 1 random row for every group of sex and age of df according every group of df1, but not all cases of age in df1 match in df, so I want to impute for every group in df1 that no match in df the value of var ov which is related with the same sex and the closest age, something like this:
df3 <- rbind(df[c(8,7),2:4],c(0,39,"d"),c(1,41,"n"),df[14,2:4])
Note that the donor for case in which sex = 0 and age = 39 is the df[4,] and note that the donor for case in which sex = 1 and age = 41 is the df[14,]
How can I do this:
Using data.table you can try something like this:
1) Convert data to data.table and add keys:
df1
dt1 <- as.data.table(df1) # convert to data.table
dt1[, newSex := Sex] # this will serve as grouping column
dt1[, newage := age] # also this
setkey(dt1, Sex, age) # set data.tables keys
dt1
Sex age newSex newage
1: 0 25 0 25
2: 0 33 0 33
3: 0 39 0 39
4: 1 41 1 41
5: 1 43 1 43
# we do similar with df:
dt <- as.data.table(df)
setkey(dt, Sex, age)
dt
id Sex age ov
1: 1 0 25 a
2: 8 0 25 h
3: 9 0 25 i
4: 10 0 25 j
5: 3 0 29 c
6: 5 0 33 e
7: 6 0 33 f
8: 7 0 33 g
9: 4 0 42 d
10: 2 0 56 b
11: 11 1 26 k
12: 18 1 26 r
13: 19 1 26 s
14: 20 1 26 t
15: 13 1 30 m
16: 15 1 34 o
17: 16 1 34 p
18: 17 1 34 q
19: 14 1 43 n
20: 12 1 57 l
2) Using rolling merge we get dtnew with new groups:
dtnew <- dt1[dt, roll = "nearest"]
dtnew
Sex age newSex newage id ov
1: 0 25 0 25 1 a
2: 0 25 0 25 8 h
3: 0 25 0 25 9 i
4: 0 25 0 25 10 j
5: 0 29 0 25 3 c
6: 0 33 0 33 5 e
7: 0 33 0 33 6 f
8: 0 33 0 33 7 g
9: 0 42 0 39 4 d
10: 0 56 0 39 2 b
11: 1 26 1 41 11 k
12: 1 26 1 41 18 r
13: 1 26 1 41 19 s
14: 1 26 1 41 20 t
15: 1 30 1 41 13 m
16: 1 34 1 41 15 o
17: 1 34 1 41 16 p
18: 1 34 1 41 17 q
19: 1 43 1 43 14 n
20: 1 57 1 43 12 l
3) Now we can sample. In your case we can simply reorder rows in random order, and then take firs row of each group:
dtnew <- dtnew[sample(.N)] #create random order
sampleDT <- unique(dtnew, by = c("newSex", "newage")) #take first unique by newSex and newage
sampleDT
Sex age newSex newage id ov
1: 0 56 0 39 2 b
2: 0 29 0 25 3 c
3: 1 43 1 43 14 n
4: 1 34 1 41 16 p
5: 0 33 0 33 7 g
This question already has answers here:
Consecutive group number in R
(3 answers)
Closed 5 years ago.
I am currently dealing with a car data. We recorded the speed of the car every 5 minutes, and it contains a lot of zero values. My question is, how to segment the data by a zero values and give each non-zero section a ordered number in R?
Let's take a sample data as example:
sample <- data.frame(
id = 1:15,
speed = c(50,0, 0, 30, 50, 40,0, 0, 25, 30, 50, 0, 30, 50, 40))
I want to add a new column that gives each non-zero section a number (starting from 1), while a consecutive number of k zero speeds (or more) is numbered as 0.
Specifically for this sample data, let's say k equals 2, then my desired result should be like this dataframe:
sample_new <- data.frame(
id = 1:15,
speed = c(50,0, 0, 0, 50, 40,0, 0, 25, 30, 50, 0, 30, 50, 40),
number = c(1, 0, 0, 0, 2, 2, 0 ,0, 3, 3, 3, 3, 3, 3, 3))
which prints as
id speed number
1 1 50 1
2 2 0 0
3 3 0 0
4 4 0 0
5 5 50 2
6 6 40 2
7 7 0 0
8 8 0 0
9 9 25 3
10 10 30 3
11 11 50 3
12 12 0 3** <- here is the difference
13 13 30 3
14 14 50 3
15 15 40 3
There are more than 1 million rows in my data, so I hope that the solution could be acceptable in speed.
The reason for setting a threshold "k" is that, some drivers just leave their GPS open even if they lock the car and go to sleep. But in other occasion where the interval is less than k, they just stopped because of the crossroad light. I want to focus on the longtime stops and just ignore the short time stops.
Hope my question makes sense to you.Thank you.
As processing speed is a concern for the production data set of more than 1 M rows, I suggest to use data.table.
It's quite easy to identify the groups of subsequent non-zero entries:
library(data.table)
setDT(sample)[, number := rleid(speed > 0 ) * (speed > 0)][]
id speed number
1: 1 50 1
2: 2 0 0
3: 3 0 0
4: 4 30 3
5: 5 50 3
6: 6 40 3
7: 7 0 0
8: 8 0 0
9: 9 25 5
10: 10 30 5
11: 11 50 5
12: 12 0 0
13: 13 30 7
14: 14 50 7
15: 15 40 7
The group numbers are different but aren't numbered consecutively. If this is a requirement it will get tricky:
setDT(sample)[, number := as.integer(factor(rleid(speed > 0 ) * (speed > 0), exclude = 0))][]
id speed number
1: 1 50 1
2: 2 0 NA
3: 3 0 NA
4: 4 30 2
5: 5 50 2
6: 6 40 2
7: 7 0 NA
8: 8 0 NA
9: 9 25 3
10: 10 30 3
11: 11 50 3
12: 12 0 NA
13: 13 30 4
14: 14 50 4
15: 15 40 4
If really required, the NAs can be replaced by 0 with
setDT(sample)[, number := as.integer(factor(rleid(speed > 0 ) * (speed > 0), exclude = 0))][
is.na(number), number := 0][]
There is an alternative approach
setDT(sample)[, number := {
tmp <- speed > 0
cumsum(tmp - shift(tmp, fill = 0, type = "lag") > 0) * tmp
}][]
id speed number
1: 1 50 1
2: 2 0 0
3: 3 0 0
4: 4 30 2
5: 5 50 2
6: 6 40 2
7: 7 0 0
8: 8 0 0
9: 9 25 3
10: 10 30 3
11: 11 50 3
12: 12 0 0
13: 13 30 4
14: 14 50 4
15: 15 40 4
I'm cleaning up some eye-tracking data, which is, as expected, messy. I'm stuck on a preliminary step that I'll do my best to describe thoroughly. The solution is likely quite simple.
I've got two variables, one binary (x1) and the other continuous (x2), such as that created by:
dat <- data.frame(x1 = c(0,1,1,0,1,1,1,0,1,1),
x2 = c(22,23,44,25,36,37,28,19,30,41))
I need to create a new variable (x3) that is the cumulative sum of x2 only for consecutive cases in which x1 is equal to 1. The end product would look like such:
dat <- data.frame(x1 = c(0,1,1,0,1,1,1,0,1,1),
x2 = c(22,23,44,25,36,37,28,19,30,41),
x3 = c(0, 23, 67, 0, 36, 73, 101, 0, 30, 71))
In other words, it's a cumsum() of x2 that "resets" after each 0 in x1.
dat$x3 <- with(dat, ave(replace(x2, x1 == 0, 0), cumsum(x1 == 0), FUN=cumsum))
dat
# x1 x2 x3
#1 0 22 0
#2 1 23 23
#3 1 44 67
#4 0 25 0
#5 1 36 36
#6 1 37 73
#7 1 28 101
#8 0 19 0
#9 1 30 30
#10 1 41 71
In data.table, you could group by runs of x1 (using by=rleid(x1)) and then return 0 if the group of x1 is 0, or otherwise return the cumulative sum of x2. := is used to assign the variable by reference.
library(data.table)
setDT(dat)[, x3 := if(x1[1] == 0) 0 else cumsum(x2), by=rleid(x1)]
this returns
dat
x1 x2 x3
1: 0 22 0
2: 1 23 23
3: 1 44 67
4: 0 25 0
5: 1 36 36
6: 1 37 73
7: 1 28 101
8: 0 19 0
9: 1 30 30
10: 1 41 71