I want to calculate rolling cumulative sums by item in a data.table. Sometimes, data is missing for a given time period.
set.seed(8)
item <- c(rep("A",4), rep("B",3))
time <- c(1,2,3,4,1,3,4)
sales <- rpois(7,5)
DT <- data.table(item, time,sales)
For a rolling window of 2 time periods I want the following output:
item time sales sales_rolling2
1: A 1 5 5
2: A 2 3 8
3: A 3 7 10
4: A 4 6 13
5: B 1 4 4
6: B 3 6 6
7: B 4 4 10
Note, that item B has no data at time 2. Thus the result for row 6 just includes the latest observation.
We can use rollsum from library(zoo) to do the rolling sum. Before applying the rollsum, I guess we need to create another grouping variable ('indx') based on the 'time' variable. I find that for the item 'B', the time is not continous, ie. 2 is missing. So, we can use diff to create a logical index based on the difference of adjacent elements. If the difference is not 1, it will return TRUE or else FALSE. As the diff output is of length 1 less than the length of the column, we can pad with TRUE and then do the cumsum to create the 'indx' variable.
library(zoo)
DT[, indx:=cumsum(c(TRUE, diff(time)!=1))]
In the second step, we use both 'indx' and 'time' as the grouping variable, get the rollsum of 'sales' with k=2 and also based on the condition that if the number of elements in the group is greater than 1 only we need to do this (if(.N >1)), otherwise it should return the 'sales', create the 'sales_rolling2', and assign (:=) the 'indx' to NULL as it is not needed in the expected output.
DT[, sales_rolling2 := if(.N>1) c(sales[1],rollsum(sales,2)) else sales,
by = .(indx, item)][,indx:= NULL]
# item time sales sales_rolling2
#1: A 1 5 5
#2: A 2 3 8
#3: A 3 7 10
#4: A 4 6 13
#5: B 1 4 4
#6: B 3 6 6
#7: B 4 4 10
Update
As per #Khashaa's suggestion, we can use roll_sum from library(RcppRoll) can be used more effectively as it will even work with number of rows less than 'k'. In this way, we can remove the if/else condition in my previous solution. (Full credit to #Khashaa)
library(RcppRoll)
DT[, sales_rolling2 := c(sales[1L], roll_sum(sales, 2)), by = .(indx, item)]
Related
I need to summarize some data using a rolling window of different width and shift. In particular I need to apply a function (eg. sum) over some values recorded on different intervals.
Here an example of a data frame:
df <- tibble(days = c(0,1,2,3,1),
value = c(5,7,3,4,2))
df
# A tibble: 5 x 2
days value
<dbl> <dbl>
1 0 5
2 1 7
3 2 3
4 3 4
5 1 2
The columns indicate:
days how many days elapsed from the previous observation. The first value is 0 because no previous observation.
value the value I need to aggregate.
Now, let's assume that I need to sum the field value every 4 days shifting 1 day at the time.
I need something along these lines:
days value roll_sum rows_to_sum
0 5 15 1,2,3
1 7 10 2,3
2 3 3 3
3 4 6 4,5
1 2 NA NA
The column rows_to_sum has been added to make it clear.
Here more details:
The first value (15), is the sum of the 3 rows because 0+1+2 = 3 which is less than the reference value 4 and adding the next line (with value 3) will bring the total day count to 7 which is more than 4.
The second value (10), is the sum of row 2 and 3. This is because, excluding the first row (since we are shifting one day), we only summing row 2 and 3 because including row 4 will bring the total sum of days to 1+2+3 = 6 which is more than 4.
...
How can I achieve this?
Thank you
Here is one way :
library(dplyr)
library(purrr)
df %>%
mutate(roll_sum = map_dbl(row_number(), ~{
i <- max(which(cumsum(days[.x:n()]) <= 4))
if(is.na(i)) NA else sum(value[.x:(.x + i - 1)])
}))
# days value roll_sum
# <dbl> <dbl> <dbl>
#1 0 5 15
#2 1 7 10
#3 2 3 3
#4 3 4 6
#5 1 2 2
Performing this calculation in base R :
sapply(seq(nrow(df)), function(x) {
i <- max(which(cumsum(df$days[x:nrow(df)]) <= 4))
if(is.na(i)) NA else sum(df$value[x:(x + i - 1)])
})
I am trying to set the previous observation per group to NA, if a certain condition applies.
Assume I have the following datatable:
DT = data.table(group=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6,6,3,1,1,3,6), a=1:9, b=9:1)
and I am using the simple condition:
DT[y == 6]
How can I set the previous rows of DT[y == 6] within DT to NA, namely the rows with the numbers 2 and 8 of DT? That is, how to set the respectively previous rows per group to NA.
Please note: From DT we can see that there are 3 rows when y is equal to 6, but for group a (row nr 4) I do not want to set the previous row to NA, as the previous row belongs to a different group.
So what I want in different terms is the previous index of certain elements in datatable. Is that possible? Would be also interesting if one can go further back than 1 period. Thanks for any hints.
You can find the row indices where current y is not 6 and next row is 6, then set the whole row to NA:
DT[shift(y, type="lead")==6 & y!=6,
(names(DT)) := lapply(.SD, function(x) NA)]
DT
output:
group v y a b
1: b 1 1 1 9
2: <NA> NA NA NA NA
3: b 1 6 3 7
4: a 2 6 4 6
5: a 2 3 5 5
6: a 1 1 6 4
7: c 1 1 7 3
8: <NA> NA NA NA NA
9: c 2 6 9 1
As usual, Frank commenting with a more succinct version:
DT[shift(y, type="lead")==6 & y!=6, names(DT) := NA]
I have a data frame that looks a bit like this:
wt <- data.frame(region = c(rep("A", 5), rep("B", 5)), time = c(1:5, 1:5),
start = c(rep(2,5), rep(4, 5)), value = rep(1, 10))
The values in the value column could be any numbers (I am working in a very large data set), but each region will be over an equal-length time series and have a single starting point.
I want to perform a cumulative sum within each region that begins accumulating at the starting point, continues forward in the time series, and wraps to the rows before the starting point in the time series.
The full data table, WITH the intended result, would look like this:
region time start value result
A 1 2 1 5
A 2 2 1 1
A 3 2 1 2
A 4 2 1 3
A 5 2 1 4
B 1 4 1 3
B 2 4 1 4
B 3 4 1 5
B 4 4 1 1
B 5 4 1 2
A simple transformation of the time column followed by cumsum does not work, since the function cares about row order and not any particular factor.
With that in mind, I am operating on a huge data table, and runtime is absolutely a concern, so any solution must avoid re-ordering rows.
Ideas of how to do this? Thanks in advance.
EDIT: Consider time to be a cycle such as hours in a day - and for example, if the start time is 2, that means observations start at one instance of time 2 and end at the next time 1.
We can do this in an efficient way with data.table
library(data.table)
setDT(wt)[time>=start, result := seq_len(.N), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +seq_len(.N) , region][, Max := NULL][]
# region time start value result
#1: A 1 2 1 5
#2: A 2 2 1 1
#3: A 3 2 1 2
#4: A 4 2 1 3
#5: A 5 2 1 4
#6: B 1 4 1 3
#7: B 2 4 1 4
#8: B 3 4 1 5
#9: B 4 4 1 1
#10: B 5 4 1 2
akrun's solution works for the example I gave (hence I accepted it as the answer), but here's a version that works for any values in the value column:
library(data.table)
setDT(wt)[time>=start, result := cumsum(value), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +cumsum(value) , region][, Max := NULL][]
Just adding the... unfortunately named cumsum function in place of a calculated sequence.
I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).
I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html