Repeat rows then manipulate those rows using data table in R - r

Hi I am new to data table syntax in R (and R in general) and need help to repeat certain rows and incrementally increased them based on category.
My mock data table information is below:
> head(dt)
Time Values1 Values2 Values3 Category
1: 00:15:00 1 2 1.5 A
2: 00:30:00 3 4 2.5 A
3: 00:45:00 5 6 3.5 A
4: 01:00:00 7 8 4.5 A
5: 01:15:00 9 10 5.5 A
6: 01:30:00 11 12 6.5 A
> tail(dt)
Time Values1 Values2 Values3 Category
1: 22:45:00 182 181 92.5 B
2: 23:00:00 184 183 93.5 B
3: 23:15:00 186 185 94.5 B
4: 23:30:00 188 187 95.5 B
5: 23:45:00 190 189 96.5 B
6: 00:00:00 192 191 97.5 B
> str(dt)
Classes ‘data.table’ and 'data.frame': 192 obs. of 5 variables:
$ Time :Class 'ITime' int [1:192] 900 1800 2700 3600 4500 5400 6300 7200 8100 9000 ...
$ Values1 : int 1 3 5 7 9 11 13 15 17 19 ...
$ Values2 : int 2 4 6 8 10 12 14 16 18 20 ...
$ Values3 : num 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 ...
$ Category: chr "A" "A" "A" "A" ...
- attr(*, ".internal.selfref")=<externalptr>
If the Category is A, I want to extrapolate each value (highlighted yellow) in the Time column to one minute while the remaining columns would still have the same values. Note that, if the time is 00:15, then my extrapolated section would have time from both 00:01 to 00:14 and 00:16 to 00:29, as shown below:
---Goal---:
If the category is B, then the time extrapolation is 5 minutes.
The final result will have the original data with all the time extrapolations and no duplicated time values based on Category.
--- Thought Process----:
My strategy is to break up into categories A and B, somehow find ways to add the extrapolated time and append them back to the original data table.
So far, I know how to break up into categories A and B, come up with a function to add minutes to the as.ITime type Time column and repeat each row in Time column
add_minutes <- function(m) {
x <- m * 60
return(x)
}
A <- dt[Category == 'A']
B <- dt[Category == 'B']
A <- A[,list(freq=rep(1,14)), by =.(Time,Values1,Values2,Values3,Category)][,freq:=NULL]
However, I do not know how to combine add_minutes() function to those repeated rows to:
Reset the time for each original time value
.For example, if the original time is 00:30. I managed to repeat that line 14 times, then I want the 14 appearances of 00:30 to be a sequence from 00:31 to 00:44. If the original time is 00:45, then I want a sequence from 00:46 to 00:59, and so on.
Append this back to the original data table
Thank you in advance for your help!!

Unfortunately, the rolling join as suggested by pseudospin will not return the expected result because as.ITime("00:00:00") is part of the time series dt and thus will be rolled forward to the additional time steps at 00:01:00, 00:02:00, 00:03:00, etc. for Category A, or 00:05:00, 00:10:00 for Category B, resp. (Note that as.ITime("24:00:00") == as.ITime("00:00:00")).
The approach below
creates all required time steps completed_ts for each Category
right joins with dt which adds many NAs to the values columns
fills the missing values for each Category by last observation carried forward,
and fills the missing values at the top of each Category by next observation carried backward, finally.
completed_ts <- rbind(
data.table(Time = as.ITime(seq(1L, 1440L, 1L) * 60L), Category = "A"),
data.table(Time = as.ITime(seq(5L, 1440L, 5L) * 60L), Category = "B")
)
res <- dt[completed_ts, on = .(Time, Category)]
cols <- paste0("Values", 1:3)
res[, (cols) := lapply(.SD, nafill, type = "locf"), .SDcols = cols, by = Category]
res[, (cols) := lapply(.SD, nafill, type = "nocb"), .SDcols = cols, by = Category]
# print interesting parts of the result
res[Category == "A", .SD[c(1:16, .N - 16:0)]]
res[Category == "B", .SD[c(1:4, .N - 4:0)]]
Time Values1 Values2 Values3 Category
1: 00:01:00 1 2 1.5 A
2: 00:02:00 1 2 1.5 A
3: 00:03:00 1 2 1.5 A
4: 00:04:00 1 2 1.5 A
5: 00:05:00 1 2 1.5 A
6: 00:06:00 1 2 1.5 A
7: 00:07:00 1 2 1.5 A
8: 00:08:00 1 2 1.5 A
9: 00:09:00 1 2 1.5 A
10: 00:10:00 1 2 1.5 A
11: 00:11:00 1 2 1.5 A
12: 00:12:00 1 2 1.5 A
13: 00:13:00 1 2 1.5 A
14: 00:14:00 1 2 1.5 A
15: 00:15:00 1 2 1.5 A
16: 00:16:00 1 2 1.5 A
17: 23:44:00 187 188 94.5 A
18: 23:45:00 189 190 95.5 A
19: 23:46:00 189 190 95.5 A
20: 23:47:00 189 190 95.5 A
21: 23:48:00 189 190 95.5 A
22: 23:49:00 189 190 95.5 A
23: 23:50:00 189 190 95.5 A
24: 23:51:00 189 190 95.5 A
25: 23:52:00 189 190 95.5 A
26: 23:53:00 189 190 95.5 A
27: 23:54:00 189 190 95.5 A
28: 23:55:00 189 190 95.5 A
29: 23:56:00 189 190 95.5 A
30: 23:57:00 189 190 95.5 A
31: 23:58:00 189 190 95.5 A
32: 23:59:00 189 190 95.5 A
33: 00:00:00 191 192 96.5 A
Time Values1 Values2 Values3 Category
Time Values1 Values2 Values3 Category
1: 00:05:00 2 1 2.5 B
2: 00:10:00 2 1 2.5 B
3: 00:15:00 2 1 2.5 B
4: 00:20:00 2 1 2.5 B
5: 23:40:00 188 187 95.5 B
6: 23:45:00 190 189 96.5 B
7: 23:50:00 190 189 96.5 B
8: 23:55:00 190 189 96.5 B
9: 00:00:00 192 191 97.5 B
Note that data.tables's nafill() function only supports double and integer data types currently. If you need to fill other data types please see zoo::na.locf().
Reproducible data
library(data.table)
dtA <- data.table(Time = seq(as.ITime("00:15:00"), by = 900L, length.out = 96L),
Values1 = seq(1L, by = 2L, length.out = 96L),
Values2 = seq(2L, by = 2L, length.out = 96L),
Values3 = seq(1.5, by = 1.0, length.out = 96L),
Category = rep("A", 96L))
dtB <- data.table(Time = seq(as.ITime("00:15:00"), by = 900L, length.out = 96L),
Values1 = seq(to = 192L, by = 2L, length.out = 96L),
Values2 = seq(to = 191L, by = 2L, length.out = 96L),
Values3 = seq(to = 97.5, by = 1.0, length.out = 96L),
Category = rep("B", 96L))
dt <- rbind(dtA, dtB)

The magical rolling join in data.table.
desire <- rbind(
data.table(Category = "A", Time = as.ITime(seq(1, 1440, 1)*60)),
data.table(Category = "B", Time = as.ITime(seq(5, 1440, 5)*60))
)
dt[desire, on = c('Category','Time'), roll = TRUE, rollends = c(TRUE, TRUE)]

Related

How to vectorize a loop that, for each row, sums a function of the time elapsed between the current and all prior entries of that entity ID in r

I have a large data.table (circa 900k rows) which can be represented by the following example:
row.id entity.id event.date result
1: 1 100 2015-01-20 NA
2: 2 101 2015-01-20 NA
3: 3 104 2015-01-20 NA
4: 4 107 2015-01-20 NA
5: 5 103 2015-01-23 NA
6: 6 109 2015-01-23 NA
7: 7 102 2015-01-23 NA
8: 8 101 2015-01-26 NA
9: 9 110 2015-01-26 NA
10: 10 112 2015-01-26 NA
11: 11 109 2015-01-26 NA
12: 12 130 2015-01-29 NA
13: 13 100 2015-01-29 NA
14: 14 127 2015-01-29 NA
15: 15 101 2015-01-29 NA
16: 16 119 2015-01-29 NA
17: 17 104 2015-02-03 NA
18: 18 101 2015-02-03 NA
19: 19 125 2015-02-03 NA
20: 20 130 2015-02-03 NA
Essentially I have columns containing: the ID representing the entity in question (entity.id); the date of an event in which this ID partook (note that many, and differing numbers of, entities will participate in each event). I need to calculate a factor that, for each entity.id on each event date, depends (non-linearly) on the time (in days) that has elapsed since all the previous events in which that entity ID was entered.
To put it in other, more programmatic terms, on each row of the data.table I need to find all instances with matching ID and where the date is older than the event date of the row in question, work out the difference in time (in days) between the ‘current’ and historical events, and sum some non-linear function applied to each of the time periods (I’ll use the square in this example).
In the example above, for entity.id = 101 on 03-02-2015 (row 18), the we would need to look back to that ID's prior entries on rows 15, 8 and 2, calculate the differences in days from the ‘current’ event (14, 8 and 5 days), and then calculate the answer by summing the squares of those periods (14^2 + 8^2 + 5^2) = 196 + 64 + 25 = 285. (The real function is somewhat more complex but this is sufficiently representative.)
This is trivial to achieve with for-loops, as per below:
# Create sample dt
dt <- data.table(row.id = 1:20,
entity.id = c(100, 101, 104, 107, 103, 109, 102, 101, 110, 112,
109, 130, 100, 127, 101, 119, 104, 101, 125, 130),
event.date = as.Date(c("2015-01-20", "2015-01-20", "2015-01-20", "2015-01-20",
"2015-01-23", "2015-01-23", "2015-01-23",
"2015-01-26", "2015-01-26", "2015-01-26", "2015-01-26",
"2015-01-29", "2015-01-29", "2015-01-29", "2015-01-29", "2015-01-29",
"2015-02-03", "2015-02-03", "2015-02-03", "2015-02-03")),
result = NA)
setkey(dt, row.id)
for (i in 1:nrow(dt)) { #loop through each entry
# get a subset of dt comprised of rows with this row's entity.id, which occur prior to this row
event.history <- dt[row.id < i & entity.id == entity.id[i]]
# calc the sum of the differences between the current row event date and the prior events dates, contained within event.history, squared
dt$result[i] <- sum( (as.numeric(dt$event.date[i]) - as.numeric(event.history$event.date)) ^2 )
}
Unfortunately, on the real dataset it is also extremely slow, no doubt because if the amount of subsetting operations required. Is there a way to vectorise, or otherwise speed up, this operation? I’ve searched and searched and wracked my brains but can’t work out how to vecotrally subset rows based on differing data per each row without looping.
Note that I created a row.id column to allow me to extract all prior rows (rather than prior dates), as the two are broadly equivalent (an entity cannot attend more than one event a day) and this way was much quicker (I think because it avoids the need to coerce the dates to numeric before doing the comparison, ie. Dt[as.numeric(event_date) < as.numeric(event_date[i])]
Note also that I’m not wedded to it being a data.table; I’m happy to use dplyr or other mechanisms to achieve this if need be.
I think this can be achieved using a self-join with appropriate non-equi join critieria:
dt[, result2 := dt[
dt,
on=c("entity.id","event.date<event.date"),
sum(as.numeric(x.event.date - i.event.date)^2), by=.EACHI]$V1
]
dt
This gives a result which matches your output from the loop, with the exception of the NA values:
# row.id entity.id event.date result result2
# 1: 1 100 2015-01-20 0 NA
# 2: 2 101 2015-01-20 0 NA
# 3: 3 104 2015-01-20 0 NA
# 4: 4 107 2015-01-20 0 NA
# 5: 5 103 2015-01-23 0 NA
# 6: 6 109 2015-01-23 0 NA
# 7: 7 102 2015-01-23 0 NA
# 8: 8 101 2015-01-26 36 36
# 9: 9 110 2015-01-26 0 NA
#10: 10 112 2015-01-26 0 NA
#11: 11 109 2015-01-26 9 9
#12: 12 130 2015-01-29 0 NA
#13: 13 100 2015-01-29 81 81
#14: 14 127 2015-01-29 0 NA
#15: 15 101 2015-01-29 90 90
#16: 16 119 2015-01-29 0 NA
#17: 17 104 2015-02-03 196 196
#18: 18 101 2015-02-03 285 285
#19: 19 125 2015-02-03 0 NA
#20: 20 130 2015-02-03 25 25

igraph, POSIX, and data.table

In an earlier question, I learned that graphs are useful to collapse these data
require(data.table)
set.seed(333)
t <- data.table(old=1002:2001, dif=sample(1:10,1000, replace=TRUE))
t$new <- t$old + t$dif; t$foo <- rnorm(1000); t$dif <- NULL
> head(t)
old new foo
1: 1002 1007 -0.7889534
2: 1003 1004 0.3901869
3: 1004 1014 0.7907947
4: 1005 1011 2.0964612
5: 1006 1007 1.1834171
6: 1007 1015 1.1397910
to obtain only those rows such that new[i] = old[i-1]. The result could then be joined into a table with users who each have their own starting points
i <- data.table(id=1:3, start=sample(1000:1990,3))
> i
id start
1: 1 1002
2: 2 1744
3: 3 1656
Specifically, when only the first n=3 steps are calculated, the solution was
> library(igraph)
> i[, t[old %in% subcomponent(g, start, "out")[1:n]], by=.(id)]
id old new foo
1: 1 1002 1007 -0.7889534
2: 1 1007 1015 1.1397910
3: 1 1015 1022 -1.2193666
4: 2 1744 1750 -0.1368320
5: 2 1750 1758 0.3331686
6: 2 1758 1763 1.3040357
7: 3 1656 1659 -0.1556208
8: 3 1659 1663 0.1663042
9: 3 1663 1669 0.3781835
When implementing this when the setup is the same but new, old, and start are POSIXct class,
set.seed(333)
u <- data.table(old=seq(from=as.POSIXct("2013-01-01"),
to=as.POSIXct("2013-01-02"), by="15 mins"),
dif=as.difftime(sample(seq(15,120,15),97,replace=TRUE),units="mins"))
u$new <- u$old + u$dif; u$foo <- rnorm(97); u$dif <- NULL
j <- data.table(id=1:3, start=sample(seq(from=as.POSIXct("2013-01-01"),
to=as.POSIXct("2013-01-01 22:00:00"), by="15 mins"),3))
> head(u)
old new foo
1: 2013-01-01 00:00:00 2013-01-01 01:00:00 -1.5434407
2: 2013-01-01 00:15:00 2013-01-01 00:30:00 -0.2753971
3: 2013-01-01 00:30:00 2013-01-01 02:30:00 -1.5986916
4: 2013-01-01 00:45:00 2013-01-01 02:00:00 -0.6288528
5: 2013-01-01 01:00:00 2013-01-01 01:15:00 -0.8967041
6: 2013-01-01 01:15:00 2013-01-01 02:45:00 -1.2145590
> j
id start
1: 1 2013-01-01 22:00:00
2: 2 2013-01-01 21:00:00
3: 3 2013-01-01 13:30:00
the command
> j[, u[old %in% subcomponent(h, V(h)$name %in% as.character(start), "out")[1:n]], by=.(id)]
Empty data.table (0 rows and 4 cols): id,old,new,foo
returns an empty vector, which appears to be due to the inner part u[...]. I do not quite see where the problem is in this case and wonder whether anyone spots a mistake.

Dealing with apply functions of xts object in R

I have a sample xts object with the some data:
dates <- seq.Date(from = as.Date("2010-01-01", format = "%Y-%m-%d"),
to = as.Date("2013-12-01", format = "%Y-%m-%d"), by = "month")
sample_data <- cbind(1:length(dates),length(dates):1)
xts_object <- xts(x = sample_data, order.by = dates)
I then use apply.yearly on it with the function cumsum:
apply.yearly(x = xts_object, FUN = cumsum)
The output is a tranposed matrix, which is not what I originally intended it to return.
I would expect the snippet above to return the same output as:
rbind(apply(X = xts_object[1:12],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[13:24],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[25:36],MARGIN = 2,FUN = cumsum),
apply(X = xts_object[37:48],MARGIN = 2,FUN = cumsum))
The problem with using apply is that it returns a matrix and not an xts object. While I could solve this by using as.xts, I would like to know if there is something I am missing, or if I am using apply.yearly incorrectly. Using pure apply seems to be more prone to difficult to catch errors and bugs.
This might not be the most elegant solution, but it works:
# Split xts_object by year
xts_list = split(xts_object, "years")
# cumsum for each year
cumsum_list = lapply(xts_list, FUN = cumsum)
# rbind them together
do.call(rbind, cumsum_list)
# [,1] [,2]
# 2010-01-01 1 48
# 2010-02-01 3 95
# 2010-03-01 6 141
# 2010-04-01 10 186
# 2010-05-01 15 230
# 2010-06-01 21 273
# 2010-07-01 28 315
# 2010-08-01 36 356
# 2010-09-01 45 396
# 2010-10-01 55 435
# 2010-11-01 66 473
# 2010-12-01 78 510
# 2011-01-01 13 36
# 2011-02-01 27 71
# 2011-03-01 42 105
# 2011-04-01 58 138
# 2011-05-01 75 170
# 2011-06-01 93 201
# 2011-07-01 112 231
# 2011-08-01 132 260
# 2011-09-01 153 288
# 2011-10-01 175 315
# 2011-11-01 198 341
# 2011-12-01 222 366
# 2012-01-01 25 24
# 2012-02-01 51 47
# 2012-03-01 78 69
# 2012-04-01 106 90
# 2012-05-01 135 110
# 2012-06-01 165 129
# 2012-07-01 196 147
# 2012-08-01 228 164
# 2012-09-01 261 180
# 2012-10-01 295 195
# 2012-11-01 330 209
# 2012-12-01 366 222
# 2013-01-01 37 12
# 2013-02-01 75 23
# 2013-03-01 114 33
# 2013-04-01 154 42
# 2013-05-01 195 50
# 2013-06-01 237 57
# 2013-07-01 280 63
# 2013-08-01 324 68
# 2013-09-01 369 72
# 2013-10-01 415 75
# 2013-11-01 462 77
# 2013-12-01 510 78
class(do.call(rbind, cumsum_list))
# [1] "xts" "zoo"
The resulting object would still be "xts"

Transform zeros of x number of rows in NA in R

In my data frame I have only zeros up until row 1500 in the column nr.flights, that I want to transform to NA's (I have no data available about the nr.flights for the first 1500 rows). There are other values from row 1500 onwards that are zero but that needs to remain zero.
My dataframe looks like this:
Date AD Runway MTOW nr.flights
2008-01-01 A 18 376 0
2008-01-01 A 18 376 0
2008-01-01 D 36 190 0
2008-01-02 D 09 150 2
2008-01-02 A 36 280 1
2008-01-02 A 36 280 1
And I want it to look like this:
Date AD Runway MTOW nr.flights
2008-01-01 A 18 376 NA
2008-01-01 A 18 376 NA
2008-01-01 D 36 190 NA
2008-01-02 D 09 150 2
2008-01-02 A 36 280 1
2008-01-02 A 36 280 1
So far I've only managed to change the entire column into either NA's or zeros, but I want to have both of these in there. Any help would be much appreciated!
To reproduce:
df <- data.frame(Date=c("2008-01-01","2008-01-01","2008-01-01","2008-01- 02","2008-01-02","2008-01-02"),
AD = c("A", "A", "D", "D", "A", "A"), Runway = c(18, 18, 36, 09, 36,36),
MTOW = c(376, 376, 190, 150, 280, 280), nr.flights = c(0,0,0,2,1,1))
Here's a way:
is.na(df$nr.flights[1:1500])[df$nr.flights[1:1500] == 0] <- TRUE
It works by isolating the values equal to 0, then assign the NA status to TRUE. This is typically the safer option compared to df[mysubset] <- NA.
df
Date AD Runway MTOW nr.flights
1 2008-01-01 A 18 376 NA
2 2008-01-01 A 18 376 NA
3 2008-01-01 D 36 190 NA
4 2008-01-02 D 9 150 2
5 2008-01-02 A 36 280 1
6 2008-01-02 A 36 280 1
Here is an option using data.table
library(data.table)
setDT(df)[1:.N <=1500 & !nr.flights, nr.flights := NA]
df
# Date AD Runway MTOW nr.flights
#1: 2008-01-01 A 18 376 NA
#2: 2008-01-01 A 18 376 NA
#3: 2008-01-01 D 36 190 NA
#4: 2008-01- 02 D 9 150 2
#5: 2008-01-02 A 36 280 1
#6: 2008-01-02 A 36 280 1

Calculate mean of respective column values based on condition

I have a data.frame named sampleframe where I have stored all the table values. Inside sampleframe I have columns id, month, sold.
id month SMarch SJanFeb churn
101 1 0.00 0.00 1
101 2 0.00 0.00 1
101 3 0.00 0.00 1
108 2 0.00 6.00 1
103 2 0.00 10.00 1
160 1 0.00 2.00 1
160 2 0.00 3.00 1
160 3 0.50 0.00 0
164 1 0.00 3.00 1
164 2 0.00 6.00 1
I would like to calculate average sold for last three months based on ID. If it is month 3 then it has to consider average sold for the last two months based on ID, if it is month 2 then it has to consider average sold for 1 month based on ID., respectively for all months.
I have used ifelse and mean function to avail it but some rows are missing when i try to use it for all months
Query that I have used for execution
sampleframe$Churn <- ifelse(sampleframe$Month==4|sampleframe$Month==5|sampleframe$Month==6, ifelse(sampleframe$Sold<0.7*mean(sampleframe$Sold[sampleframe$ID[sampleframe$Month==-1&sampleframe$Month==-2&sampleframe$Month==-3]]),1,0),0)
adding according to the logic of the query it should compare with the previous months sold value of 70% and if the current value is higher than previous average months values then it should return 1 else 0
Not clear about the expected output. Based on the description about calculating average 'sold' for each 3 months, grouped by 'id', we can use roll_mean from library(RcppRoll). We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if the number of rows is greater than 1, we get the roll_mean with n specified as 3 and concatenate with the averages for less than 3 or else i.e. for 1 observation, get the value itself.
library(RcppRoll)
library(data.table)
k <- 3
setDT(df1)[, soldAvg := if(.N>1) c(cumsum(sold[1:(k-1)])/1:(k-1),
roll_mean(sold,n=k, align='right')) else as.numeric(sold), id]
df1
# id month sold soldAvg
#1: 101 1 124 124.0000
#2: 101 2 211 167.5000
#3: 104 3 332 332.0000
#4: 105 4 124 124.0000
#5: 101 5 211 182.0000
#6: 101 6 332 251.3333
#7: 101 7 124 222.3333
#8: 101 8 211 222.3333
#9: 101 9 332 222.3333
#10: 102 10 124 124.0000
#11: 102 12 211 167.5000
#12: 104 3 332 332.0000
#13: 105 4 124 124.0000
#14: 102 5 211 182.0000
#15: 102 6 332 251.3333
#16: 106 7 124 124.0000
#17: 107 8 211 211.0000
#18: 102 9 332 291.6667
#19: 103 11 124 124.0000
#20: 103 2 211 167.5000
#21: 108 3 332 332.0000
#22: 108 4 124 228.0000
#23: 109 5 211 211.0000
#24: 103 6 332 222.3333
#25: 104 7 124 262.6667
#26: 105 8 211 153.0000
#27: 103 10 332 291.6667
Solution for above Question can be done by using library(dplyr) and use this query to avail the output
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
link to refer for solution and output Answer

Resources