Efficient altering of variables in dataframe referencing previous row's value - r

I have a database like the following:
df <- data.frame(id=c(1,1,1,2,2,3,3,4),
num=c(12,12,12,28,28,17,17,7))
id num
1 1 12
2 1 12
3 1 12
4 2 28
5 2 28
6 3 17
7 3 17
8 4 7
I want to increment the value of num by 1 for every time there is another row for the id. I have the following code to do it:
for (i in 2:nrow(df3)) {
if(df[i,1]==df[i-1,1]) {
df[i,2]=df[i-1,2]+1
}
}
This would result in an answer like this:
id num
1 1 12
2 1 13
3 1 14
4 2 28
5 2 29
6 3 17
7 3 18
8 4 7
This code works but my actual dataset to be worked on has 100's millions of rows and so is very inefficient. I have tried using the lag() function from dplry in different ways but have had no success. One such way was to get the id from the previous row on the same row to compare, here was my attempt:
df[,lag := shift(Id, 1L, type="lag")]
df[df$id==df$lag,2]<-shift(df[df$id==df$lag,2], 1L, type="lag")+1
This will obviously not run however. Any help to speed up my approach would be great! Thanks.

library(data.table)
setDT(df)
df[, num := num + rowid(id) - 1L]
result:
# id num
# 1: 1 12
# 2: 1 13
# 3: 1 14
# 4: 2 28
# 5: 2 29
# 6: 3 17
# 7: 3 18
# 8: 4 7

Using ave
df$num+ave(df$id,df$id,FUN = seq_along)-1
[1] 12 13 14 28 29 17 18 7

Another data.table approach in case you like an explicit by:
library(data.table)
setDT(df)
df[ , num := num + 1:.N - 1, by=id]

Related

frollsum, frollapply, etc... alternative: frollmedian?

i am using frollsum with adaptive = TRUE to calculate the rolling sum over a window of 26 weeks, but for weeks < 26, the window is exactly the size of available weeks.
Is there anything similar, but instead of a rolling sum, a function to identify the most common value? I basically need the media of the past 26 (or less) weeks. I realize, that frollapply does not allow adaptive = TRUE, so that it is not working in my case, as I need values for the weeks before week 26 as well.
Here is an example (I added "desired" column four)
week product sales desired
1: 1 1 8 8
2: 2 1 8 8
3: 3 1 7 8
4: 4 1 4 8
5: 5 1 7 7.5
6: 6 1 4 7.5
7: 7 1 8 8
8: 8 1 9 and
9: 9 1 4 so
10: 10 1 7 on
11: 11 1 5 ...
12: 12 1 3
13: 13 1 8
14: 14 1 10
Here is some example code:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
data[,desired:=frollapply(sales,26,median,adaptive=TRUE)] #This only starts at week 26
Thank you very much for your help!
Here is an option using RcppRoll with data.table:
library(RcppRoll)
data[, med_sales :=
fifelse(is.na(x <- roll_medianr(sales, 26L)),
c(sapply(1L:25L, function(n) median(sales[1L:n])), rep(NA, .N - 25L)),
x)]
or using replace instead of fifelse:
data[, med_sales := replace(roll_medianr(sales, 26L), 1L:25L,
sapply(1L:25L, function(n) median(sales[1L:n])))]
output:
week product sales med_sales
1: 1 1 9 9
2: 2 1 3 6
3: 3 1 4 4
4: 4 1 6 5
5: 5 1 9 6
---
996: 96 10 2 5
997: 97 10 8 5
998: 98 10 7 5
999: 99 10 4 5
1000: 100 10 3 5
data:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")

functions with data.table variables which names are stored in a character vector

I am not a big data.table expert but I am somehow puzzled by some things. Here is my simple example:
test<-data.table(x= 1:10,y= 1:10,z= 1:10, l = 11:20,d= 21:30)
test<-test[,..I:=.I]
vec_of_names = c("z","l","d")
function_test<-function(x,y){
sum(x)+y
}
vec_of_final_names<-c("sum_z","sum_l","sum_d")
When I then attempt do to something like this:
for (i in 1:length(vec_of_names)){
test<-test[,vec_of_final_names[i]:=function_test(x=.SD,y=eval(parse(text=vec_of_names[i]))),.SDcols=c("x","y"),by=..I]
}
I get an error:
Error in eval(expr, envir, enclos) : object 'z' not found
Whereas code below works perfectly fine but is a little bit ugly and also slow. Maybe somebody can suggest better alternatives.
for (i in 1:length(vec_of_names)){
test<-test[,vec_of_final_names[i]:=function_test(x=eval(parse(text=paste("c(",paste(c("x","y"),collapse=","),")",sep=""))),y=eval(parse(text=vec_of_names[i]))),by=..I]
}
After specifying the .SDcols and grouped by = ..I (the ..I is a strange name for a column name), we unlist the .SD, get the sum, get the values of 'vec_of_names' in a list with mget, do the + of corresponding elements of this with the sum(unlist(.SD)) and assign (:=) it to 'vec_of_final_names' to create new columns
test[, (vec_of_final_names) := Map(`+`, sum(unlist(.SD)),
mget(vec_of_names)), by = ..I, .SDcols = x:y]
Based on the example, this can also be done without the grouping variable
test[, (vec_of_final_names) := Map(`+`, list(x+y), mget(vec_of_names))]
Or by specifying the .SDcols
test[, (vec_of_final_names) := Map(`+`, list(Reduce(`+`, .SD)),
mget(vec_of_names)), .SDcols = x:y]
Or using the OP's function
test[, (vec_of_final_names) := Map(function_test, list(unlist(.SD)),
mget(vec_of_names)), ..I, .SDcols = x:y]
test
# x y z l d ..I sum_z sum_l sum_d
# 1: 1 1 1 11 21 1 3 13 23
# 2: 2 2 2 12 22 2 6 16 26
# 3: 3 3 3 13 23 3 9 19 29
# 4: 4 4 4 14 24 4 12 22 32
# 5: 5 5 5 15 25 5 15 25 35
# 6: 6 6 6 16 26 6 18 28 38
# 7: 7 7 7 17 27 7 21 31 41
# 8: 8 8 8 18 28 8 24 34 44
# 9: 9 9 9 19 29 9 27 37 47
#10: 10 10 10 20 30 10 30 40 50

Reshaping a df in a specific way in R [duplicate]

This question already has answers here:
R Partial Reshape Data from Long to Wide
(2 answers)
Closed 6 years ago.
I am struggling to reshape this df into a different one, I have this:
ID task mean sd mode
1 0 2 10 1.5 223
2 0 2 21 2.4 213
3 0 2 24 4.3 232
4 1 3 26 2.2 121
5 1 3 29 1.3 433
6 1 3 12 2.3 456
7 2 4 45 4.3 422
8 2 4 67 5.3 443
9 2 4 34 2.1 432
and I would like to reshape it in this way discarding sd and mode and placing the means in the rows like this :
ID task mean mean1 mean2
1 0 2 10 21 24
2 1 3 26 29 12
3 2 4 45 67 34
Thanks a lot for your help in advance
You need to create a new column first by which we can pivot the mean values. Using data.table, this approach works:
library(data.table)
dt <- data.table(df) # Convert to data.table
dcast(dt[,nr := seq(task),
.(ID)],
ID + task ~ nr,
value.var = "mean")
# ID task 1 2 3
#1: 0 2 10 21 24
#2: 1 3 26 29 12
#3: 2 4 45 67 34
Consequently, you can always rename the columns to what you want them to be called.
reshape(cbind(df,time=ave(df$ID,df$ID,FUN=seq_along)),dir='w',idvar=c('ID','task'),drop=c('sd','mode'),sep='');
## ID task mean1 mean2 mean3
## 1 0 2 10 21 24
## 4 1 3 26 29 12
## 7 2 4 45 67 34
Data
df <- data.frame(ID=c(0L,0L,0L,1L,1L,1L,2L,2L,2L),task=c(2L,2L,2L,3L,3L,3L,4L,4L,4L),mean=c(
10L,21L,24L,26L,29L,12L,45L,67L,34L),sd=c(1.5,2.4,4.3,2.2,1.3,2.3,4.3,5.3,2.1),mode=c(223L,
213L,232L,121L,433L,456L,422L,443L,432L));

R - indices of matching values of two data.tables

This is my first post at StackOverflow. I am relatively a newbie in programming and trying to work with the data.table in R, for its reputation in speed.
I have a very large data.table, named "Actions", with 5 columns and potentially several million rows. The column names are k1, k2, i, l1 and l2. I have another data.table, with the unique values of Actions in columns k1 and k2, named "States".
For every row in Actions, I would like to find the unique index for columns 4 and 5, matching with States. A reproducible code is as follows:
S.disc <- c(2000,2000)
S.max <- c(6200,2300)
S.min <- c(700,100)
Traces.num <- 3
Class.str <- lapply(1:2,function(x) seq(S.min[x],S.max[x],S.disc[x]))
Class.inf <- seq_len(Traces.num)
Actions <- data.table(expand.grid(Class.inf, Class.str[[2]], Class.str[[1]], Class.str[[2]], Class.str[[1]])[,c(5,4,1,3,2)])
setnames(Actions,c("k1","k2","i","l1","l2"))
States <- unique(Actions[,list(k1,k2,i)])
So if i was using data.frame, the following line would be like:
index <- apply(Actions,1,function(x) {which((States[,1]==x[4]) & (States[,2]==x[5]))})
How can I do the same with data.table efficiently ?
This is relatively simple once you get the hang of keys and the special symbols which may be used in the j expression of a data.table. Try this...
# First make an ID for each row for use in the `dcast`
# because you are going to have multiple rows with the
# same key values and you need to know where they came from
Actions[ , ID := 1:.N ]
# Set the keys to join on
setkeyv( Actions , c("l1" , "l2" ) )
setkeyv( States , c("k1" , "k2" ) )
# Join States to Actions, using '.I', which
# is the row locations in States in which the
# key of Actions are found and within each
# group the row number ( 1:.N - a repeating 1,2,3)
New <- States[ J(Actions) , list( ID , Ind = .I , Row = 1:.N ) ]
# k1 k2 ID Ind Row
#1: 700 100 1 1 1
#2: 700 100 1 2 2
#3: 700 100 1 3 3
#4: 700 100 2 1 1
#5: 700 100 2 2 2
#6: 700 100 2 3 3
# reshape using 'dcast.data.table'
dcast.data.table( Row ~ ID , data = New , value.var = "Ind" )
# Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27...
#1: 1 1 1 1 4 4 4 7 7 7 10 10 10 13 13 13 16 16 16 1 1 1 4 4 4 7 7 7...
#2: 2 2 2 2 5 5 5 8 8 8 11 11 11 14 14 14 17 17 17 2 2 2 5 5 5 8 8 8...
#3: 3 3 3 3 6 6 6 9 9 9 12 12 12 15 15 15 18 18 18 3 3 3 6 6 6 9 9 9...

how to avoid an optimization warning in data.table

I have the following code:
> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
> dt
a b c d
1: 3 1 11 21
2: 3 2 12 22
3: 3 3 13 23
4: 3 4 14 24
5: 3 5 15 25
6: 4 6 16 26
7: 4 7 17 27
8: 4 8 18 28
9: 4 9 19 29
10: 4 10 20 30
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))'
Starting dogroups ... done dogroups in 0 secs
a b c d
1: 3 15 65 115
2: 4 40 90 140
> dt[,c(count=.N,lapply(.SD,sum)),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))'
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
done dogroups in 0 secs
a count b c d
1: 3 5 15 65 115
2: 4 5 40 90 140
How do I avoid the scary "very inefficient" warning?
I can add the count column before the join:
> dt$count <- 1
> dt
a b c d count
1: 3 1 11 21 1
2: 3 2 12 22 1
3: 3 3 13 23 1
4: 3 4 14 24 1
5: 3 5 15 25 1
6: 4 6 16 26 1
7: 4 7 17 27 1
8: 4 8 18 28 1
9: 4 9 19 29 1
10: 4 10 20 30 1
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))'
Starting dogroups ... done dogroups in 0 secs
a b c d count
1: 3 15 65 115 5
2: 4 40 90 140 5
but this does not look too elegant...
One way I could think of is to assign count by reference:
dt.out <- dt[, lapply(.SD,sum), by = a]
dt.out[, count := dt[, .N, by=a][, N]]
# alternatively: count := table(dt$a)
# a b c d count
# 1: 3 15 65 115 5
# 2: 4 40 90 140 5
Edit 1: I still think it's just message and not a warning. But if you still want to avoid that, just do:
dt.out[, count := as.numeric(dt[, .N, by=a][, N])]
Edit 2: Very interesting. Doing the equivalent of multiple := assignment does not produce the same message.
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# Detected that j uses these columns: a
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# Detected that j uses these columns: <none>
# Optimization is on but j left unchanged as '.N'
# Starting dogroups ... done dogroups in 0 secs
# Detected that j uses these columns: N
# Assigning to all 2 rows
# Direct plonk of unnamed RHS, no copy.
This solution removes the message about the named elements. But you have to put the names back afterwards.
require(data.table)
options(datatable.verbose = TRUE)
dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Output
> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))'
Starting dogroups ... done dogroups in 0.001 secs
a V1 V2 V3 V4
1: 3 5 15 65 115
2: 4 5 40 90 140

Resources