I have a dataset which consists of three columns: user, action and time which is a log for user actions. the data looks like this:
user action time
1: 618663 34 1407160424
2: 617608 33 1407160425
3: 89514 34 1407160425
4: 71160 33 1407160425
5: 443464 32 1407160426
---
996: 146038 8 1407161349
997: 528997 9 1407161350
998: 804302 8 1407161351
999: 308922 8 1407161351
1000: 803763 8 1407161352
I want to separate sessions for each user based on action times. Actions done in certain period (for example one hour) are going to be assumed one session.
The simple solution is to use a for loop and compare action times for each user but that's not efficient and my data is very large.
Is there any method that can I use to overcome this problem?
I can group users but separate on users actions into different sessions is somehow difficult for me :-)
Try
library(data.table)
dt <- rbind(
data.table(user=1, action=1:10, time=c(1,5,10,11,15,20,22:25)),
data.table(user=2, action=1:5, time=c(1,3,10,11,12))
)
# dt[, session:=cumsum(c(T, !(diff(time)<=2))), by=user][]
# user action time session
# 1: 1 1 1 1
# 2: 1 2 5 2
# 3: 1 3 10 3
# 4: 1 4 11 3
# 5: 1 5 15 4
# 6: 1 6 20 5
# 7: 1 7 22 5
# 8: 1 8 23 5
# 9: 1 9 24 5
# 10: 1 10 25 5
# 11: 2 1 1 1
# 12: 2 2 3 1
# 13: 2 3 10 2
# 14: 2 4 11 2
# 15: 2 5 12 2
I used a difference of <=2 to collect sessions.
Related
I am trying to call different columns of a data.table inside a loop, to get unique values of each column.
Consider the simple data.table below.
> df <- data.table(var_a = rep(1:10, 2),
+ var_b = 1:20)
> df
var_a var_b
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
6: 6 6
7: 7 7
8: 8 8
9: 9 9
10: 10 10
11: 1 11
12: 2 12
13: 3 13
14: 4 14
15: 5 15
16: 6 16
17: 7 17
18: 8 18
19: 9 19
20: 10 20
My code works when I call for a specific column outside a loop,
> unique(df$var_a)
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, var_a])
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, "var_a"])
var_a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
but not when I do so within a loop that goes through different columns of the data.table.
> for(v in c("var_a","var_b")){
+ print(v)
+ df$v
+ unique(df[, .v])
+ unique(df[, "v"])
+ }
[1] "var_a"
Error in `[.data.table`(df, , .v) :
j (the 2nd argument inside [...]) is a single symbol but column name '.v' is not found. Perhaps you intended DT[, ...v]. This difference to data.frame is deliberate and explained in FAQ 1.1.
>
> unique(df[, ..var_a])
Error in `[.data.table`(df, , ..var_a) :
Variable 'var_a' is not found in calling scope. Looking in calling scope because you used the .. prefix.
For the first problem, when you're referencing a column name indirectly, you can either use double-dot ..v syntax, or add with=FALSE in the data.table::[ construct:
for (v in c("var_a", "var_b")) {
print(v)
print(df$v)
### either one of these will work:
print(unique(df[, ..v]))
# print(unique(df[, v, with = FALSE]))
}
# [1] "var_a"
# NULL
# var_a
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# [1] "var_b"
# NULL
# var_b
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# 11: 11
# 12: 12
# 13: 13
# 14: 14
# 15: 15
# 16: 16
# 17: 17
# 18: 18
# 19: 19
# 20: 20
# var_b
But this just prints it without changing anything. If all you want to do is look at unique values within each column (and not change the underlying frame), then I'd likely go with
lapply(df[,.(var_a, var_b)], unique)
# $var_a
# [1] 1 2 3 4 5 6 7 8 9 10
# $var_b
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
which shows the name and unique values. The use of lapply (whether on df as a whole or a subset of columns) is also preferable to another recommendation to use apply(df, 2, unique), though in this case it returns the same results.
Use .subset2 to refer to a column by its name:
for(v in c("var_a","var_b")) {
print(unique(.subset2(df, v)))
}
following the information on the first error, this would be the correct way to call in a loop:
for(v in c("var_a","var_b")){
print(unique(df[, ..v]))
}
# won't print all the lines
as for the second error you have not declared a variable called "var_a", it looks like you want to select by name.
# works as you have shown
unique(df[, "var_a"])
# works once the variable is declared
var_a <- "var_a"
unique(df[, ..var_a])
You may also be interested in the env param of data.table (see development version); here is an illustration below, but you could use this in a loop too.
v="var_a"
df[, v, env=list(v=v)]
Output:
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
On my projects I usually do the data prepping with a few functions, so my code usually look like this:
readAndClean("directory") %>%
processing() %>%
readyForModelling()
Where I'm passing a data.table object from one function to another.
I've gotten a habit to always start this functions with:
processing <- function(data_init){
data <- copy(data_init)
}
to avoid making changes to the DT on the global environment, as the following example will:
test <- data.table(cars[1:10,])
processing <- function(data_init){
data_init[, id := 1:.N]
return("done")
}
test
# speed dist
# 1: 4 2
# 2: 4 10
# 3: 7 4
# 4: 7 22
# 5: 8 16
# 6: 9 10
# 7: 10 18
# 8: 10 26
# 9: 10 34
# 10: 11 17
processing(test)
# [1] "done"
test
# speed dist id
# 1: 4 2 1
# 2: 4 10 2
# 3: 7 4 3
# 4: 7 22 4
# 5: 8 16 5
# 6: 9 10 6
# 7: 10 18 7
# 8: 10 26 8
# 9: 10 34 9
# 10: 11 17 10
But this always seems a little ugly to me.
Is it the correct way of handling data.tables inside functions?
i am using frollsum with adaptive = TRUE to calculate the rolling sum over a window of 26 weeks, but for weeks < 26, the window is exactly the size of available weeks.
Is there anything similar, but instead of a rolling sum, a function to identify the most common value? I basically need the media of the past 26 (or less) weeks. I realize, that frollapply does not allow adaptive = TRUE, so that it is not working in my case, as I need values for the weeks before week 26 as well.
Here is an example (I added "desired" column four)
week product sales desired
1: 1 1 8 8
2: 2 1 8 8
3: 3 1 7 8
4: 4 1 4 8
5: 5 1 7 7.5
6: 6 1 4 7.5
7: 7 1 8 8
8: 8 1 9 and
9: 9 1 4 so
10: 10 1 7 on
11: 11 1 5 ...
12: 12 1 3
13: 13 1 8
14: 14 1 10
Here is some example code:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
data[,desired:=frollapply(sales,26,median,adaptive=TRUE)] #This only starts at week 26
Thank you very much for your help!
Here is an option using RcppRoll with data.table:
library(RcppRoll)
data[, med_sales :=
fifelse(is.na(x <- roll_medianr(sales, 26L)),
c(sapply(1L:25L, function(n) median(sales[1L:n])), rep(NA, .N - 25L)),
x)]
or using replace instead of fifelse:
data[, med_sales := replace(roll_medianr(sales, 26L), 1L:25L,
sapply(1L:25L, function(n) median(sales[1L:n])))]
output:
week product sales med_sales
1: 1 1 9 9
2: 2 1 3 6
3: 3 1 4 4
4: 4 1 6 5
5: 5 1 9 6
---
996: 96 10 2 5
997: 97 10 8 5
998: 98 10 7 5
999: 99 10 4 5
1000: 100 10 3 5
data:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
I've found several options on how to generate IDs by groups using the data.table package in R, but none of them fit my problem exactly. Hopefully someone can help.
In my problem, I have 160 markets that fall within 21 regions in a country. These markets are numbered 1:160 and there may be multiple observations documented within each market. I would like to restructure my market ID variable so that it represents unique markets within each region, and starts counting over again with each new region.
Here's some code to represent my problem:
require(data.table)
dt <- data.table(region = c(1,1,1,1,2,2,2,2,3,3,3,3),
market = c(1,1,2,2,3,3,4,4,5,6,7,7))
> dt
region market
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 3
6: 2 3
7: 2 4
8: 2 4
9: 3 5
10: 3 6
11: 3 7
12: 3 7
Currently, my data is set up to represent the result of
dt[, market_new := .GRP, by = .(region, market)]
But what I'd like get is
region market market_new
1: 1 1 1
2: 1 1 1
3: 1 2 2
4: 1 2 2
5: 2 3 1
6: 2 3 1
7: 2 4 2
8: 2 4 2
9: 3 5 1
10: 3 6 2
11: 3 7 3
12: 3 7 3
This seems to return what you want
dt[, market_new:=as.numeric(factor(market)), by=region]
here we divide the data up by regions and then give a unique ID to each market in each region via the factor() function and extract the underlying numeric index.
From 1.9.5+, you can use frank() (or frankv()) with ties.method = "dense" as follows:
dt[, market_new := frankv(market, ties="dense"), by=region]
I am attempting to append a sequence number to a data frame grouped by individuals and date. For example, to turn this:
x y
1 A 2012-01-02
2 A 2012-02-03
3 A 2012-02-25
4 A 2012-03-04
5 B 2012-01-02
6 B 2012-02-03
7 C 2013-01-02
8 C 2012-02-03
9 C 2012-03-04
10 C 2012-04-05
in to this:
x y v
1 A 2012-01-02 1
2 A 2012-02-03 2
3 A 2012-02-25 3
4 A 2012-03-04 4
5 B 2012-01-02 1
6 B 2012-02-03 2
7 C 2013-01-02 1
8 C 2012-02-03 2
9 C 2012-03-04 3
10 C 2012-04-05 4
where "x" is the individual, "y" is the date, and "v" is the appended sequence number
I have had success on a small data frame using a for loop in this code:
x=c("A","A","A","A","B","B","C","C","C","C")
y=as.Date(c("1/2/2012","2/3/2012","2/25/2012","3/4/2012","1/2/2012","2/3/2012",
"1/2/2013","2/3/2012","3/4/2012","4/5/2012"),"%m/%d/%Y")
x
y
z=data.frame(x,y)
z$v=rep(1,nrow(z))
for(i in 2:nrow(z)){
if(z$x[i]==z$x[i-1]){
z$v[i]=(z$v[i-1]+1)
} else {
z$v[i]=1
}
}
but when I expand this to a much larger data frame (250K+ rows) the process takes forever.
Any thoughts on how I can make this more efficient?
This seems to work. May be overkill though.
## code needed revision - this is old code
## > d$v <- unlist(sapply(sapply(split(d, d$x), nrow), seq))
EDIT
I can't believe I got away with that ugly mess for so long. Here's a revision. Much simpler.
## revised 04/24/2014
> d$v <- unlist(sapply(table(d$x), seq))
> d
## x y v
## 1 A 2012-01-02 1
## 2 A 2012-02-03 2
## 3 A 2012-02-25 3
## 4 A 2012-03-04 4
## 5 B 2012-01-02 1
## 6 B 2012-02-03 2
## 7 C 2013-01-02 1
## 8 C 2012-02-03 2
## 9 C 2012-03-04 3
## 10 C 2012-04-05 4
Also, an interesting one is stack. Take a look.
> stack(sapply(table(d$x), seq))
## values ind
## 1 1 A
## 2 2 A
## 3 3 A
## 4 4 A
## 5 1 B
## 6 2 B
## 7 1 C
## 8 2 C
## 9 3 C
## 10 4 C
I'm removing my previous post and replacing it with this solution. Extremely efficient for my purposes.
# order data
z=z[order(z$x,z$y),]
#convert to data table
dt.z=data.table(z)
# obtain vector of sequence numbers
z$seq=dt.z[,1:.N,"x"]$V1
The above can be accomplished in fewer steps but I wanted to illustrate what I did. This is appending sequence numbers to my data sets of over 250k records in under a second. Thanks again to Henrik and Richard.