R data.table: Count events since last occurance (multiple, inclusive/exclusive) - r

[udpated: tried to clarify and simplify, corrected sample code and data.]
I've a set of measurements that are taken over a period of days. The range of numbers that can be captured in any measurement is 1-25 (in real life, given the test set, the range could be as high as 100 or as low as 20).
I'd like a way to tally a count for how many events have passed since a specific number occurred regardless of the measurement column. I'd like it to reset the count after the number match as shown below.
V1,V2,Vn are the values captured.
Match1, Match2, Matchn are the counts since last encountered columns.
Note: Matchn counts are incremented regardless of which Vx column n is encountered.
Any help is much appreciated.
this is somewhat related to my earlier post here
Sample input
library(data.table)
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,3,29,36),
V2 = c(2, 5, 12, 4,8,2,38,50,4,1)
)
code for creating Sample output
t$match1 <- c(1,2,3,4,1,2,3,4,5,1)
t$match2 <- c(1,1,2,3,4,5,1,2,3,4)
t$match3 <- c(1,2,3,1,2,3,4,5,1,2)
> t
Date V1 V2 match1 match2 match3
1: 2013-05-01 4 2 1 1 1
2: 2013-05-02 2 5 2 1 2
3: 2013-05-03 3 12 3 2 3
4: 2013-05-04 1 4 4 3 1
5: 2013-05-05 7 8 1 4 2
6: 2013-05-06 22 2 2 5 3
7: 2013-05-07 35 38 3 1 4
8: 2013-05-08 3 50 4 2 5
9: 2013-05-09 29 4 5 3 1
10: 2013-05-10 36 1 1 4 2

I think OP has a bunch of typos in it, and as far as I understand you want this:
t <- data.table(
Date = as.Date(c("2013-5-1", "2013-5-2", "2013-5-3", "2013-5-4", "2013-5-5", "2013-5-6", "2013-5-7", "2013-5-8", "2013-5-9", "2013-5-10")),
V1 = c(4, 2, 3, 1,7,22,35,52,29,36),
V2 = c(2, 5, 2, 4,8,47,38,50,4,1)
)
t[, inclusive.match.1 := 1:.N, by = cumsum(V1 == 1 | V2 == 1)]
t[, exclusive.match.1 := 1:.N, by = rev(cumsum(rev(V1 == 1 | V2 == 1)))]
t
# Date V1 V2 inclusive.match.1 exclusive.match.1
# 1: 2013-05-01 4 2 1 1
# 2: 2013-05-02 2 5 2 2
# 3: 2013-05-03 3 2 3 3
# 4: 2013-05-04 1 4 1 4
# 5: 2013-05-05 7 8 2 1
# 6: 2013-05-06 22 47 3 2
# 7: 2013-05-07 35 38 4 3
# 8: 2013-05-08 52 50 5 4
# 9: 2013-05-09 29 4 6 5
#10: 2013-05-10 36 1 1 6

Related

Sampling from a secondary datatable to the main datatable based on a condition

I have two datatables dt_main and dt_unit.
set.seed(1)
dt_main<-data.table(ID=sample(1:20,size=10),Group=sample(1:3,size=10,replace=TRUE),Unit=0)
dt_unit<-data.table(Group=sample(1:3,size=10,replace=TRUE),Unit_id=sample(1000:3000,size=10,replace=TRUE))
dt_main look like this:
> dt_main
ID Group Unit
1: 4 1 0
2: 7 1 0
3: 1 1 0
4: 2 2 0
5: 13 2 0
6: 19 2 0
7: 11 2 0
8: 17 3 0
9: 14 1 0
10: 3 3 0
dt_unit look like this:
> dt_unit
Group Unit_id
1: 1 2624
2: 1 2963
3: 1 1974
4: 1 1800
5: 2 1851
6: 1 1930
7: 1 1325
8: 2 1329
9: 2 1553
10: 2 2445
I would like to fill in the Unit column in dt_main by sampling one Unit_id from dt_unit to dt_main with the same Group.
For example for the first row in dt_main (so Group=1), the code should lookup at dt_unit and find the rows where Group is 1 (see below), and select a Unit_id and insert it in the Unit.
> dt_unit[Group==1]
Group Unit_id
1: 1 2624
2: 1 2963
3: 1 1974
4: 1 1800
5: 1 1930
6: 1 1325
I tried something like this which assigned the same number to each row:
dt_main[,Unit:=sample(dt_unit[Group==Group]$Unit_id,size=1)]
I also attempted sapply but no good.
Here is a base R solution where we match the Groups and sample 1 value every time,
dt_main$Unit <- sapply(dt_main$Group, function(i) {
v1 <- dt_unit$Unit_id[dt_unit$Group %in% i];
if (length(v1) > 0) {sample(v1, 1) } else {NA}
})
# ID Group Unit
# 1: 4 1 1930
# 2: 7 1 1325
# 3: 1 1 1325
# 4: 2 2 1329
# 5: 13 2 2445
# 6: 19 2 2445
# 7: 11 2 1851
# 8: 17 3 NA
# 9: 14 1 1930
#10: 3 3 NA
You can join dt_main and dt_unit by Group and select a random row for each ID.
Using dplyr, you can do this by :
library(dplyr)
left_join(dt_main, dt_unit, by = 'Group') %>% group_by(ID) %>% sample_n(1)
# ID Group Unit_id
# <int> <int> <int>
# 1 1 1 1800
# 2 2 2 2445
# 3 3 3 NA
# 4 4 1 2963
# 5 7 1 1800
# 6 11 2 1851
# 7 13 2 1553
# 8 14 1 1325
# 9 17 3 NA
#10 19 2 2445
I removed Unit column from data.table creation.
Another answer with mapply, which I used for the case with multiple conditions. In this case, while looking up I check if Group columns match AND a new column (Size) in dt_main is larger than that of dt_unit. As the OP I had to add another condition to the original post and therefore adding this solution to help future users.
my_fun<-function(var1,var2)
{
d<-dt_unit[(Group%in%var1)&(Size>=var2)]
if(nrow(d)>=2){
sample(x=d$Unit_id,size=1,replace=T)
}else
{d$Unit_id}
}
vars1<-dt_main$Group
vars2<-dt_main$Size
dt_main$Unit<-mapply(my_fun,vars1,vars2)

frollsum, frollapply, etc... alternative: frollmedian?

i am using frollsum with adaptive = TRUE to calculate the rolling sum over a window of 26 weeks, but for weeks < 26, the window is exactly the size of available weeks.
Is there anything similar, but instead of a rolling sum, a function to identify the most common value? I basically need the media of the past 26 (or less) weeks. I realize, that frollapply does not allow adaptive = TRUE, so that it is not working in my case, as I need values for the weeks before week 26 as well.
Here is an example (I added "desired" column four)
week product sales desired
1: 1 1 8 8
2: 2 1 8 8
3: 3 1 7 8
4: 4 1 4 8
5: 5 1 7 7.5
6: 6 1 4 7.5
7: 7 1 8 8
8: 8 1 9 and
9: 9 1 4 so
10: 10 1 7 on
11: 11 1 5 ...
12: 12 1 3
13: 13 1 8
14: 14 1 10
Here is some example code:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
data[,desired:=frollapply(sales,26,median,adaptive=TRUE)] #This only starts at week 26
Thank you very much for your help!
Here is an option using RcppRoll with data.table:
library(RcppRoll)
data[, med_sales :=
fifelse(is.na(x <- roll_medianr(sales, 26L)),
c(sapply(1L:25L, function(n) median(sales[1L:n])), rep(NA, .N - 25L)),
x)]
or using replace instead of fifelse:
data[, med_sales := replace(roll_medianr(sales, 26L), 1L:25L,
sapply(1L:25L, function(n) median(sales[1L:n])))]
output:
week product sales med_sales
1: 1 1 9 9
2: 2 1 3 6
3: 3 1 4 4
4: 4 1 6 5
5: 5 1 9 6
---
996: 96 10 2 5
997: 97 10 8 5
998: 98 10 7 5
999: 99 10 4 5
1000: 100 10 3 5
data:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")

Data.table selecting columns by name, e.g. using grepl

Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?
One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.
Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6

R: Ordering one column conditionally on another and partial order value

I have this dataframe of retweets
set.seed(28100)
df <- data.frame(user_id = sample(1:8, 10, replace = TRUE),
timestamp = sample(1:1000, 10),
retweet = sample(999:1002, 10, replace=TRUE))
df <- df[with(df, order(retweet, -timestamp)),]
df
# user_id timestamp retweet
# 6 8 513 999
# 9 7 339 999
# 3 3 977 1000
# 2 3 395 1000
# 5 2 333 1000
# 4 5 793 1001
# 1 3 873 1002
# 8 2 638 1002
# 7 4 223 1002
# 10 6 72 1002
There is a unique id for each retweet. For each row I want to assign a rank to the user according to the inverse order of the chain or retweets. The rank should estimate the influence of each user: the longer the chain the highest the point for the early twitterer. In other words I want to rank-order each retweet chain based on the timestamp and assign higher points to those who retweeted it before. If two users have posted the same retweet at the same time they should be assign the same ranking.
Or in df
df$ranking <- c(1,2, 1,2,3, 1, 1,2,3,4)
aggregate(ranking~user_id, data=df, sum)
# user_id ranking
# 1 2 5
# 2 3 4
# 3 4 3
# 4 5 1
# 5 6 4
# 6 7 2
# 7 8 1
using data-table:
library(data.table)
setDT(df)[order(-timestamp), ranking2 := seq_len(.N), by = retweet]
df[, sum(ranking2), keyby = user_id]
# user_id V1
# 1: 2 5
# 2: 3 4
# 3: 4 3
# 4: 5 1
# 5: 6 4
# 6: 7 2
# 7: 8 1

Data Transformation in R for Panel Regression

I really need your help regarding a problem which may seem easy to solve for you.
Currently I work on a project which involves some panel-regressions. I have several large csv-files (up to 12 million entries per sheet) which are formatted as in the picture attached, whereas the columns (V1, V2) are individuals and the rows (1, 2, 3) are time identifiers.
In order to use the plm()-function I need all these files to convert to the following data structure:
ID Time X1 X2
1 1 x1 x2
1 2 x1 x2
1 ... ... ...
2 1 x1 x2
2 2 ... ...
I really struggle with this transformation and I'm really frustrated right now i.e. where do I get the identifier and the time index from?
Would really appreciate if you could provide me with information how to solve this problem.
If my question is not clear to you, just ask.
Best regards and thanks in advance
The output should look like as follows:
mydata<-structure(list(V1 = 10:13, V2 = 21:24, V3 = c(31L, 32L, 3L, 34L
)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-4L))
> mydata
V1 V2 V3
1 10 21 31
2 11 22 32
3 12 23 3
4 13 24 34
The following code can be used for your data without changing anything. For illustration, I used just the above data. I used the base R reshape function
long <- reshape(mydata, idvar = "time", ids = row.names(mydata),
times = names(mydata), timevar = "id",
varying = list(names(mydata)),v.names="value", new.row.names = 1:((dim(mydata)[2])*(dim(mydata)[1])),direction = "long")
> long
id value time
1 V1 10 1
2 V1 11 2
3 V1 12 3
4 V1 13 4
5 V2 21 1
6 V2 22 2
7 V2 23 3
8 V2 24 4
9 V3 31 1
10 V3 32 2
11 V3 3 3
12 V3 34 4
long$id<-substr(long$id,2,4) # 4 is used to take into account your 416 variables
myout<-long[,c(1,3,2)]
> myout
id time value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 13
5 2 1 21
6 2 2 22
7 2 3 23
8 2 4 24
9 3 1 31
10 3 2 32
11 3 3 3
12 3 4 34
Here is an alternative: Use Stacked from my "splitstackshape" package.
Here it is applied on #Metrics's sample data:
# install.packages("splitstackshape")
library(splitstackshape)
Stacked(cbind(id = 1:nrow(mydata), mydata),
id.vars="id", var.stubs="V", sep = "V")
# id .time_1 V
# 1: 1 1 10
# 2: 1 2 21
# 3: 1 3 31
# 4: 2 1 11
# 5: 2 2 22
# 6: 2 3 32
# 7: 3 1 12
# 8: 3 2 23
# 9: 3 3 3
# 10: 4 1 13
# 11: 4 2 24
# 12: 4 3 34
It would be very fast if your data are large. Here are the speeds for the 12MB dataset you linked to. The sorting is different but the data are the same.
It still isn't faster than stack though (but at some point, stack starts to slow down).
See the system.times below:
reshape()
system.time(out <- reshape(x, idvar = "time", ids = row.names(x),
times = names(x), timevar = "id",
varying = list(names(x)),
v.names="value",
new.row.names = 1:prod(dim(x)),
direction = "long"))
# user system elapsed
# 53.11 0.00 53.11
head(out)
# id value time
# 1 V1 0.003808635 1
# 2 V1 -0.018807416 2
# 3 V1 0.008875447 3
# 4 V1 0.001148695 4
# 5 V1 -0.019365004 5
# 6 V1 0.012436560 6
Stacked()
system.time(out2 <- Stacked(cbind(id = 1:nrow(x), x),
id.vars="id", var.stubs="V",
sep = "V"))
# user system elapsed
# 0.30 0.00 0.29
out2
# id .time_1 V
# 1: 1 1 0.003808635
# 2: 1 10 -0.014184635
# 3: 1 100 -0.013341843
# 4: 1 101 0.006784138
# 5: 1 102 0.006463707
# ---
# 963868: 2317 95 0.009569451
# 963869: 2317 96 0.002497771
# 963870: 2317 97 0.009202519
# 963871: 2317 98 0.017007545
# 963872: 2317 99 -0.002495842
stack()
system.time(out3 <- cbind(id = 1:nrow(x), stack(x)))
# user system elapsed
# 0.09 0.00 0.09
head(out3)
# id values ind
# 1 1 0.003808635 V1
# 2 2 -0.018807416 V1
# 3 3 0.008875447 V1
# 4 4 0.001148695 V1
# 5 5 -0.019365004 V1
# 6 6 0.012436560 V1

Resources