I need to put number on first or random item in the group.
I do following:
item<-sample(c("a","b", "c"), 30,replace=T)
week<-rep(c("1","2","3"),10)
volume<-c(1:30)
DT<-data.table(item, week,volume)
setkeyv(DT, c("item", "week"))
sampleDT <- DT[,.SD[1], by= list(item,week)]
item week volume newCol
1: a 1 1 5
2: a 2 14 5
3: a 3 6 5
4: b 1 13 5
5: b 2 2 5
6: b 3 9 5
7: c 1 7 5
8: c 2 5 5
9: c 3 3 5
DT[DT[,.SD[1], by= list(item,week)], newCol:=5]
The sampleDT comes out correct ,but last line puts 5 on all columns instead of conditioned ones.
What am I doing wrong?
I think you want to do this instead:
DT[DT[, .I[1], by = list(item, week)]$V1, newCol := 5]
Your version doesn't work because the join that you have results in the full data.table.
Also there is a pending FR to make the syntax simpler:
# won't work now, but maybe in the future
DT[, newCol[1] := 5, by = list(item, week)]
The problem with your command is that it is finding rows in the original data.table that have combinations of the keys [item, week] that you found in sampleDT. Since sampleDT includes all combinations of [item, week], you get the whole data.table back.
A simpler solution (I think) would be using !duplicated() to retrieve the first instance of each [item, week] combination:
DT[!duplicated(DT, c("item", "week") ), newCol := 5]
Related
For these data
library(data.table)
set.seed(42)
dat <- data.table(id=1:12, group=rep(1:3, each=4), x=rnorm(12))
> dat
id group x
1: 1 1 1.37095845
2: 2 1 -0.56469817
3: 3 1 0.36312841
4: 4 1 0.63286260
5: 5 2 0.40426832
6: 6 2 -0.10612452
7: 7 2 1.51152200
8: 8 2 -0.09465904
9: 9 3 2.01842371
10: 10 3 -0.06271410
11: 11 3 1.30486965
12: 12 3 2.28664539
My goal is to get, from each group, the first id for which x is larger than some threshold, say x>1.5.
> dat[x>1.5, .SD[1], by=group]
group id x
1: 2 7 1.511522
2: 3 9 2.018424
is indeed correct but I am unhappy about that fact that it silently yields no result for group 1. Instead, I would like it to yield the last id of each group for which no id fulfills the condition. I see that I could achieve this in two steps
> tmp <- dat[x>1.5, .SD[1], by=group]
> rbind(tmp,dat[!group%in%tmp$group,.SD[.N], by=group])
group id x
1: 2 7 1.5115220
2: 3 9 2.0184237
3: 1 4 0.6328626
but I am sure I am not making full use of the data.table capabilities here, which must permit a more elegant solution.
Using data.table, we can check for a condition and subset row by group.
library(data.table)
dat[dat[, if(any(x>1.5)) .I[which.max(x > 1.5)] else .I[.N], by=group]$V1]
# id group x
#1: 4 1 0.6328626
#2: 7 2 1.5115220
#3: 9 3 2.0184237
The dplyr, translation of that would be
library(dplyr)
dat %>%
group_by(group) %>%
slice(if(any(x > 1.5)) which.max(x > 1.5) else n())
Or more efficiently
dat[, .SD[{temp = x > 1.5; if (any(temp)) which.max(temp) else .N}], by = group]
Thanks to #IceCreamTouCan, #sindri_baldur and #jangorecki for their valuable suggestions to improve this answer.
Another option is:
dat[x>1.5 | group!=shift(group, -1L), .SD[1L], .(group)]
You could subset both ways (which are optimized by GForce) and then combine them:
D1 = dat[x>1.5, lapply(.SD, first), by=group]
D2 = dat[, lapply(.SD, last), by=group]
rbind(D1, D2[!D1, on=.(group)])
group id x
1: 2 7 1.5115220
2: 3 9 2.0184237
3: 1 4 0.6328626
There is some inefficiency here since we are grouping by group three times. I'm not sure if that will be outweighed by more efficient calculations in j thanks to GForce or not. #jangorecki points out that the inefficiency of grouping three times might be mitigated by setting the key first.
Comment: I used last(.SD) since .SD[.N] is not yet optimized and last(.SD) throws an error. I changed the OP's code to use lapply first for the sake of symmetry.
I have a data table(data) which looks like the following.
rn peoplecount
1 0,2,0,1
2 1,1,0,0
3 0,1,0,5
4 5,3,0,2
5 2,2,0,1
6 1,2,0,3
7 0,1,0,0
8 0,2,0,8
9 8,2,0,0
10 0,1,0,0
My goal is to find out all records which have the 1st element of the present row not matching with 4th element of previous row. In this example, 7th row matches the criteria. How can I get a list of all such records.
My attempt so far.
data[, previous_peoplecount:= c(NA, peoplecount[shift(seq_along(peoplecount), fill = 0)])]
This gives a new table as follows:
rn peoplecount previous_peoplecount
1 0,2,0,1 NA
2 1,1,0,0 0,2,0,1
3 0,1,0,5 1,1,0,0
4 5,3,0,2 0,1,0,5
5 0,2,0,1 5,3,0,2
6 1,2,0,3 0,2,0,1
7 0,1,0,0 1,2,0,3
8 0,2,0,8 0,1,0,0
9 8,2,0,0 0,2,0,8
10 0,1,0,0 8,2,0,0
Now I have to fetch all records where 1st element of people_count is not equal to 4th element of previous_peoplecount. I am stuck at this part. Any suggestions?
Edit: poeplecount is list of numerics.
You can try something along the lines of removing all but first value and all but last value, and comparing, i.e.
library(data.table)
setDT(dt)[, first_pos := sub(',.*', '', peoplecount)][,
last_pos_shifted := sub('.*,', '', shift(peoplecount))][
first_pos != last_pos_shifted,]
which gives,
rn peoplecount first_pos last_pos_shifted
1: 7 0,1,0,0 0 3
I would convert to long format and then select interested elements:
dt <- data.table(rn = 1:3, x = lapply(1:3, function(x) x:(x+3)))
dt$x[[2]] <- c(4, 1, 1, 1)
dt
# rn x
# 1: 1 1,2,3,4
# 2: 2 4,1,1,1
# 3: 3 3,4,5,6
# convert to long format
dt2 <- dt[, .(rn = rep(rn, each = 4), x = unlist(x))]
dt2[, id:= 1:4]
dtSelected <- dt2[x == shift(x) & id == 4]
dtSelected
# rn x id
# 1: 2 1 4
dt[dtSelected$rn]
# rn x
# 1: 2 4,1,1,1
I was not satisfied with the answers and came up with my own solution as follows:
h<-sapply(data$peoplecount,function(x){x[1]})
t<-sapply(data$peoplecount,function(x){x[4]})
indices<-which(head(t,-1)!=tail(h,-1))
Thanks to #Sotos and #minem to push me in the correct direction.
i have a question relating to data.table in R.
I am working on an acceleration data that requires me to generate features from the raw data. I want to group data by each 2 second. It is easy by generating 1 more column to indicate groups for each 2 second and group with by.
However, i want do the overlapping windows. For example, my raw data is this
a=data.table(x = c(1:10), y= c(2:11), z = c(5), second=rep(c(1:5),each=2))
x y z second
1: 1 2 5 1
2: 2 3 5 1
3: 3 4 5 2
4: 4 5 5 2
5: 5 6 5 3
6: 6 7 5 3
7: 7 8 5 4
8: 8 9 5 4
9: 9 10 5 5
10: 10 11 5 5
Now, i want to calculate the mean of x,y,z column by each 2 seconds. 1and2, 2 and 3, 3 and 4, 4 and 5.
I can run the for loops but since i have a huge dataset, it will take a long time. Do you know how do to it with just data table tools?
Thanks so much
Here's another way:
ag = data.table(
second = c(1:2, 2:3, 3:4, 4:5),
g = rep(paste(1:4, 2:5, sep="-"), each=2)
)
a[ag, on="second"][, mean(unlist(.SD)), by=g, .SDcols=x:z]
# g V1
# 1: 1-2 3.666667
# 2: 2-3 5.000000
# 3: 3-4 6.333333
# 4: 4-5 7.666667
I'm sure you could write ag less manually, but it's not clear to me what the rules behind it are.
Generally, if you are computing statistics across columns, then your data is not well-formatted. If you have time, I'd suggest reading about making data "tidy".
As there is only 2 unique observations for 'second', we get the lead of the 'x', 'y', 'z' columns, grouped by 'second', unlist the Subset of Data.table and get the mean.
nm1 <- c("x", "y", "z")
na.omit(a[, paste0(nm1, 2) := lapply(.SD, function(x) shift(x, 2,
type = "lead")), .SDcols = nm1])[, .(Mean = mean(unlist(.SD))),
.(second = paste0(second, "-", second + 1))]
# second Mean
#1: 1-2 3.666667
#2: 2-3 5.000000
#3: 3-4 6.333333
#4: 4-5 7.666667
Or a slightly more compact option would be
library(dplyr)
cbind(a[second!= last(second)], a[second!= first(second)])[
,.(Mean = mean(unlist(.SD))), .(second = paste0(second, "-", second+1))]
# second Mean
#1: 1-2 3.666667
#2: 2-3 5.000000
#3: 3-4 6.333333
#4: 4-5 7.666667
Or another option would be place them in a list, rbind the dataset, create a new 'id1' column, get the mean after unlisting the .SDcols or we can get the individual mean of each column
dt1 <- rbindlist(list(a[second!= last(second)],
a[second!= first(second)]), idcol=TRUE)[, id1:= as.numeric(gl(.N, 2, .N)), .id][]
Get the mean for each column by 'second'
dt1[, lapply(.SD, mean), .(second = paste0(id1, "-", id1 + 1)), .SDcols = x:z]
Get the whole mean by 'second'
dt1[, mean(unlist(.SD)), .(second = paste0(id1, "-", id1 +1)), .SDcols = x:z]
I have the following data.table.
ts,id
1,a
2,a
3,a
4,a
5,a
6,a
7,a
1,b
2,b
3,b
4,b
I want to subset this data.table into two. The criteria is to have approximately the first half for each group (in this case column "id") in one data table and the remaining in another data.table. So the expected result are two data.tables as follows
ts,id
1,a
2,a
3,a
4,a
1,b
2,b
and
ts,id
5,a
6,a
7,a
3,b
4,b
I tried the following,
z1 = x[,.SD[.I < .N/2,],by=dev]
z1
and got just the following
id ts
a 1
a 2
a 3
Somehow, .I within the .SD isn't working the way I think it should. Any help appreciated.
Thanks in advance.
.I gives the row locations with respect to the whole data.table. Thus it can't be used like that within .SD.
Something like
DT[, subset := seq_len(.N) > .N/2,by='id']
subset1 <- DT[(subset)][,subset:=NULL]
subset2 <- DT[!(subset)][,subset:=NULL]
subset1
# ts id
# 1: 4 a
# 2: 5 a
# 3: 6 a
# 4: 7 a
# 5: 3 b
# 6: 4 b
subset2
# ts id
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 1 b
# 5: 2 b
Should work
For more than 2 groups, you could use cut to create a factor with the appropriate number of levels
Something like
DT[, subset := cut(seq_len(.N), 3, labels= FALSE),by='id']
# you could copy to the global environment a subset for each, but this
# will not be memory efficient!
list2env(setattr(split(DT, DT[['subset']]),'names', paste0('s',1:3)), .GlobalEnv)
Here's the corrected version of your expression:
dt[, .SD[, .SD[.I <= .N/2]], by = id]
# id ts
#1: a 1
#2: a 2
#3: a 3
#4: b 1
#5: b 2
The reason yours is not working is because .I and .N are not available in the i-expression (i.e. first argument of [) and so the parent data.table's .I and .N are used (i.e. dt's).
I have a data table similar to the one obtained with the following command:
dt <- data.table(
time = 1:8,
part = rep(c(1, 1, 2, 2), 2),
type = rep(c('A', 'B'), 4),
data = rep(c(runif(1), 0), 4))
Basically, such a table contains two different type of instances (A or B). The time column contains a timestamp for when a request arrived to or leaved from a certain part. If the instance type is A, the timestamp states the arrival time (enter), and if the type is B, the timestamp states the leaving time (exit).
time part type data
1: 1 1 A 0.5842668
2: 2 1 B 0.0000000
3: 3 2 A 0.5842668
4: 4 2 B 0.0000000
5: 5 1 A 0.5842668
6: 6 1 B 0.0000000
7: 7 2 A 0.5842668
8: 8 2 B 0.0000000
I would like to pair A and B instances, and obtain the following data table:
part data enter.time exit.time
1: 1 0.4658239 1 2
2: 1 0.4658239 5 6
3: 2 0.4658239 3 4
4: 2 0.4658239 7 8
I have tried the following:
pair.types <- function(x) {
a.type <- x[type == 'A']
b.type <- x[type == 'B']
return(data.table(
enter.time = a.type$time,
exit.time = b.type$time,
data = a.type$data))
}
dt[, c('enter.time', 'exit.time', 'data') := pair.types(.SD), by = list(part)]
But, that gives me the following, which is not exactly what I want:
time part type data enter.time exit.time
1: 1 1 A 0.3441592 1 2
2: 2 1 B 0.3441592 5 6
3: 3 2 A 0.3441592 3 4
4: 4 2 B 0.3441592 7 8
5: 5 1 A 0.3441592 1 2
6: 6 1 B 0.3441592 5 6
7: 7 2 A 0.3441592 3 4
8: 8 2 B 0.3441592 7 8
It is kind of close, but since column 'type' is kept, some rows are duplicated. Perhaps, I can try to remove columns 'time' and 'type', and then remove the second half of rows. But, I am not sure whether that will work in all the cases, and I would like to learn a better way to do this operation.
Assuming your data looks like your example data:
dt[, list(part = part[1],
data = data[1],
enter.time = time[1],
exit.time = time[2]),
by = as.integer((seq_len(nrow(dt)) + 1)/2)]
# by = rep(seq(1, nrow(dt), 2), each = 2)]
# ^^^ a slightly shorter and a little more readable alternative
The idea is very simple - group rows in groups of 2 (that's the by part), i.e. each group will be one A and one B, then for each group take first part and first data and then the enter and exit times are just the first and second time's respectively. This is likely how you'd do this if you followed the by-hand logic, making it easy to read (once you know just a tiny bit about how data.table works).
Another way:
setkey(dt, "type")
dt.out <- cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
# part data entry.time exit.time
# 1: 1 0.1294204 1 2
# 2: 2 0.1294204 3 4
# 3: 1 0.1294204 5 6
# 4: 2 0.1294204 7 8
If you want you can now do setkey(dt.out, "part") to get the same order.
The idea: Your problem seems a simple "reshaping" one to me. The way I've approached it is first to create a key column as type. Now, we can subset data.table for a specific value in the key column by: dt[J("A")]. This would return the entire data.table. Since you want the column time renamed, I explicitly mention which columns to subset using:
dt[J("A"), list(part, data, entry.time = time)]
Of course this'll return also the type column (= A) which we've to remove. So, I've added a [, type := NULL] to remove column type by reference.
Now we've the first part. All we need is the exit.time. This can be obtained similarly as:
dt[J("B"), list(time)] # I don't name the column here
But this gives a data.table when you need just the time column, which can be accessed by:
dt[J("B"), list(time)]$time
So, while using cbind I name this column as exit.time to get the final result as:
cbind(dt[J("A"), list(part, data, entry.time = time)][, type := NULL],
exit.time = dt[J("B"), list(time)]$time)
Hope this helps.