How to group in data.table with overlapping value? - r

i have a question relating to data.table in R.
I am working on an acceleration data that requires me to generate features from the raw data. I want to group data by each 2 second. It is easy by generating 1 more column to indicate groups for each 2 second and group with by.
However, i want do the overlapping windows. For example, my raw data is this
a=data.table(x = c(1:10), y= c(2:11), z = c(5), second=rep(c(1:5),each=2))
x y z second
1: 1 2 5 1
2: 2 3 5 1
3: 3 4 5 2
4: 4 5 5 2
5: 5 6 5 3
6: 6 7 5 3
7: 7 8 5 4
8: 8 9 5 4
9: 9 10 5 5
10: 10 11 5 5
Now, i want to calculate the mean of x,y,z column by each 2 seconds. 1and2, 2 and 3, 3 and 4, 4 and 5.
I can run the for loops but since i have a huge dataset, it will take a long time. Do you know how do to it with just data table tools?
Thanks so much

Here's another way:
ag = data.table(
second = c(1:2, 2:3, 3:4, 4:5),
g = rep(paste(1:4, 2:5, sep="-"), each=2)
)
a[ag, on="second"][, mean(unlist(.SD)), by=g, .SDcols=x:z]
# g V1
# 1: 1-2 3.666667
# 2: 2-3 5.000000
# 3: 3-4 6.333333
# 4: 4-5 7.666667
I'm sure you could write ag less manually, but it's not clear to me what the rules behind it are.
Generally, if you are computing statistics across columns, then your data is not well-formatted. If you have time, I'd suggest reading about making data "tidy".

As there is only 2 unique observations for 'second', we get the lead of the 'x', 'y', 'z' columns, grouped by 'second', unlist the Subset of Data.table and get the mean.
nm1 <- c("x", "y", "z")
na.omit(a[, paste0(nm1, 2) := lapply(.SD, function(x) shift(x, 2,
type = "lead")), .SDcols = nm1])[, .(Mean = mean(unlist(.SD))),
.(second = paste0(second, "-", second + 1))]
# second Mean
#1: 1-2 3.666667
#2: 2-3 5.000000
#3: 3-4 6.333333
#4: 4-5 7.666667
Or a slightly more compact option would be
library(dplyr)
cbind(a[second!= last(second)], a[second!= first(second)])[
,.(Mean = mean(unlist(.SD))), .(second = paste0(second, "-", second+1))]
# second Mean
#1: 1-2 3.666667
#2: 2-3 5.000000
#3: 3-4 6.333333
#4: 4-5 7.666667
Or another option would be place them in a list, rbind the dataset, create a new 'id1' column, get the mean after unlisting the .SDcols or we can get the individual mean of each column
dt1 <- rbindlist(list(a[second!= last(second)],
a[second!= first(second)]), idcol=TRUE)[, id1:= as.numeric(gl(.N, 2, .N)), .id][]
Get the mean for each column by 'second'
dt1[, lapply(.SD, mean), .(second = paste0(id1, "-", id1 + 1)), .SDcols = x:z]
Get the whole mean by 'second'
dt1[, mean(unlist(.SD)), .(second = paste0(id1, "-", id1 +1)), .SDcols = x:z]

Related

Row operations on selected columns based on substring in data.table

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.
Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.
update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity
library(data.table)
# generate data
dt <- data.table(id= letters[1:5],
v1= 1:5,
v2= 1:5,
f1= 11:15,
f2= 11:15)
dt
#> id v1 v2 f1 f2
#> 1: a 1 1 11 11
#> 2: b 2 2 12 12
#> 3: c 3 3 13 13
#> 4: d 4 4 14 14
#> 5: e 5 5 15 15
# what I've tried
dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]
We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable
dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
, sum(v) - mean(f), id], Y :=V1, on = .(id)]
dt
# id v1 v2 f1 f2 Y
#1: a 1 1 11 11 -9
#2: b 2 2 12 12 -8
#3: c 3 3 13 13 -7
#4: d 4 4 14 14 -6
#5: e 5 5 15 15 -5
Or another option is with Reduce after creating index or 'v' and 'f' columns
nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with = FALSE])/l1)]
rowSums and rowMeans combined with grep can accomplish this.
dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])

R Given a list of same dimension data tables, produce a summary of the means of each cell

I'm finding it hard to put what I want into words so I will try to run through an example to explain it. Let's say I've repeated an experiment twice and have two tables:
[df1] [df2]
X Y X Y
2 3 4 1
5 2 2 4
These tables are stored in a list (where the list can contain more than two elements if necessary), and what I want to do is create an average of each cell in the tables across the list (or for a generalised version, apply any function I choose to the cells i.e. mad, sd, etc)
[df1] [df2] [dfMeans]
X Y X Y X Y
2 3 4 1 mean(2,4) mean(3,1)
5 2 2 4 mean(5,2) mean(2,4)
I have a code solution to my problem, but since this is in R there is most likely a cleaner way to do things:
df1 <- data.frame(X=c(2,3,4),Y=c(3,2,1))
df2 <- data.frame(X=c(5,1,3),Y=c(4,1,4))
df3 <- data.frame(X=c(2,7,4),Y=c(1,7,6))
dfList <- list(df1,df2,df3)
dfMeans <- data.frame(MeanX=c(NA,NA,NA),MeanY=c(NA,NA,NA))
for (rowIndex in 1:nrow(df1)) {
for (colIndex in 1:ncol(df1)) {
valuesAtCell <- c()
for (tableIndex in 1:length(dfList)) {
valuesAtCell <- c(valuesAtCell, dfList[[tableIndex]][rowIndex,colIndex])
}
dfMeans[rowIndex, colIndex] <- mean(valuesAtCell)
}
}
print(dfMeans)
Here is a data.table solution where the mean is applied row-wise across the data frames:
library(data.table)
dtList <- rbindlist(dfList, use.names = TRUE, idcol = TRUE)
dtList
.id X Y
1: 1 2 3
2: 1 3 2
3: 1 4 1
4: 2 5 4
5: 2 1 1
6: 2 3 4
7: 3 2 1
8: 3 7 7
9: 3 4 6
dtList[, rn := 1:.N, by = .id][][, .(X = mean(X), Y = mean(Y)), by = rn]
rn X Y
1: 1 3.000000 2.666667
2: 2 3.666667 3.333333
3: 3 3.666667 3.666667
You can replace the mean by another aggregation function, eg, median. The .id column numbers the original data frames each row was sourced from.
Edit
The solution can be extended to an arbitrary number of columns (provided column names and column order are identical in all data frames):
cn <- colnames(df1)
cn
[1] "X" "Y"
dtList[, rn := 1:.N, by = .id][, lapply(.SD, mean), by = rn, .SDcols = cn][, rn := NULL][]
X Y
1: 3.000000 2.666667
2: 3.666667 3.333333
3: 3.666667 3.666667
The column names are taken from one of the original data frames which adds to the flexibility of the solution. [, rn := NULL] removes the row numbers from the result, [] ensures the result ist printed.
You could simply sum all data.frame's in your list using Reduce(), and divide by the length of dfList, which is equal to the number of df's it contains.
Reduce(`+`, dfList) / length(dfList)
# X Y
#1 3.000000 2.666667
#2 3.666667 3.333333
#3 3.666667 3.666667

Use of lapply .SD in data.table R

I am not very clear about use of .SD and by.
For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?
DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]
However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)
DT[,lapply(.SD,sum),by=.(A,B)]
Then how do I do a simple colsum over all the columns except A and B?
Just to illustrate the comments above with an example, let's take
set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2
Compare the following:
#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14
A few notes:
You said "does the below snippet... change all the columns in DT..."
The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.
You mentioned wanting to change the column types
Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].
This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.
You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:
Results:
library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]
library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100

How to group by column?

I have a dataframe of student scores, instead of getting an overall average score for every student, I need to get the average scores by "course-type" for every student, for example, courses a,c,d are the same type, and courses b, e are the same type. I do this by the following code, but it is not "R" enough:
x <- data.frame(a=c(1,2,3), b=c(4,5,6), c=c(6,7,8),
d=c(7,8,9), e=c(10, 11, 12))
group <- data.frame(no=c(1,2,1,1,2), name=c("a", "b", "c", "d","e"))
> x
a b c d e
1 1 4 6 7 10
2 2 5 7 8 11
3 3 6 8 9 12
> group
no name
1 1 a
2 2 b
3 1 c
4 1 d
5 2 e
I think this is some stupid:
x.1 <- x[,as.character(group$name[group$no==1])]
x.2 <- x[,as.character(group$name[group$no==2])]
mean.by.no <- data.frame(x.1.mean=apply(x.1, 1, mean),
x.2.mean=apply(x.2, 1, mean))
If mean.by.no is the expected result, we could split the 'name' column by 'no' ('group' dataset) to get a list. Using one ofapply family functions (lapply/sapply/vapply), we can use the output as column index for the 'x', and get the mean for each row (rowMeans).
vapply(with(group, split(as.character(name), no)),
function(y) rowMeans(x[y]), numeric(nrow(x)))
# 1 2
#[1,] 4.666667 7
#[2,] 5.666667 8
#[3,] 6.666667 9
Or using tapply, we can get the mean using grouping index for row and column.
indx <- xtabs(no~name, group)[col(x)]
t(tapply(as.matrix(x), list(indx, row(x)), FUN=mean))
# 1 2
#1 4.666667 7
#2 5.666667 8
#3 6.666667 9
Or another option would be to convert the 'x' from 'wide' to 'long' format using melt from data.table after converting the 'data.frame' to 'data.table' (setDT). Set the key column as 'name' (setkey(..), and get the mean grouped by 'no' and 'rn' (row number column created by keep.rownames=TRUE). If needed, the output can be converted back to 'wide' format using dcast.
library(data.table)#v1.9.5+
dL <- setkey(melt(setDT(x, keep.rownames=TRUE), id.var='rn',
variable.name='name')[, name:= as.character(name)],
name)[group[2:1]][,mean(value) , by=list( no, rn)]
dcast(dL, rn~paste0('mean',no), value.var='V1')[,rn:=NULL][]
# mean1 mean2
#1: 4.666667 7
#2: 5.666667 8
#3: 6.666667 9
There's probably a more elegant way of this, but:
library(reshape)
library(plyr)
x <- data.frame(a=c(1,2,3), b=c(4,5,6), c=c(6,7,8), d=c(7,8,9), e=c(10, 11, 12))
group <- data.frame(no=c(1,2,1,1,2), name=c("a", "b", "c", "d","e"))
a<-melt(x)
names(a)<-c("name", "score")
b<-merge(a, group, by="name")
c<-ddply(b, c("no"), summarize, meanscore=mean(score))
c
> c
no meanscore
1 1 5.666667
2 2 8.000000

How to Enter data for only conditioned rows on data table

I need to put number on first or random item in the group.
I do following:
item<-sample(c("a","b", "c"), 30,replace=T)
week<-rep(c("1","2","3"),10)
volume<-c(1:30)
DT<-data.table(item, week,volume)
setkeyv(DT, c("item", "week"))
sampleDT <- DT[,.SD[1], by= list(item,week)]
item week volume newCol
1: a 1 1 5
2: a 2 14 5
3: a 3 6 5
4: b 1 13 5
5: b 2 2 5
6: b 3 9 5
7: c 1 7 5
8: c 2 5 5
9: c 3 3 5
DT[DT[,.SD[1], by= list(item,week)], newCol:=5]
The sampleDT comes out correct ,but last line puts 5 on all columns instead of conditioned ones.
What am I doing wrong?
I think you want to do this instead:
DT[DT[, .I[1], by = list(item, week)]$V1, newCol := 5]
Your version doesn't work because the join that you have results in the full data.table.
Also there is a pending FR to make the syntax simpler:
# won't work now, but maybe in the future
DT[, newCol[1] := 5, by = list(item, week)]
The problem with your command is that it is finding rows in the original data.table that have combinations of the keys [item, week] that you found in sampleDT. Since sampleDT includes all combinations of [item, week], you get the whole data.table back.
A simpler solution (I think) would be using !duplicated() to retrieve the first instance of each [item, week] combination:
DT[!duplicated(DT, c("item", "week") ), newCol := 5]

Resources