Computing squared residual in regression row-wise in a data.table - r

Suppose I have a data.table with columns X1,X2,X3,Y. For each row, I would like to treat the entries in X1,X2,X3 as vector of length 3, take the inner product with a fixed vector say beta of length 4, subtract the result from the entry inY, square the result, and either output the final result for every row (or save it as another column).
After much research, I came up with this
dat[, (Y-sum(.SD*beta))^2, .SDcols=c(1:3)]
which does not work as expected.
Bonus point #1: Doing this with 3 replaced by a general n.
Bonus point #2: Suppose I have a grp column with group indices and I want to average these residual squares by group.

Assuming y is the first column of your data table dat and the rest of the columns are predictors. This works for bonus 1.
mat = as.matrix(dat[, x1:x3, with = F])
pred = cbind(1, mat) %*% beta
dat[, rss := (pred - y)^2]
For bonus 2:
dat[, mean_by_grp := mean(rss), by = grp]
To avoid the matrix conversion, you could do this:
dat[, pred := beta[1] + beta[2] * x1 + beta[3] * x2 + beta[4] * x3]
writing out the inner product.
Complete reproducible example
set.seed(47)
dat = data.table(replicate(4, rnorm(5)))
setnames(dat, c("y", paste0("x", 1:3)))
dat[, grp := c("A", "A", "B", "B", "B")]
beta = 1:4
mat = as.matrix(dat[, x1:x3, with = F])
pred = cbind(1, mat) %*% beta
dat[, rss := (pred - y) ^ 2]
dat[, mean_by_grp := mean(rss), by = grp]
dat
# y x1 x2 x3 grp rss mean_by_grp
# 1: 1.9946963 -1.08573747 -0.92245624 0.67077922 A 10.565250 7.064164
# 2: 0.7111425 -0.98548216 0.03960243 -0.08107805 A 3.563078 7.064164
# 3: 0.1854053 0.01513086 0.49382018 1.26424109 B 54.512843 38.263204
# 4: -0.2817650 -0.25204590 -1.82822917 -0.70338819 B 56.558929 38.263204
# 5: 0.1087755 -1.46575030 0.09147291 -0.04057817 B 3.717840 38.263204

Related

Subtract vector from matrix based on data.frame efficiently

I have a matrix X, two data frames A and B and to vectors of indices vec_a and vec_b. A and B contain an index variable each, where the values correspond to the values in vec_a and vec_b. Other than that, A and B contain as as many values as there are columns in X:
# original data
X <- matrix(rnorm(200),100,2)
# values to substract in data.frames
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
# indices, which values to substract (one for each row of X)
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
What I want to achieve is the following: For each row iii in X get the values value1 and value2 from A and B based on elements iii in the vectors vec_a and vec_b. Then, subtract these values from the corresponding row in X. May sound a bit confusing, but I hope the following solution makes it more clear what the goal is:
# iterate over all rows of X
for(iii in 1:nrow(X)){
# get correct values
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
# this intermediate step is necessary, otherwise we substract a data.frame from a matrix
X_clean <- as.numeric(X_clean)
# subtract from X
X[iii,] = X[iii,] - X_clean
}
Note that we have to convert to numeric in my loop solution, otherwise X loses class matrix as we subtract a data.frame from a matrix. My solution works fine, until you need to do that for many matrices like A and B and for millions of observations. Is there a solution that does not rely on looping over all rows?
EDIT
Thanks, both answers improve the speed of the code massively. I chose the answer by StupidWolf as it was more efficient than using data.table:
Unit: microseconds
expr min lq mean median uq max neval cld
datatable 5557.355 5754.931 6052.402 5881.729 5975.386 14154.040 100 b
stupid.wolf 818.529 1172.840 1311.784 1187.593 1221.164 4777.743 100 a
loop 111748.790 115141.149 116677.528 116109.571 117085.048 156497.999 100 c
You can just match the rows:
set.seed(111)
# original data
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
newX <- X - as.matrix(A[match(vec_a,A$index_a),-1]-B[match(vec_b,B$index_b),-1])
Then we run your loop:
for(iii in 1:nrow(X)){
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
X_clean <- as.numeric(X_clean)
X[iii,] = X[iii,] - X_clean
}
And check the values are equal:
all.equal(c(newX),c(X))
[1] TRUE
Match should be pretty fast, but if it is still too slow, you can just call out the values of A using vec_a, like A[vec_a,] ..
This approach uses data.table for easy joining.
library(data.table)
set.seed(111)
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
setDT(A);setDT(B)
dtX <- as.data.table(cbind(1:nrow(X),X,vec_a,vec_b))
as.matrix(
dtX[A, on = .(vec_a = index_a)][B,
on = .(vec_b = index_b)][order(V1),
.(V2 - (value1 - i.value1), V3 - (value2 - i.value2))]
)
V1 V2
[1,] 0.22746 0.7069
[2,] 1.84340 -0.1258
[3,] -0.70038 1.2494
...
[98,] 2.04666 0.6767
[99,] 0.02451 1.0473
[100,] -2.72553 -0.6595
Hopefully this will be pretty fast for very large matrices.

R: calculating mean of positive values in every column of a data frame

I want to calculate the mean of every column of a date frame. But only the positive values should be considered. The positive mean-values of every column are summarised in one vector.
My code:
x <- data.frame(replicate(3, sample(-5000:7000, 1000, rep = TRUE)))
meanxpositive <- c(NA)
for (n_col in 1:3) {
z <- mean(x[which(x[, ncol] > 0)])
meanxpositive[n_col] <- z
}
This code don't work. Maybe someone have a better idea.
sapply(x, function (y) mean(y[y > 0]))
colMeans(as.matrix(x) * (x > 0))
A colMeans approach might look like:
x[x<=0] <- NA
colMeans(x, na.rm=T)
# X1 X2 X3
#3483.664 3626.115 3533.687
Since you also mentioned you wanted to see this solution using a for loop you could adjust your old code to:
meanxpositive<-rep(NA, ncol(x))
for (n_col in 1:3) {
z<-mean(x[which(x[,n_col]>0), n_col]) #Changed this line to reference "n_col" instead of "ncol"
meanxpositive[n_col]<-z
}
meanxpositive
#[1] 3483.664 3626.115 3533.687
Note I changed references to ncol to n_col and also put in a specific selection of n_col in your mean(...) before, you weren't selecting any columns.
Lastly, with for loops it is best to "pre-allocate" the memory of your result. In this case that means setting it to be the size that you know it should be (3). In R growing objects inside loops is extremely slow and inefficient.
Data:
set.seed(1)
x<-data.frame(replicate(3,sample(-5000:7000,1000,rep=TRUE)))
If you want to keep your data in a data.frame,
library(dplyr)
set.seed(47)
x <- data.frame(replicate(3, sample(-5000:7000, 1000, replace = TRUE)))
x %>% summarise_all(~mean(.x[.x > 0]))
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or with the old funs notation,
x %>% summarise_all(funs(mean(.[. > 0])))
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or in base R,
aggregate(. ~ TRUE, x, function(x){mean(x[x > 0])})
#> X1 X2 X3
#> 1 3578.912 3535.614 3358.444
or data.table,
library(data.table)
setDT(x)[, lapply(.SD, function(x){mean(x[x > 0])})]
#> X1 X2 X3
#> 1: 3578.912 3535.614 3358.444

Faster way to calculate distance between all individuals during each time step

I have data on positions of several individuals, each registered at several time steps. I want to calculate distance between each animal to all other animals registered at the same time step.
Here's a simplified example, with data on three individuals ('animal_id') registered on three dates ('date') each, on different positions ('x', 'y'):
library(data.table)
dt1 <- data.table(animal_id = 1, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt2 <- data.table(animal_id = 2, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt3 <- data.table(animal_id = 3, date = as.POSIXct(c("2014-01-01", "2014-01-02", "2014-01-03")),
x = runif(3, -10, 10), y = runif(3, -10, 10))
dt <- rbindlist(list(dt1, dt2, dt3))
# Create dist function between two animals at same time
dist.between.animals <- function(collar_id1, x1, y1, collar_id2, x2, y2) {
if (collar_id1 == collar_id2) return(NA)
sqrt((x1 - x2)^2 + (y1 - y2)^2)
}
# Get unique collar id of each animal, create column name for all animals per animal
animal_ids <- dt[ , unique(animal_id)]
animal_ids_str <- dt[,paste0("dist_to_", unique(animal_id))]
datetimes <- dt[ , unique(date)]
# Calculate distance of each animal to all animals, at same time
for (i in 1:length(animal_ids)) {
for (j in 1:length(datetimes)) {
x1 <- dt[.(animal_ids[i], datetimes[j]), x, on = .(animal_id, date)]
y1 <- dt[.(animal_ids[i], datetimes[j]), y, on = .(animal_id, date)]
dt[date == datetimes[j], animal_ids_str[i] := mapply(function(c, x2, y2) dist.between.animals(animal_ids[i], x1, y1, c, x2, y2), animal_id, x, y)]
}
}
Here's an example of what the output should look like:
animal_id date x y dist_to_1 dist_to_2 dist_to_3
1: 1 2014-01-01 -7.0276047 4.7660664 NA 7.1354265 13.7962800
2: 1 2014-01-02 -6.6383802 7.0087919 NA 3.7003879 16.4294999
3: 1 2014-01-03 -0.9722872 -4.8638019 NA 11.6447645 11.8313410
4: 2 2014-01-01 0.1076661 4.8131960 7.135426 NA 7.7052205
5: 2 2014-01-02 -8.9042124 4.0832364 3.700388 NA 13.3225921
6: 2 2014-01-03 8.2858839 2.1992575 11.644764 NA 0.4569632
7: 3 2014-01-01 5.7519522 -0.4320359 13.796280 7.7052205 NA
8: 3 2014-01-02 -9.0805265 -9.2381889 16.429500 13.3225921 NA
9: 3 2014-01-03 8.6832729 1.9736531 11.831341 0.4569632 NA
However, my real data have about 30 animals with 20,000 observations per animal, so my current code takes a long time to run. Is there a more efficient way to do this?
OK, so here's kind of an unorthodox method, especially given that for once I think datatables make the situation worse. I'm using the dist function, which calculates the Euclidean distance (or any other, your pick). If you use diag=T, upper=Tit generates a matrix that you can then assign to the specified rows-columns. Creating the columns might get tedious with multiple animals, but nothing that the paste function can't fix.
dt[, c("dist_to_1", "dist_to_2", "dist_to_3") := NA]
dt<- arrange(dt, date, animal_id) # order by date. here it turns into a data.frame
for (i in 1:length(unique(dt$date))){
sub<- subset(dt, dt$date == unique(dt$date)[i])
dt[which(dt$date == unique(sub$date)), c("dist_to_1", "dist_to_2", "dist_to_3")]<- as.matrix(dist(sub[, c("x","y")], diag=T, upper=T))
}
dt[dt==0]<- NA #assign NAs for 0s. Not necessary if the it's ok for diag==0.
setDT(dt) # back to datatable. Again this part is not really necessary.
dt<- dt[order(animal_id, date)] # order as initially ordered
Using this code:
> proc.time()-ptm
user system elapsed
0.051 0.007 0.068
Using earlier code:
> proc.time()-ptm
user system elapsed
0.083 0.004 0.092
If you find a way to use both dist and data.table you're golden, but I couldn't figure it out. It's pretty fast, since it calls C, and it will get faster the more observations you add.
You can make a self-join on date (dt[dt, on = "date",), and for each match (by = .EACHI) calculate the distance:
dt[dt, on = "date",
.(from_id = id, to_id = i.id, dist = sqrt((x - i.x)^2 + (y - i.y)^2)), by = .EACHI]
I you wish to turn the data to a wide format (dcast), chain this to the code above:
[ , dcast(.SD, from_id + date ~ to_id, value.var = "dist")]
It seems to perform OK in a benchmark using the data of #digEmAll
library(microbenchmark)
microbenchmark(
digemall = dt[,(animal_ids_str):=distancesInSameDate(.SD,animal_ids_str),by=date],
henrik = dt[dt, on = "date",
.(from_id = animal_id, to_id = i.animal_id, dist = sqrt((x - i.x)^2 + (y - i.y)^2)), by = .EACHI][
, dcast(.SD, from_id + date ~ to_id, value.var = "dist")],
times = 5, unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# digemall 3.206063 2.058547 2.189487 2.035975 2.032324 2.019082 5
# henrik 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 5
Note that I haven't renamed the "to_id" in my code. That basically reflects my prefence to keep the data in long format, and in that format I would like to have both the 'from_id' and 'to_id' without any prefix. If you want prefix in the columns in the wide format, just add to_id = paste0("dist_to_", i.animal_id) in the first step.
Here's an alternative approach which should be much faster :
library(data.table)
### CREATE A BIG DATASET
set.seed(123)
nSamples <- 20000
nAnimals <- 30
allDates <- as.POSIXct(c("2014-01-01")) + (1:nSamples)*24*3600
dts <- lapply(1:nAnimals, function(id){
data.table(animal_id=id,date=allDates,
x=runif(nSamples,-10,10), y=runif(nSamples,-10,10))
})
dt <- rbindlist(dts)
### ALTERNATIVE APPROACH (NO LOOP)
animal_ids_str <- dt[,paste0("dist_to_",sort(unique(animal_id)))]
# set keys
setkey(dt,animal_id,date)
# add the distance columns
dt[,(animal_ids_str):=as.double(NA)]
# custom function that computes animal distances for a subset of dt at the same date
distancesInSameDate <- function(subsetDT,animal_ids_str){
m <- as.matrix(dist(subsetDT[,.(x,y)]))
diag(m) <- NA
cols <- paste0("dist_to_",subsetDT$animal_id)
missingCols <- animal_ids_str[is.na(match(animal_ids_str,cols))]
m <- cbind(m,matrix(NA,nrow=nrow(m),ncol=length(missingCols)))
colnames(m) <- c(cols,missingCols)
DF <- as.data.frame(m,stringsAsFactors=F)
DF <- DF[,match(animal_ids_str,colnames(DF))]
return(DF)
}
# let's compute the distances
system.time( dt[,(animal_ids_str):=distancesInSameDate(.SD,animal_ids_str),by=date] )
On my machine it takes :
user system elapsed
13.76 0.00 13.82

R Sum every k columns in matrix

I have a matrix temp1 (dimensions Nx16) (generally, NxM)
I would like to sum every k columns in each row to one value.
Here is what I got to so far:
cbind(rowSums(temp1[,c(1:4)]), rowSums(temp1[,c(5:8)]), rowSums(temp1[,c(9:12)]), rowSums(temp1[,c(13:16)]))
There must be a more elegant (and generalized) method to do it.
I have noticed similar question here:
sum specific columns among rows
couldn't make it work with Ananda's solution;
Got following error:
sapply(split.default(temp1, 0:(length(temp1)-1) %/% 4), rowSums)
Error in FUN(X[[1L]], ...) :
'x' must be an array of at least two dimensions
Please advise.
You can use by:
do.call(cbind, by(t(temp1), (seq(ncol(temp1)) - 1) %/% 4, FUN = colSums))
If the dimensions are equal for the sub matrices, you could change the dimensions to an array and then do the rowSums
m1 <- as.matrix(temp1)
n <- 4
dim(m1) <- c(nrow(m1), ncol(m1)/n, n)
res <- matrix(rowSums(apply(m1, 2, I)), ncol=n)
identical(res[,1],rowSums(temp1[,1:4]))
#[1] TRUE
Or if the dimensions are unequal
t(sapply(seq(1,ncol(temp2), by=4), function(i) {
indx <- i:(i+3)
rowSums(temp2[indx[indx <= ncol(temp2)]])}))
data
set.seed(24)
temp1 <- as.data.frame(matrix(sample(1:20, 16*4, replace=TRUE), ncol=16))
set.seed(35)
temp2 <- as.data.frame(matrix(sample(1:20, 17*4, replace=TRUE), ncol=17))
Another possibility:
x1<-sapply(1:(ncol(temp1)/4),function(x){rowSums(temp1[,1:4+(x-1)*4])})
## check
x0<-cbind(rowSums(temp1[,c(1:4)]), rowSums(temp1[,c(5:8)]), rowSums(temp1[,c(9:12)]), rowSums(temp1[,c(13:16)]))
identical(x1,x0)
# TRUE
Here's another approach. Convert the matrix to an array and then use apply with sum.
n <- 4
apply(array(temp1, dim=c(dim(temp1)/c(1,n), n)), MARGIN=c(1,3), FUN=sum)
Using #akrun's data
set.seed(24)
temp1 <- matrix(sample(1:20, 16*4, replace=TRUE), ncol=16)
a function which sums matrix columns with each group of size n columns
set.seed(1618)
mat <- matrix(rnorm(24 * 16), 24, 16)
f <- function(mat, n = 4) {
if (ncol(mat) %% n != 0)
stop()
cols <- split(colSums(mat), rep(1:(ncol(mat) / n), each = n))
## or use this to have n mean the number of groups you want
# cols <- split(colSums(mat), rep(1:n, each = ncol(mat) / n))
sapply(cols, sum)
}
f(mat, 4)
# 1 2 3 4
# -17.287137 -1.732936 -5.762159 -4.371258
c(sum(mat[,1:4]), sum(mat[,5:8]), sum(mat[,9:12]), sum(mat[,13:16]))
# [1] -17.287137 -1.732936 -5.762159 -4.371258
More examples:
## first 8 and last 8 cols
f(mat, 8)
# 1 2
# -19.02007 -10.13342
## each group is 16 cols, ie, the entire matrix
f(mat, 16)
# 1
# -29.15349
sum(mat)
# [1] -29.15349

Find values in a given interval without a vector scan

With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
Where DT can be a very big table.
Bonus question:
is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
Where both I and DT can be very big.
Thanks a lot
Here is a variation of the code proposed by #user1935457 (see comment in #user1935457 post)
system.time({
if(!identical(key(DT), "x")) setkey(DT, x)
setkey(IT, min)
#below is the line that differs from #user1935457
#Using IT to address the lines of DT creates a smaller intermediate table
#We can also directly use .I
target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
setattr(IT, "sorted", "max")
# same here
target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
setcolorder(ans.roll2, c("i", "x"))
})
# user system elapsed
# 0.07 0.00 0.06
system.time({
# #user1935457 code
})
# user system elapsed
# 0.08 0.00 0.08
identical(ans.roll2, ans.roll)
#[1] TRUE
The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to #user1935457 for your answer.
First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :
DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")
Then you can use %between% :
R> DT[x %between% c(3,10),]
x
1: 3
2: 5
3: 8
R> DT[x %between% c(3,10) | x %between% c(20,40),]
x
1: 3
2: 5
3: 8
4: 21
5: 34
EDIT : As #mnel pointed out, %between% still does vector scans. The Note section of the help page says :
Current implementation does not make use of ordered keys.
So this doesn't answer your question.

Resources