Subtract vector from matrix based on data.frame efficiently - r

I have a matrix X, two data frames A and B and to vectors of indices vec_a and vec_b. A and B contain an index variable each, where the values correspond to the values in vec_a and vec_b. Other than that, A and B contain as as many values as there are columns in X:
# original data
X <- matrix(rnorm(200),100,2)
# values to substract in data.frames
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
# indices, which values to substract (one for each row of X)
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
What I want to achieve is the following: For each row iii in X get the values value1 and value2 from A and B based on elements iii in the vectors vec_a and vec_b. Then, subtract these values from the corresponding row in X. May sound a bit confusing, but I hope the following solution makes it more clear what the goal is:
# iterate over all rows of X
for(iii in 1:nrow(X)){
# get correct values
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
# this intermediate step is necessary, otherwise we substract a data.frame from a matrix
X_clean <- as.numeric(X_clean)
# subtract from X
X[iii,] = X[iii,] - X_clean
}
Note that we have to convert to numeric in my loop solution, otherwise X loses class matrix as we subtract a data.frame from a matrix. My solution works fine, until you need to do that for many matrices like A and B and for millions of observations. Is there a solution that does not rely on looping over all rows?
EDIT
Thanks, both answers improve the speed of the code massively. I chose the answer by StupidWolf as it was more efficient than using data.table:
Unit: microseconds
expr min lq mean median uq max neval cld
datatable 5557.355 5754.931 6052.402 5881.729 5975.386 14154.040 100 b
stupid.wolf 818.529 1172.840 1311.784 1187.593 1221.164 4777.743 100 a
loop 111748.790 115141.149 116677.528 116109.571 117085.048 156497.999 100 c

You can just match the rows:
set.seed(111)
# original data
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
newX <- X - as.matrix(A[match(vec_a,A$index_a),-1]-B[match(vec_b,B$index_b),-1])
Then we run your loop:
for(iii in 1:nrow(X)){
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
X_clean <- as.numeric(X_clean)
X[iii,] = X[iii,] - X_clean
}
And check the values are equal:
all.equal(c(newX),c(X))
[1] TRUE
Match should be pretty fast, but if it is still too slow, you can just call out the values of A using vec_a, like A[vec_a,] ..

This approach uses data.table for easy joining.
library(data.table)
set.seed(111)
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
setDT(A);setDT(B)
dtX <- as.data.table(cbind(1:nrow(X),X,vec_a,vec_b))
as.matrix(
dtX[A, on = .(vec_a = index_a)][B,
on = .(vec_b = index_b)][order(V1),
.(V2 - (value1 - i.value1), V3 - (value2 - i.value2))]
)
V1 V2
[1,] 0.22746 0.7069
[2,] 1.84340 -0.1258
[3,] -0.70038 1.2494
...
[98,] 2.04666 0.6767
[99,] 0.02451 1.0473
[100,] -2.72553 -0.6595
Hopefully this will be pretty fast for very large matrices.

Related

Thousand separator to numeric columns in R

I am trying to format numbers as shown (adding thousand separator). The function is working fine but post formatting the numbers, the numeric columns does not sort by numbers since there are characters
df <- data.frame(x = c(12345,35666,345,5646575))
format_numbers <- function (df, column_name){
df[[column_name]] <- ifelse(nchar(df[[column_name]]) <= 5, paste(format(round(df[[column_name]] / 1e3, 1), trim = TRUE), "K"),
paste(format(round(df[[column_name]] / 1e6, 1), trim = TRUE), "M"))
}
df$x <- format_numbers(df,"x")
> df
x
1 12.3 K
2 35.7 K
3 0.3 K
4 5.6 M
Can we make sure the numbers are sorted in descending/ascending order post formatting ?
Note : This data df is to be incorporated in DT table
The problem is the formating part. If you do it correctly--ie while maintaining your data as numeric, then everything else will fall in place. Here I will demonstrate using S3 class:
my_numbers <- function(x) structure(x, class = c('my_numbers', 'numeric'))
format.my_numbers <- function(x,..., d = 1, L = c('', 'K', 'M', 'B', 'T')){
ifelse(abs(x) >= 1000, Recall(x/1000, d = d + 1),
sprintf('%.1f%s', x, L[d]))
}
print.my_numbers <- function(x, ...) print(format(x), quote = FALSE)
'[.my_numbers' <- function(x, ..., drop = FALSE) my_numbers(NextMethod('['))
Now you can run your code:
df <- data.frame(x = c(12345,35666,345,5646575))
df$x <- my_numbers(df$x)
df
x
1 12.3K
2 35.7K
3 345.0
4 5.6M
You can use any mathematical operation on column x as it is numeric.
eg:
cbinding with its double and ordering from smallest to larges:
cbind(x = df, y = df*2)[order(df$x),]
x x
3 345.0 690.0 # smallest
1 12.3K 24.7K
2 35.7K 71.3K
4 5.6M 11.3M # largest ie Millions
Note that under the hood, x does not change:
unclass(df$x)
[1] 12345 35666 345 5646575 # Same as given

R - Calculating differences by group for all cuts of data

I have a dataset with several attributes and a value.
Input (sample)
GRP CAT TYP VAL
X H 5 0.76
X A 2 0.34
X D 3 0.70
X I 3 0.33
X F 4 0.80
X E 1 0.39
I want to:
Determine all combinations of CAT and TYP
For each combination, calculate the average value when the combination is removed
Return a final table of differences
Final Table (sample)
CAT TYP DIFF
1 <NA> NA 0.04000
2 H NA 0.03206
Row 1 means that if no records are removed, the difference between the average value of GRP='X' and GRP='Y' is 0.04. Row 2 means that if records with CAT='H' are removed, the difference is 0.032.
I have working code, but I want to make it faster. I'm open to your suggestions.
Working Code
library(dplyr)
set.seed(777)
# build example data frame
df <- data.frame(GRP = c(rep('X',25),rep('Y',25)),
CAT = sample(LETTERS[1:10], 50, T),
TYP = sample(1:5, 50, T),
VAL = sample(1:100, 50, T)/100,
stringsAsFactors = F)
# table of all combinations of CAT and TYP
splits <- expand.grid(lapply(df[,-c(1,4)], function(x) c(NA, unique(x))), stringsAsFactors = F)
# null data frame to store results
ans <- data.frame(CAT = character(),
TYP = integer(),
DIFF = numeric(),
stringsAsFactors = F)
# loop through each combination and calculate the difference between group X and Y
for(i in 1:nrow(splits)) {
split.i <- splits[i,]
# determine non-na columns
by.cols <- colnames(split.i)[unlist(lapply(split.i, function(x) !all(is.na(x))))]
# anti-join to remove records that match `split.i`
if(length(by.cols) > 0){
df.i <- df %>%
anti_join(split.i, by = by.cols)
} else {
df.i <- df
}
# calculate average by group
df.i <- df.i %>%
group_by(GRP) %>%
summarize(VAL_MEAN = mean(VAL))
# calculate difference of averages
DIFF <- df.i[,2] %>%
as.matrix() %>%
diff() %>%
as.numeric()
ans.tmp <- cbind(split.i, DIFF)
# bind to final data frame
ans <- bind_rows(ans, ans.tmp)
}
return(ans)
Speed results
> system.time(fcnDiffCalc())
user system elapsed
0.30 0.02 0.31
Consider assigning DIFF column with sapply rather than growing a data frame in a loop to avoid the repetitive in-memory copying:
fcnDiffCalc2 <- function() {
# table of all combinations of CAT and TYP
splits <- data.frame(expand.grid(lapply(df[,-c(1,4)], function(x) c(NA, unique(x))),
stringsAsFactors = F))
# loop through each combination and calculate the difference between group X and Y
splits$DIFF <- sapply(1:nrow(splits), function(i) {
split.i <- splits[i,]
# determine non-na columns
by.cols <- colnames(split.i)[unlist(lapply(split.i, function(x) !all(is.na(x))))]
# anti-join to remove records that match `split.i`
df.i <- tryCatch(df %>%
anti_join(split.i, by = by.cols), error = function(e) df)
# calculate average by group
df.i <- df.i %>%
group_by(GRP) %>%
summarize(VAL_MEAN = mean(VAL))
# calculate difference of averages
DIFF <- df.i[,2] %>%
as.matrix() %>%
diff() %>%
as.numeric()
})
return(splits)
}
Even better, avoid the loop in expand.grid, use vapply over sapply (even the unlist + lapply = sapply or vapply) defining the outcome structure, and avoid pipes in loop to revert to base R's aggregate:
fcnDiffCalc3 <- function() {
# table of all combinations of CAT and TYP
splits <- data.frame(expand.grid(CAT = c(NA, unique(df$CAT)), TYP = c(NA, unique(df$TYP)),
stringsAsFactors = FALSE))
# loop through each combination and calculate the difference between group X and Y
splits$DIFF <- vapply(1:nrow(splits), function(i) {
split.i <- splits[i,]
# determine non-na columns
by.cols <- colnames(split.i)[vapply(split.i, function(x) !all(is.na(x)), logical(1))]
# anti-join to remove records that match `split.i`
df.i <- tryCatch(anti_join(df, split.i, by = by.cols), error = function(e) df)
# calculate average by group
df.i <- aggregate(VAL ~ GRP, df.i, mean)
# calculate difference of averages
diff(df.i$VAL)
}, numeric(1))
return(splits)
}
Output
df_op <- fcnDiffCalc()
df_new <- fcnDiffCalc2()
df_new2 <- fcnDiffCalc3()
identical(df_op, df_new)
# [1] TRUE
identical(df_op, df_new2)
# [1] TRUE
library(microbenchmark)
microbenchmark(fcnDiffCalc(), fcnDiffCalc2(), fcnDiffCalc3())
# Unit: milliseconds
# expr min lq mean median uq max neval
# fcnDiffCalc() 128.1442 140.1946 152.0703 154.3662 159.6809 180.5960 100
# fcnDiffCalc2() 115.4415 126.6108 138.0991 137.4108 145.2452 266.3297 100
# fcnDiffCalc3() 107.6847 116.9920 126.9131 126.0414 133.3887 227.2758 100

Optimized way of looking for a specific value in R across all the rows in a large matrix

I have a large sparse matrix 1M X 10 (1 Million rows and 10 columns), I want to look every row in the matrix for a value and create a new vector based on it. Below is my code. I am wondering if there is any way I can optimize it.
CreatenewVector <- function(TestMatrix){
newColumn = c()
for(i in 1:nrow(TestMatrix)){ ## Loop begins
Value = ifelse(1 %in% TestMatrix[i,],1,0)
newColumn = c(newColumn,Value)
} ##Loop ends
return(newColumn)
}
## SampleInput: TestMatrix = matrix(c(1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0), byrow = T, nrow = 4)
## Sampleoutput: = (1,1,1,0)
## In the input TestMatrix, each vector represents a row. for instance (1,0,0) is the first row and so on.
Assuming you are using a normal matrix object, not a special sparse matrix class, you should use rowSums.
rowSums(x == 1) > 0
if x is the name of your matrix. This will return a logical vector, you can easily coerce to numeric with as.numeric() if you prefer 1/0 to true/false.
To give some sense of timing I benchmarked first using a thousand row matrix, then a million row matrix:
gregor = function(x) {as.numeric(rowSums(x == 1L) > 0L)}
# original method in question
op1 = function(x){
newColumn = c()
for(i in 1:nrow(x)){ ## Loop begins
Value = ifelse(1 %in% x[i,],1,0)
newColumn = c(newColumn,Value)
} ##Loop ends
return(newColumn)
}
# modified original:
# eliminated unnecessary ifelse
# pre-allocated result vector (no growing in a loop!)
# saved numeric conversion to the end
op2 = function(x){
newColumn = logical(nrow(x))
for(i in 1:nrow(x)){ ## Loop begins
newColumn[i] = 1L %in% x[i,]
} ##Loop ends
return(as.numeric(newColumn))
}
bouncy = function(x) {
as.numeric(apply(x, 1, function(y) any(y == 1L)))
}
Here are the results for a thousand row matrix:
n = 1e3
x = matrix(sample(c(0L, 1L), size = n, replace = T), ncol = 4)
microbenchmark(gregor(x), op1(x), op2(x), bouncy(x), times = 20)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# gregor(x) 12.164 15.7750 20.14625 20.1465 24.8980 30.410 20 a
# op1(x) 1224.736 1258.9465 1345.46110 1275.6715 1338.0105 2002.075 20 d
# op2(x) 846.140 864.7655 935.46740 886.2425 951.4325 1287.075 20 c
# bouncy(x) 439.795 453.8595 496.96475 486.5495 508.0260 711.199 20 b
Using rowSums is the clear winner. I eliminated OP1 from the next test on a million row matrix:
n = 1e6
x = matrix(sample(c(0L, 1L), size = n, replace = T), ncol = 4)
microbenchmark(gregor(x), op2(x), bouncy(x), times = 30)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor(x) 9.371777 10.02862 12.55963 10.61343 14.13236 27.70671 30 a
# op2(x) 822.171523 856.68916 937.23602 881.39219 1028.26738 1183.68569 30 c
# bouncy(x) 391.604590 412.51063 502.61117 502.02431 588.78785 656.18824 30 b
Where the relative margin is even more in favor of rowSums.

Replace NA with 0, only in numeric columns in data.table

I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0.
I am aware that replacing na-values with zero goes like this:
DT[is.na(DT)] <- 0
To select only numeric columns, I found this solution, which works fine:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
I can achieve what I want by assigning
DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
and then do:
DT2[is.na(DT2)] <- 0
But of course I would like to have my original DT modified by reference. With the following, however:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
[is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0
I get
"Error in [.data.table([...] i is invalid type (matrix)"
What am I missing?
Any help is much appreciated!!
We can use set
for(j in seq_along(DT)){
set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
}
Or create a index for numeric columns, loop through it and set the NA values to 0
ind <- which(sapply(DT, is.numeric))
for(j in ind){
set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}
data
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
I wanted to explore and possibly improve on the excellent answer given above by #akrun. Here's the data he used in his example:
library(data.table)
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT
#> v1 v2 v3
#> 1: NA <NA> -0.5458808
#> 2: 1 A 0.5365853
#> 3: 2 B 0.4196231
#> 4: 3 C -0.5836272
#> 5: 4 D NA
And the two methods he suggested to use:
fun1 <- function(x){
for(j in seq_along(x)){
set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
}
}
fun2 <- function(x){
ind <- which(sapply(x, is.numeric))
for(j in ind){
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
I think the first method above is really genius as it exploits the fact that NAs are typed.
First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:
fun3 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
for(j in nms){
x[is.na(get(j)), (j):=0]
}
}
Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns
fun4 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]
}
But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()
fun5 <- function(x){
setDT(
lapply(x, function(i){
if(is.numeric(i))
i[is.na(i)]<-0
i
})
)
}
Finally,we could use the same idea of conditional to limit the columns on which we apply the set()
fun6 <- function(x){
for(j in seq_along(x)){
if (is.numeric(x[[j]]) )
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
Here are the benchmarks:
microbenchmark::microbenchmark(
for.set.2cond = fun1(copy(DT)),
for.set.ind = fun2(copy(DT)),
for.get = fun3(copy(DT)),
for.SDcol = fun4(copy(DT)),
for.list = fun5(copy(DT)),
for.set.if =fun6(copy(DT))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> for.set.2cond 59.812 67.599 131.6392 75.5620 114.6690 4561.597 100 a
#> for.set.ind 71.492 79.985 142.2814 87.0640 130.0650 4410.476 100 a
#> for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202 100 c
#> for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491 100 b
#> for.list 69.722 81.932 137.2275 87.7720 123.6935 3906.149 100 a
#> for.set.if 52.380 58.397 116.1909 65.1215 72.5535 4570.445 100 a
You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.
library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))
Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:
DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table
So, tidyverse can be less verbose than data.table sometimes :-)

How to pairwise compare values referring to distinct elements in two matrices of different formats?

I've got a set of objects, let's say with the IDs 'A' to 'J'. And I've got two data frames which look the following way (as you can see, the second data frame is symmetric):
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,69,9,83,26,NA,67,95,74,69,67,NA,6,84,9,95,6,NA), ncol = 5, nrow = 5, dimnames = list(df1$ID, df1$ID)))
For example, take the objects 'B' and 'E'. I want to know: Is 13+28 (from df1) less than 9 (from df2)? I'd like to know this for all pairs of objects. The output should be
(a) a logical data frame structured like df2 and
(b) the number of "TRUE" values.
Most of the time I will only need result (b), but sometimes I would also need (a). So if (b) can be calculated without (a) and if this would be significantly faster, then I'd like to have both algorithms in order to select the suitable one dependent on which output I need to answer a particular question.
I'm comparing around 2000 objects, so the algorithm should be reasonably fast. So far I've been only able to implement this with two nested for-loops which is awfully slow. I bet there is a much nicer way to do this, maybe exploiting vectorisation.
This is what it currently looks like:
df3 <- as.data.frame(matrix(data = NA, ncol = nrow(df1), nrow = nrow(df1),
dimnames = list(df1$ID, df1$ID)))
for (i in 2:nrow(df3)){
for (j in 1:(i-1)){
sum.val <- df1[df1$ID == rownames(df3)[i], "Var"] + df1[df1$ID == names(df3)[j], "Var"]
df3[i,j] <- sum.val <= df2[i,j]
}
}
#
Is this what you want?
df3 <- outer(df1$Var, df1$Var, "+")
df3
df4 <- df3 < df2
df4
sum(df4, na.rm = TRUE)
Here's one way to do it...
# Get row and column indices
ind <- t( combn( df1$ID , 2 ) )
# Get totals
tot <- with( df1 , Var[ match( ind[,1] , ID ) ] + Var[ match( ind[,2] , ID ) ] )
# Make df2 a matrix
m <- as.matrix( df2 )
# Total number of values is simply
sum( m[ ind ] > tot )
#[1] 7
# Find which values in upper triangle part of the matrix exceed those from df1 (1 = TRUE)
m[upper.tri(m)] <- m[ ind ] > tot
# A B C D E
#A NA 1 1 1 0
#B 42 NA 1 0 1
#C 83 26 NA 1 1
#D 74 69 67 NA 0
#E 84 9 95 6 NA
This will do what you want.
# Generate the data
df1 <- data.frame(ID = LETTERS[1:5], Var = c(9,13,15,11,28))
df2 <- as.data.frame(matrix(data = c(NA,42,83,74,84,42,NA,26,
69,9,83,26,NA,67,95,74,69,
67,NA,6,84,9,95,6,NA),
ncol = 5, nrow = 5,
dimnames = list(df1$ID, df1$ID)))
# Define a pairwise comparison index matrix using 'combn'
idx <- combn(nrow(df1), 2)
# Create a results matrix
res <- matrix(NA, ncol = ncol(df2), nrow = nrow(df2))
# Loop through 'idx' for each possible comparison (without repeats)
for(i in 1:ncol(idx)){
logiTest <- (df1$Var[idx[1,i]] + df1$Var[idx[2,i]]) < df2[idx[1,i], idx[2,i]]
res[idx[1,i], idx[2, i]] <- logiTest
res[idx[2,i], idx[1, i]] <- logiTest
}
# Count the number of 'true' comparisons
nTrues <- sum(res, na.rm = TRUE)/2
The code simply uses a pairwise comparison index (idx) to define which elements in both df1 and df2 are to be used in each iteration of the 'for loop'. It then uses this same index to define where in the 'res' matrix the answer to the logical test is to be written.
N.B. This code will break down if the order of elements in df1 and df2 are not the same. In such cases, it would be appropriate to use the actual letters to define which values to compare.

Resources