operations (+, -, /, *) on unequal-sized data.table - r

1) Is it possible to do operations (multiplication, division, addition, subtraction) between unequal-sized data.tables using data.table or will it have to be done with data.frame?
The following example is a simplified version of my original posting. In my actual data set, it would be A1:A12, B1:B12, C1:C12, E1:E12, F1:F12, etc. I've added in columns J and K to get close to my original data set and to show that I can not do the following in a matrix.
# Sample Data
library(data.table)
input1a <- data.table(ID = c(37, 45, 900),
A1 = c(1, 2, 3),
A2 = c(43, 320, 390),
B1 = c(-0.94, 2.2, -1.223),
B2 = c(2.32, 4.54, 7.21),
C1 = c(1, 2, 3),
C2 = c(-0.94, 2.2, -1.223),
D = c(43, 320, 390),
J = paste0("measurement_1", 1:3),
K = paste0("type_1", 1:3))
setkey(input1a, ID)
input1a
# ID A1 A2 B1 B2 C1 C2 D J K
# 1: 37 1 43 -0.940 2.32 1 -0.940 43 measurement_11 type_11
# 2: 45 2 320 2.200 4.54 2 2.200 320 measurement_12 type_12
# 3: 900 3 390 -1.223 7.21 3 -1.223 390 measurement_13 type_13
input2a <- data.table(ID = c(37, 45, 900),
E1 = c(23, -0.2, 12),
E2 = c(-0.33, -0.012, -1.342))
setkey(input2a, ID)
input2a
# ID E1 E2
# 1: 37 -0.6135756 -0.330
# 2: 45 -0.0124872 -0.012
# 3: 900 -0.4165049 -1.342
outputa <- 0.00066 * input1a[, c(4:5), with = FALSE] *
input1a[, 8, with = FALSE] * input2a[, c(2:3), with = FALSE] # no keys, but would
# like to keep the keys
# outputa <- 0.00066 * B1:B2 * D * A1:A2 / referring back to the column names
setnames(outputa, 2:3, c("F1", "F2"))
Result using outputa
outputa # using existing code and gives a result with no keys
# F1 F2
# 1: -0.6135756 -0.02172773
# 2: -0.0929280 -0.01150618
# 3: -3.7776024 -2.49055607
In the following code I took outputa, which did not keep the keys, and rewrote outputa as outputause. I would like to have the following question answered so that I can perform the needed operations on the data set while keeping the keys intact.
2) How can the following code be rewritten with x defined for each group of columns? This question stems from Weighted sum of variables by groups with data.table and my trouble trying to replicate any of the answers with my data set.
Each group of columns is defined below:
A1:A2 (input1a[, 2:3]),
B1:B2 (input1a[, 4:5]), and
D input1a[, 8]
In outputause, if input1a[, c(4:5), with = FALSE] was the only group from input1a, then alone it would be x.
What about when you have more than one group from a single data.table as is shown below?
outputause <- input1a[, lapply(.SD, function(x) {
0.00066 * input1a[, c(4:5), with = FALSE] * input1a[, 8, with = FALSE] *
input2a[, c(2, 3), with = FALSE]
}), by = key(input1a)] # keeping keys intact
setnames(outputause, 2:3, c("F1", "F2"))
Result using outputause
outputause # using revised code and result includes the keys
# ID F1 F2
# 1: 37 -0.6135756 -0.02172773
# 2: 45 -0.0929280 -0.01150618
# 3: 900 -3.7776024 -2.49055607
UPDATE
input2at <- data.table(t(input2a))
inputs <- data.table(input1a, input2at)
I have transposed input2a and combined it with input1a in the data.table inputs. In this simple example I had 3 rows, but in my actual data set I'll have close to 1300 rows. This is the reason why I've asked question 2).
Thank you.

I am answering my own question based on an answer provided to me in R data.table operations with multiple groups in single data.table and outside function with lapply.
outputa <- data.table(input1a, input2a)
setnames(outputa, 8, "D1")
outputa[, D2 := D1]
fun <- function(B, D, E) 0.00066 * B * D * E
outputa[, lapply(1:2, function(i) fun(get(paste0('B', i)),
get(paste0('D', i)),
get(paste0('E', i)))),
by = ID]

Related

How can this R data.table join+group+summarise, operation be made a lot faster?

Summary of real-world problem
Essentially this is a scenario evaluation, of a linear system of equations.
I have two data tables.
s_dt contains the scenarios, drivers (d) and values (v) for each observed scenario (o).
c_dt contains a series of terms (n) for a number of fitted model bases (b).
The individual powers of drivers, and associated coefficients are coded into (d and t) as name-value pairs.
Each basis (b) is essentially a polynomial with n terms.
The issue
Repro case below gives desired output format.
But is far too slow for required use case, even on a cut-down problem.
Numbers are junk, but I can't share actual data. Running on real-world data gives similar timing.
Circa 3sec for "lil" problem on my system (12 threads).
But "big" problem is 4000 times larger. So expect circa 3hours. Ouch!
Aim is to have the "big" problem run sub 5min (or ideally much faster!)
So, awesome clever people, how can this be made a lot faster?
(And what is the root cause of the slowdown?)
I'll happily accept base/tidyverse solutions too, if they meet the performance needs. I just assumed data.table was the best way to go for the size of the problem.
Current solution
Run fun on s_dt, grouping by o.
fun: Joins c_dt with each group data, to populate v, thus enabling calculation of r (the result of evaluating each of the polynomial equations).
In data.table parlance:
s_dt[, fun(.SD), keyby = .(o)]
Repro case
Creates two data.tables that have the combinations and field types matching real-world problem.
But with cut-down size for illustrative purposes.
Defines fun, then runs to populate r for all scenarios.
library(data.table)
# problem sizing ----
dims <- list(o = 50000, d = 50, b = 250, n = 200) # "big" problem - real-life size
dims <- list(o = 100, d = 50, b = 25, n = 200) # "lil" problem (make runtime shorter as example)
# build some test data tables ----
build_s <- function() {
o <- seq_len(dims$o)
d <- paste0("d",seq_len(dims$d))
v <- as.double(seq_len(dims$o * dims$d))/10000
CJ(o, d)[, `:=`(v = v)]
}
s_dt <- build_s()
build_c <- function() {
b <- paste0("c", seq_len(dims$b))
n <- seq_len(dims$n)
d <- c("c", paste0("d", seq_len(dims$d)))
t <- as.double(rep_len(0:6, dims$b * dims$n * (dims$d+1)))
dt <- CJ(d, b, n)[, `:=`(t = t)]
dt <- dt[t != 0]
}
c_dt <- build_c()
# define fun and evaluate ----
# (this is what needs optimising)
profvis::profvis({
fun <- function(dt) {
# don't use chaining here, for more useful profvis output
dt <- dt[c_dt, on = .(d)]
dt <- dt[, r := fcase(d == "c", t,
is.na(v), 0,
rep(TRUE, .N), v^t)]
dt <- dt[, .(r = prod(r)), keyby = .(b, n)]
dt <- dt[, .(r = sum(r)), keyby = .(b)]
}
res <- s_dt[, fun(.SD), keyby = .(o)]
})
Example inputs and outputs
> res
o b r
1: 1 c1 0.000000e+00
2: 1 c10 0.000000e+00
3: 1 c11 0.000000e+00
4: 1 c12 0.000000e+00
5: 1 c13 0.000000e+00
---
2496: 100 c5 6.836792e-43
2497: 100 c6 6.629646e-43
2498: 100 c7 6.840915e-43
2499: 100 c8 6.624668e-43
2500: 100 c9 6.842608e-43
> s_dt
o d v
1: 1 d1 0.0001
2: 1 d10 0.0002
3: 1 d11 0.0003
4: 1 d12 0.0004
5: 1 d13 0.0005
---
4996: 100 d50 0.4996
4997: 100 d6 0.4997
4998: 100 d7 0.4998
4999: 100 d8 0.4999
5000: 100 d9 0.5000
> c_dt
d b n t
1: c c1 2 1
2: c c1 3 2
3: c c1 4 3
4: c c1 5 4
5: c c1 6 5
---
218567: d9 c9 195 5
218568: d9 c9 196 6
218569: d9 c9 198 1
218570: d9 c9 199 2
218571: d9 c9 200 3
This would be difficult to fully vectorize. The "big" problem requires so many operations that going parallel is probably the most straightforward way to get to ~5 minutes.
But first, we can get a ~3x speed boost by using RcppArmadillo for the product and sum calculations instead of data.table's grouping operations.
library(data.table)
library(parallel)
Rcpp::cppFunction(
"std::vector<double> sumprod(arma::cube& a) {
for(unsigned int i = 1; i < a.n_slices; i++) a.slice(0) %= a.slice(i);
return(as<std::vector<double>>(wrap(sum(a.slice(0), 0))));
}",
depends = "RcppArmadillo",
plugins = "cpp11"
)
cl <- makeForkCluster(detectCores() - 1L)
The following approach requires extensive preprocessing. The upshot is that it makes it trivial to parallelize. However, it will work only if the values of s_dt$d are the same for each o as in the MRE:
identical(s_dt$d, rep(s_dt[o == 1]$d, length.out = nrow(s_dt)))
#> [1] TRUE
Now let's build the functions to accept s_dt and c_dt:
# slightly modified original function for comparison
fun1 <- function(dt, c_dt) {
# don't use chaining here, for more useful profvis output
dt <- dt[c_dt, on = .(d)]
dt <- dt[, r := fcase(d == "c", t,
is.na(v), 0,
rep(TRUE, .N), v^t)]
dt <- dt[, .(r = prod(r)), keyby = .(b, n)]
dt <- dt[, .(r = sum(r)), keyby = .(b)]
}
fun2 <- function(s_dt, c_dt, cl = NULL) {
s_dt <- copy(s_dt)
c_dt <- copy(c_dt)
# preprocess to get "a", "tt", "i", and "idxs"
i_dt <- s_dt[o == 1][, idxs := .I][c_dt, on = .(d)][, ic := .I][!is.na(v)]
ub <- unique(c_dt$b)
un <- unique(c_dt$n)
nb <- length(ub)
nn <- length(un)
c_dt[, `:=`(i = match(n, un) + nn*(match(b, ub) - 1L), r = 0)]
c_dt[, `:=`(i = i + (0:(.N - 1L))*nn*nb, ni = .N), i]
c_dt[d == "c", r := t]
a <- array(1, c(nn, nb, max(c_dt$ni)))
a[c_dt$i] <- c_dt$r # 3-d array to store v^t (updated for each unique "o")
i <- c_dt$i[i_dt$ic] # the indices of "a" to update (same for each unique "o")
tt <- c_dt$t[i_dt$ic] # c_dt$t ordered for "a" (same for each unique "o")
idxs <- i_dt$idxs # the indices to order s_dt$v (same for each unique "o")
uo <- unique(s_dt$o)
v <- collapse::gsplit(s_dt$v, s_dt$o)
if (is.null(cl)) {
# non-parallel solution
data.table(
o = rep(uo, each = length(ub)),
b = rep(ub, length(v)),
r = unlist(
lapply(
v,
function(x) {
a[i] <- x[idxs]^tt
sumprod(a)
}
)
),
key = "o"
)
} else {
# parallel solution
clusterExport(cl, c("a", "tt", "i", "idxs"), environment())
data.table(
o = rep(uo, each = length(ub)),
b = rep(ub, length(v)),
r = unlist(
parLapply(
cl,
v,
function(x) {
a[i] <- x[idxs]^tt
sumprod(a)
}
)
),
key = "o"
)
}
}
Now the data:
# problem sizing ----
bigdims <- list(o = 50000, d = 50, b = 250, n = 200) # "big" problem - real-life size
lildims <- list(o = 100, d = 50, b = 25, n = 200) # "lil" problem (make runtime shorter as example)
# build some test data tables ----
build_s <- function(dims) {
o <- seq_len(dims$o)
d <- paste0("d",seq_len(dims$d))
v <- as.double(seq_len(dims$o * dims$d))/10000
CJ(o, d)[, `:=`(v = v)]
}
build_c <- function(dims) {
b <- paste0("c", seq_len(dims$b))
n <- seq_len(dims$n)
d <- c("c", paste0("d", seq_len(dims$d)))
t <- as.double(rep_len(0:6, dims$b * dims$n * (dims$d+1)))
dt <- CJ(d, b, n)[, `:=`(t = t)]
dt <- dt[t != 0]
}
Timing the lil problem, which is so small that parallelization doesn't help:
s_dt <- build_s(lildims)
c_dt <- build_c(lildims)
microbenchmark::microbenchmark(fun1 = s_dt[, fun1(.SD, c_dt), o],
fun2 = fun2(s_dt, c_dt),
times = 10,
check = "equal")
#> Unit: seconds
#> expr min lq mean median uq max neval
#> fun1 3.204402 3.237741 3.383257 3.315450 3.404692 3.888289 10
#> fun2 1.134680 1.138761 1.179907 1.179872 1.210293 1.259249 10
Now the big problem:
s_dt <- build_s(bigdims)
c_dt <- build_c(bigdims)
system.time(dt2p <- fun2(s_dt, c_dt, cl))
#> user system elapsed
#> 24.937 9.386 330.600
stopCluster(cl)
A bit longer than 5 minutes with 31 cores.

Replace integers in a data frame column with other integers in R?

I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))

How to prepare a 3rd data.frame from two others

I've constructed a data.frame using the inefficient code below. Can you improve it, noting that if you can think of a better starting point, please include that answer.
My code takes data from the first two data frames and combines them to give the third. The first data.frame is a grid of 1s and -1s representing low or high values. The second data.frame includes all the information for me to calculate the high or low values. Note that each column has similar calculations but the calculation may differ from column to column.
## Exampple Data for Question
## How to prepare a 3rd data.frame from two others
## Prepare 1st data.frame, a Yates table
A <- B <- C <- c(1,-1)
yates.1 <- expand.grid(A = A, B = B, C = C)
## Prepare 2nd data.frame with the reaction data
reaction.info <- data.frame(stringsAsFactors = FALSE,
factor.name = c("A", "B", "C"),
Component = c("Water", "SM", "Reagent"),
Mw = c(18, 36.5, 40),
centre.point = c(20, 1.4, 1.45),
positive.point = c(22, 1.54, 1.595),
negative.point = c(18, 1.26, 1.305))
## Prepare 3rd data.frame to be filled
reaction.quants <- as.data.frame(matrix(NA, dim(yates.1)[1], dim(yates.1)[2]))
names(reaction.quants) <- reaction.info[,2]
reaction.quants[yates.1[,1] == 1 ,1] <- round(5 * reaction.info[1,3] * reaction.info[1,5], 3)
reaction.quants[yates.1[,1] == -1 ,1] <- round(5 * reaction.info[1,3] * reaction.info[1,6], 3)
reaction.quants[yates.1[,2] == 1 ,2] <- round(5 * reaction.info[2,3] * reaction.info[2,5], 3)
reaction.quants[yates.1[,2] == -1 ,2] <- round(5 * reaction.info[2,3] * reaction.info[2,6], 3)
reaction.quants[yates.1[,3] == 1 ,3] <- round(5 * reaction.info[3,3] * reaction.info[3,5], 3)
reaction.quants[yates.1[,3] == -1 ,3] <- round(5 * reaction.info[3,3] * reaction.info[3,6], 3)
## three data.frames
yates
reaction.info
reaction.quants
Consider refactoring with ifelse logic and filtered vectors using a user defined method since logic is very similar but across different columns:
convert_col <- function(nm) {
with(reaction.info,
ifelse(reaction.quants_new[[nm]] == 1,
round(5 * Mw[Component==nm] * positive.point[Component==nm], 3),
round(5 * Mw[Component==nm] * negative.point[Component==nm], 3)
)
)
}
# INITIALIZE DATA FRAME (CAN BE NESTED IN NEXT CALL)
reaction.quants_new <- setNames(data.frame(yates.1), reaction.info$Component)
# ADD COLUMNS (CAN ALSO USE TRANSFORM)
reaction.quants_new <- within(reaction.quants_new, {
Water <- convert_col("Water")
SM <- convert_col("SM")
Reagent <- convert_col("Reagent")
})
reaction.quants_new
# Water SM Reagent
# 1 1980 281.05 319
# 2 1620 281.05 319
# 3 1980 229.95 319
# 4 1620 229.95 319
# 5 1980 281.05 261
# 6 1620 281.05 261
# 7 1980 229.95 261
# 8 1620 229.95 261
identical(reaction.quants, reaction.quants_new)
# [1] TRUE
And with pipes available in R 4.1+ still using user-defined method:
reaction.quants_new <- data.frame(yates.1) |>
setNames(reaction.info$Component) |>
within({
Water <- convert_col("Water")
SM <- convert_col("SM")
Reagent <- convert_col("Reagent")
})

Recursive update of columns in R data.table

As a simple example I need to create the following table (this is my desired result):
library(data.table)
DT <- data.table( A= c(2,5,4,-2,4),
B= c(1000,1100,1375,1650,1485),
C= c(50,55,68.75,82.5,74.25),
D= c(100,275,275,-165,297))
DT
This is my attempt so far which is not working:
DT.2 <- data.table(A= c(2,5,4,-2,4) )
DT.2[, B := 1000 ] # B should begin at 1000, then cumulatively add column D
DT.2[, C := B * 0.05 ]
DT.2[, D := A * C ]
DT.2[, B := 1000 + cumsum(shift(D, type= "lag", fill=FALSE)) ]
DT.2
As you can see the columns do not update correctly as each column relies on the results of other columns. It's a very easy calculation in Excel and I just need to understand how to adapt it to R.
Thank you
So far as I can tell, you need a loop. Here's my input data:
DT <- data.table(A = c(2, 5, 4, -2, 4),
B = c(1000, rep(NA, 4)),
C = numeric(5),
D = numeric(5))
And the loop I use:
#initial row
DT[1, c("C", "D") := .(.05 * B, .05 * A * B)]
#subsequent rows
for (nn in 2:nrow(DT)){
new_B <- DT[nn - 1L, B + D]
DT[nn, c("B", "C", "D") := .(new_B, .05 * new_B, .05 * A * new_B)]
}
Could also easily be translated to a sapply so you don't have nn sitting around in your namespace (but now it will print something meaningless to your console):
sapply(2:nrow(DT), function(nn){
(same as before)})
And yes, I can imagine this would look much simpler in Excel since it auto-updates the formula for you with click+drag.
I think that you actually want a function, and I am guessing that it would make more sense like this:
fun <- function(id, ## not sure what this is, probably an id
start = 1000,
rate = .05,
gain_or_loss){
require(data.table)
cnt <- length(id)
ret <- data.table(ID = c(id, NA),
bal = cumsum(c(start, gain_or_loss)),
fixed_change = rep(NA_real_, cnt + 1),
total_change = c(gain_or_loss, NA_real_))
ret[ , fixed_change := bal * 0.05 ]
ret <- ret[] ## needed because of a data.table quirk
return(ret)
}
Which would give:
fun(id = c(2,5,4,-2,4),
gain_or_loss = c(100,275,275,-165,297))
# ID bal fixed_change total_change
# 1: 2 1000 50.00 100
# 2: 5 1100 55.00 275
# 3: 4 1375 68.75 275
# 4: -2 1650 82.50 -165
# 5: 4 1485 74.25 297
# 6: NA 1782 89.10 NA
I would also add code after require to say if (length(id) != length(gain_or_loss)) stop("id and gain_or_loss need to be of the same length")

Loop through data frames based upon name

I have another simple r question that hopefully someone can help with. I have a series of dataframes that have a repetitive name structure. I would like to loop through them and perform some analysis. Here is hardcoded example of what I want to do using some fake data:
#Create some fake data
n1 = c(2, 3, 5, 7)
s1 = c(1, 1, 2, 0)
b1 = c(6, 0, 0, 0)
Tank001.df = data.frame(n1, s1, b1)
n2 = c(1, 2, 4, 6)
s2 = c(2, 2, 0, 0)
b2 = c(8, 9, 10, 0)
Tank002.df = data.frame(n2, s2, b2)
n3 = c(7, 12, 0, 0)
s3 = c(5, 3, 0, 0)
b3 = c(8, 9, 10, 4)
Tank003.df = data.frame(n3, s3, b3)
The first action I would like to automate is the conversion of 0 values to "NA". Here is the harcoded version but I would ideally automate this dependant on how many Tankxxx.df dataframes I have:
#Convert zeros to NA
Tank001.df[Tank001.df==0] <- NA
Tank002.df[Tank002.df==0] <- NA
Tank003.df[Tank003.df==0] <- NA
Finally I would like to complete a series of queries of the data, a simple example of which might be the number of values smaller than 5 in each dataframe:
#Return the number of values smaller than 5
Tank001.less.than.5 <- numeric(length(Tank001.df))
for (i in 1:(length(Tank001.df))) {Tank001.less.than.5[i] <- sum(Tank001.df[[i]] < 5,na.rm=TRUE)}
Tank002.less.than.5 <- numeric(length(Tank002.df))
for (i in 1:(length(Tank002.df))) {Tank002.less.than.5[i] <- sum(Tank002.df[[i]] < 5,na.rm=TRUE)}
Tank003.less.than.5 <- numeric(length(Tank003.df))
for (i in 1:(length(Tank003.df))) {Tank003.less.than.5[i] <- sum(Tank003.df[[i]] < 5,na.rm=TRUE)}
Ideally I would also like to know how to write the results of such simple calculations to a new dataframe. In this case for example Less.than.5$TankXXX etc.
Any help would be greatly appreciated.
Create a list of your data.frames and use a combination of lapply and sapply as follows:
TankList <- list(Tank001.df, Tank002.df, Tank003.df)
lapply(TankList, function(x) {
x[x == 0] <- NA
sapply(x, function(y) sum(y < 5, na.rm = TRUE))
})
# [[1]]
# n1 s1 b1
# 2 3 0
#
# [[2]]
# n2 s2 b2
# 3 2 0
#
# [[3]]
# n3 s3 b3
# 0 1 1
This also works with a single lapply and colSums:
l <- list(Tank001.df, Tank002.df, Tank003.df) # create a list
lapply(l, function(x) colSums("is.na<-"(x, !x) < 5, na.rm = TRUE))
# [[1]]
# n1 s1 b1
# 2 3 0
#
# [[2]]
# n2 s2 b2
# 3 2 0
#
# [[3]]
# n3 s3 b3
# 0 1 1

Resources