A function for using in apply() in R - r

Is there anyone know how can I define a function to be used in apply() that return the following value? data[i,data[i,300]].
data[i,300] returns a column number in my data, and I am interested to know the content of this column number for each row.
It's easy to use for loop for addressing this issue as follows:
for (i in 1: nrow(data)) {
value<-data[i,data[i,300]] }
But it takes too much time for 10M observations. I am looking for a fast approach using apply().
I appreciate any help you can provide me.

I am going to assume that the 300th column has different column numbers for each row and for each row you want to check the value in the column specified by this 300th column.
sample data:
set.seed(123)
df <- data.frame(
c1 = sample(letters, 100, replace = TRUE),
c2 = sample(letters, 100, replace = TRUE),
c3 = sample(letters, 100, replace = TRUE),
c4 = sample(letters, 100, replace = TRUE),
c5 = sample(letters, 100, replace = TRUE),
d = sample(1:5, size = 100, replace = TRUE)
)
> head(df)
c1 c2 c3 c4 c5 d
1 o o p b r 2
2 s u w m l 2
3 n e a x o 3
4 c h h q s 5
5 j s h c u 1
6 r j j r f 3
Thus for above sample, let's assume the 6th column, column d is the one that gives the column numbers, so the expected output should be df[1, 2], df[2, 2], df[3, 3], df[4, 5], df[5, 1], df[6, 3], which evaluates to c('o', 'u','a', 's', 'j', 'j')
To get this we can use:
df[cbind(1:nrow(df), df[[6]])] # change 6 with 300 for your case

Related

Thousand separator to numeric columns in R

I am trying to format numbers as shown (adding thousand separator). The function is working fine but post formatting the numbers, the numeric columns does not sort by numbers since there are characters
df <- data.frame(x = c(12345,35666,345,5646575))
format_numbers <- function (df, column_name){
df[[column_name]] <- ifelse(nchar(df[[column_name]]) <= 5, paste(format(round(df[[column_name]] / 1e3, 1), trim = TRUE), "K"),
paste(format(round(df[[column_name]] / 1e6, 1), trim = TRUE), "M"))
}
df$x <- format_numbers(df,"x")
> df
x
1 12.3 K
2 35.7 K
3 0.3 K
4 5.6 M
Can we make sure the numbers are sorted in descending/ascending order post formatting ?
Note : This data df is to be incorporated in DT table
The problem is the formating part. If you do it correctly--ie while maintaining your data as numeric, then everything else will fall in place. Here I will demonstrate using S3 class:
my_numbers <- function(x) structure(x, class = c('my_numbers', 'numeric'))
format.my_numbers <- function(x,..., d = 1, L = c('', 'K', 'M', 'B', 'T')){
ifelse(abs(x) >= 1000, Recall(x/1000, d = d + 1),
sprintf('%.1f%s', x, L[d]))
}
print.my_numbers <- function(x, ...) print(format(x), quote = FALSE)
'[.my_numbers' <- function(x, ..., drop = FALSE) my_numbers(NextMethod('['))
Now you can run your code:
df <- data.frame(x = c(12345,35666,345,5646575))
df$x <- my_numbers(df$x)
df
x
1 12.3K
2 35.7K
3 345.0
4 5.6M
You can use any mathematical operation on column x as it is numeric.
eg:
cbinding with its double and ordering from smallest to larges:
cbind(x = df, y = df*2)[order(df$x),]
x x
3 345.0 690.0 # smallest
1 12.3K 24.7K
2 35.7K 71.3K
4 5.6M 11.3M # largest ie Millions
Note that under the hood, x does not change:
unclass(df$x)
[1] 12345 35666 345 5646575 # Same as given

Subtract vector from matrix based on data.frame efficiently

I have a matrix X, two data frames A and B and to vectors of indices vec_a and vec_b. A and B contain an index variable each, where the values correspond to the values in vec_a and vec_b. Other than that, A and B contain as as many values as there are columns in X:
# original data
X <- matrix(rnorm(200),100,2)
# values to substract in data.frames
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
# indices, which values to substract (one for each row of X)
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
What I want to achieve is the following: For each row iii in X get the values value1 and value2 from A and B based on elements iii in the vectors vec_a and vec_b. Then, subtract these values from the corresponding row in X. May sound a bit confusing, but I hope the following solution makes it more clear what the goal is:
# iterate over all rows of X
for(iii in 1:nrow(X)){
# get correct values
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
# this intermediate step is necessary, otherwise we substract a data.frame from a matrix
X_clean <- as.numeric(X_clean)
# subtract from X
X[iii,] = X[iii,] - X_clean
}
Note that we have to convert to numeric in my loop solution, otherwise X loses class matrix as we subtract a data.frame from a matrix. My solution works fine, until you need to do that for many matrices like A and B and for millions of observations. Is there a solution that does not rely on looping over all rows?
EDIT
Thanks, both answers improve the speed of the code massively. I chose the answer by StupidWolf as it was more efficient than using data.table:
Unit: microseconds
expr min lq mean median uq max neval cld
datatable 5557.355 5754.931 6052.402 5881.729 5975.386 14154.040 100 b
stupid.wolf 818.529 1172.840 1311.784 1187.593 1221.164 4777.743 100 a
loop 111748.790 115141.149 116677.528 116109.571 117085.048 156497.999 100 c
You can just match the rows:
set.seed(111)
# original data
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
newX <- X - as.matrix(A[match(vec_a,A$index_a),-1]-B[match(vec_b,B$index_b),-1])
Then we run your loop:
for(iii in 1:nrow(X)){
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
X_clean <- as.numeric(X_clean)
X[iii,] = X[iii,] - X_clean
}
And check the values are equal:
all.equal(c(newX),c(X))
[1] TRUE
Match should be pretty fast, but if it is still too slow, you can just call out the values of A using vec_a, like A[vec_a,] ..
This approach uses data.table for easy joining.
library(data.table)
set.seed(111)
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
setDT(A);setDT(B)
dtX <- as.data.table(cbind(1:nrow(X),X,vec_a,vec_b))
as.matrix(
dtX[A, on = .(vec_a = index_a)][B,
on = .(vec_b = index_b)][order(V1),
.(V2 - (value1 - i.value1), V3 - (value2 - i.value2))]
)
V1 V2
[1,] 0.22746 0.7069
[2,] 1.84340 -0.1258
[3,] -0.70038 1.2494
...
[98,] 2.04666 0.6767
[99,] 0.02451 1.0473
[100,] -2.72553 -0.6595
Hopefully this will be pretty fast for very large matrices.

Use different formulas to calculate a column in data.table in R

I want to calculate a column called result using different formulas based on the other variable.
Here is an example.
library(data.table)
# my formulas
formulas <- c('(a+b)/2', 'a*b', 'a/b', 'b^2+a')
temp <- data.table(a = rnorm(n = 100, mean = 0, sd = 1),
b = rnorm(n = 100, mean = 0, sd = 1),
formula = sample(formulas, size = 100, replace = T))
temp[, 'result':=eval(parse(text = formula))]
The problem here is that no matter what the formula column is, all the values in results are calculated using the average, which is the first formula. I know I could always create one column per formula, but I still want to ask if there is a better way?
How can I fix it? Is it the most efficient method?
Here is a snippet of the table:
> temp[1:10]
a b formula result
1: -3.2133845 -0.78415565 b^2+a -1.998770087
2: 1.0723745 -0.31782577 a/b 0.377274341
3: 0.2269515 -0.15369020 a*b 0.036630652
4: 0.3339993 -0.86385430 a*b -0.264927499
5: 2.1118212 0.33736843 a/b 1.224594821
6: 0.9475773 -0.95697168 (a+b)/2 -0.004697187
7: 0.1912716 -1.71286598 a/b -0.760797195
8: 0.7773886 -0.01156844 (a+b)/2 0.382910072
9: -1.3132885 0.42693258 (a+b)/2 -0.443177939
10: 0.4569847 -0.70861707 b^2+a -0.125816205
eval(parse uses only the first expression. so we need to loop by either using a group by sequence of row
temp[, result := eval(parse(text = formula)), 1:nrow(temp)]
head(temp, 10)
# a b formula result
# 1: -0.8745498 1.59139467 b^2+a 1.6579872
# 2: 1.8701160 -0.04637923 a/b -40.3222721
# 3: 0.7160009 1.20070549 a*b 0.8597063
# 4: -0.3374944 1.18557927 a/b -0.2846662
# 5: 1.0452618 -0.60531471 (a+b)/2 0.2199735
# 6: 1.1730676 -0.29384938 a/b -3.9920710
# 7: 1.4913614 -0.44015278 a/b -3.3882811
# 8: 1.6538467 -0.18478224 a/b -8.9502472
# 9: -0.1331562 1.84615754 b^2+a 3.2751415
#10: 1.3096422 -0.39569134 a/b -3.3097570
-checking
1.8701160/-0.04637923
#[1] -40.32227
0.7160009 * 1.20070549
#[1] 0.8597062
Or with Map
temp[, result := unlist(Map(function(x, a, b)
eval(parse(text = x))), formula, a, b)]

dynamic number of columns in a data table with unique random draws

Suppose I have the following matrix, for arbitrary J:
set.seed(1)
J=2
n = 100
BB = data.table(r=1:n)
BB[, (paste0("a",seq(J))) := rnorm(n,1,7) ]
So the output is...
> BB
r a1 a2
1: 1 -3.38517668 -3.38517668
2: 2 2.28550327 2.28550327
3: 3 -4.84940029 -4.84940029
...
How come the two columns are identical and now different rnorms?
You can use the super-fast for-set combination:
for(i in seq(J))
set(x = BB, j = paste0('a',i), value = rnorm(n, 1, 7))

operations (+, -, /, *) on unequal-sized data.table

1) Is it possible to do operations (multiplication, division, addition, subtraction) between unequal-sized data.tables using data.table or will it have to be done with data.frame?
The following example is a simplified version of my original posting. In my actual data set, it would be A1:A12, B1:B12, C1:C12, E1:E12, F1:F12, etc. I've added in columns J and K to get close to my original data set and to show that I can not do the following in a matrix.
# Sample Data
library(data.table)
input1a <- data.table(ID = c(37, 45, 900),
A1 = c(1, 2, 3),
A2 = c(43, 320, 390),
B1 = c(-0.94, 2.2, -1.223),
B2 = c(2.32, 4.54, 7.21),
C1 = c(1, 2, 3),
C2 = c(-0.94, 2.2, -1.223),
D = c(43, 320, 390),
J = paste0("measurement_1", 1:3),
K = paste0("type_1", 1:3))
setkey(input1a, ID)
input1a
# ID A1 A2 B1 B2 C1 C2 D J K
# 1: 37 1 43 -0.940 2.32 1 -0.940 43 measurement_11 type_11
# 2: 45 2 320 2.200 4.54 2 2.200 320 measurement_12 type_12
# 3: 900 3 390 -1.223 7.21 3 -1.223 390 measurement_13 type_13
input2a <- data.table(ID = c(37, 45, 900),
E1 = c(23, -0.2, 12),
E2 = c(-0.33, -0.012, -1.342))
setkey(input2a, ID)
input2a
# ID E1 E2
# 1: 37 -0.6135756 -0.330
# 2: 45 -0.0124872 -0.012
# 3: 900 -0.4165049 -1.342
outputa <- 0.00066 * input1a[, c(4:5), with = FALSE] *
input1a[, 8, with = FALSE] * input2a[, c(2:3), with = FALSE] # no keys, but would
# like to keep the keys
# outputa <- 0.00066 * B1:B2 * D * A1:A2 / referring back to the column names
setnames(outputa, 2:3, c("F1", "F2"))
Result using outputa
outputa # using existing code and gives a result with no keys
# F1 F2
# 1: -0.6135756 -0.02172773
# 2: -0.0929280 -0.01150618
# 3: -3.7776024 -2.49055607
In the following code I took outputa, which did not keep the keys, and rewrote outputa as outputause. I would like to have the following question answered so that I can perform the needed operations on the data set while keeping the keys intact.
2) How can the following code be rewritten with x defined for each group of columns? This question stems from Weighted sum of variables by groups with data.table and my trouble trying to replicate any of the answers with my data set.
Each group of columns is defined below:
A1:A2 (input1a[, 2:3]),
B1:B2 (input1a[, 4:5]), and
D input1a[, 8]
In outputause, if input1a[, c(4:5), with = FALSE] was the only group from input1a, then alone it would be x.
What about when you have more than one group from a single data.table as is shown below?
outputause <- input1a[, lapply(.SD, function(x) {
0.00066 * input1a[, c(4:5), with = FALSE] * input1a[, 8, with = FALSE] *
input2a[, c(2, 3), with = FALSE]
}), by = key(input1a)] # keeping keys intact
setnames(outputause, 2:3, c("F1", "F2"))
Result using outputause
outputause # using revised code and result includes the keys
# ID F1 F2
# 1: 37 -0.6135756 -0.02172773
# 2: 45 -0.0929280 -0.01150618
# 3: 900 -3.7776024 -2.49055607
UPDATE
input2at <- data.table(t(input2a))
inputs <- data.table(input1a, input2at)
I have transposed input2a and combined it with input1a in the data.table inputs. In this simple example I had 3 rows, but in my actual data set I'll have close to 1300 rows. This is the reason why I've asked question 2).
Thank you.
I am answering my own question based on an answer provided to me in R data.table operations with multiple groups in single data.table and outside function with lapply.
outputa <- data.table(input1a, input2a)
setnames(outputa, 8, "D1")
outputa[, D2 := D1]
fun <- function(B, D, E) 0.00066 * B * D * E
outputa[, lapply(1:2, function(i) fun(get(paste0('B', i)),
get(paste0('D', i)),
get(paste0('E', i)))),
by = ID]

Resources