R Function to write 3 calculated columns to a data.table - r

This may have already been answered, but couldn't quite find the answer I am looking for. I am trying to write the output of a function that calculates 3 variables to a data.table.
Currently I am copying the function three times (with three different names), each time returning a different variable. This is taking a lot more time as it runs thrice. I understand
there may be a better way to do it, using list or some unique data.table command.
I would greatly appreciate any input you can provide to simplify this. Below is the example of how I am calling it one variable at a time.
Example
fn_1 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
return(col_1)
}
data[ ,column_1 := fn_1(a,b,c,d) ,by= .(e,f) ]
fn_2 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
return(col_2)
}
data[ ,column_2 := fn_2(a,b,c,d) ,by= .(e,f) ]

The OP has tagged the question with data.table. docendo discimus' comment is showing the direction to follow.
Create sample data
library(data.table) # CRAN version 1.10.4 used
n <- 10L
DT <- data.table(
a = 1:n, b = (n:1)^2, c = -(1:n), d = 2 * (1:n) - n/2,
e = rep(LETTERS[1:2], length.out = n),
f = rep(LETTERS[3:4], each = n/2, length.out = n))
DT
# a b c d e f
# 1: 1 100 -1 -3 A C
# 2: 2 81 -2 -1 B C
# 3: 3 64 -3 1 A C
# 4: 4 49 -4 3 B C
# 5: 5 36 -5 5 A C
# 6: 6 25 -6 7 B D
# 7: 7 16 -7 9 A D
# 8: 8 9 -8 11 B D
# 9: 9 4 -9 13 A D
#10: 10 1 -10 15 B D
Define function
fn <- function(p, q, r, s) {
list(X1 = p + mean(q) + r + s,
Y2 = p * q + r * s,
Z3 = p * q - r * s)
}
The function takes 4 parameters and returns a list of 3 named vectors. Note that the computations inside the function don't need to use for loops in contrast to OP's approach.
Apply function to data.table
Note that the OP wants to group on columns e and f when the function is applied.
The first variant creates a new data.table. By default, the names of the list elements as defined in fn are used:
DT[, fn(a, b, c, d), .(e, f)]
# e f X1 Y2 Z3
# 1: A C 63.66667 103 97
# 2: A C 67.66667 189 195
# 3: A C 71.66667 155 205
# 4: B C 64.00000 164 160
# 5: B C 68.00000 184 208
# 6: B D 18.66667 108 192
# 7: B D 22.66667 -16 160
# 8: B D 26.66667 -140 160
# 9: A D 19.00000 49 175
#10: A D 23.00000 -81 153
The second variant updates DT by reference. The names of the new columns are explicitely stated.
DT[, c("x", "y", "z") := fn(a, b, c, d), .(e, f)]
DT
# a b c d e f x y z
# 1: 1 100 -1 -3 A C 63.66667 103 97
# 2: 2 81 -2 -1 B C 64.00000 164 160
# 3: 3 64 -3 1 A C 67.66667 189 195
# 4: 4 49 -4 3 B C 68.00000 184 208
# 5: 5 36 -5 5 A C 71.66667 155 205
# 6: 6 25 -6 7 B D 18.66667 108 192
# 7: 7 16 -7 9 A D 19.00000 49 175
# 8: 8 9 -8 11 B D 22.66667 -16 160
# 9: 9 4 -9 13 A D 23.00000 -81 153
#10: 10 1 -10 15 B D 26.66667 -140 160

You're in the second circle of hell. To solve the problem, pre-allocate what you want to add.
data <- data.table(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
Then, make a vectorized function to do the calculation, which returns the whole column to append.
calculation <- Vectorize(function(x) mean(c(x, 3)))
Write fn in terms of this new function, and return the whole block of columns to be added, then cbind it with data to add all the columns at once. It's extremely slow to do all the calculations every time, and then only return one part.
fn <- function(b, c, d) {
toBeAdded <- data.table(matrix(nrow = nrow(data), ncol = 3))
toBeAdded[ , 1] <- calculation(b)
toBeAdded[ , 2] <- calculation(b)
toBeAdded[ , 3] <- calculation(b)
toBeAdded
}
data <- cbind(data, fn(data[1,], data[2,], data[3,]))

Answering my own question, based on inputs from #docendodiscimus & #ConCave, i solved it like this. appreciate everyone's input!
fn_1 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
df = data.table(col_1, col_2, col_3)
return(df)
}
data[,c("column_1","column_2","column_3"):= fn_1(a,b,c,d) ,by= .(e,f)]

Does it have to be a data.table? If not , then you can just use mutate in dplyr
a <- c(1,2,2,1,2,3,4,2)
b <- c(3,3,2,3,5,4,3,2)
c <- c(9,9,8,7,8,9,8,7)
d <- c(0,1,1,0,1,1,0,1)
have <- data.frame(a,b,c,d)
want <-
have %>%
mutate(abc = a+ b + c,
db = d * b,
aa = 2 * a)

Related

How to use functions to do a recursive calculation in data.table/R?

I am new to Programming and got stuck in it. I wanted to calculate the hourly temperature variation of an object throughout the year using some variables, which changes in every hour. The original data contains 60 columns and 8760 rows for the calculation.
I got the desired output using the for loop, but the model is taking a lot of time for the calculation. I wonder if there is any way to replace the loop with functions, which I suspect, can also increase the speed of the calculations.
Here is a small reproducible example to show what I did.
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table
A B C
1: 1 1 10
2: 1 2 10
3: 1 3 10
4: 1 4 10
5: 1 5 10
The forloop
for (j in (2: nrow(table))) {
table$A[j] = (table$A[j-1] + table$B[j-1]) * table$B[j]
table$C[j] = table$B[j] * table$A[j]
}
I got the output as I desired:
A B C
1: 1 1 10
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
but it took 15 min to run the whole program in my case (not this!)
So I tried to use function instead of the for loop.
I tried this:
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
myfun <- function(df){
df = df %>% mutate(A = (lag(A) + lag(B)) * B,
C = B * A)
return(df)
}
myfun(table)
But the output was
A B C
1 NA 1 NA
2 4 2 8
3 9 3 27
4 16 4 64
5 25 5 125
As it seems that the function refers to the rows of the first table not the updated rows after the calculation. Is there any way to obtain the desired output using functions? It is my first R project, any help is very much appreciated. Thank you.
A much faster alternative using data.table. Note that the calculation of C can be separated from the calculation of A so we can do less within the loop:
for (i in 2:nrow(table)) {
set(table, i = i, j = "A", value = with(table, (A[i-1] + B[i-1]) * B[i]))
}
table[-1, C := A * B]
table
# A B C
# <num> <int> <num>
# 1: 1 1 10
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200
You can try Reduce like below
dt[
,
A := Reduce(function(x, Y) (x + Y[2]) * Y[1],
asplit(embed(B, 2), 1),
init = A[1],
accumulate = TRUE
)
][
,
C := A * B
]
which updates dt as
> dt
A B C
1: 1 1 1
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
data
dt <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
Here's a solution using purrr::accumulate2 which lets you use the result of the previous computation as the input to the next one:
library(data.table)
library(purrr)
library(magrittr)
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table$A <- accumulate2(
table$A,
seq(table$A),
~ (..1 + table$B[..3]) * table$B[..3 + 1],
.init = table$A[1]
) %>%
unlist() %>%
extract(1:nrow(table))
table$C <- table$B * table$A
table
# A B C
# 1: 1 1 1
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200

Same column ( different row ) operations in R

I have a big database and I'm trying to create a new column starting from an existing one doing the difference between elements in consecutive cells ( same column, different row):
existing_column
new_column
A
A-B
B
B-C
C
C-D
D
D-E
...
...
Z
Z-NULL
The way I'm doing it is to duplicate existing column into a dummy one, remove first element, adding NULL as last element and subtracting the existing column and the dummy one ... is there a better way? Thank you
exist <-c("A","B","C","D","E")
db<-data.frame(exist)
dummy<-exist[-1]
dummy[length(dummy)+1]<-"NULL"
new_col<-paste(exist,"-",dummy)
new_col
db<-data.frame(exist,new_col)
db
Does this work:
library(dplyr)
df <- data.frame(existing_column = LETTERS)
df %>% mutate(new_column = paste(existing_column, lead(existing_column, default = 'NULL'), sep = '-'))
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NULL
Try the code below
transform(
df,
new_column = paste(existing_column, c(existing_column[-1], NA), sep = "-")
)
which gives
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NA
If you are working with numeric data just represented as characters in your example, you can use mutate() and lead()
df<-data.frame(old_col=sample(1:10))
df%>%mutate(new_col=old_col-lead(old_col, default = 0))
old_col new_col
1 10 4
2 6 -3
3 9 8
4 1 -1
5 2 -5
6 7 3
7 4 1
8 3 -5
9 8 3
10 5 5
In case there is a need of a fast data.table version
dt[, new_column:=paste(exist, shift(exist, type="lead"), sep="-")]
Edit. Turns it isn't much faster:
df = data.table(exist = rep(letters, 80000))
> m = microbenchmark::microbenchmark(
... a = df %>% mutate(new_column = paste(exist, lead(exist, default = 'NULL'), sep = '-')),
...
... b = transform(
... df,
... new_column = paste(exist, c(exist[-1], NA), sep = "-")
... ),
...
... d = df[, new_column := paste(exist, shift(exist, type="lead"), sep="-")]
... )
> m
Unit: milliseconds
expr min lq mean median uq max neval
a 292.2430 309.6150 342.0191 323.9778 361.0937 603.8449 100
b 349.4509 383.3391 475.0177 423.8864 472.0276 2136.2970 100
d 294.6786 302.8530 332.3989 315.6228 340.9642 641.8345 100

Update values of a column based on predefined thresholds

I have a data set as follows
Name Price
A 100
B 123
C 112
D 114
E 101
F 102
I need a way to update the value in the price column if the price is between +3 or -3 of a vector of values specified to the value specified in the vector. The vector may contain any number of elements.
Vector = c(100,111)
Updated dataframe
Name Price
A 100
B 123
C 111
D 111
E 100
F 100
If the vector is
Vector = c(104,122)
then the updated dataframe needs to be
Name Price
A 100
B 122
C 112
D 114
E 104
F 104
df <- data.frame('Name' = LETTERS[1:6], 'Price'= c(100,123,112,114,101,102))
transform <- function(value, conditionals){
for(cond in conditionals){
if(abs(value - cond) < 4){
return(cond)
}
}
return(value)
}
sapply(df$Price, transform, c(104,122))
This should work. It can probably done in one line with apply (but I find it difficult to read sometimes so this should be easier to read).
Here's one approach
bound <- 3
upper_bound <- Vector+bound
lower_bound <- Vector-bound
vi <- Reduce("pmax", lapply(seq_along(Vector), function(i) i*(df$Price <= upper_bound[i] & df$Price >= lower_bound[i])))
# [1] 1 0 2 2 1 1
vi_na <- replace(vi, vi == 0, NA)
# [1] 1 NA 2 2 1 1
df$Price <- dplyr::mutate(df, Price = ifelse(is.na(Vector[vi_na]), Price, Vector[vi_na]))
# Name Price.Name Price.Price
# 1 A A 100
# 2 B B 123
# 3 C C 111
# 4 D D 111
# 5 E E 100
# 6 F F 100
Data
df <- read.table(text = "Name Price
A 100
B 123
C 112
D 114
E 101
F 102", header=TRUE)
Vector = c(100,111)

conditional sampling without replacement

I am attempting to write a simulation that involves randomly re-assigning items to categories with some restrictions.
Lets say I have a collection of pebbles 1 to N distributed across buckets A through J:
set.seed(100)
df1 <- data.frame(pebble = 1:100,
bucket = sample(LETTERS[1:10], 100, T),
stringsAsFactors = F)
head(df1)
#> pebble bucket
#> 1 1 D
#> 2 2 C
#> 3 3 F
#> 4 4 A
#> 5 5 E
#> 6 6 E
I want to randomly re-assign pebbles to buckets. Without restrictions I could do it like so:
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
colSums(table(random.permutation.df1))
#> A B C D E F G H I J
#> 4 7 13 14 12 11 11 10 9 9
colSums(table(df1))
#> A B C D E F G H I J
#> 4 7 13 14 12 11 11 10 9 9
Importantly this re-assigns pebbles while ensuring that each bucket retains the same number (because we are sampling without replacement).
However, I have a set of restrictions such that certain pebbles cannot be assigned to certain buckets. I encode the restrictions in df2:
df2 <- data.frame(pebble = sample(1:100, 10),
bucket = sample(LETTERS[1:10], 10, T),
stringsAsFactors = F)
df2
#> pebble bucket
#> 1 33 I
#> 2 39 I
#> 3 5 A
#> 4 36 C
#> 5 55 J
#> 6 66 A
#> 7 92 J
#> 8 95 H
#> 9 2 C
#> 10 49 I
The logic here is that pebbles 33 and 39 cannot be placed in bucket I, or pebble 5 in bucket A, etc. I would like to permute which pebbles are in which bucket subject to these restrictions.
So far, I've thought of tackling it in a loop as below, but this does not result in buckets retaining the same number of pebbles:
perms <- character(0)
cnt <- 1
for (p in df1$pebble) {
perms[cnt] <- sample(df1$bucket[!df1$bucket %in% df2$bucket[df2$pebble==p]], 1)
cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G H I J
#> 6 7 12 22 15 1 14 7 7 9
I then tried sampling positions, and then removing that position from the available buckets and the available remaining positions. This is also not working, and I suspect it is because I am sampling my way into branches of the tree that do not yield solutions.
set.seed(42)
perms <- character(0)
cnt <- 1
ids <- 1:nrow(df1)
bckts <- df1$bucket
for (p in df1$pebble) {
id <- sample(ids[!bckts %in% df2$bucket[df2$pebble==p]], 1)
perms[cnt] <- bckts[id]
bckts <- bckts[-id]
ids <- ids[ids!=id]
cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G J
#> 1 1 4 1 2 1 2 2
Any thoughts or advice much appreciated (and apologies for the length).
EDIT:
I foolishly forgot to clarify that I was previously solving this by just resampling until I got a draw that didn't violate any of the conditions in df2, but I now have many conditions such that this would make my code take too long to run. I am still up for trying to force it if I could figure out a way to make forcing it faster.
I have a solution (I managed to write it in base R, but the data.table solution is easier to understand and write:
random.permutation.df2 <- data.frame(pebble = df1$pebble, bucket = rep(NA,length(df1$pebble)))
for(bucket in unique(df1$bucket)){
N <- length( random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) &
!random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] )
random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) &
!random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] <-
sample(c(rep(bucket,sum(df1$bucket == bucket)),rep(NA,N-sum(df1$bucket == bucket))))
}
The idea is to sample the authorised peeble for each bucket: those that are not in df2, and those that are not already filled. You sample then a vector of the good length, choosing between NAs (for the following buckets values) and the value in the loop, and voilà.
Now easier to read with data.table
library(data.table)
random.permutation.df2 <- setDT(random.permutation.df2)
df2 <- setDT(df2)
for( bucketi in unique(df1$bucket)){
random.permutation.df2[is.na(bucket) & !pebble %in% df2[bucket == bucketi, pebble],
bucket := sample(c(rep(bucketi,sum(df1$bucket == bucket)),rep(NA,.N-sum(df1$bucket == bucket))))]
}
it has the two conditions
> colSums(table(df1))
A B C D E F G H I J
4 7 13 14 12 11 11 10 9 9
> colSums(table(random.permutation.df2))
A B C D E F G H I J
4 7 13 14 12 11 11 10 9 9
To verify that there isn't any contradiction with df2
> df2
pebble bucket
1: 37 D
2: 95 H
3: 90 C
4: 80 C
5: 31 D
6: 84 G
7: 76 I
8: 57 H
9: 7 E
10: 39 A
> random.permutation.df2[pebble %in% df2$pebble,.(pebble,bucket)]
pebble bucket
1: 7 D
2: 31 H
3: 37 J
4: 39 F
5: 57 B
6: 76 E
7: 80 F
8: 84 B
9: 90 H
10: 95 D
Here a brute force approach where one simply tries long enough until a valid solution is found:
set.seed(123)
df1 <- data.frame(pebble = 1:100,
bucket = sample(LETTERS[1:10], 100, T),
stringsAsFactors = F)
df2 <- data.frame(pebble = sample(1:100, 10),
bucket = sample(LETTERS[1:10], 10, T),
stringsAsFactors = F)
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
Random permutation does not match the condition, so try new ones:
merge(random.permutation.df1, df2)
#> pebble bucket
#> 1 60 J
while(TRUE) {
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
if(nrow(merge(random.permutation.df1, df2)) == 0)
break;
}
New permutation matches the condition:
merge(random.permutation.df1, df2)
#> [1] pebble bucket
#> <0 Zeilen> (oder row.names mit Länge 0)
colSums(table(random.permutation.df1))
#> A B C D E F G H I J
#> 7 12 11 9 14 7 11 11 11 7
colSums(table(df1))
#> A B C D E F G H I J
#> 7 12 11 9 14 7 11 11 11 7

combine data.tables and sum the shared column

I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS
a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")
a2
b c
1: a 100
2: b 100
3: c 100
4: d 100
5: e 100
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
a
b c
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:
setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
b c
1: a 101
2: b 102
3: c 103
4: d 104
5: e 105
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
However I am trying retain the rows that don't match as well.
Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.
c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]
c
b value
1: a 102
2: b 104
3: c 106
4: d 108
5: e 110
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
11: k 100
12: l 100
13: m 100
14: n 100
15: o 100
This last table is what I would like to achieve, Thanks in Advance.
merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:
rbindlist(list(a, a2))[, sum(c), b]
I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).
Compare some timings on larger datasets:
library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))
system.time(rbindlist(list(a, a2))[, sum(c), b])
# user system elapsed
# 0.83 0.05 0.87
system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
# user system elapsed
# 159.58 0.48 162.95
## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166
nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166

Resources