Use previous calculated row value in r Continued - r

I have a data.table that looks like this:
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT
A B C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
...
20: 20 200 2000
I want to be able to calculate a new column "G" that has the first value as the average of the first 20 rows in column B as the first value, and then I want to use the first row of column G to help calculate the next row value of G.
Say the Average of the first 20 rows of column B is 105, and the formula for the next row in G is: DT$G[2] = DT$G[1]*2, and the next row again is DT$G[3]=DT$G[2]*2. This means that the first value should not be used again in the next row and so forth.
A B C G
1: 1 10 100 105
2: 2 20 200 210
3: 3 30 300 420
4: 4 40 400 840
5: 5 50 500 1680
...
20: 20 200 2000 55050240
Any ideas on this would be made?

You can do this with a little arithmetic:
DT$G <- mean(DT$B[1:20])
DT$G <- DT$G * cumprod(rep(2,nrow(DT)))/2
Or using data.table syntax, courtesy of #DavidArenburg:
DT[ , G := mean(B[1:20]) * cumprod(rep(2, .N)) / 2]
or from #Frank
DT$G <- cumprod(c( mean(head(DT$B,20)), rep(2,nrow(DT)-1) ))

mycalc <- function(x, n) {
y <- numeric(n)
y[1] <- mean(x)
for (i in 2:n) y[i] <- 2*y[i-1]
y
}
DT[ , G := mycalc(B[1:20], .N)]

Related

How to use functions to do a recursive calculation in data.table/R?

I am new to Programming and got stuck in it. I wanted to calculate the hourly temperature variation of an object throughout the year using some variables, which changes in every hour. The original data contains 60 columns and 8760 rows for the calculation.
I got the desired output using the for loop, but the model is taking a lot of time for the calculation. I wonder if there is any way to replace the loop with functions, which I suspect, can also increase the speed of the calculations.
Here is a small reproducible example to show what I did.
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table
A B C
1: 1 1 10
2: 1 2 10
3: 1 3 10
4: 1 4 10
5: 1 5 10
The forloop
for (j in (2: nrow(table))) {
table$A[j] = (table$A[j-1] + table$B[j-1]) * table$B[j]
table$C[j] = table$B[j] * table$A[j]
}
I got the output as I desired:
A B C
1: 1 1 10
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
but it took 15 min to run the whole program in my case (not this!)
So I tried to use function instead of the for loop.
I tried this:
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
myfun <- function(df){
df = df %>% mutate(A = (lag(A) + lag(B)) * B,
C = B * A)
return(df)
}
myfun(table)
But the output was
A B C
1 NA 1 NA
2 4 2 8
3 9 3 27
4 16 4 64
5 25 5 125
As it seems that the function refers to the rows of the first table not the updated rows after the calculation. Is there any way to obtain the desired output using functions? It is my first R project, any help is very much appreciated. Thank you.
A much faster alternative using data.table. Note that the calculation of C can be separated from the calculation of A so we can do less within the loop:
for (i in 2:nrow(table)) {
set(table, i = i, j = "A", value = with(table, (A[i-1] + B[i-1]) * B[i]))
}
table[-1, C := A * B]
table
# A B C
# <num> <int> <num>
# 1: 1 1 10
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200
You can try Reduce like below
dt[
,
A := Reduce(function(x, Y) (x + Y[2]) * Y[1],
asplit(embed(B, 2), 1),
init = A[1],
accumulate = TRUE
)
][
,
C := A * B
]
which updates dt as
> dt
A B C
1: 1 1 1
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
data
dt <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
Here's a solution using purrr::accumulate2 which lets you use the result of the previous computation as the input to the next one:
library(data.table)
library(purrr)
library(magrittr)
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table$A <- accumulate2(
table$A,
seq(table$A),
~ (..1 + table$B[..3]) * table$B[..3 + 1],
.init = table$A[1]
) %>%
unlist() %>%
extract(1:nrow(table))
table$C <- table$B * table$A
table
# A B C
# 1: 1 1 1
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200

Update values of a column based on predefined thresholds

I have a data set as follows
Name Price
A 100
B 123
C 112
D 114
E 101
F 102
I need a way to update the value in the price column if the price is between +3 or -3 of a vector of values specified to the value specified in the vector. The vector may contain any number of elements.
Vector = c(100,111)
Updated dataframe
Name Price
A 100
B 123
C 111
D 111
E 100
F 100
If the vector is
Vector = c(104,122)
then the updated dataframe needs to be
Name Price
A 100
B 122
C 112
D 114
E 104
F 104
df <- data.frame('Name' = LETTERS[1:6], 'Price'= c(100,123,112,114,101,102))
transform <- function(value, conditionals){
for(cond in conditionals){
if(abs(value - cond) < 4){
return(cond)
}
}
return(value)
}
sapply(df$Price, transform, c(104,122))
This should work. It can probably done in one line with apply (but I find it difficult to read sometimes so this should be easier to read).
Here's one approach
bound <- 3
upper_bound <- Vector+bound
lower_bound <- Vector-bound
vi <- Reduce("pmax", lapply(seq_along(Vector), function(i) i*(df$Price <= upper_bound[i] & df$Price >= lower_bound[i])))
# [1] 1 0 2 2 1 1
vi_na <- replace(vi, vi == 0, NA)
# [1] 1 NA 2 2 1 1
df$Price <- dplyr::mutate(df, Price = ifelse(is.na(Vector[vi_na]), Price, Vector[vi_na]))
# Name Price.Name Price.Price
# 1 A A 100
# 2 B B 123
# 3 C C 111
# 4 D D 111
# 5 E E 100
# 6 F F 100
Data
df <- read.table(text = "Name Price
A 100
B 123
C 112
D 114
E 101
F 102", header=TRUE)
Vector = c(100,111)

R Function to write 3 calculated columns to a data.table

This may have already been answered, but couldn't quite find the answer I am looking for. I am trying to write the output of a function that calculates 3 variables to a data.table.
Currently I am copying the function three times (with three different names), each time returning a different variable. This is taking a lot more time as it runs thrice. I understand
there may be a better way to do it, using list or some unique data.table command.
I would greatly appreciate any input you can provide to simplify this. Below is the example of how I am calling it one variable at a time.
Example
fn_1 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
return(col_1)
}
data[ ,column_1 := fn_1(a,b,c,d) ,by= .(e,f) ]
fn_2 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
return(col_2)
}
data[ ,column_2 := fn_2(a,b,c,d) ,by= .(e,f) ]
The OP has tagged the question with data.table. docendo discimus' comment is showing the direction to follow.
Create sample data
library(data.table) # CRAN version 1.10.4 used
n <- 10L
DT <- data.table(
a = 1:n, b = (n:1)^2, c = -(1:n), d = 2 * (1:n) - n/2,
e = rep(LETTERS[1:2], length.out = n),
f = rep(LETTERS[3:4], each = n/2, length.out = n))
DT
# a b c d e f
# 1: 1 100 -1 -3 A C
# 2: 2 81 -2 -1 B C
# 3: 3 64 -3 1 A C
# 4: 4 49 -4 3 B C
# 5: 5 36 -5 5 A C
# 6: 6 25 -6 7 B D
# 7: 7 16 -7 9 A D
# 8: 8 9 -8 11 B D
# 9: 9 4 -9 13 A D
#10: 10 1 -10 15 B D
Define function
fn <- function(p, q, r, s) {
list(X1 = p + mean(q) + r + s,
Y2 = p * q + r * s,
Z3 = p * q - r * s)
}
The function takes 4 parameters and returns a list of 3 named vectors. Note that the computations inside the function don't need to use for loops in contrast to OP's approach.
Apply function to data.table
Note that the OP wants to group on columns e and f when the function is applied.
The first variant creates a new data.table. By default, the names of the list elements as defined in fn are used:
DT[, fn(a, b, c, d), .(e, f)]
# e f X1 Y2 Z3
# 1: A C 63.66667 103 97
# 2: A C 67.66667 189 195
# 3: A C 71.66667 155 205
# 4: B C 64.00000 164 160
# 5: B C 68.00000 184 208
# 6: B D 18.66667 108 192
# 7: B D 22.66667 -16 160
# 8: B D 26.66667 -140 160
# 9: A D 19.00000 49 175
#10: A D 23.00000 -81 153
The second variant updates DT by reference. The names of the new columns are explicitely stated.
DT[, c("x", "y", "z") := fn(a, b, c, d), .(e, f)]
DT
# a b c d e f x y z
# 1: 1 100 -1 -3 A C 63.66667 103 97
# 2: 2 81 -2 -1 B C 64.00000 164 160
# 3: 3 64 -3 1 A C 67.66667 189 195
# 4: 4 49 -4 3 B C 68.00000 184 208
# 5: 5 36 -5 5 A C 71.66667 155 205
# 6: 6 25 -6 7 B D 18.66667 108 192
# 7: 7 16 -7 9 A D 19.00000 49 175
# 8: 8 9 -8 11 B D 22.66667 -16 160
# 9: 9 4 -9 13 A D 23.00000 -81 153
#10: 10 1 -10 15 B D 26.66667 -140 160
You're in the second circle of hell. To solve the problem, pre-allocate what you want to add.
data <- data.table(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
Then, make a vectorized function to do the calculation, which returns the whole column to append.
calculation <- Vectorize(function(x) mean(c(x, 3)))
Write fn in terms of this new function, and return the whole block of columns to be added, then cbind it with data to add all the columns at once. It's extremely slow to do all the calculations every time, and then only return one part.
fn <- function(b, c, d) {
toBeAdded <- data.table(matrix(nrow = nrow(data), ncol = 3))
toBeAdded[ , 1] <- calculation(b)
toBeAdded[ , 2] <- calculation(b)
toBeAdded[ , 3] <- calculation(b)
toBeAdded
}
data <- cbind(data, fn(data[1,], data[2,], data[3,]))
Answering my own question, based on inputs from #docendodiscimus & #ConCave, i solved it like this. appreciate everyone's input!
fn_1 <- function(a, b, c, d){
for (i in 1:b) { col_1[i] = calculation }
for (i in 1:c) { col_2[i] = calculation }
for (i in 1:d) { col_3[i] = calculation }
df = data.table(col_1, col_2, col_3)
return(df)
}
data[,c("column_1","column_2","column_3"):= fn_1(a,b,c,d) ,by= .(e,f)]
Does it have to be a data.table? If not , then you can just use mutate in dplyr
a <- c(1,2,2,1,2,3,4,2)
b <- c(3,3,2,3,5,4,3,2)
c <- c(9,9,8,7,8,9,8,7)
d <- c(0,1,1,0,1,1,0,1)
have <- data.frame(a,b,c,d)
want <-
have %>%
mutate(abc = a+ b + c,
db = d * b,
aa = 2 * a)

Replace NA with a value that is row and column specific [duplicate]

This question already has answers here:
Fastest way to replace NAs in a large data.table
(10 answers)
Closed 5 years ago.
A lot comes together in this question. First off all I would like to segment the data by column c. The subsets are given by the factor c: the levels are 1 to 4. So 4 distinct segments.
Next I have two columns. Column a and b.
I would like to replace the NA's with the maximum value of each segment specific column. So for example, NA at row 3 and column 'a', this would be 30. (b,3) would be 80, (b,8) would be 50 and (a, 5) would be 80.
I have created the code below that does the job, but now I need to make it automatic (like a for loop) for all segments and columns. How could I do this?
a <- c(10,NA,30,40,NA,60,70,80,90,90,80,90,10,40)
b <- c(80,70,NA,50,40,30,20,NA,0,0,10,69, 40, 90)
c <- c(1,1,1,2,2,2,2,2,3,3,3,4,4,4)
a b c
1: 10 80 1
2: NA 70 1
3: 30 NA 1
4: 40 50 2
5: NA 40 2
6: 60 30 2
7: 70 20 2
8: 80 NA 2
9: 90 0 3
10: 90 0 3
11: 80 10 3
12: 90 69 4
13: 10 40 4
14: 40 90 4
mytable <- data.table(a,b,c)
mytable[which(is.na(mytable[c == 1][,1, with = FALSE]) == TRUE),1] <- max(mytable[c==1,1], na.rm = TRUE)
Unfortunately, this try results in an error:
for(i in unique(mytable$c)){
for(j in unique(c(1:2))){
mytable[which(is.na(mytable[c == i][,j, with = FALSE]) == TRUE),j, with = FALSE] <- max(mytable[c==i][,j, with = FALSE], na.rm = TRUE)
}
}
Error in [<-.data.table(*tmp*, which(is.na(mytable[c == i][, j, with = FALSE]) == :
unused argument (with = FALSE)
Surprisingly, this results in an error as well:
for(i in unique(mytable$c)){
for(j in unique(c(1:2))){
mytable[which(is.na(mytable[c == i][,j]) == TRUE),j] <- max(mytable[c==i,j], na.rm = TRUE)
}
}
Error in [.data.table(mytable, c == i, j) :
j (the 2nd argument inside [...]) is a single symbol but column name 'j' is not found. Perhaps you intended DT[,..j] or DT[,j,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.
library("data.table")
mytable <- data.table(
a=c(10,NA,30,40,NA,60,70,80,90,90,80,90,10,40),
b=c(80,70,NA,50,40,30,20,NA,0,0,10,69, 40, 90),
c=c(1,1,1,2,2,2,2,2,3,3,3,4,4,4))
foo <- function(x) { x[is.na(x)] <- max(x, na.rm=TRUE); x }
mytable[, .(A=foo(a), B=foo(b)), by=c]
result:
> mytable[, .(A=foo(a), B=foo(b)), by=c]
# c A B
# 1: 1 10 80
# 2: 1 30 70
# 3: 1 30 80
# 4: 2 40 50
# 5: 2 80 40
# 6: 2 60 30
# 7: 2 70 20
# 8: 2 80 50
# 9: 3 90 0
#10: 3 90 0
#11: 3 80 10
#12: 4 90 69
#13: 4 10 40
#14: 4 40 90
or for direct substitution of a and b:
mytable[, `:=`(a=foo(a), b=foo(b)), by=c] # or
mytable[, c("a", "b") := (lapply(.SD, foo)), by = c] # from #Sotos
or the safer variant (tnx to #Frank for the remark):
cols <- c("a", "b")
mytable[, (cols) := lapply(.SD, foo), by=c, .SDcols=cols]
Using data.table
library(data.table)
mytable[, a := ifelse(is.na(a), max(a, na.rm = TRUE), a), by = c]
mytable[, b := ifelse(is.na(b), max(b, na.rm = TRUE), b), by = c]
Or in a single command
mytable[, c("a", "b") := lapply(.SD, function(x) ifelse(is.na(x), max(x, na.rm = TRUE), x)), .SDcols = c("a", "b"), by = c]
Use ddply() from package plyr:
df<-data.frame(a,b,c=as.factor(c))
library(plyr)
df2<-ddply(df, .(c), transform, a=ifelse(is.na(a), max(a, na.rm=T),a),
b=ifelse(is.na(b), max(b, na.rm=T),b))

What is the most efficient way to create a column of vectors in `data.table` when matching data from a second table?

What is the most efficient way to create a column of vectors in a data.table
where we need to match elements from a second data.table.
For example, given the two data.tables below
> A_ids.DT > rec_data_table
name id bid counts names_list
1: A 1 1: 301 21 C,E
2: B 2 2: 302 21 E
3: C 3 3: 303 5 H,E,G
4: D 4 4: 304 10 H,D
5: F 6 5: 305 3 E
6: G 7 6: 306 5 G
7: H 8 7: 307 6 B,C
8: J 10
9: K 11
I would like to create a new column in rec_data_table where each element is a list of the id's from A_ids.DT as referenced in rec_data_table[,names_list]
IMPORTANT: The order represented in each entry of names_list must be reflected in the new column. ie: for row 3: (H, E, G) we should get c(8, NA, 7)
The following line, which uses sapply works, but I question its efficiency.
Are there better (ie quicker, more elegant) alternatives? (Note that the actual data is several 100K of rows)
rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))]
bid counts names_list A_IDs.list
1: 301 21 C,E 3,NA
2: 302 21 E NA
3: 303 5 H,E,G 8,NA,7
4: 304 10 H,D 8,4
5: 305 3 E NA
6: 306 5 G 7
7: 307 6 B,C 2,3
#--------------------------------------------------#
# SAMPLE DATA #
library(data.table)
set.seed(101)
rows <- size <- 7
varyingLengths <- c(sample(1:3, rows, TRUE))
A <- lapply(varyingLengths, function(n) sample(LETTERS[1:8], n))
counts <- round(abs(rnorm(size)*12))
rec_data_table <- data.table(bid=300+(1:size), counts=counts, names_list=A, key="bid")
A_ids.DT <- data.table(name=LETTERS[c(1:4,6:8,10:11)], id=c(1:4,6:8,10:11), key="name")
Perhaps unpack the lists, then join the whole table, then repack?
tmp <- setkey(rec_data_table[, list(names = names_list[[1]],
orig.order = seq_along(names_list[[1]])),
by = list(bid, counts)], names)
tmp <- A_ids.DT[tmp]
setkey(tmp, orig.order)
tmp <- tmp[, list(names_list = list(name), A_IDs.list = list(id)),
by = list(bid, counts)]
# Rearrange to sample output order
setkey(tmp, bid)
setcolorder(tmp, c("bid", "counts", "names_list", "A_IDs.list"))
### Output###
> tmp
# bid counts names_list A_IDs.list
# 1: 301 21 C,E 3,NA
# 2: 302 21 E NA
# 3: 303 5 H,E,G 8,NA,7
# 4: 304 10 H,D 8,4
# 5: 305 3 E NA
# 6: 306 5 G 7
# 7: 307 6 B,C 2,3
> identical(tmp, rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))])
# [1] TRUE
Timings
I increased the number of rows in rec_data_table to 1e5 and got the following timings.
Method presented in question:
> system.time(rec_data_table[, A_IDs.list := sapply(names_list, function(n) c(A_ids.DT[n, id]$id))])
user system elapsed
196.89 0.04 197.81
Method presented here:
> system.time( {
+ tmp <- setkey(rec_data_ta .... [TRUNCATED]
user system elapsed
0.95 0.00 0.95

Resources