I have a list of data.tables and I want to apply a function to each data.table. I things set up to use := inside an lapply function. Everything works fine and my outputs are updated by reference, but my function also prints to the console. This is part of a much larger project and printing this step to the console is not ideal.
How do I run this 'silently' without printing? Is there a better way to structure the workflow / code?
dt1 <- data.table(a = rnorm(1:10),
b = rnorm(1:10))
dt2 <- data.table(a = rnorm(1:10),
b = rnorm(1:10))
dts <- list(dt1, dt2)
lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
})
dts
dts now has a C column, but the outputs were displayed in the console. This code chunk is called from another function.
You can use a for loop
for(dt in dts) dt[, ':='(c = a + b)]
You can assign the lapply call which will suppress the output
dts <- lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
})
We. can use walk which will not print anything into the console
library(purrr)
walk(dts, ~ .x[, `:=`(c = a + b)])
dts
#[[1]]
# a b c
# 1: -0.1069952 0.1115983 0.004603111
# 2: 0.3228771 -0.8400846 -0.517207530
# 3: -1.6072728 -0.2727947 -1.880067477
# 4: 0.1715614 -0.3864995 -0.214938065
# 5: 1.8233350 -1.0786569 0.744678084
# 6: 0.2366026 -0.6166318 -0.380029253
# 7: 0.2373992 0.2251559 0.462555116
# 8: -0.1075611 -1.0418174 -1.149378504
# 9: 1.6742520 -0.5635583 1.110693774
#10: 2.4733842 2.1091365 4.582520731
#[[2]]
# a b c
# 1: -0.8332617 1.67201117 0.83874947
# 2: 1.3688393 1.12168046 2.49051974
# 3: 1.0208642 -1.18482073 -0.16395650
# 4: 0.6784662 2.15979872 2.83826493
# 5: -0.4351644 -0.04629453 -0.48145894
# 6: 1.3133550 -1.03423308 0.27912197
# 7: 1.0143396 -0.84787780 0.16646185
# 8: -0.9622108 0.92338456 -0.03882627
# 9: -0.3106202 1.08886031 0.77824008
#10: 0.7602507 -0.08996701 0.67028370
Or wrap with invisible along with lapply
invisible(lapply(dts, function(dt) {
dt[, ':=' (c = a + b)]
}))
Using set:
for (i in seq_along(dts)) set(dts[[i]], j = "c", value = dts[[i]]$a + dts[[i]]$b)
Related
I would like to modify a data.table within a function. If I use the := feature within the function, the result is only printed for the second call.
Look at the following illustration:
library(data.table)
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
dt[, z := y - x]
dt
}
When I call only the function, the table is not printed (which is the standard behaviour. However, if I save the returned data.table into a new object, it is not printed at the first call, only for the second one.
myfunction(mydt) # nothing is printed
result <- myfunction(mydt)
result # nothing is printed
result # for the second time, the result is printed
mydt
# x y z
# 1: 1 5 4
# 2: 2 6 4
# 3: 3 7 4
Could you explain why this happens and how to prevent it?
As David Arenburg mentions in a comment, the answer can be found here. There was a bug fixed in the version 1.9.6 but the fix introduced this downside.
One should call DT[] at the end of the function to prevent this behaviour.
myfunction <- function(dt) {
dt[, z := y - x][]
}
myfunction(mydt) # prints immediately
# x y z
# 1: 1 5 4
# 2: 2 6 4
# 3: 3 7 4
This is described in data.table FAQ 2.23:
Why do I have to type DT sometimes twice after using := to print the result to console?
This is an unfortunate downside to get #869 to work. If a := is used inside a function with no DT[] before the end of the function, then the next time DT is typed at the prompt, nothing will be printed. A repeated DT will print. To avoid this: include a DT[] after the last := in your function. If that is not possible (e.g., it's not a function you can change) then print(DT) and DT[] at the prompt are guaranteed to print. As before, adding an extra [] on the end of := query is a recommended idiom to update and then print; e.g.> DT[,foo:=3L][].
I'm sorry if I'm not supposed to post something here that's not an
answer, but my post is too long for a comment.
I'd like to point out that janosdivenyi's solution of adding a
trailing [] to dt does not always give the expected results (even
when using data.table 1.9.6 or 1.10.4) as I do below.
The examples below show that if dt is the last line in the function
one gets the desired behaviour without the presence of the
trailing [], but if dt is not on the last line in the function then
a trailing [] is needed to get the desired behaviour.
The first example shows that with no trailing [] on dt we get the
expected behaviour when dt is on the last line of the function
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
df <- 1
dt[, z := y - x]
}
myfunction(mydt) # Nothing printed as expected
mydt # Content printed as desired
## x y z
## 1: 1 5 4
## 2: 2 6 4
## 3: 3 7 4
Adding a trailing [] on dt gives unexpected behaviour
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
df <- 1
dt[, z := y - x][]
}
myfunction(mydt) # Content printed unexpectedly
## x y z
## 1: 1 5 4
## 2: 2 6 4
## 3: 3 7 4
mydt # Content printed as desired
## x y z
## 1: 1 5 4
## 2: 2 6 4
## 3: 3 7 4
Moving df <- 1 to after the dt with no trailing [] gives unexpected
behaviour
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
dt[, z := y - x]
df <- 1
}
myfunction(mydt) # Nothing printed as expected
mydt # Nothing printed unexpectedly
Moving df <- 1 after the dt with a trailing [] gives the expected
behaviour
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
dt[, z := y - x][]
df <- 1
}
myfunction(mydt) # Nothing printed as expected
mydt # Content printed as desired
## x y z
## 1: 1 5 4
## 2: 2 6 4
## 3: 3 7 4
how can I loop over certain variables in order to standardize them? I am strying to set up the code but it's not working, my idea was to use assign or eval but those seem not working. Below a reproducible working example.
if (!require('data.table')) {install.packages('data.table'); library('data.table')}
a <- seq(0,10,1)
b <- seq(99,100,0.1)
dt <- data.table(a,b)
# Expected result
dt[,z_a:= ((a-mean(a,na.rm=TRUE))/sd(a,na.rm=TRUE)) ]
dt[,z_b:= ((a-mean(a,na.rm=TRUE))/sd(a,na.rm=TRUE)) ]
# Loop not working
stdvars <- c(a,b)
for (v in stdvars) {
dt[z_v:= ((v-mean(v,na.rm=TRUE))/sd(v,na.rm=TRUE)) ]
}
dt
I would advise against using explicit loops when working with data.table, as its internal functionality is many times more efficient. In particular, you can define a function which you call through lapply over a specified subset (.SD):
standardise = function(x){(x-mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)} # Define a standardising function
oldcols = c('a', 'b') # Name of old columns
newcols = paste0('z_', oldcols) # Name of new columns ('z_a' and 'z_b')
dt[, (newcols) := lapply(.SD, standardise), .SDcols = oldcols]
Output:
> dt
a b z_a z_b
1: 0 99.0 -1.5075567 -1.5075567
2: 1 99.1 -1.2060454 -1.2060454
3: 2 99.2 -0.9045340 -0.9045340
4: 3 99.3 -0.6030227 -0.6030227
5: 4 99.4 -0.3015113 -0.3015113
6: 5 99.5 0.0000000 0.0000000
7: 6 99.6 0.3015113 0.3015113
8: 7 99.7 0.6030227 0.6030227
9: 8 99.8 0.9045340 0.9045340
10: 9 99.9 1.2060454 1.2060454
11: 10 100.0 1.5075567 1.5075567
.SD means that you are calling a lapply across a Subset of the Data, defined by the .SDcols argument. In this case, we define newcols as the application of standardise function across the subset oldcols.
There is a built-in function scale which allows to standardize variables.
The missing values are removed when standardizing.
So it would be more direct to proceed as follows:
cols <- c("a", "b")
dt[, paste0("z_", cols) := lapply(.SD, scale), .SDcols = cols]
An option is to use non-standard evaluation:
for (v in c("a", "b")) {
eval(substitute(dt[, paste0("z_", v) := (V - mean(V, na.rm=TRUE)) / sd(V, na.rm=TRUE)],
list(V=as.name(v))))
}
dt
Or putting it in a function:
f <- function(DT, v) {
lhs <- paste0("z_", as.list(match.call())$v)
eval(substitute(
DT[, (lhs) := (v - mean(v, na.rm=TRUE)) / sd(v, na.rm=TRUE)]))
}
f(dt, a)
f(dt, b)
dt
I would like to modify a data.table within a function. If I use the := feature within the function, the result is only printed for the second call.
Look at the following illustration:
library(data.table)
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
dt[, z := y - x]
dt
}
When I call only the function, the table is not printed (which is the standard behaviour. However, if I save the returned data.table into a new object, it is not printed at the first call, only for the second one.
myfunction(mydt) # nothing is printed
result <- myfunction(mydt)
result # nothing is printed
result # for the second time, the result is printed
mydt
# x y z
# 1: 1 5 4
# 2: 2 6 4
# 3: 3 7 4
Could you explain why this happens and how to prevent it?
As David Arenburg mentions in a comment, the answer can be found here. There was a bug fixed in the version 1.9.6 but the fix introduced this downside.
One should call DT[] at the end of the function to prevent this behaviour.
myfunction <- function(dt) {
dt[, z := y - x][]
}
myfunction(mydt) # prints immediately
# x y z
# 1: 1 5 4
# 2: 2 6 4
# 3: 3 7 4
This is described in data.table FAQ 2.23:
Why do I have to type DT sometimes twice after using := to print the result to console?
This is an unfortunate downside to get #869 to work. If a := is used inside a function with no DT[] before the end of the function, then the next time DT is typed at the prompt, nothing will be printed. A repeated DT will print. To avoid this: include a DT[] after the last := in your function. If that is not possible (e.g., it's not a function you can change) then print(DT) and DT[] at the prompt are guaranteed to print. As before, adding an extra [] on the end of := query is a recommended idiom to update and then print; e.g.> DT[,foo:=3L][].
I'm sorry if I'm not supposed to post something here that's not an
answer, but my post is too long for a comment.
I'd like to point out that janosdivenyi's solution of adding a
trailing [] to dt does not always give the expected results (even
when using data.table 1.9.6 or 1.10.4) as I do below.
The examples below show that if dt is the last line in the function
one gets the desired behaviour without the presence of the
trailing [], but if dt is not on the last line in the function then
a trailing [] is needed to get the desired behaviour.
The first example shows that with no trailing [] on dt we get the
expected behaviour when dt is on the last line of the function
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
df <- 1
dt[, z := y - x]
}
myfunction(mydt) # Nothing printed as expected
mydt # Content printed as desired
## x y z
## 1: 1 5 4
## 2: 2 6 4
## 3: 3 7 4
Adding a trailing [] on dt gives unexpected behaviour
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
df <- 1
dt[, z := y - x][]
}
myfunction(mydt) # Content printed unexpectedly
## x y z
## 1: 1 5 4
## 2: 2 6 4
## 3: 3 7 4
mydt # Content printed as desired
## x y z
## 1: 1 5 4
## 2: 2 6 4
## 3: 3 7 4
Moving df <- 1 to after the dt with no trailing [] gives unexpected
behaviour
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
dt[, z := y - x]
df <- 1
}
myfunction(mydt) # Nothing printed as expected
mydt # Nothing printed unexpectedly
Moving df <- 1 after the dt with a trailing [] gives the expected
behaviour
mydt <- data.table(x = 1:3, y = 5:7)
myfunction <- function(dt) {
dt[, z := y - x][]
df <- 1
}
myfunction(mydt) # Nothing printed as expected
mydt # Content printed as desired
## x y z
## 1: 1 5 4
## 2: 2 6 4
## 3: 3 7 4
I have a datatable of data and a datatable of fitted coefficients. I want to calculate the fitted value for each row.
dt = data.table(a = rep(c("x","y"), each = 5), b = rnorm(10), c = rnorm(10), d = rnorm(10))
coefs = data.table(a = c("x","y"), b = c(0, 1), d = c(2,3))
dt
# a b c d
# 1: x -0.25174915 -0.2130797 -0.67909764
# 2: x -0.35569766 0.6014930 0.35201386
# 3: x -0.31600957 0.4398968 -1.15475814
# 4: x -0.54113762 -2.3497952 0.64503654
# 5: x 0.11227873 0.0233775 -0.96891456
# 6: y 1.24077566 -1.2843439 1.98883516
# 7: y -0.23819626 0.9950835 -0.17279980
# 8: y 1.49353589 0.3067897 -0.02592004
# 9: y 0.01033722 -0.5967766 -0.28536224
#10: y 0.69882444 0.8702424 1.24131062
coefs # NB no "c" column
# a b d
#1: x 0 2
#2: y 1 3
For each a=="x" row in dt, I want 0*b+2*d; and for each a=="y" row in dt, I want 1*b+3*d.
Is there a datatable way to do this without hardcode the column name? I'm happy to put the column names in a variable cols = colnames(coefs)[-1].
It's easy to loop over groups and rbind together, so if the grouping is causing trouble, please ignore that part.
Join the data.tables:
dt[coefs, res := b * i.b + d * i.d, on = "a"]
# a b c d res
#1: x 0.09901786 -0.362080111 -0.5108862 -1.0217723
#2: x -0.16128422 0.169655945 0.3199648 0.6399295
#3: x -0.79648896 -0.502279345 1.3828633 2.7657266
#4: x -0.26121421 0.480548972 -1.1559392 -2.3118783
#5: x 0.54085591 -0.601323442 1.3833795 2.7667590
#6: y 0.83662761 0.607666970 0.6320762 2.7328562
#7: y -1.92510391 -0.050515610 -0.3176544 -2.8780671
#8: y 1.65639926 -0.167090105 0.6830158 3.7054466
#9: y 1.48772354 -0.349713539 -1.2736467 -2.3332166
#10: y 1.49065993 0.008198885 -0.1923361 0.9136516
Usually you would use the matrix product here, but that would mean that you had to coerce the respective subset to a matrix. That would result in a copy being made and since data.tables are mainly used for larger data, you want to avoid copies.
If you need dynamic column names, the most simple solution that comes to mind is actually an eval/parse construct:
cols = colnames(coefs)[-1]
expr <- parse(text = paste(paste(cols, paste0("i.", cols), sep = "*"), collapse = "+"))
#expression(b*i.b+d*i.d)
dt[coefs, res := eval(expr), on = "a"]
Maybe someone else can suggest a better solution.
Here is a solution using matrix multiplication:
dt[, res := as.matrix(.SD) %*% unlist(coefs[a == .BY, .SD, .SDcols = cols]),
by = "a", .SDcols = cols]
Of course this makes copies, which is potentially less efficient then the eval solution.
I found out that data.table of all numerical type columns can do arithmetic operations (+,-,*,/), but no name matching -- just order matching.
> coefs
a b d
1: x 0 2
2: y 1 3
> coefs[, .(b,d)] * coefs[, .(b,d)]
b d
1: 0 4
2: 1 9
> coefs[, .(b,d)] * coefs[, .(d,b)]
b d
1: 0 0
2: 3 3
so a solution based on this
> cols = colnames(coefs)[-1]
> zz = rowSums(coefs[dt[,.(a)], .SD, on = 'a', .SDcols = cols] * dt[, .SD, .SDcols = cols])
> dt[, newcol := zz]
Another alternative (but slower) approach is:
dt$res <- unsplit(Map(function(x,y){x$b*y$b + x$d*y$d}, split(dt, dt$a=="x"),
split(coefs,coefs$a=="x")),dt$a=="x")
dt
a b c d res
1: x 0.47859729 1.3479271 0.5691897 1.1383794
2: x 0.28491505 -0.3291934 1.8621365 3.7242730
3: x -1.43894695 1.5555413 0.3685772 0.7371544
4: x 0.04360066 0.1358920 0.5240700 1.0481400
5: x -1.39897890 -0.0175886 -0.6876451 -1.3752901
6: y -0.60952146 1.2331907 -0.3582176 -1.6841742
7: y 0.31777772 1.4090295 -0.4053615 -0.8983067
8: y 0.42758431 -0.3746061 2.1208417 6.7901094
9: y -0.60701063 -0.9232092 1.9386482 5.2089341
10: y -1.52042316 -0.8871454 -0.9314232 -4.3146927
This same code would work in base R as well if your data was already data.frames.
My aggregation needs vary among columns / data.frames. I would like to pass the "list" argument to the data.table dynamically.
As a minimal example:
require(data.table)
type <- c(rep("hello", 3), rep("bye", 3), rep("ok",3))
a <- (rep(1:3, 3))
b <- runif(9)
c <- runif(9)
df <- data.frame(cbind(type, a, b, c), stringsAsFactors=F)
DT <-data.table(df)
This call:
DT[, list(suma = sum(as.numeric(a)), meanb = mean(as.numeric(b)), minc = min(as.numeric(c))), by= type]
will have result similar to this:
type suma meanb minc
1: hello 6 0.1332210 0.4265579
2: bye 6 0.5680839 0.2993667
3: ok 6 0.5694532 0.2069026
Future data.frames will have more columns that I will want to summarize differently. But for the sake of working with this small example: Is there a way to pass the list programatically?
I naïvely tried:
# create a different list
mylist <- "list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))"
# new call
DT[, mylist, by=type]
With the following error:
1: hello
2: bye
3: ok
mylist
1: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
2: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
3: list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
Any hints appreciated! Best regards!
PS sorry about these as.numeric(), I could not quite figure out why, but I needed them for the example to run.
Minor edit inserted columns / before data.frame in initial sentence to clarify my needs.
This is explained FAQ 1.6 what you are looking for is quote and eval
something like
mycall <- quote(list(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c))))
DT[, eval(mycall)]
After a bit of head-banging, here is a very ugly way of constructing the call for ddply using .()
myplyrcall <- .(lengtha = length(as.numeric(a)), maxb = max(as.numeric(b)), meanc = mean(as.numeric(c)))
do.call(ddply,c(.data = quote(DF), .variables = 'type',.fun = quote(summarise),myplyrcall))
You could also use as.quoted which has an as.quoted.character method to construct using paste0
myplc <-as.quoted(c("lengtha" = "length(as.numeric(a))", "maxb" = "max(as.numeric(b))", "meanc" = "mean(as.numeric(c))"))
This can be used with data.table as well!
dtcall <- as.quoted(mylist)[[1]]
DT[,eval(dtcall), by = type]
data.table all the way.
Another way is to use .SDcols to group the columns for which you'd like to perform the same operations together. Let's say that you require columns a,d,e to be summed by type where as, b,g should have mean taken and c,f its median, then,
# constructing an example data.table:
set.seed(45)
dt <- data.table(type=rep(c("hello","bye","ok"), each=3), a=sample(9),
b = rnorm(9), c=runif(9), d=sample(9), e=sample(9),
f = runif(9), g=rnorm(9))
# type a b c d e f g
# 1: hello 6 -2.5566166 0.7485015 9 6 0.5661358 -2.2066521
# 2: hello 3 1.1773119 0.6559926 3 3 0.4586280 -0.8376586
# 3: hello 2 -0.1015588 0.2164430 1 7 0.9299597 1.7216593
# 4: bye 8 -0.2260640 0.3924327 8 2 0.1271187 0.4360063
# 5: bye 7 -1.0720503 0.3256450 7 8 0.5774691 0.7571990
# 6: bye 5 -0.7131021 0.4855804 6 9 0.2687791 1.5398858
# 7: ok 1 -0.4680549 0.8476840 2 4 0.5633317 1.5393945
# 8: ok 4 0.4183264 0.4402595 4 1 0.7592801 2.1829996
# 9: ok 9 -1.4817436 0.5080116 5 5 0.2357030 -0.9953758
# 1) set key
setkey(dt, "type")
# 2) group col-ids by similar operations
id1 <- which(names(dt) %in% c("a", "d", "e"))
id2 <- which(names(dt) %in% c("b","g"))
id3 <- which(names(dt) %in% c("c","f"))
# 3) now use these ids in with .SDcols parameter
dt1 <- dt[, lapply(.SD, sum), by="type", .SDcols=id1]
dt2 <- dt[, lapply(.SD, mean), by="type", .SDcols=id2]
dt3 <- dt[, lapply(.SD, median), by="type", .SDcols=id3]
# 4) merge them.
dt1[dt2[dt3]]
# type a d e b g c f
# 1: bye 20 21 19 -0.6704055 0.9110304 0.3924327 0.2687791
# 2: hello 11 13 16 -0.4936211 -0.4408838 0.6559926 0.5661358
# 3: ok 14 11 10 -0.5104907 0.9090061 0.5080116 0.5633317
If/when you have many many column, making a list like the one you've might be cumbersome.
Another method (supporting the use of paste or paste0 to build the expression):
expr <- parse(text=mylist)
DT[, eval( expr ), by=type]
#-------
type lengtha maxb meanc
1: hello 3 0.8265407 0.5244094
2: bye 3 0.4955301 0.6289475
3: ok 3 0.9527455 0.5600915
I find it worrysome that apparently eval is part of the answer. From your question it is not clear to me, if and why you really want to do what you claim to want. Thus I demonstrate here that you can also use a function:
fun <- function(a,b,c) {
list(lengtha = length(as.numeric(a)),
maxb = max(as.numeric(b)),
meanc = mean(as.numeric(c)))
}
DT[, fun(a,b,c), by=type]
type lengtha maxb meanc
1: hello 3 0.8792184 0.3745643
2: bye 3 0.8718397 0.4519999
3: ok 3 0.8900764 0.4511536