Preface:
I have a column in a data.table of difftime values with units set to days. I am trying to create another data.table summarizing the values with
dt2 <- dt[, .(AvgTime = mean(DiffTime)), by = Group]
When printing the new data.table, I see values such as
1.925988e+00 days
1.143287e+00 days
1.453975e+01 days
I would like to limit the decimal place values for this column only (i.e. not setting options() unless I can do this specifically for difftime values this way). When I try to do this using the method above, modified, e.g.
dt2 <- dt[, .(AvgTime = round(mean(DiffTime)), 2), by = Group]
I am left with NA values, with both the base round() and format() functions returning the warning:
In mean(DiffTime) : argument is not numeric or logical.
Oddly enough, if I perform the same operation on a numeric field, this runs with no problems. Also, if I run the two separate lines of code, I can accomplish what I am looking to do:
dt2 <- dt[, .(AvgTime = mean(DiffTime)), by = Group]
dt2[, AvgTime := round(AvgTime, 2)]
Reproducible Example:
library(data.table)
set.seed(1)
dt <- data.table(
Date1 =
sample(seq(as.Date('2017/10/01'),
as.Date('2017/10/31'),
by="days"), 24, replace = FALSE) +
abs(rnorm(24)) / 10,
Date2 =
sample(seq(as.Date('2017/10/01'),
as.Date('2017/10/31'),
by="days"), 24, replace = FALSE) +
abs(rnorm(24)) / 10,
Num1 =
abs(rnorm(24)) * 10,
Group =
rep(LETTERS[1:4], each=6)
)
dt[, DiffTime := abs(difftime(Date1, Date2, units = 'days'))]
# Warnings/NA:
class(dt$DiffTime) # "difftime"
dt2 <- dt[, .(AvgTime = round(mean(DiffTime), 2)), by = .(Group)]
# Works when numeric/not difftime:
class(dt$Num1) # "numeric"
dt2 <- dt[, .(AvgNum = round(mean(Num1), 2)), by = .(Group)]
# Works, but takes an additional step:
dt2<-dt[,.(AvgTime = mean(DiffTime)), by = .(Group)]
dt2[,AvgTime := round(AvgTime,2)]
# Works with base::mean:
class(dt$DiffTime) # "difftime"
dt2 <- dt[, .(AvgTime = round(base::mean(DiffTime), 2)), by = .(Group)]
Question:
Why am I not able to complete this conversion (rounding of the mean) in one step when the class is difftime? Am I missing something in my execution? Is this some sort of bug in data.table where it can't properly handle the difftime?
Issue added on github.
Update: Issue appears to be cleared after updating from data.table version 1.10.4 to 1.12.8.
This was fixed by update #3567 on 2019/05/15, data.table version 1.12.4 released 2019/10/03
This might be a little late but if you really want it to work you can do:
as.numeric(round(as.difftime(difftime(DATE1, DATE2)), 0))
I recently ran into the same problem using data.table_1.11.8. One quick work around is to use base::mean instead of mean.
Related
This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.
I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).
This may have been asked before and I have looked through Reference semantics but I can't seem to find the answer. SO also suggested revising my title, so I will be fine if someone posts a link to the answer!
I have a MWE below. I am trying to group by column val by the day of each month. From my understanding, in SCENARIO 1 below in the code, since I am not assigning the values of lapply to any new column through :=, the data.table is printed.
However, in SCENARIO 2, when I assign new column variables by reference using := the new columns are created (with the correct values) but the value is repeated for every hour of the day, when I want just the daily values.
SCENARIO 3 also gives the desired result, but requires the creation of a new data.table.
I also wouldn't think of set because value iterates by row, and I need to group certain columns.
Thanks for any help,
library(data.table)
library(magrittr)
set.seed(123)
# create data.table to group by
dt <- data.table(year = rep(2018, times = 24 * 31),
month = rep(1, times = 24 * 31),
day = rep(1:31, each = 24),
hour = rep(0:23, times = 31)) %>%
.[, val := sample(100, size = nrow(dt), replace = TRUE)]
# SCENARIO 1
# creates desired dataframe but only prints it, doesn't modify dt by reference (because it is missing `:=`)
dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
# Scenario 2
# creates desired val column, but creates duplicate val values for all rows of original grouping by data.table
dt[, val := lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
# SCENARIO 3
# this also works, but requires creating a new data.table
new_dt <- dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
I don't see any problem in the creation of the new data.table object, you can do it with the same name to rewrite.
dt <- dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
Now you cannot change the number of rows in the data.table without rewriting like dt<-unique(dt) according to discussion in this feature request: https://github.com/Rdatatable/data.table/issues/635.
Take the following data table:
# IMPUTING VALUES
library(data.table)
set.seed(1337)
mydt <- data.table(Year = rep(2000:2005, each = 10),
Type = c("A","B"),
Value = 30 + rnorm(60)
)
naRows <- sample(nrow(mydt),15)
mydt[ naRows, Value := NA]
setkey(mydt,Year,Type)
How would I go about imputing the NAs with the median by Year and Type? I have tried the following
# computed medians
computedMedians <- mydt[, .(Median = median(Value, na.rm = TRUE)), keyby = .(Year,Type)]
# dataset of just NA rows
dtNAs <- mydt[ is.na(Value), .SD, by = .(Year,Type)]
mydt[ is.na(Value),
Imputations := dtNAs[computedMedians, nomatch = 0][, Median],
by = .(Year,Type)]
mydt
but when you run the code, you'll see that it works unless a group is missing data completely, and the computed medians get recycled. Is there a simpler way? or how would you go about getting just the last error fixed?
If you prefer updating the rows without copying the entire column, then:
require(data.table) # v1.9.6+
cols = c("Year", "Type")
dt[is.na(Value), Value := dt[.BY, median(Value, na.rm=TRUE), on=cols], by=c(cols)]
.BY is a special symbol which is a named list containing the groups. Although this requires a join with the entire data.table every time, it should be quite fast, as it's searching for only one group.
There's no need to make a secondary table; it can be done inside a single by-group call:
mydt[,
Value := replace(Value, is.na(Value), median(Value, na.rm=TRUE))
, by=.(Year,Type)]
This imputation doesn't guarantee that all missing values are filled (e.g., 2005-B is still NA).
Hi I am trying to create a data.table with lagged variables by group id. Certain id's have only 1 row in the data.table in that case the shift operator for lag gives error but the lead operator works fine. Here is an example
dt = data.table(id = 1, week = as.Date('2014-11-11'), sales = 1)
lead = 2
lag = 2
lagSalesNames = paste('lag_sales_', 1:lag, sep = '')
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lag'), by = list(id)]
This gives me the following error
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead
(much quicker), or cbind or merge afterwards.
But if I try the same thing with lead instead, it works fine
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lead'), by = list(id)]
It also seem to work fine if the data.table has more than 1 row e.g. you can try the following with 2 rows which works fine
dt = data.table(id = 1, week = as.Date(c('2014-11-11', '2014-11-11')), sales = 1:2)
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lag'), by = list(id)]
I am using data.table version 1.9.5 on a linux machine with R version 3.1.0. Any help would be much appreciated.
Thanks,
Ashin
Thanks for the report. This is now fixed (issue #1014) with commit #1722 in data.table v1.9.5.
Now works as intended:
dt
# id week sales lag_sales_1 lag_sales_2
# 1: 1 2014-11-11 1 NA NA
I have a 1.3 million row data frame which I need to aggregate into regional and temporal summaries. Plyr's syntax is straightforward, but it's just much too slow to be practical (I've left ddply to run for an hour, and it's completed less than 25%). I'm looking for help translating the ddply syntax into data.table to exploit its vaunted speed.
My data are of the following type
library(plyr)
library(lubridate)
dat <- expand.grid(area = letters[1:2],
day = as.Date("2012-10-01") + c(0:10) * days(1),
type = paste("t", 1:2, sep=""))
dat$val <- runif(44)
I need row counts (which will be equal here, given my toy data) and sums of the val variable for different periods.
This ddply call gives me what I'm looking for
count.and.sum <- function(i){
if(i$day >= as.Date("2012-10-02")){
k <- data.frame(c_1d = nrow(dat[dat$type == i$type &
dat$area == i$area &
dat$day %in% i$day - days(1),]),
c_2d = nrow(dat[dat$type == i$type &
dat$area == i$area &
dat$day %in% (i$day - c(1:2) * days(1)),]),
s_1d = sum(dat$val[dat$type == i$type &
dat$area == i$area &
dat$day %in% i$day - days(1)]),
s_2d = sum(dat$val[dat$type == i$type &
dat$area == i$area &
dat$day %in% (i$day - c(1:2) * days(1))]))
return(k)
}
}
ddply(dat, .(area, day, type), count.and.sum)[1:10,]
Would really appreciate any data.table syntax you could provide.
Firstly, your function is terribly inefficient and exposes a lack of understanding of what a function to be passed to plyr should look like. For ddply(), it should take a generic data frame as input and output a data frame. By 'generic' in this context, I mean a data frame that would be produced as any one of the 'splits' defined by combinations of the levels of the grouping variables. Your function should look more like this:
count.and.sum <- function(d) data.frame(n = length(d$val), valsum = sum(d$val))
The grouping variable combinations are taken care of in the ddply() call.
Secondly, your ddply() call creates one line data frames because each observation is associated with a unique combination of area, day and type. A more realistic application of ddply() for this toy example would be to summarize by day:
Direct method using summarise as the 'apply' function:
ddply(dat, .(day), summarise, nrow = length(val), valsum = sum(val))
Using count.and.sum:
ddply(dat, .(day), count.and.sum)
This is very likely to be much faster than your version of count.and.sum.
As for an equivalent data.table version (not necessarily the most efficient), try this:
library(data.table)
DT <- data.table(dat, key = c('area', 'day', 'type'))
DT[, list(n = length(val), valsum = sum(val)), by = 'day']
Here's a slightly more elaborate toy example with 100K observations:
set.seed(5490)
dat2 <- data.frame(area = sample(letters[1:2], 1e5, replace = TRUE),
day = sample(as.Date("2012-10-01") + c(0:10) * days(1),
1e5, replace = TRUE),
type = sample(paste0("t", 1:2), 1e5, replace = TRUE),
val = runif(1e5))
system.time(u <- ddply(dat2, .(area, day, type), summarise,
n = length(val), valsum = sum(val)))
DT2 <- data.table(dat2, key = c('area', 'day', 'type'))
system.time(v <- DT2[, list(n = length(val), valsum = sum(val)), by = key(DT)])
identical(u, as.data.frame(v))
On my system, the data.table version is about 4.5 times faster than the plyr version (0.09s elapsed for plyr, 0.02 for data.table).