R: data.table .dynamic aggregations on column Date columns - r

I am trying to do a min/max aggregate on a dynamically chosen column in a data.table. It works perfectly for numeric columns but I cannot get it to work on Date columns unless I create a temporary data.table.
It works when I use the name:
dt <- data.table(Index=1:31, Date = seq(as.Date('2015-01-01'), as.Date('2015-01-31'), by='days'))
dt[, .(minValue = min(Date), maxValue = max(Date))]
# minValue maxValue
# 1: 2015-01-01 2015-01-31
It does not work when I use with=FALSE:
colName = 'Date'
dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)), :
# could not find function "."
I can use .SDcols on a numeric column:
colName = 'Index'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# minValue maxValue
# 1: 1 31
But I get an error when I do the same thing for a Date column:
colName = 'Date'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
If I use lapply(.SD, min) or sapply() then the dates are changed to numbers.
The following works and does not seem to waste memory and is fast. Is there anything better?
a <- dt[, colName, with=F]
setnames(a, 'a')
a[, .(minValue = min(a), maxValue = max(a))]

On your first attempt:
dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)), :
# could not find function "."
You should simply read the Introduction to data.table vignette to understand what with= means. It's easier if you're aware of with() function from base R.
On the second one:
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
This seems like an issue with min() and max() on a data.frame/data.table with column with attributes. Here's a MRE.
df = data.frame(x=as.Date("2015-01-01"))
min(df)
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
To answer your question, you can use get():
dt[, .(min = min(get(colName)), max = max(get(colName)))]
Or as #Frank suggested, [[ operator to subset the column:
dt[, .(min = min(.SD[[colName]]), max = max(.SD[[colName]]))]
There's not yet a nicer way of applying .SD to multiple functions (because base R doesn't seem to have one AFAICT, and data.table tries to use base R functions as much as possible). There's a FR #1063 to address this issue. If/when that gets implemented, then one could do, for example:
# NOTE: not yet implemented, FR #1063
dt[, colwise(.SD, min, max), .SDcols = colName]

Related

Group by a variable number of columns in R data.table

Consider the following:
library(data.table)
dt <- data.table(CO2)
What if I wanted to conditionally do:
dt[, mean(conc), by = .(Type, round(uptake))]
OR
dt[, mean(conc), by = round(uptake)]
depending on the value of some other boolean variable bool? I'd just like to avoid repeating two very similar commands in an if else form and I'm wondering if it's possible at all with data.table.
I tried the following:
bool <- TRUE
dt[, mean(conc), by = .(unlist(ifelse(bool, list(Type), list(NULL))), round(uptake))]
which works in this case, but if bool <- FALSE, it gives this error:
Error in `[.data.table`(dt, , mean(conc), by = .(unlist(ifelse(FALSE, :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
A quick something:
gby <- c('Type', 'tmp')[c(bool, TRUE)]
dt[, tmp := round(uptake)][, mean(conc), by = gby]

Grouped mean of difftime fails in data.table

Preface:
I have a column in a data.table of difftime values with units set to days. I am trying to create another data.table summarizing the values with
dt2 <- dt[, .(AvgTime = mean(DiffTime)), by = Group]
When printing the new data.table, I see values such as
1.925988e+00 days
1.143287e+00 days
1.453975e+01 days
I would like to limit the decimal place values for this column only (i.e. not setting options() unless I can do this specifically for difftime values this way). When I try to do this using the method above, modified, e.g.
dt2 <- dt[, .(AvgTime = round(mean(DiffTime)), 2), by = Group]
I am left with NA values, with both the base round() and format() functions returning the warning:
In mean(DiffTime) : argument is not numeric or logical.
Oddly enough, if I perform the same operation on a numeric field, this runs with no problems. Also, if I run the two separate lines of code, I can accomplish what I am looking to do:
dt2 <- dt[, .(AvgTime = mean(DiffTime)), by = Group]
dt2[, AvgTime := round(AvgTime, 2)]
Reproducible Example:
library(data.table)
set.seed(1)
dt <- data.table(
Date1 =
sample(seq(as.Date('2017/10/01'),
as.Date('2017/10/31'),
by="days"), 24, replace = FALSE) +
abs(rnorm(24)) / 10,
Date2 =
sample(seq(as.Date('2017/10/01'),
as.Date('2017/10/31'),
by="days"), 24, replace = FALSE) +
abs(rnorm(24)) / 10,
Num1 =
abs(rnorm(24)) * 10,
Group =
rep(LETTERS[1:4], each=6)
)
dt[, DiffTime := abs(difftime(Date1, Date2, units = 'days'))]
# Warnings/NA:
class(dt$DiffTime) # "difftime"
dt2 <- dt[, .(AvgTime = round(mean(DiffTime), 2)), by = .(Group)]
# Works when numeric/not difftime:
class(dt$Num1) # "numeric"
dt2 <- dt[, .(AvgNum = round(mean(Num1), 2)), by = .(Group)]
# Works, but takes an additional step:
dt2<-dt[,.(AvgTime = mean(DiffTime)), by = .(Group)]
dt2[,AvgTime := round(AvgTime,2)]
# Works with base::mean:
class(dt$DiffTime) # "difftime"
dt2 <- dt[, .(AvgTime = round(base::mean(DiffTime), 2)), by = .(Group)]
Question:
Why am I not able to complete this conversion (rounding of the mean) in one step when the class is difftime? Am I missing something in my execution? Is this some sort of bug in data.table where it can't properly handle the difftime?
Issue added on github.
Update: Issue appears to be cleared after updating from data.table version 1.10.4 to 1.12.8.
This was fixed by update #3567 on 2019/05/15, data.table version 1.12.4 released 2019/10/03
This might be a little late but if you really want it to work you can do:
as.numeric(round(as.difftime(difftime(DATE1, DATE2)), 0))
I recently ran into the same problem using data.table_1.11.8. One quick work around is to use base::mean instead of mean.

R: Order by new variable created in Data table

I am trying to understand why can't I order by a new variable that I create in the same line.
Currently I need to write two lines, one for creating the new variable and then for ordering it.
Can this be done in the same line in data.table:
DF <- data.table(ID = c(1,2,1,2,1,1,1,1,2), Value = c(1,1,1,1,1,1,1,1,1))
newDF <- DF[order(-Count), .(Count = .N), by = ID]
# Gives error: Error in eval(v, x, parent.frame()) : object 'Count' not found
# Works Correctly
newDF <- DF[, .(Count = .N), by = ID]
newDF <- newDF[order(-Count)]
> newDF
ID Count
1: 1 6
2: 2 3
You can simply chain both of the operations in a single line,
DF[, .(Count = .N), by = ID][order(-Count)]

How to impute values in a data.table by groups?

Take the following data table:
# IMPUTING VALUES
library(data.table)
set.seed(1337)
mydt <- data.table(Year = rep(2000:2005, each = 10),
Type = c("A","B"),
Value = 30 + rnorm(60)
)
naRows <- sample(nrow(mydt),15)
mydt[ naRows, Value := NA]
setkey(mydt,Year,Type)
How would I go about imputing the NAs with the median by Year and Type? I have tried the following
# computed medians
computedMedians <- mydt[, .(Median = median(Value, na.rm = TRUE)), keyby = .(Year,Type)]
# dataset of just NA rows
dtNAs <- mydt[ is.na(Value), .SD, by = .(Year,Type)]
mydt[ is.na(Value),
Imputations := dtNAs[computedMedians, nomatch = 0][, Median],
by = .(Year,Type)]
mydt
but when you run the code, you'll see that it works unless a group is missing data completely, and the computed medians get recycled. Is there a simpler way? or how would you go about getting just the last error fixed?
If you prefer updating the rows without copying the entire column, then:
require(data.table) # v1.9.6+
cols = c("Year", "Type")
dt[is.na(Value), Value := dt[.BY, median(Value, na.rm=TRUE), on=cols], by=c(cols)]
.BY is a special symbol which is a named list containing the groups. Although this requires a join with the entire data.table every time, it should be quite fast, as it's searching for only one group.
There's no need to make a secondary table; it can be done inside a single by-group call:
mydt[,
Value := replace(Value, is.na(Value), median(Value, na.rm=TRUE))
, by=.(Year,Type)]
This imputation doesn't guarantee that all missing values are filled (e.g., 2005-B is still NA).

Convert *some* column classes in data.table

I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]

Resources