Hi I am trying to create a data.table with lagged variables by group id. Certain id's have only 1 row in the data.table in that case the shift operator for lag gives error but the lead operator works fine. Here is an example
dt = data.table(id = 1, week = as.Date('2014-11-11'), sales = 1)
lead = 2
lag = 2
lagSalesNames = paste('lag_sales_', 1:lag, sep = '')
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lag'), by = list(id)]
This gives me the following error
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead
(much quicker), or cbind or merge afterwards.
But if I try the same thing with lead instead, it works fine
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lead'), by = list(id)]
It also seem to work fine if the data.table has more than 1 row e.g. you can try the following with 2 rows which works fine
dt = data.table(id = 1, week = as.Date(c('2014-11-11', '2014-11-11')), sales = 1:2)
dt[,(lagSalesNames) := shift(sales, 1:lag, NA, 'lag'), by = list(id)]
I am using data.table version 1.9.5 on a linux machine with R version 3.1.0. Any help would be much appreciated.
Thanks,
Ashin
Thanks for the report. This is now fixed (issue #1014) with commit #1722 in data.table v1.9.5.
Now works as intended:
dt
# id week sales lag_sales_1 lag_sales_2
# 1: 1 2014-11-11 1 NA NA
Related
I came across a surprising result with data.table. Here is a really simple example :
library(data.table)
df <- data.table(x = 1:10)
df[,x[x>3][.N]]
[1] NA
This syntax gives NA, but this work:
df[,x[x>3][1]]
[1] 4
and of course this
df[,x[.N]]
[1] 10
I know that in this simple example case you can do
df[x>3,x[.N]]
but I wanted to use the df[,x[x>3][.N]] syntax while using lapply on .SD to avoid a loop on the i selection, so something like
df2 <- data.table(x = rep(1:10,2), y = rep(2:11,2),ID = rep(c("A","B"),each = 10))
cols = c("x","y")
df2[,lapply(.SD,function(x){x[x>3][.N]}),.SDcols = cols, by = ID]
But this fail, same as in my simple example. Is it because .N is not implemented in this case, or am I doing something wrong ?
My actual work around:
Reduce(merge,lapply(cols,function(col){df2[col>3,setNames(list( get(col)[.N]),col),by = ID]}))
ID x y
1: A 10 11
2: B 10 11
but I am not fully happy with it, I find it less readable. Has anyone an explanation and a better work around ?
Thank you !!
df[,x[x>3]] has 7 elements. .N is 10. You are trying to subset a vector out of range.
So you can access the last element of the vector in lapply using:
df2[, lapply(.SD, function(x) tail(x[x>3], 1) ), .SDcols = c('x','y'), by = ID]
Or more idiomatic for data.table we can use
df2[, lapply(.SD, function(x) last(x[x>3]) ), .SDcols = c('x','y'), by = ID]
Preface:
I have a column in a data.table of difftime values with units set to days. I am trying to create another data.table summarizing the values with
dt2 <- dt[, .(AvgTime = mean(DiffTime)), by = Group]
When printing the new data.table, I see values such as
1.925988e+00 days
1.143287e+00 days
1.453975e+01 days
I would like to limit the decimal place values for this column only (i.e. not setting options() unless I can do this specifically for difftime values this way). When I try to do this using the method above, modified, e.g.
dt2 <- dt[, .(AvgTime = round(mean(DiffTime)), 2), by = Group]
I am left with NA values, with both the base round() and format() functions returning the warning:
In mean(DiffTime) : argument is not numeric or logical.
Oddly enough, if I perform the same operation on a numeric field, this runs with no problems. Also, if I run the two separate lines of code, I can accomplish what I am looking to do:
dt2 <- dt[, .(AvgTime = mean(DiffTime)), by = Group]
dt2[, AvgTime := round(AvgTime, 2)]
Reproducible Example:
library(data.table)
set.seed(1)
dt <- data.table(
Date1 =
sample(seq(as.Date('2017/10/01'),
as.Date('2017/10/31'),
by="days"), 24, replace = FALSE) +
abs(rnorm(24)) / 10,
Date2 =
sample(seq(as.Date('2017/10/01'),
as.Date('2017/10/31'),
by="days"), 24, replace = FALSE) +
abs(rnorm(24)) / 10,
Num1 =
abs(rnorm(24)) * 10,
Group =
rep(LETTERS[1:4], each=6)
)
dt[, DiffTime := abs(difftime(Date1, Date2, units = 'days'))]
# Warnings/NA:
class(dt$DiffTime) # "difftime"
dt2 <- dt[, .(AvgTime = round(mean(DiffTime), 2)), by = .(Group)]
# Works when numeric/not difftime:
class(dt$Num1) # "numeric"
dt2 <- dt[, .(AvgNum = round(mean(Num1), 2)), by = .(Group)]
# Works, but takes an additional step:
dt2<-dt[,.(AvgTime = mean(DiffTime)), by = .(Group)]
dt2[,AvgTime := round(AvgTime,2)]
# Works with base::mean:
class(dt$DiffTime) # "difftime"
dt2 <- dt[, .(AvgTime = round(base::mean(DiffTime), 2)), by = .(Group)]
Question:
Why am I not able to complete this conversion (rounding of the mean) in one step when the class is difftime? Am I missing something in my execution? Is this some sort of bug in data.table where it can't properly handle the difftime?
Issue added on github.
Update: Issue appears to be cleared after updating from data.table version 1.10.4 to 1.12.8.
This was fixed by update #3567 on 2019/05/15, data.table version 1.12.4 released 2019/10/03
This might be a little late but if you really want it to work you can do:
as.numeric(round(as.difftime(difftime(DATE1, DATE2)), 0))
I recently ran into the same problem using data.table_1.11.8. One quick work around is to use base::mean instead of mean.
I am using the data.table in R (version 3.3.2 on OS X 10.11.6) and have noticed a change in behavior from version 1.9.6 to 1.10.0 with respect to the use of the := operator and a character string for name.
I am renaming columns inside of a loop based upon the index number. Previously, I had been using eval(as.symbol("string")) on both sides of :=, but this no longer works (this was based upon answers from a previous question). Through trial and error, I figured out I needed use ("string") on of the left side and eval(as.symbol("string")) on the right hand side.
Here is MCVE that demonstrates this behavior
library(data.table)
dt <- data.table(col1 = 1:10, col2 = 11:20)
## the next lines would be inside a loop that is excluded to simplify this MCVE
colA = paste0("col", 1)
colB = paste0("col", 2)
colC = paste0("col", 3)
## Old code that worked with 1.9.6, but not longer works
dt[ , eval(as.symbol(colC)) := eval(as.symbol(colA)) + eval(as.symbol(colB))]
## New code that now works 1.10.0
dt[ , (colC) := eval(as.symbol(colA)) + eval(as.symbol(colB))]
I have looked through the data.table documentation and have not been able to figure out why this work around works. So, here is my question:
Why do I need the eval(as.symbol("string")) on the right side, but not on the left?
From a discussion, it is now assumed that if j is a single string, it is evaluated as a symbol, so that, for example, dt[, "col" := 3] will also work.
There's be a fair bit of changing around with exactly when this became the default, but the full story is contained in both the previous post and the data.table news.
It may be of interest to you, however, that with
new_cols = c("j1", "j2")
dt[, (new_cols) := value] # brackets so we don't just make a new_col col
or
dt[, c("j1", "j2") := value]
it may be possible for you to achieve the above without needing a loop
library(data.table)
dt = data.table(a = c(2, 3), b = c(5, 7), c = c(11, 13))
cols1 = sapply(c("a", "b"), as.symbol)
cols2 = sapply(c("b", "c"), as.symbol)
new_cols = c("d", "e")
> print(dt)
a b c
1: 2 5 11
2: 3 7 13
dt[, (new_cols) := purrr::map2(cols1, cols2, ~ eval(.x) + eval(.y))]
a b c d e
1: 2 5 11 7 16
2: 3 7 13 10 20
I have a data frame with two columns. I want to add an additional two columns to the data set with counts based on aggregates.
df <- structure(list(ID = c(1045937900, 1045937900),
SMS.Type = c("DF1", "WCB14"),
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"),
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")
I want to simply count the number of Instances of SMS.Type and Reply.Date where there is no null. So in the toy example below, i will generate the 2 for SMS.Type and 1 for Reply.Date
I then want to add this to the data frame as total counts (Im aware they will duplicate out for the number of rows in the original dataset but thats ok)
I have been playing around with aggregate and count function but to no avail
mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
train,
function(x) length(unique(which(!is.na(x)))))
mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
testtrain,
function(x) length(which(!is.na(x))))
Can anyone help?
Thank you for your time
Using data.table you could do (I've added a real NA to your original data).
I'm also not sure if you really looking for length(unique()) or just length?
library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") :=
lapply(.SD, function(x) length(unique(na.omit(x)))),
.SDcols = cols,
by = ID]
# ID SMS.Type SMS.Date Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900 DF1 12/02/2015 19:51 NA 2 1
# 2: 1045937900 WCB14 13/02/2015 08:38 13/02/2015 09:52 2 1
In the devel version (v >= 1.9.5) you also could use uniqueN function
Explanation
This is a general solution which will work on any number of desired columns. All you need to do is to put the columns names into cols.
lapply(.SD, is calling a certain function over the columns specified in .SDcols = cols
paste0(cols, ".count") creates new column names while adding count to the column names specified in cols
:= performs assignment by reference, meaning, updates the newly created columns with the output of lapply(.SD, in place
by argument is specifying the aggregator columns
After converting your empty strings to NAs:
library(dplyr)
mutate(df, SMS.Type.count = sum(!is.na(SMS.Type)),
Reply.Date.count = sum(!is.na(Reply.Date)))
This is very similar to a question applying a common function to multiple columns of a data.table uning .SDcols answered thoroughly here.
The difference is that I would like to simultaneously apply a different function on another column which is not part of the .SD subset. I post a simple example below to show my attempt to solve the problem:
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
Yields the following error:
Error in `[.data.table`(dt, , list(v1 = sum(v1), lapply(.SD, mean)), by = grp,
: object 'v1' not found
Now this makes sense because the v1 column is not included in the subset of columns which must be evaluated first. So I explored further by including it in my subset of columns:
sd.cols = c("v1","v2", "v3")
dt.out = dt[, list(sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
Now this does not cause an error but it provides an answer containing 9 rows (for 3 groups), with the sum repeated thrice in column V1 and the means for all 3 columns (as expected but not wanted) placed in V2 as shown below:
> dt.out
grp V1 V2
1: c -1.070608 -0.0486639841313638
2: c -1.070608 -0.178154270921521
3: c -1.070608 -0.137625003604012
4: b -2.782252 -0.0794929150464099
5: b -2.782252 -0.149529237116445
6: b -2.782252 0.199925178109264
7: a 6.091355 0.141659419355985
8: a 6.091355 -0.0272192037753071
9: a 6.091355 0.00815760216214876
Workaround Solution using 2 steps
Clearly it is possible to solve the problem in multiple steps by calculating the mean by group for the subset of columns and joining it to the sum by group for the single column as follows:
dt.out1 = dt[, sum(v1), by = grp]
dt.out2 = dt[, lapply(.SD,mean), by = grp, .SDcols = sd.cols]
dt.out = merge(dt.out1, dt.out2, by = "grp")
> dt.out
grp V1 v2 v3
1: a 6.091355 -0.0272192 0.008157602
2: b -2.782252 -0.1495292 0.199925178
3: c -1.070608 -0.1781543 -0.137625004
Im sure it's a fairly simple thing I am missing, thanks in advance for any guidance.
Update: Issue #495 is solved now with this recent commit, we can now do this just fine:
require(data.table) # v1.9.7+
set.seed(1L)
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
However note that in this case, v2 would be returned as a list. That's because you're doing list(val, list()) effectively. What you intend to do perhaps is:
dt[, c(list(v1=sum(v1)), lapply(.SD, mean)), by=grp, .SDcols = sd.cols]
# grp v1 v2 v3
# 1: a -6.440273 0.16993940 0.2173324
# 2: b 4.304350 -0.02553813 0.3381612
# 3: c 0.377974 -0.03828672 -0.2489067
See history for older answer.
Try this:
dt[,list(sum(v1), mean(v2), mean(v3)), by=grp]
In data.table, using list() in the second argument allows you to describe a set of columns that result in the final data.table.
For what it's worth, .SD can be quite slow [^1] so you may want to avoid it unless you truly need all of the data supplied in the subsetted data.table like you might for a more sophisticated function.
Another option, if you have many columns for .SDcols would be to do the merge in one line using the data.table merge syntax.
For example:
dt[, sum(v1), by=grp][dt[,lapply(.SD,mean), by=grp, .SDcols=sd.cols]]
In order to use the merge from data.table, you need to first use setkey() on your data.table so it knows how to match things up.
So really, first you need:
setkey(dt, grp)
Then you can use the line above to produce an equivalent result.
[^1]: I find this to be especially true as your number of groups approach the number of total rows. For example, this might happen where your key is an individual ID and many individuals have just one or two observations.