Update a vector in a data.table [duplicate] - r

This question already has answers here:
Using lists inside data.table columns
(2 answers)
Closed 8 years ago.
I have this:
dt = data.table(index=c(1,2), items=list(c(1,2,3),c(4,5)))
# index items
#1: 1 1,2,3
#2: 2 4,5
I want to change the dt[index==2,items] to c(6,7).
I tried:
dt[index==2, items] = c(6,7)
dt[index==2, items := c(6,7)]

One workaround is to use ifelse:
dt[,items:=ifelse(index==2,list(c(6,7)),items)]
index items
1: 1 1,2,3
2: 2 6,7
EDIT the correct answer:
dt[index==2,items := list(list(c(6,7)))]
Indeed, you'll need one more list because data.table uses list(.) to look for values to assign to columns by reference.
There are two ways to use the := operator in data.table:
The LHS := RHS form:
DT[, c("col1", "col2", ..) := list(val1, val2, ...)]
It takes a list() argument on the RHS. To add a list column, you'll need to wrap with another list (as illustrated above).
The functional form:
DT[, `:=`(col1 = val1, ## some comments
col2 = val2, ## some more comments
...)]
It is especially useful to add some comments along with the assignment.

dt[index==2]$items[[1]] <- list(c(6,7))
dt
# index items
# 1: 1 1,2,3
# 2: 2 6,7
The problem is that, the way you have it set up, dt$items is a list, not a vector, so you have to use list indexing (e.g., dt$items[[1]]). But AFAIK you can't update a list element by reference, so, e.g.,
dt[index==2,items[[1]]:=list(c(6,7))]
will not work.
BTW I also do not see the point of using data.tables for this.

This worked:
dt$items[[which(dt$index==2)]] = c(6,7)

Related

dynamic column names seem to work when := is used but not when = is used in data.table [duplicate]

This question already has answers here:
Dynamically add column names to data.table when aggregating
(2 answers)
Closed 2 years ago.
Using this dummy dataset
setDT(mtcars_copy<-copy(mtcars))
new_col<- "sum_carb" # for dynamic column referencing
Why does Case 1 work but not Case 2?
# Case 1 - Works fine
mtcars_copy[,eval(new_col):=sum(carb)] # Works fine
# Case 2:Doesnt work
aggregate_mtcars<-mtcars_copy[,(eval(new_col)=sum(carb))] # error
aggregate_mtcars<-mtcars_copy[,eval(new_col)=sum(carb))] # error
aggregate_mtcars<-mtcars_copy[,c(eval(new_col)=sum(carb))] # Error
How does one get Case 2 to work wherein I dont want the main table (mtcars_copy in this case to hold the new columns) but for the results to be stored in a separate aggregation table (aggregate_mtcars)
One option is to use the base R function setNames
aggregate_mtcars <- mtcars_copy[, setNames(.(sum(carb)), new_col)]
Or you could use data.table::setnames
aggregate_mtcars <- setnames(mtcars_copy[, .(sum(carb))], new_col)
I think what you want is to simply make a copy when doing case 1.
aggregate_mtcars <- copy(mtcars_copy)[, eval(new_col) := sum(carb)]
That retains mtcars_copy as a separate dataset to the new aggregate_metcars, without the new columns.
The reason is because case 2 uses data.frame way to create column in a data frame (as a new list).
There is hidden parameter in data.table : with that handles the way the object is returned. It can be a data.table or a vector.
?data.table :
By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables. In case of overlapping variables names inside dataset and in parent scope you can use double dot prefix ..cols to explicitly refer to 'cols variable parent scope and not from your dataset.
When j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is not necessary anymore to select columns dynamically. Note that x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols].
# Case 2 :
aggregate_mtcars<-mtcars_copy[,(get(new_col)=sum(carb))] # error
aggregate_mtcars<-mtcars_copy[,eval(new_col)=sum(carb))] # error
aggregate_mtcars<-mtcars_copy[,c(eval(new_col)=sum(carb))] # Error
mtcars_copy[, new_col, with = FALSE ] # gives a data.table
mtcars_copy[, eval(new_col), with = FALSE ] # this works and create a data.table
mtcars_copy[, eval(new_col), with = TRUE ] # the default that is used here with error
mtcars_copy[, get(new_col), with = TRUE ] # works and gives a vector
# Case 2 solution : affecting values the data.frame way
mtcars_copy[, eval(new_col) ] <- sum(mtcars_copy$carb) # or any vector
mtcars_copy[[eval(new_col)]] <- sum(mtcars_copy$carb) # or any vector

Flexible mixing of multiple aggregations in data.table for different column combinations

The following problem prevents me so far from a really flexible usage of data.table aggregations.
Example:
library(data.table)
set.seed(1)
DT <- data.table(C1=c("a","b","b"),
C2=round(rnorm(4),4),
C3=1:12,
C4=9:12)
sum_cols <- c("C2","C3")
#I want to apply a custom aggregation over multiple columns
DT[,lapply(.SD,sum),by=C1,.SDcols=sum_cols]
### Part 1 of question ###
#but what if I want to add another aggregation, e.g. count
DT[,.N,by=C1]
#this is not working as intended (creates 4 rows instead of 2 and doesnt contain sum_cols)
DT[,.(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols]
### Part 2 of question ###
# or another function for another set of colums and adding a prefix to keep them appart?
mean_cols <- c("C3","C4")
#intended table structure (with 2 rows again)
C1 sum_C2 sum_C3 mean_C3 mean_C4
I know I can always merge various single aggregation results by some key but I`m sure there must be a correct, flexible and easy way to do what I would like to do (especially Part 2).
The first thing to notice is that data.table's j argument expects a list output, which can be built with c, as mentioned in #akrun's answer. Here are two ways to do it:
set.seed(1)
DT <- data.table(C1=c("a","b","b"), C2=round(rnorm(4),4), C3=1:12, C4=9:12)
sum_cols <- c("C2","C3")
mean_cols <- c("C3","C4")
# with the development version, 1.10.1+
DT[, c(
.N,
sum = lapply(.SD[, ..sum_cols], sum),
mean = lapply(.SD[, ..mean_cols], mean)
), by=C1]
# in earlier versions
DT[, c(
.N,
sum = lapply(.SD[, sum_cols, with=FALSE], sum),
mean = lapply(.SD[, mean_cols, with=FALSE], mean)
), by=C1]
mget returns a list and c connects elements together to make a list.
Comments
If you turn on the verbose data.table option for these calls, you'll see a message:
The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
Also, you'll see that the optimized group mean and sum are not being used (see ?GForce for details). We can get around this by following FAQ 1.6 perhaps, but I couldn't figure out how.
The output are lists, so we use c to concatenate both the list outputs
DT[,c(.N,lapply(.SD,sum)),by=C1,.SDcols=sum_cols]
# C1 N C2 C3
# 1: a 4 0.288 22
# 2: b 8 0.576 56

Simultaneous order, row-filter and column-select with data.table

I am trying to do multiple steps in one line in R to select a value from a data.table (dt) with multiple criteria.
For example:
set.seed(123)
dt <- data.table(id = rep(letters[1:2],2),
time = rnorm(4),
value = rnorm(4)*100)
# id time value
# 1: a -0.56047565 12.92877
# 2: b -0.23017749 171.50650
# 3: a 1.55870831 46.09162
# 4: b 0.07050839 -126.50612
# Now I want to select the last (maximum time) value from id == "a"
# My pseudo data.table code looks like this
dt[order(time) & id == "a" & .N, value]
# [1] 12.92877 46.09162
Instead of getting the two values I want to get only the last value (which has the higher time-value).
If I do it step-by-step it works:
dt <- dt[order(time) & id == "a"]
dt[.N, value]
# [1] 46.09162
Bonus:
How can I order a data.table without copying the data.table: ie
dt <- dt[order(time)]
without the <-. Similar to the :=-operator such as in dt[, new_val := value*2] which creates the new variable without copying the whole data.table.
Thank you, any idea is greatly appreciated!
For you first question, try
dt[id == "a", value[which.max(time)]]
## [1] 46.09162
For bonus question, try the setorder function which will order your data in place (you can also order in descending order by adding - in front of time)
setorder(dt, time)
dt
# id time value
# 1: a -0.56047565 12.92877
# 2: b -0.23017749 171.50650
# 3: b 0.07050839 -126.50612
# 4: a 1.55870831 46.09162
Also, if you already ordering your data by time, you could do both - order by reference and select value by condition- in single line
setorder(dt, time)[id == "a", value[.N]]
I know this is an older question, but I'd like to add something. Having a similar problem I stumbled on this question and although David Arenburg's answer does provide a solution to this exact question, I had trouble with it when trying to replace/overwrite values from that filtered and ordered data.table, so here is an alternative way which also lets you apply <- calls directly onto the filtered and ordered data.tabe.
The key is that data.table lets you concatenate several [] to each other.
Example:
dt[id=="a", ][order(time), ][length(value), "value"] <- 0
This also works for more than one entry, simply provide a suitable vector as replacement value.
Note however that the .N which is a list object needs to be replaced with e.g. length of the column because data.table expects an integer at this position in i and the column which you want to select in j needs to be wrapped by "".
I found this to be the more intuitive way and it lets you not only filter the data table, but also manipulate its values without needing to worry about temporary tables.

unique working incorrectly with data.table [duplicate]

This question already has answers here:
Extracting unique rows from a data table in R [duplicate]
(2 answers)
Closed 4 years ago.
I've discovered some interesting behavior in data.table, and I'm curious if someone can explain to me why this is happening. I'm merging two data.tables (in this MWE, one has 1 row and the other 2 rows). The merged data.table has two unique rows, but when I call unique() on the merged data.table, I get a data.table with one row. Am I doing something wrong? Or is this a bug?
Here's an MWE:
library(data.table)
X = data.table(keyCol = 1)
setkey(X, keyCol)
Y = data.table(keyCol = 1, otherKey = 1:2)
setkeyv(Y, c("keyCol", "otherKey"))
X[Y, ] # 2 unique rows
unique(X[Y, ]) # Only 1 row???
I'd expect unique(X[Y, ]) to be the same as X[Y, ] since all rows are unique, but this doesn't seem to be the case.
The default value to by argument for unique.data.table is key(x). Therefore, if you do unique(x) on a keyed data.table, it only looks at the key columns. To override it, do:
unique(x, by = NULL)
by = NULL by default considers all the columns. Alternatively you can also provide by = names(x).

How to avoid listing out function arguments but still subset too?

I have a function myFun(a,b,c,d,e,f,g,h) which contains a vectorised expression of its parameters within it.
I'd like to add a new column: data$result <- with(data, myFun(A,B,C,D,E,F,G,H)) where A,B,C,D,E,F,G,H are column names of data. I'm using data.table but data.frame answers are appreciated too.
So far the parameter list (column names) can be tedious to type out, and I'd like to improve readability. Is there a better way?
> myFun <- function(a,b,c) a+b+c
> dt <- data.table(a=1:5,b=1:5,c=1:5)
> with(dt,myFun(a,b,c))
[1] 3 6 9 12 15
The ultimate thing I would like to do is:
dt[isFlag, newCol:=myFun(A,B,C,D,E,F,G,H)]
However:
> dt[a==1,do.call(myFun,dt)]
[1] 3 6 9 12 15
Notice that the j expression seems to ignore the subset. The result should be just 3.
Ignoring the subset aspect for now: df$result <- do.call("myFun", df). But that copies the whole df whereas data.table allows you to add the column by reference: df[,result:=myFun(A,B,C,D,E,F,G,H)].
To include the comment from #Eddi (and I'm not sure how to combine these operations in data.frame so easily) :
dt[isFlag, newCol := do.call(myFun, .SD)]
Note that .SD can be used even when you aren't grouping, just subsetting.
Or if your function is literally just adding its arguments together :
dt[isFlag, newCol := do.call(sum, .SD)]
This automatically places NA into newCol where isFlag is FALSE.
You can use
df$result <- do.call(myFun, df)

Resources