data.table assignment involving factors - r

I'm using data.table (1.8.9) and the := operator to update the values in one table from the values in another. The table to be updated (dt1) has many factor columns, and the table with the updates (dt2) has similar columns with values that may not exist in the other table. If the columns in dt2 are characters I get an error message, but when I factorize them I get incorrect values.
How can I update a table without converting all factors to characters first?
Here is a simplified example:
library(data.table)
set.seed(3957)
## Create some sample data
## Note column y is a factor
dt1<-data.table(x=1:10,y=factor(sample(letters,10)))
dt1
## x y
## 1: 1 m
## 2: 2 z
## 3: 3 t
## 4: 4 b
## 5: 5 l
## 6: 6 a
## 7: 7 s
## 8: 8 y
## 9: 9 q
## 10: 10 i
setkey(dt1,x)
set.seed(9068)
## Create a second table that will be used to update the first one.
## Note column y is not a factor
dt2<-data.table(x=sample(1:10,5),y=sample(letters,5))
dt2
## x y
## 1: 2 q
## 2: 7 k
## 3: 3 u
## 4: 6 n
## 5: 8 t
## Join the first and second tables on x and attempt to update column y
## where there is a match
dt1[dt2,y:=i.y]
## Error in `[.data.table`(dt1, dt2, `:=`(y, i.y)) :
## Type of RHS ('character') must match LHS ('integer'). To check and
## coerce would impact performance too much for the fastest cases. Either
## change the type of the target column, or coerce the RHS of := yourself
## (e.g. by using 1L instead of 1)
## Create a third table that is the same as the second, except y
## is also a factor
dt3<-copy(dt2)[,y:=factor(y)]
## Join the first and third tables on x and attempt to update column y
## where there is a match
dt1[dt3,y:=i.y]
dt1
## x y
## 1: 1 m
## 2: 2 i
## 3: 3 m
## 4: 4 b
## 5: 5 l
## 6: 6 b
## 7: 7 a
## 8: 8 l
## 9: 9 q
## 10: 10 i
## No error message this time, but it is using the levels and not the labels
## from dt3. For example, row 2 should be q but it is i.
Page 3 of the data.table help file says:
When LHS is a factor column and RHS is a character vector with items
missing from the factor levels, the new level(s) are automatically
added (by reference, efficiently), unlike base methods.
This makes it seem like what I've tried should work, but obviously I'm missing something. I wonder if this is related to this similar issue:
rbindlist two data.tables where one has factor and other has character type for a column

Here's a workaround:
dt1[dt2, z := i.y][!is.na(z), y := z][, z := NULL]
Note that z is a character column and the second assignment works as expected, not really sure why the OP one doesn't.

Related

Order data.table by a character vector of column names

I'd like to order a data.table by a variable holding the name of a column:
I've tried every combination of + eval, getandc` without success:
I have colVar = "someColumnName"
I'd like to apply this to: DT[order(colVar)]
data.table has special functions for that matter which will modify your data set by reference instead of copying it to a new object.
You can either use setkey or (in versions >= 1.9.4) setorder which is capable of ordering in decreasing order too.
Note the difference between setkey vs. setkeyv and setorder vs. setorderv. v notes that you can pass either a quoted variable name or a variable containing one.
Using #andrewzm data set
dtbl
# x y
# 1: 1 5
# 2: 2 4
# 3: 3 3
# 4: 4 2
# 5: 5 1
setorderv(dtbl, colVar)[] # or `sekeyv(dtbl, colVar)[]` or `setorderv(dtbl, "y")[]`
# x y
# 1: 5 1
# 2: 4 2
# 3: 3 3
# 4: 2 4
# 5: 1 5
You can use double brackets for data tables:
library(data.table)
dtbl <- data.table(x = 1:5, y = 5:1)
colVar = "y"
dtbl_sorted <- dtbl[order(dtbl[[colVar]])]
dtbl_sorted

data.table avoid column name changing [duplicate]

Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.

How to pass a list of columns to data.table where some are predetermined

Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.

Don't resort data.table rows

I am learning data.table so I'm very new to it's syntax. I am trying to use the package as a hash lookup and it works well except, because of my ignorance of syntax, it reorders the rows. I want it not to reorder the rows without sacrificing speed (i.e., the efficient way to accomplish this). Here is an example and desired output:
library(data.table)
(key <- setNames(aggregate(mpg~as.character(carb), mtcars, mean), c("x", "y")))
set.seed(10)
terms <- data.frame(x = c(9, 12, sample(key[, 1], 6, TRUE)), stringsAsFactors = FALSE)
## > terms$x
## [1] "9" "12" "4" "2" "3" "6" "1" "2"
setDT(key)
setDT(terms)
setkey(key, x)
setkey(terms, x)
terms[key, out := i.y]
terms
This gives:
## x out
## 1: 1 25.34286
## 2: 12 NA
## 3: 2 22.40000
## 4: 2 22.40000
## 5: 3 16.30000
## 6: 4 15.79000
## 7: 6 19.70000
## 8: 9 NA
I want:
## x out
## 1: 9 NA
## 2: 12 NA
## 3: 4 15.79000
## 4: 2 22.40000
## 5: 3 16.30000
## 6: 6 19.70000
## 7: 1 25.34286
## 8: 2 22.40000
In data.table, a join x[i] has to have a key set for x, but it's not essential for the key to be set for i.
NOTE: But if you don't set the key for i,
1) Ensure that the columns of i are in the same order as the key columns of x (reorder if necessary, using setcolorder), as it doesn't join by checking for names (yet).
2) It could be a tad slower (but not by much in my benchmarks).
The issue therefore is that, if you just want to do a x[i] join without any additional preprocessing, then terms has to take the place of i with no key set in order to get the results in the order you require.
With this in mind, we can approach this in two ways (that I could think of).
First method:
This one requires no additional preprocessing. That is we treat key as x as mentioned above - meaning it's key has to be set. We don't set key for terms.
setkey(key, x)
The first column of terms is also named x and that's the column we want to join with. So, no reordering needed here.
ans = key[terms]
> ans
# x y
# 1: 9 NA
# 2: 12 NA
# 3: 4 15.79000
# 4: 2 22.40000
# 5: 3 16.30000
# 6: 6 19.70000
# 7: 1 25.34286
# 8: 2 22.40000
The difference is that this is an entirely new data.table, not just assigning the column by reference.
Second method:
We do a little extra preprocessing - addition of an extra column N to terms, by reference, which runs from 1:nrow(terms). This basically helps us to rearrange the data back in the order required, after the join. Here, we'll consider terms as x.
terms[, N := 1:.N]
setkey(terms, x)
It doesn't matter if key has 'x' column set as key.. But again, ensure that x is the first column in key if it's key isn't set.. In my case, I'll set the key column of key to x.
setkey(key, x)
setkey(terms[key, out := i.y], N)
> terms
# x N out
# 1: 9 1 NA
# 2: 12 2 NA
# 3: 4 3 15.79000
# 4: 2 4 22.40000
# 5: 3 5 16.30000
# 6: 6 6 19.70000
# 7: 1 7 25.34286
# 8: 2 8 22.40000
Personally, since you require terms unsorted, I'd go with the first method here. But feel free to benchmark on your real data dimensions and choose which suits your need best.

data.table joins - Select all columns in the i argument

Joining two data.table I can specify the table I want the column from, like
X[Y, i.id] # `id` is taken from Y
My problem is that I have a big table with ~80 columns. Every night a data refresh happens and, according to some parameters, some rows get replaced by a new version of the table (same table, just new data).
current <- data.table(id=1:4, var=1:4, var2=1:4, key="id")
new <- data.table(id=1:4, var=11:14, var2=11:14, key="id")
current[new[c(1,3)], `:=`(var=i.var, var2=i.var2)]
> current
id var var2
1: 1 11 11
2: 2 2 2
3: 3 13 13
4: 4 4 4
As I said, in my real case, I have much more columns so (besides rbind()ing pieces of the two tables) I wonder how can I select all the columns of the data.table used in a join as the i argument? I could spend an half an hour in hard coding all of them but it wouldn't be a maintainable code (in case new columns get added to the tables in future).
How about constructing the j-expression and just eval'ing it?
nc = names(current)[-1L]
nn = paste0("i.", nc)
expr = lapply(nn, as.name)
setattr(expr, 'names', nc)
expr = as.call(c(quote(`:=`), expr))
> current[new[c(1,3)], eval(expr)]
> current
## id var var2
## 1: 1 11 11
## 2: 2 2 2
## 3: 3 13 13
## 4: 4 4 4

Resources