Removing factor level in data.table - r

A public dataset contains a factor level (e.g., "(0) Omitted"), that I would like to recode as an NA. Ideally, I'd like to be able to scrub an entire subset at once. I'm using the data.table package and am wondering if there is a better or faster way of accomplishing this than converting the values to characters, dropping the character, and then converting the data to factors.
library(data.table)
DT <- data.table(V1=factor(sample(LETTERS,size = 2000000,replace=TRUE)),
V2 = factor(sample(LETTERS,size = 2000000,replace=TRUE)),
V3 = factor(sample(LETTERS,size = 2000000,replace=TRUE)))
# Convert to character
DT1 <- DT[, lapply(.SD, as.character)]
DT2 <- copy(DT1)
DT3 <- copy(DT) # Needs to be factor
# Scrub all 'B' values
DT1$V1[DT1$V1=="B"] <- NA
# Works!
DT2[V1 == "B", V1 := NA]
# Warning message:
# In `[.data.table`(DT, V1 == "B", `:=`(V1, NA)) :
# Coerced 'logical' RHS to 'character' to match the column's type. Either change the target column to 'logical' first (by creating a new 'logical' vector length 26 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
identical(DT1,DT2)
# [1] TRUE
# First attempt at looping over data.table
cnames <- colnames(DT3)
system.time(for(cname in cnames) {
DT3[ ,
cname := gsub("B", NA, DT3[[cname]]),
with=FALSE]
})
# user system elapsed
# 4.258 0.128 4.478
identical(DT1$V1,DT3$V1)
# [1] TRUE
# Back to factors
DT3 <- DT3[, lapply(.SD, as.factor)]

Set the factor level to NA:
levels(DT$V1)[levels(DT$V1) == 'B'] <- NA
Example:
> d <- data.table(l=factor(LETTERS[1:3]))
> d
l
1: A
2: B
3: C
> levels(d$l)[levels(d$l) == 'B'] <- NA
> d
l
1: A
2: NA
3: C
> levels(d$l)
[1] "A" "C"

You can change the levels as follows:
for (j in seq_along(DT)) {
x = DT[[j]]
lx = levels(x)
lx[lx == "B"] = NA
setattr(x, 'levels', lx) ## reset levels by reference
}

Related

split rows by condition in R data.table

I have a data table containing 3 columns, one of them
contains a key:value list of different lengths.
I wish to rearrange the table such that each row will have only one key, conditioned on the value
for example, suppose that I wish to get all rows for whom the value is <= 2 so that each key is on its own row:\
input_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"ha:llo\":1,\"wor:ld\":2,\"doog:bye\":3}"),
c=c(1))
the wanted table then should be
tbl_output <- data.table::data.table(a=c("AA",
"AA"),b=c("ha:llo","wor:ld"), c=c(1,1), s=c(1,2))
I had tried the following function:
data_table_clean <- function(dt){
dt[ ,"b" := data.table::tstrsplit(b, ',', fixed = T),by=c(a, c)]
dt[,c('b', 's'):= data.table::tstrsplit(b, ':', fixed=TRUE)]
return(dt[s <=2,])
}
this produces the following error
"Error in eval(expr, envir, enclos) : object 'a' not found"
Any suggestions are welcome, off course.
The keys are actually of the form :
input2_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"99:1d:3u:7y:89:67\":1,\"99:1D:34:YY:T6:Y6\":2,\"ll:5Y:UY:56:R5:R6\":3}"),
c=c(1))
and accordingly the output table should be:
tbl2_output <- data.table::data.table(a=c("AA",
"AA"),b=c(""99:1d:3u:7y:89:67","99:1D:34:YY:T6:Y6"),
c=c(1,1), s=c(1,2))
Thank you!
update
data_table_clean <- function(dt){
res <- dt[, data.table::tstrsplit(unlist(strsplit(gsub('[{}"]', '', b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE),
by = .(a, c)][V2 > -100]
data.table::setnames(res, 3:4, c("b", "s"))
res
}
when running this I get the following error:
Error in .subset(x, j) : invalid subscript type 'list'
One option would be to extract the characters that we need in the final output. We use str_extract to do that after grouping by 'a', 'c'. The output is a list, which we unlist, get the non-numeric and numeric into two columns and then subset the rows with the condition s<3.
library(stringr)
library(data.table)
input_tbl[, {
tmp <- unlist(str_extract_all(b, "[A-Za-z]+:[A-Za-z]+|\\d+"))
list(b=tmp[c(TRUE, FALSE)], s=tmp[c(FALSE, TRUE)])
}, by = .(a,c)][s<3]
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Or if we are using strsplit/tstrsplit, grouped by 'a', 'c', we remove the curly brackets and quotes ([{}]") with gsub, split by , (strsplit), unlist the output, and then use tstrsplit to split by : that is followed by a number. The subset part is similar as above.
res <- input_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b), ',', fixed=TRUE)), ":(?=\\d)", perl=TRUE) ,.(a,c)][V2<3]
setnames(res, 3:4, c("b", "s"))
res
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Update
For the updated dataset, we can do the tstrsplit on the last delimiter (:)
res1 <- input2_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE) ,
by = .(a, c)][V2 < 3]
setnames(res1, 3:4, c("b", "s"))
res1
# a c b s
# 1: AA 1 99:1d:3u:7y:89:67 1
# 2: AA 1 99:1D:34:YY:T6:Y6 2
Since it seems like you are working with a JSON object, why not use something that parses the JSON, for example, the "jsonlite" package?
With that, you can make a simple function, that looks like this:
myFun <- function(invec) {
require(jsonlite)
x <- fromJSON(invec)
list(b = names(x), s = unlist(x))
}
Now, applied to your dataset, you would get:
input_tbl[, myFun(b), by = .(a, c)]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
# 3: AA 1 doog:bye 3
And, for the subsetting:
input_tbl[, myFun(b), by = .(a, c)][s <= 2]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
You can probably also even rewrite the myFun function to add a "threshold" argument that lets you subset within the function itself.

How to apply a operation on a subset of columns in data.table in R [duplicate]

How do you refer to variables in a data.table if the variable names are stored in a character vector? For instance, this works for a data.frame:
df <- data.frame(col1 = 1:3)
colname <- "col1"
df[colname] <- 4:6
df
# col1
# 1 4
# 2 5
# 3 6
How can I perform this same operation for a data.table, either with or without := notation? The obvious thing of dt[ , list(colname)] doesn't work (nor did I expect it to).
Two ways to programmatically select variable(s):
with = FALSE:
DT = data.table(col1 = 1:3)
colname = "col1"
DT[, colname, with = FALSE]
# col1
# 1: 1
# 2: 2
# 3: 3
'dot dot' (..) prefix:
DT[, ..colname]
# col1
# 1: 1
# 2: 2
# 3: 3
For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).
To assign to variable(s), wrap the LHS of := in parentheses:
DT[, (colname) := 4:6]
# col1
# 1: 4
# 2: 5
# 3: 6
The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:
Using with = FALSE with := is now deprecated in all cases, given that wrapping
the LHS of := with parentheses has been preferred for some time.
colVar = "col1"
DT[, (colVar) := 1] # please change to this
DT[, c("col1", "col2") := 1] # no change
DT[, 2:4 := 1] # no change
DT[, c("col1","col2") := list(sum(a), mean(b))] # no change
DT[, `:=`(...), by = ...] # no change
See also Details section in ?`:=`:
DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol
And to answer further question in comment, here's one way (as usual there are many ways) :
DT[, colname := cumsum(get(colname)), with = FALSE]
# col1
# 1: 4
# 2: 9
# 3: 15
or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :
expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"
eval(parse(text=expr))
# col1
# 1: 4
# 2: 13
# 3: 28
If you do that a lot, you can define a helper function EVAL :
EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))
EVAL("DT[,",colname,":=cumsum(",colname,")]")
# col1
# 1: 4
# 2: 17
# 3: 45
Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.
Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.
set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
# col1
# 1: 4
# 2: 21
# 3: 66
*This is not an answer really, but I don't have enough street cred to post comments :/
Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?
colname <- as.name("users")
# Google Analytics query is run with chosen metric and resulting data is assigned to DT
DT2 <- DT[, sum(eval(colname, .SD)), by = country]
setnames(DT2, "V1", as.character(colname))
Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.
Retrieve multiple columns from data.table via variable or function:
library(data.table)
x <- data.table(this=1:2,that=1:2,whatever=1:2)
# === explicit call
x[, .(that, whatever)]
x[, c('that', 'whatever')]
# === indirect via variable
# ... direct assignment
mycols <- c('that','whatever')
# ... same as result of a function call
mycols <- grep('a', colnames(x), value=TRUE)
x[, ..mycols]
x[, .SD, .SDcols=mycols]
# === direct 1-liner usage
x[, .SD, .SDcols=c('that','whatever')]
x[, .SD, .SDcols=grep('a', colnames(x), value=TRUE)]
which all yield
that whatever
1: 1 1
2: 2 2
I find the .SDcols way the most elegant.
With development version 1.14.3, data.table has gained a new interface for programming on data.table, see item 10 in New Features. It uses the new env = parameter.
library(data.table) # development version 1.14.3 used
dt <- data.table(col1 = 1:3)
colname <- "col1"
dt[, cn := cn + 3L, env = list(cn = colname)][]
col1
<int>
1: 4
2: 5
3: 6
For multiple columns and a function applied on column values.
When updating the values from a function, the RHS must be a list object, so using a loop on .SD with lapply will do the trick.
The example below converts integer columns to numeric columns
a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
sapply(a1, class) # show classes of columns
# a b c1
# "integer" "integer" "character"
# column name character vector
nm <- c("a", "b")
# Convert columns a and b to numeric type
a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]
sapply(a1, class)
# a b c1
# "numeric" "numeric" "character"
You could try this:
colname <- as.name("COL_NAME")
DT2 <- DT[, list(COL_SUM=sum(eval(colname, .SD))), by = c(group)]

data.table loses factor ordering after rbind, R

When rbinding two data.table with ordered factors, the ordering seems to be lost:
dtb1 = data.table(id = factor(c("a", "b"), levels = c("a", "c", "b"), ordered=T), key="id")
dtb2 = data.table(id = factor(c("c"), levels = c("a", "c", "b"), ordered=T), key="id")
test = rbind(dtb1, dtb2)
is.ordered(test$id)
#[1] FALSE
Any thoughts or ideas?
data.table does some fancy footwork that means that data.table:::.rbind.data.table is called when rbind is called on objects including data.tables. .rbind.data.table utilizes the speedups associated with rbindlist, with a bit of extra checking to match by name etc.
.rbind.data.table deals with factor columns by using c to combine them (hence retaining the levels attribute)
# the relevant code is
l = lapply(seq_along(allargs[[1L]]), function(i) do.call("c",
lapply(allargs, "[[", i)))
In base R using c in this manner does not retain the "ordered" attribute, it doesn't even return a factor!
For example (in base R)
f <- factor(1:2, levels = 2:1, ordered=TRUE)
g <- factor(1:2, levels = 2:1, ordered=TRUE)
# it isn't ordered!
is.ordered(c(f,g))
# [1] FALSE
# no suprise as it isn't even a factor!
is.factor(c(f,g))
# [1] FALSE
However data.table has an S3 method c.factor, which is used to ensure that a factor is returned and the levels are retained. Unfortunately this method does not retain the ordered attribute.
getAnywhere('c.factor')
# A single object matching ‘c.factor’ was found
# It was found in the following places
# namespace:data.table
# with value
#
# function (...)
# {
# args <- list(...)
# for (i in seq_along(args)) if (!is.factor(args[[i]]))
# args[[i]] = as.factor(args[[i]])
# newlevels = unique(unlist(lapply(args, levels), recursive = TRUE,
# use.names = TRUE))
# ind <- fastorder(list(newlevels))
# newlevels <- newlevels[ind]
# nm <- names(unlist(args, recursive = TRUE, use.names = TRUE))
# ans = unlist(lapply(args, function(x) {
# m = match(levels(x), newlevels)
# m[as.integer(x)]
# }))
structure(ans, levels = newlevels, names = nm, class = "factor")
}
<bytecode: 0x073f7f70>
<environment: namespace:data.table
So yes, this is a bug. It is now reported as #5019.
As of version 1.8.11 data.table will combine ordered factors to result in ordered if a global order exists, and will complain and result in a factor if it doesn't exist:
DT1 = data.table(ordered('a', levels = c('a','b','c')))
DT2 = data.table(ordered('a', levels = c('a','d','b')))
rbind(DT1, DT2)$V1
#[1] a a
#Levels: a < d < b < c
DT3 = data.table(ordered('a', levels = c('b','a','c')))
rbind(DT1, DT3)$V1
#[1] a a
#Levels: a b c
#Warning message:
#In rbindlist(lapply(seq_along(allargs), function(x) { :
# ordered factor levels cannot be combined, going to convert to simple factor instead
To contrast, here's what base R does:
rbind(data.frame(DT1), data.frame(DT2))$V1
#[1] a a
#Levels: a < b < c < d
# Notice that the resulting order does not respect the suborder for DT2
rbind(data.frame(DT1), data.frame(DT3))$V1
#[1] a a
#Levels: a < b < c
# Again, suborders are not respected and new order is created
I met with the same problem after rbind, just re-assign the ordered level for the column.
test$id <- factor(test$id, levels = letters, ordered = T)
It's better to define factor after rbind

Select / assign to data.table when variable names are stored in a character vector

How do you refer to variables in a data.table if the variable names are stored in a character vector? For instance, this works for a data.frame:
df <- data.frame(col1 = 1:3)
colname <- "col1"
df[colname] <- 4:6
df
# col1
# 1 4
# 2 5
# 3 6
How can I perform this same operation for a data.table, either with or without := notation? The obvious thing of dt[ , list(colname)] doesn't work (nor did I expect it to).
Two ways to programmatically select variable(s):
with = FALSE:
DT = data.table(col1 = 1:3)
colname = "col1"
DT[, colname, with = FALSE]
# col1
# 1: 1
# 2: 2
# 3: 3
'dot dot' (..) prefix:
DT[, ..colname]
# col1
# 1: 1
# 2: 2
# 3: 3
For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).
To assign to variable(s), wrap the LHS of := in parentheses:
DT[, (colname) := 4:6]
# col1
# 1: 4
# 2: 5
# 3: 6
The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:
Using with = FALSE with := is now deprecated in all cases, given that wrapping
the LHS of := with parentheses has been preferred for some time.
colVar = "col1"
DT[, (colVar) := 1] # please change to this
DT[, c("col1", "col2") := 1] # no change
DT[, 2:4 := 1] # no change
DT[, c("col1","col2") := list(sum(a), mean(b))] # no change
DT[, `:=`(...), by = ...] # no change
See also Details section in ?`:=`:
DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol
And to answer further question in comment, here's one way (as usual there are many ways) :
DT[, colname := cumsum(get(colname)), with = FALSE]
# col1
# 1: 4
# 2: 9
# 3: 15
or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :
expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"
eval(parse(text=expr))
# col1
# 1: 4
# 2: 13
# 3: 28
If you do that a lot, you can define a helper function EVAL :
EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))
EVAL("DT[,",colname,":=cumsum(",colname,")]")
# col1
# 1: 4
# 2: 17
# 3: 45
Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.
Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.
set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
# col1
# 1: 4
# 2: 21
# 3: 66
*This is not an answer really, but I don't have enough street cred to post comments :/
Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?
colname <- as.name("users")
# Google Analytics query is run with chosen metric and resulting data is assigned to DT
DT2 <- DT[, sum(eval(colname, .SD)), by = country]
setnames(DT2, "V1", as.character(colname))
Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.
Retrieve multiple columns from data.table via variable or function:
library(data.table)
x <- data.table(this=1:2,that=1:2,whatever=1:2)
# === explicit call
x[, .(that, whatever)]
x[, c('that', 'whatever')]
# === indirect via variable
# ... direct assignment
mycols <- c('that','whatever')
# ... same as result of a function call
mycols <- grep('a', colnames(x), value=TRUE)
x[, ..mycols]
x[, .SD, .SDcols=mycols]
# === direct 1-liner usage
x[, .SD, .SDcols=c('that','whatever')]
x[, .SD, .SDcols=grep('a', colnames(x), value=TRUE)]
which all yield
that whatever
1: 1 1
2: 2 2
I find the .SDcols way the most elegant.
With development version 1.14.3, data.table has gained a new interface for programming on data.table, see item 10 in New Features. It uses the new env = parameter.
library(data.table) # development version 1.14.3 used
dt <- data.table(col1 = 1:3)
colname <- "col1"
dt[, cn := cn + 3L, env = list(cn = colname)][]
col1
<int>
1: 4
2: 5
3: 6
For multiple columns and a function applied on column values.
When updating the values from a function, the RHS must be a list object, so using a loop on .SD with lapply will do the trick.
The example below converts integer columns to numeric columns
a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
sapply(a1, class) # show classes of columns
# a b c1
# "integer" "integer" "character"
# column name character vector
nm <- c("a", "b")
# Convert columns a and b to numeric type
a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]
sapply(a1, class)
# a b c1
# "numeric" "numeric" "character"
You could try this:
colname <- as.name("COL_NAME")
DT2 <- DT[, list(COL_SUM=sum(eval(colname, .SD))), by = c(group)]

Splitting a data.table with the by-operator: functions that return numeric values and/or NAs fail

I have a data.table with two columns: one ID column and one value column. I want to split up the table by the ID column and run a function foo on the value column. This works fine as long as foo does not return NAs. In that case, I get an error that tells me that the types of the groups are not consistent. My assumption is that - since is.logical(NA) equals TRUE and is.numeric(NA) equals FALSE, data.table internally assumes that I want to combine logical values with numeric ones and returns an error. However, I find this behavior peculiar. Any comments on that? Do I miss something obvious here or is that indeed intended behavior? If so, a short explanation would be great. (Notice that I do know a work-around: just let foo2 return a complete improbable number and filter for that later. However, this seems bad coding).
Here is the example:
library(data.table)
foo1 <- function(x) {if (mean(x) < 5) {return(1)} else {return(2)}}
foo2 <- function(x) {if (mean(x) < 5) {return(1)} else {return(NA)}}
DT <- data.table(ID=rep(c("A", "B"), each=5), value=1:10)
DT[, foo1(value), by=ID] #Works perfectly
ID V1
[1,] A 1
[2,] B 2
DT[, foo2(value), by=ID] #Throws error
Error in `[.data.table`(DT, , foo2(value), by = ID) :
columns of j don't evaluate to consistent types for each group: result for group 2 has column 1 type 'logical' but expecting type 'numeric'
You can fix this by specifying that your function should return an NA_real_, rather than an NA of the default type.
foo2 <- function(x) {if (mean(x) < 5) {return(1)} else {return(NA)}}
DT[, foo2(value), by=ID] #Throws error
# Error in `[.data.table`(DT, , foo2(value), by = ID) :
# columns of j don't evaluate to consistent types for each group:
# result for group 2 has column 1 type 'logical' but expecting type 'numeric'
foo3 <- function(x) {if (mean(x) < 5) {return(1)} else {return(NA_real_)}}
DT[, foo3(value), by=ID] #Works
# ID V1
# [1,] A 1
# [2,] B NA
Incidentally the message that foo2() gives when it fails is nicely informative. It essentially tells you that your NA is of the wrong type. To fix the problem, you just need to look for the NA constant of the right type (or class):
NAs <- list(NA, NA_integer_, NA_real_, NA_character_, NA_complex_)
data.frame(contantName = sapply(NAs, deparse),
class = sapply(NAs, class),
type = sapply(NAs, typeof))
# contantName class type
# 1 NA logical logical
# 2 NA_integer_ integer integer
# 3 NA_real_ numeric double
# 4 NA_character_ character character
# 5 NA_complex_ complex complex

Resources