I want to perform a data.table task over and over in a function call: Reduce number of levels for large categorical variables My problem is similar to Data.table and get() command (R) or pass column name in data.table using variable in R but I can't get it to work
Without a function call this works just fine:
# Load data.table
require(data.table)
# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
weight = rnorm(n = 10e3, mean = 70, sd = 20))
# Decide the minimum frequency a level needs...
min.freq <- 3350
# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]
# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"
but wrapped like
reduceCategorical <- function(variableName, min.freq){
fail.min.f <- dt[, .N, variableName][N < min.freq, variableName]
levels(dt[, variableName][fail.min.f]) <- "Other"
}
I only get errors like:
reduceCategorical(dt$x, 3350)
Fehler in levels(df[, variableName][fail.min.f]) <- "Other" :
trying to set attribute of NULL value
And sometimes
Error is: number of levels differs
One possibility is to define your own re-leveling function using data.table::setattr that will modify dt in place. Something like
DTsetlvls <- function(x, newl)
setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))
Then use it within another predefined function
f <- function(variableName, min.freq){
fail.min.f <- dt[, .N, by = variableName][N < min.freq, get(variableName)]
dt[, DTsetlvls(get(variableName), fail.min.f)]
invisible()
}
f("type", min.freq)
levels(dt$type)
# [1] "C" "other"
Some other data.table alternatives
f <- function(var, min.freq) {
fail.min.f <- dt[, .N, by = var][N < min.freq, get(var)]
dt[get(var) %in% fail.min.f, (var) := "Other"]
dt[, (var) := factor(get(var))]
}
Or using set/.I
f <- function(var, min.freq) {
fail.min.f <- dt[, .I[.N < min.freq], by = var]$V1
set(dt, fail.min.f, var, "other")
set(dt, NULL, var, factor(dt[[var]]))
}
Or combining with base R (doesn't modify original data set)
f <- function(df, variableName, min.freq){
fail.min.f <- df[, .N, by = variableName][N < min.freq, get(variableName)]
levels(df$type)[fail.min.f] <- "Other"
df
}
Alternatively, we could stick we characters instead (if type is a character), you could simply do
f <- function(var, min.freq) dt[, (var) := if(.N < min.freq) "other", by = var]
You are referencing things little differently in the wrapper, to get "type" column name you are using the whole variableName which is actually a vector same with getting levels, you are not using variableName directly as done in function
The error is because value of fail.min.f is coming NULL owing to referencing.
Related
I want to group my data.table and feed the resulting subsets into a function. This function returns a list of data.tables.
I can imagine that calling the function directly in jon .SD won't work, because the result is not a data.table, atomic list or vector. However, I was hoping to make use of .SD and call the function in a vectorized fashion (I hope that's the correct term here), instead of looping over the grouping variable, subsetting the data in the loop and calling the function.
Reproducible example:
f <- function(dt_subset) {
l <- list(dt_subset)
return(l)
}
dt <- data.table(group = rep(c(1,2,3), 3),
name = letters[seq( from = 1, to = 9)],
value = runif(9))
# obviously this won't work:
result <- dt[, f(.SD), by = group]
Naive solution (shows the desired output):
result <- list()
for (g in unique(dt$group)) {
dt_sub <- dt[group == g]
result <- c(result, f(dt_sub))
}
The problem is as follows: I have a data.table with columns A and B. A summary is required and its name is passed as a character vector in variable var1.
I have tried to find an answer for some time now, see e.g. this and this SO posts. Being unable to find a proper solution, I feel forced to ask this myself.
Now what I want to do is (using data.frame)
tmp[, var1] <- rep(1, nrow(tmp))
tmp <- aggregate(formula(paste(var1, "~ A + B")), tmp, sum)
but I fail to do so with data.table with my last and best effort being
tmp <- tmp[, list(..var1 = .N), by = list(A, B)]
Now, what is wrong with my code and how do I fix it?
And note that I do NOT want to use the := operator, because I want the result to be exactly as it would be from aggregate().
Edit 1: A working example:
library(data.table)
tmp <- data.table(A=c("R","G","G","B","B","B"), B=c(1,1,1,2,1,2))
print(tmp)
var1 <- "C"
tmp[, var1] <- rep(1, nrow(tmp))
tmp2 <- aggregate(formula(paste(var1, "~ A + B")), tmp, sum)
print(tmp2)
tmp3 <- tmp[, list(..var1 = .N), by = list(A, B)]
print(tmp3)
Hope that I did not misread your qn. Here are some options:
1) using base::setNames
DT[, setNames(.(.N), var1), by=.(A, B)]
2) using data.table::setnames
setnames(DT[, .N, by=.(A, B)], "N", var1)[]
3) using base::structure followed by base::as.list
DT[, as.list(structure(.N, names=var1)), by=.(A, B)]
data:
DT <- data.table(A=c(1,1,2,2), B=c(1,1,2,3))
var1 <- "myCol"
I am trying to combine two approaches:
Bootstrapping multiple columns in data.table in a scalable fashion
with
Bootstrap weighted mean in R
Here is some random data:
## Generate sample data
# Function to randomly generate weights
set.seed(7)
rtnorm <- function(n, mean, sd, a = -Inf, b = Inf){
qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd)
}
# Generate variables
nps <- round(runif(3500, min=-1, max=1), 0) # nps value which takes 1, 0 or -1
group <- sample(letters[1:11], 3500, TRUE) # groups
weight <- rtnorm(n=3500, mean=1, sd=1, a=0.04, b=16) # weights between 0.04 and 16
# Build data frame
df = data.frame(group, nps, weight)
# The following packages / libraries are required:
require("data.table")
require("boot")
This is the code from the first post above boostrapping the weighted mean:
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(d, w))
}
results_qsec <- boot(data= df[, 2, drop = FALSE],
statistic = samplewmean,
R=10000,
j = df[, 3 , drop = FALSE])
This works totally fine.
Below ist the code from the second post above bootstrapping the mean by groups within a data table:
dt = data.table(df)
stat <- function(x, i) {x[i, (m=mean(nps))]}
dt[, list(list(boot(.SD, stat, R = 100))), by = group]$V1
This, too, works fine.
I have trouble combining both approaches:
Running …
dt[, list(list(boot(.SD, samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
… brings up the error message:
Error in weighted.mean.default(d, w) :
'x' and 'w' must have the same length
Running …
dt[, list(list(boot(dt[, 2 , drop = FALSE], samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
… brings up a different error:
Error in weighted.mean.default(d, w) :
(list) object cannot be coerced to type 'double'
I still have problems getting my head around the arguments in data.table and how to combine functions running data.table.
I would appreciate any help.
It is related to how data.table behaves within the scope of a function. d is still a data.table within samplewmean even after subsetting with i whereas weighted.mean is expecting numerical vector of weights and of values. If you unlist before calling weighted.mean, you will be able to fix this error
Error in weighted.mean.default(d, w) :
(list) object cannot be coerced to type 'double'
Code to unlist before passing into weighted.mean:
samplewmean <- function(d, i, j) {
d <- d[i, ]
w <- j[i, ]
return(weighted.mean(unlist(d), unlist(w)))
}
dt[, list(list(boot(dt[, 2 , drop = FALSE], samplewmean, R = 5000, j = dt[, 3 , drop = FALSE]))), by = group]$V1
A more data.table-like (data.table version >= v1.10.2) syntax is probably as follows:
#a variable named original is being passed in from somewhere and i am unable to figure out from where
samplewmean <- function(d, valCol, wgtCol, original) {
weighted.mean(unlist(d[, ..valCol]), unlist(d[, ..wgtCol]))
}
dt[, list(list(boot(.SD, statistic=samplewmean, R=1, valCol="nps", wgtCol="weight"))), by=group]$V1
Or another possible syntax is: (see data.table faq 1.6)
samplewmean <- function(d, valCol, wgtCol, original) {
weighted.mean(unlist(d[, eval(substitute(valCol))]), unlist(d[, eval(substitute(wgtCol))]))
}
dt[, list(list(boot(.SD, statistic=samplewmean, R=1, valCol=nps, wgtCol=weight))), by=group]$V1
I have a situation where I need to look at character data, and convert to numeric or integer. I need to perform this operation on a data.table and it needs to be sofastthatIdontnoticeit when working with a data.table that has ~1000 columns and 1e6 rows. There's a lot of missing, or sparse data so that is a confounding element.
fread from the data.table package does this incredibly quickly and is well tested from a csv file (amoung other options).
Is there a way to apply the column identification used in fread to an existing data.table?
Otherwise, here's the approach I was considering (which is still too slow):
Dummy Data:
library(data.table)
size = 1e6
resample <- function(x,size = 1e6) sample(x,size,replace = TRUE)
text <- c("Canada","Peru","Australia",
"Angola","France","", NA_character_)
text2 <- c("Oh Canada.","Arriba Peru.",
"Australia?","Vive la France.")
numerics <- rnorm(1e6)
dt <- data.table(
id = as.character(1:1e6),
i1 = resample(c(as.character(c(0:5,NA)),"")), # sometimes just blank
i2 = resample(c(as.character(c(100:500,NA)))),
n1 = as.character(round(rnorm(1e6),3)),
t1 = resample(text),
t2 = resample(text2)
)
str(dt)
My approach so far, is to use grep to test the columns for alpha, and a literal . and then write a short function to apply as.* as identified.
decide <- data.frame(
vars = names(dt),
character = unlist(lapply(dt, function(x) length(grep("[a-z]",x)))),
numeric = unlist(lapply(dt, function(x) length(grep("[.]",x))))
)
what_is_it <- function(character, numeric) {
if(character == 0 & numeric == 0) {
return("as.integer")
}
if(character > 0) {
return("as.character")
}
if(numeric > 0 & character == 0) {
return("as.numeric")
}
}
decide$fun <- apply(decide[-1], 1, function(x) what_is_it(x[1],x[2]))
for(var in decide$vars) {
fun <- get(decide$fun[decide$vars == var])
dt[, (var) := fun(get(var))]
dt[]
}
system.time(source("https://gist.githubusercontent.com/1beb/183511b51d615751860204344a02c799/raw/91fcee73f24596ac6bdec00edaad944b5b1b7713/quick_convert.R"))
Running at about 3.5 seconds on my machine, but for only 7 columns.
As provided by user20650. The answer is type.convert
I'm trying to calculate columns in a data.table having the calculation passed by variable. The following is the same as what I'm trying to achieve:
dt <- data.table(mpg)
dt[, list(manufacturer, model, mpg_cyl_cty=cty/cyl, mpg_cyl_hwy=hwy/cyl)]
where I want mpg_cyl_cty=cty/cyl, mpg_cyl_hwy=hwy/cyl to come from a variable like:
var <- c('mpg_cyl_cty=cty/cyl', 'mpg_cyl_hwy=hwy/cyl')
dt[, list(manufacturer, model, var)]
I guess there are more problems to this as what type var should be assigned (c or list) and how dt is called, via list or c.
Hope somebody has a suggestion as I'm not finding anything on the WWW.
library(ggplot2)
library(data.table)
dt <- data.table(mpg)
# The original calculation
dt1 <- dt[, list(manufacturer, model, mpg_cyl_cty=cty/cyl, mpg_cyl_hwy=hwy/cyl)]
var <- c('mpg_cyl_cty=cty/cyl', 'mpg_cyl_hwy=hwy/cyl')
# create a string to pass for evaluation
expr <- paste0("`:=`(", paste0(var, collapse = ", "), ")")
dt2 <- dt[,
.(manufacturer, model, cty, cyl, hwy)
][, eval(parse(text = expr)) # evaluate the expression
][, c("cty", "cyl", "hwy") := NULL] # delete unnecessary columns
> print(all.equal(dt1, dt2))
[1] TRUE
Slightly different approach to avoid eval(parse(.)) and operate on language objects.
Instead of c('mpg_cyl_cty=cty/cyl', 'mpg_cyl_hwy=hwy/cyl') it takes just c("cty","hwy") input.
library(data.table)
dt = as.data.table(ggplot2::mpg)
r.expected = dt[, list(manufacturer, model, mpg_cyl_cty=cty/cyl, mpg_cyl_hwy=hwy/cyl)]
cyl.ratio.j = function(var){
substitute(lhs := rhs, list(
lhs = as.name(paste0("mpg_cyl_", var)),
rhs = call("/", as.name(var), as.name("cyl"))
))
}
r = dt[, eval(cyl.ratio.j("cty"))
][, eval(cyl.ratio.j("hwy"))
][, .SD, .SDcols = c("manufacturer", "model", paste0("mpg_cyl_", c("cty","hwy")))]
all.equal(r.expected, r)
#[1] TRUE