Consider the following:
library(data.table)
dt <- data.table(CO2)
What if I wanted to conditionally do:
dt[, mean(conc), by = .(Type, round(uptake))]
OR
dt[, mean(conc), by = round(uptake)]
depending on the value of some other boolean variable bool? I'd just like to avoid repeating two very similar commands in an if else form and I'm wondering if it's possible at all with data.table.
I tried the following:
bool <- TRUE
dt[, mean(conc), by = .(unlist(ifelse(bool, list(Type), list(NULL))), round(uptake))]
which works in this case, but if bool <- FALSE, it gives this error:
Error in `[.data.table`(dt, , mean(conc), by = .(unlist(ifelse(FALSE, :
column or expression 1 of 'by' or 'keyby' is type NULL. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
A quick something:
gby <- c('Type', 'tmp')[c(bool, TRUE)]
dt[, tmp := round(uptake)][, mean(conc), by = gby]
Related
I am writing a function which will impute zero if a value in a column is NA.
The tables I will need to impute will be in a format of:
tab = data.table(V1 = 1, var = NA, perc = NA)
Tables will have different column names but the one to impute will always be the second one.
To simplify, the function could be:
impute = function(DT, variable) {
DT[is.na(get(variable)), variable := 0]
}
That second 'variable' needs to be wrapped in something to work I assume. I would like to point it to the
variable = colnames(tab)[2]
Can anyone help please
You can wrap variable in () :
library(data.table)
impute = function(DT, variable) {
DT[is.na(get(variable)), (variable) := 0]
}
variable = colnames(tab)[2]
impute(tab, variable)
tab
# V1 var perc
#1: 1 0 NA
I don't think you need a function for it, you can do
cols <- colnames(DT)[2]
DT[, (cols) := lapply(.SD, function(z) replace(z, is.na(z), 0)), .SDcols = cols]
though if you want one, you could do
na0 <- function(x, default = 0) replace(x, is.na(x), default)
DT[, (cols) := lapply(.SD, na0), .SDcols = cols]
I have the following block of code that needs to be repeated often:
flights <- fread("https://raw.githubusercontent.com/wiki/arunsrinivasan/flights/NYCflights14/flights14.csv")
flights$origin %>% table()
flights[grepl("jfk", origin, ignore.case = TRUE),
origin := "0",
][grepl("ewr|lga", origin, ignore.case = TRUE),
origin := "1",
][, origin := as.numeric(origin)]
flights$origin %>% table()
Here is my attempt at wrapping this in a function that allow me to have n number of regex expressions and replacements for those for any given column in the data set.
my_function <- function(regex, replacement, column) {
flights[, column, with = FALSE] %>% table()
for (i in seq_along(regex)) {
responses[grepl(regex[i], column, ignore.case = TRUE),
column := replacement[i],
with = FALSE]
}
flights[, column := as.numeric(column)]
flights[, column, with = FALSE] %>% table()
}
But this spits the following warning message:
Warning messages:
1: In `[.data.table`(flights, grepl(regex[i], column, ignore.case = TRUE), :
with=FALSE together with := was deprecated in v1.9.4 released Oct 2014. Please wrap the LHS of := with parentheses; e.g., DT[,(myVar):=sum(b),by=a] to assign to column name(s) held in variable myVar. See ?':=' for other examples. As warned in 2014, this is now a warning.
2: In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
Any help would be appreciated. Many thanks.
Figured it out, will leave my solution here for everyone else's benefit.
Instead of using with = FALSE use () to refer to column by name.
To pass the column to another function (grepl() in my case) use get() function.
my_function <- function(regex, # Vector of regex strings to match
replacement, # Vector of strings to replace the matches from 'regex' arg
column, # Column to operate on
as.numeric = FALSE # Optional arg; convert 'column' to numeric as final step?
) {
cat("Converting:..")
responses[, column, with = FALSE] %>% table() %>% print
for (i in seq_along(regex)) {
responses[grepl(regex[i], get(column), ignore.case = TRUE, perl = TRUE),
(column) := replacement[i]]
}
if (as.numeric) {
responses[, (column) := as.numeric(get(column))]
}
cat("to:..")
responses[, column, with = FALSE] %>% table() %>% print
}
In R I have a data.table df with an integer column X. I want to convert this column from integer to a character.
This is really easy with df[, X:=as.character(X)].
Now for the question, I have the name of the column (X) stored in a variable like this:
col_name <- 'X'.
How do I access the column (and convert it to a character column) while only knowing the variable.
I tried numerious things all yielding in nothing useful or a column of NA's. Which syntax will get me the result I want?
library(data.table)
DT <- as.data.table(iris)
col_name <- "Petal.Length"
Use ( to evaluate the LHS of := and use list subsetting to select the column:
DT[, (col_name) := as.character(DT[[col_name]])]
class(DT[[col_name]])
#[1] "character"
We can specify it in .SDcols and assign the columns to character
df[, (col_name) := as.character(.SD[[1L]]), .SDcols = col_name]
If there are more than one column, use lapply
df[, (col_names) := lapply(.SD, as.character), .SDcols = col_names]
data
df <- data.table(X = as.integer(1:5), Y = LETTERS[1:5])
col_name <- "X"
I've been getting used to data.tables and just cannot seem to find the answer to something that feels so simple (or at least is with data frames).
I want to use data.table to aggregate, however, I don't always know which column to aggregate ahead of time (it takes input from the user). I want to define what column to use based off of a character vector. Here's a short example of what I want to do:
require(data.table)
myDT <- data.table(a = 1:10, b = 11:20, n1 = c("first", "second"))
aggWith <- "a"
Now I want to use the aggWith object to define what column to sum on. This does not work:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1)]
Error in sum(aggWith) : invalid 'type' (character) of argument
Nor does this:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1), with = FALSE]
Error in sum(aggWith) : invalid 'type' (character) of argument
This does:
myDT.Agg <- myDT[, .(Agg = sum(a)), by = .(n1)]
However, I want to be able to define which column "a" is arbitrarily based off a character vector. I've looking through ?data.table, but am just not seeing what I need. Sorry in advance if this is really simple and I'm just overlooking something.
We could specify the 'aggWith' as .SDcols and then get the sum of .SD
myDT[, list(Agg= sum(.SD[[1L]] )), by = n1, .SDcols=aggWith]
If there are multiple columns, then loop with lapply
myDT[, lapply(.SD, sum), by = n1, .SDcols= aggWith]
Another option would be to use eval(as.name
myDT[, list(Agg= sum(eval(as.name(aggWith)))), by = n1]
I am trying to do a min/max aggregate on a dynamically chosen column in a data.table. It works perfectly for numeric columns but I cannot get it to work on Date columns unless I create a temporary data.table.
It works when I use the name:
dt <- data.table(Index=1:31, Date = seq(as.Date('2015-01-01'), as.Date('2015-01-31'), by='days'))
dt[, .(minValue = min(Date), maxValue = max(Date))]
# minValue maxValue
# 1: 2015-01-01 2015-01-31
It does not work when I use with=FALSE:
colName = 'Date'
dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)), :
# could not find function "."
I can use .SDcols on a numeric column:
colName = 'Index'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# minValue maxValue
# 1: 1 31
But I get an error when I do the same thing for a Date column:
colName = 'Date'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
If I use lapply(.SD, min) or sapply() then the dates are changed to numbers.
The following works and does not seem to waste memory and is fast. Is there anything better?
a <- dt[, colName, with=F]
setnames(a, 'a')
a[, .(minValue = min(a), maxValue = max(a))]
On your first attempt:
dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)), :
# could not find function "."
You should simply read the Introduction to data.table vignette to understand what with= means. It's easier if you're aware of with() function from base R.
On the second one:
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
This seems like an issue with min() and max() on a data.frame/data.table with column with attributes. Here's a MRE.
df = data.frame(x=as.Date("2015-01-01"))
min(df)
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
To answer your question, you can use get():
dt[, .(min = min(get(colName)), max = max(get(colName)))]
Or as #Frank suggested, [[ operator to subset the column:
dt[, .(min = min(.SD[[colName]]), max = max(.SD[[colName]]))]
There's not yet a nicer way of applying .SD to multiple functions (because base R doesn't seem to have one AFAICT, and data.table tries to use base R functions as much as possible). There's a FR #1063 to address this issue. If/when that gets implemented, then one could do, for example:
# NOTE: not yet implemented, FR #1063
dt[, colwise(.SD, min, max), .SDcols = colName]