fill up a numeric variable in data.table efficiently by id - r

As a natural consequence of the data.table subsetting logic in i, I often end up in situations where I have part of a variable defined for an id (like "total economic crises before 2007 per country" being counted for data < 2007 hence NA for anything later). Here is a slightly more general example:
library("data.table")
Data <- data.table(id = rep(c(1,2,3), each = 4),
variable =c(3,3,NA,NA,NA,NA,4,NA,NA,NA,NA,NA))
When I subsequently need this variable defined over the entire dataset, I want to fill up the NA's by group. I usually do this using max by group:
Data[, variable_full := max(variable, na.rm = T), by = id]
Data[variable_full == -Inf, variable_full := NA] # this just overwrites the result of the warning
But, for whatever reason, this takes very long in large datasets. Is there a more efficient, more data.table like way of doing this?
edit: "large datasets" is currently 8 million observations and it stops my workflow because it takes several minutes. Other data.table operations take split seconds because data.table is amazing.

perhaps a join?
Data[, variable_full := variable]
Data[is.na(variable), variable_full := Data[!is.na(variable),
max(variable),
by = .(id)][Data[is.na(variable), ], V1, on = .(id)]][]
a (slightly) shorter version of the line with the join is
Data[is.na(variable), variable_full := Data[!is.na(variable), max(variable), by = .(id)][.SD, V1, on = .(id)]]
here, the [Data[is.na(variable), ], part has bene replaced with [.SD, , because is is alreadaey derived from i (at the beginning of the line)...
the

If you want to install the collapse package, I think this could be much faster:
library(collapse)
Data = Data |> gby(id) |> fmutate(variable_full=fmax(variable)) |> setDT()
gby is 'group by' and fmutate is 'fast mutate'. The default output is a grouped data frame so it needs the 'setDT' at the end

Related

Lookup deciles from one data.table in another data.table (maybe: how to take a by= argument into another filter??)

I have a data table with prices that looks like:
library("data.table")
Data <- data.table(id = c(1,2),
value = c(250, 450))
I have another data table with price decile distributions that looks like:
lookupdata <- data.table(decile = seq(0.1,1, by = 0.1),
value_lookup = seq(100,1000, by = 100))
I want to add a column to the first data.table that tells me in which price decile each id happens to be. I managed to code a loop through id that does this:
for(i in Data$id){
Data[id == i, decile := min(lookupdata[Data[id == i,value] <= value_lookup, decile])]
print(i)
}
But since my actual data has 23 million observations, this loop takes ages. I'm looking for an efficient data.table solution to this issue. All that needs to happen is something like: Lookup your value in the lookupdata, when you have found it, take the associated decile and assign it to your id.
Here is my attempt (feel free to ignore, I'm just looking for a solution):
Data[, wrong_decile := min(lookupdata[value_lookup >= Data[,value], decile]), by = id][]
The problem here is that the inner Data[, value] does not know which ID to return so I'm just taking the global minimum here. I just don't manage to get this into a by=id format without a loop through IDs because I don't know how to take this by=id argument inside another data.table within a filter. This might be my actual question but it feels like this should be much simpler then my attempt.
I think you are looking for a rolling join
Data[, decile := lookupdata[Data, decile, on = .(value_lookup = value), roll = "nearest"]][]
# id value decile
# 1: 1 250 0.2
# 2: 2 450 0.4

fast replacement of data.table values by labels stored in another data.table

It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]

How can I program a loop in R?

How can I program a loop so that all eight tables are calculated one after the other?
The code:
dt_M1_I <- M1_I
dt_M1_I <- data.table(dt_M1_I)
dt_M1_I[,I:=as.numeric(gsub(",",".",I))]
dt_M1_I[,day:=substr(t,1,10)]
dt_M1_I[,hour:=substr(t,12,16)]
dt_M1_I_median <- dt_M1_I[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
This should be calculated for:
M1_I
M2_I
M3_I
M4_I
M1_U
M2_U
M3_U
M4_U
Thank you very much for your help!
Whenever you have several variables of the same kind, especially when you find yourself numbering them, as you did, step back and replace them with a single list variable. I do not recommend doing what the other answer suggested.
That is, instead of M1_I…M4_I and M1_U…M4_U, have two variables m_i and m_u (using lower case in variable names is conventional), which are each lists of four data.tables.
Alternatively, you might want to use a single variable, m, which contains nested lists of data.tables (m = list(list(i = …, u = …), …)).
Assuming the first, you can then iterate over them as follows:
give_this_a_meaningful_name = function (df) {
dt <- data.table(df)
dt[, I := as.numeric(gsub(",", ".", I))]
dt[, day := substr(t, 1, 10)]
dt[, hour := substr(t, 12, 16)]
dt[, list(median_I = median(I, na.rm = TRUE)), by = .(day, hour)]
}
m_i_median = lapply(m_i, give_this_a_meaningful_name)
(Note also the introduction of consistent spacing around operators; good readability is paramount for writing bug-free code.)
You can use a combination of a for loop and the get/assign functions like this:
# create a vector of the data.frame names
dts <- c('M1_I', 'M2_I', 'M3_I', 'M4_I', 'M1_U', 'M2_U', 'M3_U', 'M4_U')
# iterate over each dataframe
for (dt in dts){
# get the actual dataframe (not the string name of it)
tmp <- get(dt)
tmp <- data.table(tmp)
tmp[, I:=as.numeric(gsub(",",".",I))]
tmp[, day:=substr(t,1,10)]
tmp[, hour:=substr(t,12,16)]
tmp <- tmp[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
# assign the modified dataframe to the name you want (the paste adds the 'dt_' to the front)
assign(paste0('dt_', dt), tmp)
}

How to impute values in a data.table by groups?

Take the following data table:
# IMPUTING VALUES
library(data.table)
set.seed(1337)
mydt <- data.table(Year = rep(2000:2005, each = 10),
Type = c("A","B"),
Value = 30 + rnorm(60)
)
naRows <- sample(nrow(mydt),15)
mydt[ naRows, Value := NA]
setkey(mydt,Year,Type)
How would I go about imputing the NAs with the median by Year and Type? I have tried the following
# computed medians
computedMedians <- mydt[, .(Median = median(Value, na.rm = TRUE)), keyby = .(Year,Type)]
# dataset of just NA rows
dtNAs <- mydt[ is.na(Value), .SD, by = .(Year,Type)]
mydt[ is.na(Value),
Imputations := dtNAs[computedMedians, nomatch = 0][, Median],
by = .(Year,Type)]
mydt
but when you run the code, you'll see that it works unless a group is missing data completely, and the computed medians get recycled. Is there a simpler way? or how would you go about getting just the last error fixed?
If you prefer updating the rows without copying the entire column, then:
require(data.table) # v1.9.6+
cols = c("Year", "Type")
dt[is.na(Value), Value := dt[.BY, median(Value, na.rm=TRUE), on=cols], by=c(cols)]
.BY is a special symbol which is a named list containing the groups. Although this requires a join with the entire data.table every time, it should be quite fast, as it's searching for only one group.
There's no need to make a secondary table; it can be done inside a single by-group call:
mydt[,
Value := replace(Value, is.na(Value), median(Value, na.rm=TRUE))
, by=.(Year,Type)]
This imputation doesn't guarantee that all missing values are filled (e.g., 2005-B is still NA).

Elegantly assigning multiple columns in data.table with lapply()

I am trying to figure out an elegant way to use := assignment to replace many columns at once in a data.table by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub) to all character columns in a table. It is not difficult to extend the data.frame way of doing this to a data.table, but I'm looking for a method consistent with the data.table way of doing things.
For example:
library(data.table)
m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))
## replace V20-V100 with sqrt
# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)
# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
I don't like this because I don't like reference the data.table in a j expression. I also know that I can use := to assign with lapply given that I know the column names:
dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]
(You could extend this by building an expression with unknown column names.)
Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?
# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]
Yes, you're right in question here :
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100))
dt[, col := sqrt(dt[[col]]), with = FALSE]
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
because the with = FALSE wasn't easy to read whether it referred to the LHS or the RHS of :=. End aside.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with lapply on the RHS of := is that the RHS (the lapply) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
As #Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table method starts to add up). To eliminate that overhead that there is data.table::set which under ?set is described as a "loopable" :=. I use a for loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set over a large number of rows than a large number of columns.) However, looped set doesn't solve the problem of the repeated reference to the dt symbol name that you mentioned in the question :
I don't like this because I don't like reference the data.table in a j expression.
Agreed. So the best I can do is revert to your looping of := but use get instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
However, I fear that using get in j carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get() on the RHS but not on the LHS. To address that we could sugar the LHS and allow get() as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
Also, maybe value of set could be run within scope of DT, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
These should work if you want to refer to the columns by string name:
n = paste0("V", 20:100)
dt[, (n) := lapply(n, function(x) {sqrt(get(x))})]
or
dt[, (n) := lapply(n, function(x) {sqrt(dt[[x]])})]
Is this what you are looking for?
dt[ , names(dt)[20:100] :=lapply(.SD, function(x) sqrt(x) ) , .SDcols=20:100]
I have heard tell that using .SD is not so efficient because it makes a copy of the table beforehand, but if your table isn't huge (obviously that's relative depending on your system specs) I doubt it will make much of a difference.

Resources