Using a function within a data.table extremely slow - r

Obviously I'm not doing things right. This method works but it's hardly practical. Please keep in mind that the sample data I'm using runs quickly, but if you were to try this on a DT that's 400k rows+, you'll be waiting awhile.
I originally did this so that I could convert Dates into Fiscal Weeks (Friday to Thursday) with a name. I wanted to be able to look at each date within the DT, test which range it fell into and then return the week name. It worked like a charm! But, like I said, extremely slowly.
I then thought this would be a good technique for any row-by-row tweaking I might need to do, but unless I can optimize this somehow I might have to rethink things...
So let's use the mtcars dataset, and let's say we want to look at what cyl is in each row and then translate that number to it's word.
cyl.word <- function(c) {
r <- "Huh?"
if(c==4) {r <- "Four"}
if(c==6) {r <- "Six"}
if(c==8) {r <- "Eight"}
return(r)
}
Works like you would expect, if you type cyl.word(4) you get "Four". Great. Let's apply it across the DT...
cars <- data.table(mtcars)
cars[, Word:=cyl.word(cars$cyl[.I]), by=cyl]
head(cars)
mpg cyl disp hp drat wt qsec vs am gear carb Word
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Six
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Six
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Four
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Six
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Eight
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Six
I think the slowdown is obviously caused by the [.I], but without it it seems to grab the entire vector of cyl and just use the first one for the function, they all return "six":
cars[, Word:=cyl.word(cars$cyl), by=cyl]
head(cars)
mpg cyl disp hp drat wt qsec vs am gear carb Word
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Six
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Six
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Six
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Six
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Six
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Six
And it also gives me this error message:
Warning messages:
1: In if (c == 4) { :
the condition has length > 1 and only the first element will be used
2: In if (c == 6) { :
the condition has length > 1 and only the first element will be used
3: In if (c == 8) { :
the condition has length > 1 and only the first element will be used
Which is why I'm using the .I variable to tell the function which iteration to look at...

Using the vectorized function match() should speed up your computations considerably:
library(data.table)
cars <- data.table(mtcars)
cyl.word <- function(x) {
c("Four", "Six", "Eight", "Huh?")[match(x, c(4,6,8), nomatch=4)]
}
cars[, Word:=cyl.word(cyl)]
head(cars)
# mpg cyl disp hp drat wt qsec vs am gear carb Word
# 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Six
# 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Six
# 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Four
# 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Six
# 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Eight
# 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Six

You could join to another data.table as well, which should be very fast. "Huh" would be NA.
library(data.table)
cars <- data.table(mtcars)
cars[32]$cyl <- 10
dt2 <- data.table(cyl = c(4,6,8), word = c("Four", "Six", "Eight"))
setkey(cars, cyl)
setkey(dt2, cyl)
cars <- dt2[cars, nomatch = NA]
tail(cars)
# cyl word mpg disp hp drat wt qsec vs am gear carb
# 1: 8 Eight 15.2 304 150 3.15 3.435 17.30 0 0 3 2
# 2: 8 Eight 13.3 350 245 3.73 3.840 15.41 0 0 3 4
# 3: 8 Eight 19.2 400 175 3.08 3.845 17.05 0 0 3 2
# 4: 8 Eight 15.8 351 264 4.22 3.170 14.50 0 1 5 4
# 5: 8 Eight 15.0 301 335 3.54 3.570 14.60 0 1 5 8
# 6: 10 NA 21.4 121 109 4.11 2.780 18.60 1 1 4 2
Both match and data.table join by key are pretty fast. Here are benchmarks for a much larger dataset.
library(microbenchmark)
N <- 50000
cars <- data.table(mtcars[rep(seq_len(nrow(mtcars)), N), ])
dim(cars)
#[1] 1600000 11
microbenchmark(
MATCH = {cyl.word <- function(x) {
c("Four", "Six", "Eight", "Huh?")[match(x, c(4,6,8), nomatch=4)]}
cars[, match_word:=cyl.word(cyl)]},
DTJOIN = {dt2 <- data.table(cyl = c(4,6,8), word = c("Four", "Six", "Eight"))
setkey(cars, cyl)
setkey(dt2, cyl)
new_cars <- dt2[cars, nomatch = NA]})
#Unit: milliseconds
# expr min lq median uq max neval
# MATCH 36.73572 41.16291 50.89132 63.22235 271.3654 100
# DTJOIN 29.56963 33.72217 39.51063 61.92716 268.3304 100

Related

How to obtain a random sample in order to their category?

I have a DF called 'listing_df', what I need is to compute the NA data from the variables 'number_of_reviews' and 'review_scores_rating' creating random samples according to the numbers of each group of 'room_type'.
I attach a picture of how the DF looks like:
I tried first of all grouping by 'room_typeI:
test <- listings_df %>% group_by(room_type)
Then, I select the columns where I want to transform the Na data, and create the samples
test$number_of_reviews[is.na(listings_df$number_of_reviews)] <-
sample(listings_df$number_of_reviews, size = sum(is.na(listings_df$number_of_reviews)))
test$review_scores_rating[is.na(listings_df$review_scores_rating)] <-
sample(listings_df$review_scores_rating, size = sum(is.na(listings_df$review_scores_rating)))
I am not sure if it's createn the random data according the room_type, also I would like to know if it's possible to manage this creating a loop.
Thanks!
What you're asking for is called imputation. I'll demonstrate using mtcars as the data, cyl as the grouping variable (your room_type, I suspect), and other columns with NA values.
mt <- mtcars
set.seed(42)
mt$disp[sample(32,10)] <- NA
mt$hp[sample(32,10)] <- NA
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 NA 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 NA 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 NA NA 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 NA NA 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
From here:
dplyr
library(dplyr)
set.seed(42)
mt %>%
group_by(cyl) %>%
mutate(across(c(disp, hp), ~ coalesce(., sample(na.omit(.), size=n(), replace=TRUE)))) %>%
ungroup()
# # A tibble: 32 × 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 3 22.8 4 108 91 3.85 2.32 18.6 1 1 4 1
# 4 21.4 6 145 110 3.08 3.22 19.4 1 0 3 1
# 5 18.7 8 301 180 3.15 3.44 17.0 0 0 3 2
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# 7 14.3 8 301 245 3.21 3.57 15.8 0 0 3 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 9 22.8 4 141. 52 3.92 3.15 22.9 1 0 4 2
# 10 19.2 6 225 123 3.92 3.44 18.3 1 0 4 4
# # … with 22 more rows
data.table
library(data.table)
cols <- c("disp", "hp")
set.seed(42)
as.data.table(mt)[, c(cols) := lapply(.SD, \(z) fcoalesce(z, sample(na.omit(z), size=.N, replace=TRUE))), .SDcols = cols][] |>
head()
>
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 21.0 6 79.0 110 3.90 2.620 16.46 0 1 4 4
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
# 3: 22.8 4 108.0 113 3.85 2.320 18.61 1 1 4 1
# 4: 21.4 6 460.0 97 3.08 3.215 19.44 1 0 3 1
# 5: 18.7 8 146.7 62 3.15 3.440 17.02 0 0 3 2
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
base R
set.seed(42)
mt[cols] <- lapply(mt[cols], \(z) ave(z, mt$cyl, FUN = \(z) ifelse(is.na(z), sample(na.omit(z), size=length(z), replace=TRUE), z)))
head(mt)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 91 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 145 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 301 180 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Note: while I set the random seed for imputation in each dialect, there is no expectation that the order of columns and fixes will be the same between the dialects. For this reason we see that the replacement for the NA values is not the same between the dialects of code; the seed is provided for basic reproducibility, not identical results.

Data table cap and floor assigning new variables

I have a list of variables I want to cap and floor but I want to save them with a different name to not overwrite the variables.
Example with mtcars that yields an error:
# Variables to cap and floor
mtcars_vars <- c("wt","qsec","mpg")
# Function to cap and floor
cap_floor <- function(x,pct_floor = 0.05, pct_cap = 0.95){
quantiles <- quantile( x, c( pct_floor, pct_cap ),na.rm = TRUE)
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
mtcars[,paste0(mtcars_vars,"_cap_fl") := lapply(mtcars_vars, cap_floor)]
This yields a Error in (1 - h) * qs[i] : non-numeric argument to binary operator although the function seems to work fine with individual columns.
cap_floor(mtcars$qsec)
cap_floor(mtcars$wt)
cap_floor(mtcars$mpg)
Any idea why it is happening?
You are essentially lapply()'ing over a character vector. You could use .SDcols:
library(data.table)
setDT(mtcars)
DTmtcars <- data.table(mtcars)
DTmtcars[,paste0(mtcars_vars,"_cap_fl") := lapply(.SD, cap_floor), .SDcols = mtcars_vars]
> head(DTmtcars)
mpg cyl disp hp drat wt qsec vs am gear carb wt_cap_fl qsec_cap_fl mpg_cap_fl
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2.620 16.4600 21.0
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2.875 17.0200 21.0
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2.320 18.6100 22.8
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.215 19.4400 21.4
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3.440 17.0200 18.7
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.460 20.1045 18.1

Create new data.table column by applying function on other columns

I want to create a new column in a data.table based on the values of other columns. Using mtcars as an example:
> library(data.table)
> dt <- as.data.table(mtcars)
> head(dt[, newval := cyl + gear])
mpg cyl disp hp drat wt qsec vs am gear carb newval
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 10
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 10
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 8
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 9
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 11
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 9
which works fine, but for even slightly more complex function, I get warning messages:
simple_func <- function(a, b){
if(a %in% c(4,6) ){
return(a*b)
}else{
return(b/a)
}
}
head(dt[, newval := simple_func(cyl, disp)])
returns:
> head(dt[, newval := simple_func(cyl, disp)])
mpg cyl disp hp drat wt qsec vs am gear carb newval
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 960
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 960
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 432
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1548
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2880
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1350
Warning message:
In if (a %in% c(4, 6)) { :
the condition has length > 1 and only the first element will be used
the value for row 5 (cyl == 8) is clearly incorrect and expected value of newval is 45.
The reason is that the function is not being evaluated one row at a time but for the entire column and therefore if the condition is met for the first row (dt$cyl[1], dt$disp[1]), all other rows have the same formula appllied to them.
How do I get around this? I tried using .SDcols but didn't get it right and got other errors instead.
Use ifelse
simple_func <- function(a, b){
ifelse(a %in% c(4,6), a*b, b/a)
}

Add new data.frame column based on values in other columns

I'm trying to iterate over a data table to calculate the integral of two columns, a dt$xmin and a dt$xmax, with a function f, having the answer be written to a new column dt$integral. I'm currently trying to use something like the below code without success:
dt$integral <- mapply(f, dt$xmin, dt$xmax)
Any help would be greatly appreciated!
Perhaps you do not need mapply and a simple assignment should work dt$integral<- f(dt$min, dt$max). There is no data or example of what you want but here's what I think could work for you (using data.table):
library(data.table)
dt <- as.data.table(mtcars)
newfunc <- function(a, b){
return(a + log(b) - exp(a/b) + 3.1*a^1.918)
}
# Adding a new column called "newcol"
> head(dt[, newcol := newfunc(wt, mpg/qsec),])
mpg cyl disp hp drat wt qsec vs am gear carb newcol
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 14.73145
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 16.30387
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11.45233
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 13.87593
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 13.78816
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 -10.84599
For a single new column the above style of variable assignment would work. For multiple new columns and functions, you would need to use a function that returns a list for the new columns. Look up more on assignment using := in data.table.

problems with NA in data.table

I have problems with missing values NA in data.table. When using mean(x) BY=z, I got NA if some of observations in a group with the same value of z has x=NA. How I can treat that?
As you have not provided any example data, its hard to guess what are you trying to do. However, here is a sample example to exclude the NA values from calculation. Consider a data table dt
dt = data.table(mtcars)[1:6][2, mpg := NA][]
mpg cyl disp hp drat wt qsec vs am gear carb
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2: NA 6 160 110 3.90 2.875 17.02 0 1 4 4
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Where you have NA value in second row of first column. If you are interested to calculate the mean of first column, you can use na.rm.
mean(dt$mpg, na.rm = TRUE)
#[1] 20.06129
Or, when doing by-group calculations:
dt[, mean(mpg, na.rm = TRUE), by=cyl]
# cyl V1
# 1: 6 20.16667
# 2: 4 22.80000
# 3: 8 18.70000

Resources