Binary search for integer64 in data.table

Binary search for integer64 in data.table - r

I have a integer64 indexed data.table object:
library(data.table)
library(bit64)
some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))
#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):
search_values <- some_data[(n+1):length(some_data)]
If these where native integers I could use findInterval() to solve the problem:
values_index <- findInterval(search_values, x$new_col)
but when the arguments to findInterval are integer64, I get:
Warning messages:
1: In as.double.integer64(vec) :
integer precision lost while converting to double
2: In as.double.integer64(x) :
integer precision lost while converting to double
and wrong indexes:
> values_index
[1] 10 10 10 10 10
e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.
Edit:
Desired output:
print(values_index)
9 10 6 10 1
Why?:
value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.

Maybe you want something like this:
findInterval2 <- function(y, x) {
toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
x2 <- copy(x)
x2[, i := .I] # mark the original data set
x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x2, new_col) # order
x2[, index := cumsum(!is.na(i))]
x2[match(y, new_col), index]
}
# x2 is:
# a new_col i index
# 1: 0.56602278 1514772184120000011 1 1
# 2: NA 1514772184120000013 NA 1
# 3: 0.29408237 1514772184120000017 2 2
# 4: 0.28532378 1514772184120000026 3 3
# 5: NA 1514772184120000029 NA 3
# 6: NA 1514772184120000032 NA 3
# 7: 0.66844754 1514772184120000042 4 4
# 8: 0.83008829 1514772184120000043 5 5
# 9: NA 1514772184120000059 NA 5
# 10: NA 1514772184120000060 NA 5
# 11: 0.76992760 1514772184120000068 6 6
# 12: 0.57049677 1514772184120000078 7 7
# 13: 0.14406169 1514772184120000083 8 8
# 14: 0.02044602 1514772184120000085 9 9
# 15: 0.68016024 1514772184120000094 10 10
findInterval2(search_values, x)
# [1] 1 5 3 5 3
If not, then maybe you could change the code as needed.
update
look at this integer example to see that this function gives the same result as base findInterval
now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]
r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)

If I get what you want, then a quick workaround could be:
toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output

Related

data.table subassignment with `on = `

When making a subassigment,
the RHS length must either be 1 (single values are ok) or match the LHS length exactly,
as the error message says when rule is not followed.
However, the following works:
tab.01 <- data.table( a = 1L:5L, b = 11L:15L )
tab.02 <- data.table( a = c(1L, 1L, 2L), x = c(11L, 12L, 22L) )
tab.01[ tab.02, x := i.x, on = "a"]
# a b x
# 1: 1 11 12
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA
The column x is not functionally dependent on the column a. Yet, an assignment is made and, if my guess is right, the last element of the subgroup is assigned.
Can this default behaviour be changed, e.g. to choose the first element? The following trials do not work:
mult = "first" has no effect.
tab.01[ tab.02, x := first(i.x), on = "a" ] assigns the value 11L to all matches.
tab.01[ tab.02, x := first(i.x), on = "a", by = "a"]
results in an error, because i.x is not available anymore (or any other column in i).
tab.01[ tab.02, x := first(i.x), on = "a", by = .EACHI ] does not raise an error, but does not fix anything either. The values in the group a reassigned in the order of the rows, hence the last value is kept.
One can use a version of tab.02 with functionally dependent columns:
tab.02[ , y := f_fd(x), by = "a" ] # e.g. f_fd <- data.table::first
tab.01[ tab.02, x := y, on = "a"]
Is this the concisest way to perform this task?

I believe there's no built-in method specifically for accomplishing this. However, it is possible to do this update without modifying tab.02.
You could create a subset
tab.01[tab.02[rowid(a) == 1], x := i.x, on = "a"][]
# a b x
# 1: 1 11 11
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA
or order before joining
tab.01[tab.02[order(-x)], x := i.x, on = "a"][]
# a b x
# 1: 1 11 11
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA

Quick search in data.table or quick subset

I have a DF with 800k+ rows with repeated (random) values. For each row I need to take a value and find an index of a new row(s) with same value. E.g. "asd" - where else do I see it? The index of the current row is NOT needed.
My current solution: subset a DF and create a temp frame/table by removing current row. Problem - it takes a minute per 1000 iterations. So 800+k rows will take me 13 hours to run. Any ideas? Thanks!
Running on original DF (not subsetted) is < 1 second, but as you can imagine it gives me the index of the current row.
Edit: My real-life DF is more than 1 column. Example below is simplified. I need to take V1[1] and get row numbers of other V1 with value of V1[1], then repeat for V1[2] and so on for each row
library(fastmatch)
library(stringi)
set.seed(12345)
V1 = stringi::stri_rand_strings(800000, 3)
df0 = as.data.table(V1)
mapped = matrix("",nrow=800000)
print(Sys.time())
for (i in 1:1000) {
tmp_df = df0[-i,] #This takes very long time!!!
mapped[i] = fmatch(df0$V1[i],tmp_df$V1)
}
print(Sys.time())
View(mapped)

Data:
library("data.table")
set.seed(12345)
V1 = stringi::stri_rand_strings(80, 3)
df0 <- data.table( sample(V1, 100, replace = TRUE ))
Code:
df0[, id := list(list(.I)), by = V1] # integer id
Output:
head(df0, 10)
# V1 id
# 1: iuR 1,2,21
# 2: iuR 1,2,21
# 3: KXc 3
# 4: LwA 4
# 5: pYn 5
# 6: qoN 6,66
# 7: 5Xt 7
# 8: wBH 8,77
# 9: V9r 9,39,54
# 10: 9ks 10,28,42,48
EDIT - Removed Current Index:
df0[, id2 := 1:.N ]
df0[, id := list(list(unlist(id)[ unlist(id) != .I ] )), by = id2 ]
df0[, id2 := NULL ]
df0[ lengths(id) > 0, ]
head( df0, 10 )
# V1 id
# 1: iuR 2,21
# 2: iuR 1,21
# 3: KXc
# 4: LwA
# 5: pYn
# 6: qoN 66
# 7: 5Xt
# 8: wBH 77
# 9: V9r 39,54
# 10: 9ks 28,42,48

create column in datatable depending on it's values

I have got single column in data table
library(data.table)
DT <- data.table(con=c(1:5))
My result is a data table with new column x calculated as follows: first value should be first value of con(here:1), next(second) value should be calculated by muliplication second value of con times first value of x. Third value of x is a result of multiplcation third value of con times second value of x and so on. Result:
DT <- data.table(con=c(1:5), x = c(1,2,6,24,120))
I tried use shifts but it did non helped, below some lines of my code:
DT <- data.table(con=c(1:5))
DT[, x := shift(con,1, type = "lead")]
DT[, x := shift(x, 1)]
DT[, x := con * x]

You are looking for cumprod
DT[,x:=cumprod(con)]
DT
con x
1: 1 1
2: 2 2
3: 3 6
4: 4 24
5: 5 120

We can use the accumulate function from the purrr package.
library(data.table)
library(purrr)
DT <- data.table(con=c(1:5))
DT[, x := accumulate(con, `*`)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120
Or the Reduce function from the base R.
DT <- data.table(con=c(1:5))
DT[, x:= Reduce(`*`, con, accumulate = TRUE)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120

How to avoid recycling when using roll_mean in R?

I have been trying to apply a rolling mean to several columns in a dataframe, where each column contains data from multiple individuals. I have succeeded using roll_mean from the RcppRoll package and lapply. I've included below an example using a dummy dataframe and the output.
x <- rnorm(20,1);
y <- rnorm(20,2);
z <- rnorm(20,3);
ID <- rep(1:2, each=10);
mydf <- data.frame(ID, x, y, z);
vars <- c("x", "y", "z");
setDT(mydf)[, paste0(vars, "_", "mean") := lapply(.SD, function(x) roll_mean(x, n=3, na.rm = TRUE)), .SDcols = vars, by = ID]
mydf
ID x y z x_mean y_mean z_mean
1: 1 0.34457704 1.9580361 2.6458335 1.2515642 1.8307447 2.569645
2: 1 1.41839352 2.0697324 1.8495358 1.7012511 1.7248261 2.988908
3: 1 1.99172192 1.4644657 3.2135652 1.8455087 1.7165419 3.184736
4: 1 1.69363783 1.6402801 3.9036227 1.5002658 2.1512764 3.289555
5: 1 1.85116646 2.0448798 2.4370206 0.9775842 3.1215589 2.563110
6: 1 0.95599300 2.7686692 3.5280206 0.8477701 3.4576141 3.106095
7: 1 0.12559300 4.5511275 1.7242892 0.9450234 3.5134499 3.020176
8: 1 1.46172438 3.0530454 4.0659766 0.9080677 3.0100022 3.371839
9: 1 1.24775283 2.9361768 3.2702614 1.2515642 1.8307447 2.569645
10: 1 0.01472603 3.0407845 2.7792776 1.7012511 1.7248261 2.988908
11: 2 -0.91146047 2.5898074 2.0328348 0.4314443 1.2688530 2.477879
12: 2 0.48183559 1.8230335 2.6910075 1.2689767 0.9650435 2.544006
13: 2 1.72395769 -0.6062819 2.7097949 0.8747931 1.2273766 1.974265
14: 2 1.60113680 1.6783790 2.2312143 0.2579207 1.6945497 2.233321
15: 2 -0.70071522 2.6100328 0.9817857 0.1162224 2.0928536 2.606608
16: 2 -0.12665946 0.7952374 3.4869635 1.3884888 2.1063817 2.986786
17: 2 1.17604187 2.8732906 3.3510742 2.0557599 2.2701173 3.178248
18: 2 3.11608400 2.6506171 2.1223190 1.5553274 2.3987061 3.015501
19: 2 1.87515393 1.2864441 4.0613513 0.4314443 1.2688530 2.477879
20: 2 -0.32525560 3.2590570 2.8628313 1.2689767 0.9650435 2.544006
As you can see from the output table (mydf) the mean parameters have been created as part of the lapply statement, and the rolling means have been calculated for each individual ID. However, the rolling mean function has recycled the results in order to fill the data frame, as the roll_mean function generates 8 values from the 10 raw values for each individual ID. It has used recycling to fill the last 2 rows for each ID.
My actual data is time series data and I don't want the results recycled. I want to avoid recycling by adding the raw x values to the start of the x_mean column up until the point where there are sufficient raw data to produce the 3 point rolling mean.
I've tried searching (on SO and google) for posts about avoiding recycling in roll_mean or similar functions with no success.
Does anyone know how to pad the first 2 rows in my example to avoid recycling in the roll_mean function?
Thanks.

The whole solution:
x <- rnorm(20,1);
y <- rnorm(20,2);
z <- rnorm(20,3);
ID <- rep(1:2, each=10);
mydf <- data.table(ID, x, y, z); # Changed to dt here
vars <- c("x", "y", "z");
# fill = NA and align = 'right'
mydf[, paste0(vars, "_", "mean") := lapply(.SD, function(x) RcppRoll::roll_mean(x, n = 3, na.rm = TRUE, fill = NA, align = 'right')), .SDcols = vars, by = ID]
mydf
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 NA NA NA
# 2: 1 1.1836433 2.7821363 2.746638 NA NA NA
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...
mydf[is.na(x_mean), c(paste0(vars, "_", "mean")) := mget(paste0(vars))]
mydf
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 0.3735462 2.918977 2.835476
# 2: 1 1.1836433 2.7821363 2.746638 1.1836433 2.782136 2.746638
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...
Edit:
Missing parts of mydf can be also filled in a bit "smarter" way, i.e. by using roll means with window smaller by 1 in every iteration:
for (n_inner in n_roll:1) {
mydf[!complete.cases(mydf),
paste0(vars, "_", "mean") := lapply(
.SD, function(x) RcppRoll::roll_mean(x, n = n_inner, na.rm = TRUE, fill = NA, align = 'right')), .SDcols = vars, by = ID]
}
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 0.3735462 2.918977 2.835476 <- Values from x, y and z
# 2: 1 1.1836433 2.7821363 2.746638 0.7785948 2.850557 2.791057 <- roll_mean with window 2
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026 <- roll_mean with window 3
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422 <- as above
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...

Remove constant columns with or without NAs

I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?

Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)

There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.

Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5

Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.

If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}

Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5

For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method

library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Binary search for integer64 in data.table - r

Related

data.table subassignment with `on = `

Quick search in data.table or quick subset

create column in datatable depending on it's values

How to avoid recycling when using roll_mean in R?

Remove constant columns with or without NAs

Categories

Resources