How to avoid recycling when using roll_mean in R? - r

I have been trying to apply a rolling mean to several columns in a dataframe, where each column contains data from multiple individuals. I have succeeded using roll_mean from the RcppRoll package and lapply. I've included below an example using a dummy dataframe and the output.
x <- rnorm(20,1);
y <- rnorm(20,2);
z <- rnorm(20,3);
ID <- rep(1:2, each=10);
mydf <- data.frame(ID, x, y, z);
vars <- c("x", "y", "z");
setDT(mydf)[, paste0(vars, "_", "mean") := lapply(.SD, function(x) roll_mean(x, n=3, na.rm = TRUE)), .SDcols = vars, by = ID]
mydf
ID x y z x_mean y_mean z_mean
1: 1 0.34457704 1.9580361 2.6458335 1.2515642 1.8307447 2.569645
2: 1 1.41839352 2.0697324 1.8495358 1.7012511 1.7248261 2.988908
3: 1 1.99172192 1.4644657 3.2135652 1.8455087 1.7165419 3.184736
4: 1 1.69363783 1.6402801 3.9036227 1.5002658 2.1512764 3.289555
5: 1 1.85116646 2.0448798 2.4370206 0.9775842 3.1215589 2.563110
6: 1 0.95599300 2.7686692 3.5280206 0.8477701 3.4576141 3.106095
7: 1 0.12559300 4.5511275 1.7242892 0.9450234 3.5134499 3.020176
8: 1 1.46172438 3.0530454 4.0659766 0.9080677 3.0100022 3.371839
9: 1 1.24775283 2.9361768 3.2702614 1.2515642 1.8307447 2.569645
10: 1 0.01472603 3.0407845 2.7792776 1.7012511 1.7248261 2.988908
11: 2 -0.91146047 2.5898074 2.0328348 0.4314443 1.2688530 2.477879
12: 2 0.48183559 1.8230335 2.6910075 1.2689767 0.9650435 2.544006
13: 2 1.72395769 -0.6062819 2.7097949 0.8747931 1.2273766 1.974265
14: 2 1.60113680 1.6783790 2.2312143 0.2579207 1.6945497 2.233321
15: 2 -0.70071522 2.6100328 0.9817857 0.1162224 2.0928536 2.606608
16: 2 -0.12665946 0.7952374 3.4869635 1.3884888 2.1063817 2.986786
17: 2 1.17604187 2.8732906 3.3510742 2.0557599 2.2701173 3.178248
18: 2 3.11608400 2.6506171 2.1223190 1.5553274 2.3987061 3.015501
19: 2 1.87515393 1.2864441 4.0613513 0.4314443 1.2688530 2.477879
20: 2 -0.32525560 3.2590570 2.8628313 1.2689767 0.9650435 2.544006
As you can see from the output table (mydf) the mean parameters have been created as part of the lapply statement, and the rolling means have been calculated for each individual ID. However, the rolling mean function has recycled the results in order to fill the data frame, as the roll_mean function generates 8 values from the 10 raw values for each individual ID. It has used recycling to fill the last 2 rows for each ID.
My actual data is time series data and I don't want the results recycled. I want to avoid recycling by adding the raw x values to the start of the x_mean column up until the point where there are sufficient raw data to produce the 3 point rolling mean.
I've tried searching (on SO and google) for posts about avoiding recycling in roll_mean or similar functions with no success.
Does anyone know how to pad the first 2 rows in my example to avoid recycling in the roll_mean function?
Thanks.

The whole solution:
x <- rnorm(20,1);
y <- rnorm(20,2);
z <- rnorm(20,3);
ID <- rep(1:2, each=10);
mydf <- data.table(ID, x, y, z); # Changed to dt here
vars <- c("x", "y", "z");
# fill = NA and align = 'right'
mydf[, paste0(vars, "_", "mean") := lapply(.SD, function(x) RcppRoll::roll_mean(x, n = 3, na.rm = TRUE, fill = NA, align = 'right')), .SDcols = vars, by = ID]
mydf
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 NA NA NA
# 2: 1 1.1836433 2.7821363 2.746638 NA NA NA
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...
mydf[is.na(x_mean), c(paste0(vars, "_", "mean")) := mget(paste0(vars))]
mydf
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 0.3735462 2.918977 2.835476
# 2: 1 1.1836433 2.7821363 2.746638 1.1836433 2.782136 2.746638
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...
Edit:
Missing parts of mydf can be also filled in a bit "smarter" way, i.e. by using roll means with window smaller by 1 in every iteration:
for (n_inner in n_roll:1) {
mydf[!complete.cases(mydf),
paste0(vars, "_", "mean") := lapply(
.SD, function(x) RcppRoll::roll_mean(x, n = n_inner, na.rm = TRUE, fill = NA, align = 'right')), .SDcols = vars, by = ID]
}
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 0.3735462 2.918977 2.835476 <- Values from x, y and z
# 2: 1 1.1836433 2.7821363 2.746638 0.7785948 2.850557 2.791057 <- roll_mean with window 2
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026 <- roll_mean with window 3
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422 <- as above
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...

Related

How to avoid redundant calculation within data.table?

I need to find the unique two minima in data column at a data table by two ids id1 and id2:
n <- 12
set.seed(1234)
id1 <- rep(1:2, each = 6)
id2 <- rep(1:6, each = 2)
data <- 100+100*rnorm(n)
dt <- data.table(id1=id1, id2=id2, data=data)
Find below the function that, given the second id id2, calculates the two unique minima at the same time and export them as a vector:
detect_two_lower <- function(ids, values){
dt <- data.table(ids, values)
dt <- dt[, .(V1=min(values, na.rm = T))
, by = ids
][order(V1)]
min_1 <- dt$V1[1]
min_2 <- dt$V1[2]
nn <- c(min_1 = min_1, min_2 = min_2)
}
detect_two_lower <- memoise(detect_two_lower)
Then apply the function on the data.table, grouping by = id1:
dt[, `:=` ( min_1 = detect_two_lower(id2, data)[1]
,min_2 = detect_two_lower(id2, data)[2])
, by = id1
]
The calculation runs as expected (see below). Note, however, that the code calls detect_two_lower twice with the same parameters. As a workaround I tried to minimize the reworking with memoise, but I would like to avoid this patch.
Is there a better way to accomplish the same result?
dt
id1 id2 data min_1 min_2
1: 1 1 -20.7065749 -134.5697703 -20.70657
2: 1 1 127.7429242 -134.5697703 -20.70657
3: 1 2 208.4441177 -134.5697703 -20.70657
4: 1 2 -134.5697703 -134.5697703 -20.70657
5: 1 3 142.9124689 -134.5697703 -20.70657
6: 1 3 150.6055892 -134.5697703 -20.70657
7: 2 4 42.5260040 0.1613555 10.99622
8: 2 4 45.3368144 0.1613555 10.99622
9: 2 5 43.5548001 0.1613555 10.99622
10: 2 5 10.9962171 0.1613555 10.99622
11: 2 6 52.2807300 0.1613555 10.99622
12: 2 6 0.1613555 0.1613555 10.99622
Return a list from the function
library(data.table)
detect_two_lower <- function(ids, values){
dt <- data.table(ids, values)
dt <- dt[, .(V1=min(values, na.rm = T)), by = ids][order(V1)]
as.list(dt$V1)
}
So you can assign them directly :
dt[, c('min_1', 'min_2') := detect_two_lower(id2, data), id1]

How to optimize this for loop for bigger data in r?

I have some reproducible data, (my original dataset contains about 2,000,000 rows). For this reason, my for loop becomes inefficient and will take a long time to run this much data. I was wondering if there is a more efficient way to run this data. I attached my code with reproducible data
#----Reproducible data example--------------------#
#Upload first data set#
words1<-c("How","did","Quebec","nationalists","see","their","province","as","a","nation","in","the","1960s")
words2<-c("Why","does","volicty","effect","time",'?',NA,NA,NA,NA,NA,NA,NA)
words3<-c("How","do","I","wash","a","car",NA,NA,NA,NA,NA,NA,NA)
library<-c("The","the","How","see","as","a","for","then","than","example")
embedding1<-c(.5,.6,.7,.8,.9,.3,.46,.48,.53,.42)
embedding2<-c(.1,.5,.4,.8,.9,.3,.98,.73,.48,.56)
df <- data.frame(words1,words2,words3)
names(df)<-c("words1","words2","words3")
#--------Upload 2nd dataset-------#
df2 <- data.frame(library,embedding1, embedding2)
names(df2)<-c("library","embedding1","embedding2")
df2$meanembedding=rowMeans(df2[c("embedding1","embedding2")],na.rm=T)
df2<-df2[,-c(2,3)]
#-----Find columns--------#
l=ncol(df)
names<-names(df)
head(names)
classes<-sapply(df[,c(1:l)],class)
head(classes)
#------Combine and match libary to training data------#
require(gridExtra)
List = list()
for( name in names){
df1<-df[,name]
df1<-as.data.frame(df1)
x_train2<-merge(x= df1, y = df2,
by.x = "df1", by.y = 'library',all.x=T, sort=F)
x_train2<-x_train2[,-1]
x_train2<-as.data.frame(x_train2)
names(x_train2) <- name
List[[length(List)+1]] = x_train2
}
A better approach would be to use lapply:
myList2 <- lapply(names(df), function(x){
y <- merge(x = df[, x, drop = FALSE],
y = df2,
by.x = x,
by.y = 'library',
all.x = T,
sort = F)[, -1, drop = FALSE]
names(y) <- x
return(y)
})
We loop over the vector names(df), subset and merge on the fly, using [drop = FALSE] to prevent the simplification from a one-column-data.frame to a vector, and overwrite the column name. The output is a list.
Post script: You technically do not need the drop = FALSE if you use df[x] instead of df[, x], as #RuiBarradas pointed out. But I think it is helpful to know about the drop = FALSE option in cases where you need to subset both rows and columns.
when joining on large data volumes, give data.table a try...
library( data.table )
dt <- as.data.table( df )
dt2 <- as.data.table ( df2 )
lapply( names(dt), function(x) {
on_expr <- parse( text = paste0( "c( library = \"", x, "\")" ) )
dt2[dt, on = eval( on_expr )][,2]
})
# [[1]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.80
# 6: NA
# 7: NA
# 8: 0.90
# 9: 0.30
# 10: NA
# 11: NA
# 12: 0.55
# 13: NA
#
# [[2]]
# meanembedding
# 1: NA
# 2: NA
# 3: NA
# 4: NA
# 5: NA
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA
#
# [[3]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.30
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA

Binary search for integer64 in data.table

I have a integer64 indexed data.table object:
library(data.table)
library(bit64)
some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))
#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):
search_values <- some_data[(n+1):length(some_data)]
If these where native integers I could use findInterval() to solve the problem:
values_index <- findInterval(search_values, x$new_col)
but when the arguments to findInterval are integer64, I get:
Warning messages:
1: In as.double.integer64(vec) :
integer precision lost while converting to double
2: In as.double.integer64(x) :
integer precision lost while converting to double
and wrong indexes:
> values_index
[1] 10 10 10 10 10
e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.
Edit:
Desired output:
print(values_index)
9 10 6 10 1
Why?:
value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.
Maybe you want something like this:
findInterval2 <- function(y, x) {
toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
x2 <- copy(x)
x2[, i := .I] # mark the original data set
x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x2, new_col) # order
x2[, index := cumsum(!is.na(i))]
x2[match(y, new_col), index]
}
# x2 is:
# a new_col i index
# 1: 0.56602278 1514772184120000011 1 1
# 2: NA 1514772184120000013 NA 1
# 3: 0.29408237 1514772184120000017 2 2
# 4: 0.28532378 1514772184120000026 3 3
# 5: NA 1514772184120000029 NA 3
# 6: NA 1514772184120000032 NA 3
# 7: 0.66844754 1514772184120000042 4 4
# 8: 0.83008829 1514772184120000043 5 5
# 9: NA 1514772184120000059 NA 5
# 10: NA 1514772184120000060 NA 5
# 11: 0.76992760 1514772184120000068 6 6
# 12: 0.57049677 1514772184120000078 7 7
# 13: 0.14406169 1514772184120000083 8 8
# 14: 0.02044602 1514772184120000085 9 9
# 15: 0.68016024 1514772184120000094 10 10
findInterval2(search_values, x)
# [1] 1 5 3 5 3
If not, then maybe you could change the code as needed.
update
look at this integer example to see that this function gives the same result as base findInterval
now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]
r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)
If I get what you want, then a quick workaround could be:
toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output

average by group, removing current row

I want to compute group means of a variable but excluding the focal respondent:
set.seed(1)
dat <- data.table(id = 1:30, y = runif(30), grp = rep(1:3, each=10))
The first record (respondent) should have an average of... the second... and so on:
mean(dat[c==1, y][-1])
mean(dat[c==1, y][-2])
mean(dat[c==1, y][-3])
For the second group the same:
mean(dat[c==2, y][-1])
mean(dat[c==2, y][-2])
mean(dat[c==2, y][-3])
I tried this, but it didn't work:
ex[, avg := mean(ex[, y][-.I]), by=grp]
Any ideas?
You can try this solution:
set.seed(1)
dat <- data.table(id = 1:9, y = c(NA,runif(8)), grp = rep(1:3, each=3))
dat[, avg2 := sapply(seq_along(y),function(i) mean(y[-i],na.rm=T)), by=grp]
dat
# id y grp avg2
# 1: 1 NA 1 0.3188163
# 2: 2 0.2655087 1 0.3721239
# 3: 3 0.3721239 1 0.2655087
# 4: 4 0.5728534 2 0.5549449
# 5: 5 0.9082078 2 0.3872676
# 6: 6 0.2016819 2 0.7405306
# 7: 7 0.8983897 3 0.8027365
# 8: 8 0.9446753 3 0.7795937
# 9: 9 0.6607978 3 0.9215325
Seems like you're most of the way there and just need to account for NA's:
dat[, avg := (sum(y, na.rm=T) - ifelse(is.na(y), 0, y)) / (sum(!is.na(y)) + is.na(y) - 1)
, by = grp]
No double loops or extra memory required.
If I'm understanding correctly, I think this does the job:
dat[,
.(id, y2=rep(y, .N), id2=rep(id, .N), id3=rep(id, each=.N)), by=grp
][
!(id2 == id3),
mean(y2),
by=.(id3, grp)
]
First step is to duplicate the whole group data for each id, and to mark which row we want to exclude from the mean. Second step is to exclude the rows, and then group back by group/id. Obviously this isn't super memory efficient, but should work so long as you're not memory constrained.

Remove constant columns with or without NAs

I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?
Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)
There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.
Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.
If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}
Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5
For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method
library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5

Resources