I know that I should avoid for-loops, but I'm not exactly sure how to do what I want to do with an apply function.
Here is a slightly simplified model of what I'm trying to do. So, essentially I have a big matrix of predictors and I want to run a regression using a window of 5 predictors on each side of the indexed predictor (i in the case of a for loop). With a for loop, I can just say something like:
results<-NULL
window<-5
for(i in 1:ncol(g))
{
first<-i-window #Set window boundaries
if(first<1){
1->first
}
last<-i+window-1
if(last>ncol(g)){
ncol(g)->last
}
predictors<-g[,first:last]
#Do regression stuff and return some result
results[i]<-regression stuff
}
Is there a good way to do this with an apply function? My problem is that the vector that apply would be shoving into the function really doesn't matter. All that matters is the index.
This question touches several points that are made in 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
There are some loops you should avoid, but not all of them. And using an apply function is more hiding the loop than avoiding it. This example seems like a good choice to leave in a 'for' loop.
Growing objects is generally bad form -- it can be extremely inefficient in some cases. If you are going to have a blanket rule, then "not growing objects" is a better one than "avoid loops".
You can create a list with the final length by:
result <- vector("list", ncol(g))
for(i in 1:ncol(g)) {
# stuff
result[[i]] <- #results
}
In some circumstances you might think the command:
window<-5
means give me a logical vector stating which values of 'window' are less than -5.
Spaces are good to use, mostly not to confuse humans, but to get the meaning directly above not to confuse R.
Using an apply function to do your regression is mostly a matter of preference in this case; it can handle some of the bookkeeping for you (and so possibly prevent errors) but won't speed up the code.
I would suggest using vectorized functions though to compute your first's and last's, though, perhaps something like:
window <- 5
ng <- 15 #or ncol(g)
xy <- data.frame(first = pmax( (1:ng) - window, 1 ),
last = pmin( (1:ng) + window, ng) )
Or be even smarter with
xy <- data.frame(first= c(rep(1, window), 1:(ng-window) ),
last = c((window+1):ng, rep(ng, window)) )
Then you could use this in a for loop like this:
results <- list()
for(i in 1:nrow(xy)) {
results[[i]] <- xy$first[i] : xy$last[i]
}
results
or with lapply like this:
results <- lapply(1:nrow(xy), function(i) {
xy$first[i] : xy$last[i]
})
where in both cases I just return the sequence between first and list; you would substitute with your actual regression code.
Related
I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.
I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.
Here is my code:
srt=NULL
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
e=unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)
doubletCounts is 700 x 3 df, and each of the values is found within the large df.
I would be glad to hear any ideas to optimize/speed up this process.
Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.
# load library
library(datat.table)
# convert your dataset into data.table
setDT(YearHT)
# subset YearHT keeping values that are present in doubletCounts$Var1
YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]
# aggregate values
output <- YearHT_df[ , .( median= median(berthtime)) ]
for loops aren't necessarily something to avoid, but there are certain ways of using for loops that should be avoided. You've committed the classic for loop blunder here.
srt = NULL
for (i in index)
{
[stuff]
srt = rbind(srt, [stuff])
}
is bound to be slower than you would like because each time you hit srt = rbind(...), you're asking R to do all sorts of things to figure out what kind of object srt needs to be and how much memory to allocate to it. When you know what the length of your output needs to be up front, it's better to do
srt <- vector("list", length = doubletCounts$Var1)
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
srt[[i]] = unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
srt=data.frame(srt)
Or the apply alternative of
srt = lapply(doubletCounts$Var1,
function(i)
{
s=subset(YearHT,YearHT$berthlet==i)
unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
)
Both of those should run at about the same speed
(Note: both are untested, for lack of data, so they might be a little buggy)
Something else you can try that might have a smaller effect would be dropping the subset call and use indexing. The content of your for loop could be boiled down to
unlist(c(strsplit(i, '\\|'),
median(YearHT[YearHT$berthlet == i, "berthtime"])))
But I'm not sure how much time that would save.
Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the string distance matrices, if there are multiple fuzzy variables, and not the computation of the string distances itself is parallelized). As for the fuzzy_join function the idea is to match for exact variables if possible (to keep the matrices smaller) and then to proceed to fuzzy matching within this exactly matched groups. I actually think that the function is self-explanatory. I am posting it here because I would like to have some feedback to improve it and because I guess that I am not the only one who tries to do stuff like that in R (although I admit that Python, SQL and things like that would probably be more efficient in this context. But one has to stick to the things one feels most comfortable with and doing the data cleaning and preparation in the same language is nice with regard to reproducibility)
merge.fuzzy = function(a,b,.exact,.fuzzy,.weights,.method,.ncores) {
require(stringdist)
require(matrixStats)
require(parallel)
if (length(.fuzzy)!=length(.weights)) {
stop(paste0("fuzzy and weigths must have the same length"))
}
if (!any(class(a)=="data.table")) {
stop(paste0("'a' must be of class data.table"))
}
if (!any(class(b)=="data.table")) {
stop(paste0("'b' must be of class data.table"))
}
#convert everything to lower
a[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy]
b[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy]
a[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact]
b[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact]
#create ids
a[,"id.a":=as.numeric(.I),by=c(.exact,.fuzzy)]
b[,"id.b":=as.numeric(.I),by=c(.exact,.fuzzy)]
c <- unique(rbind(a[,.exact,with=FALSE],b[,.exact,with=FALSE]))
c[,"exa.id":=.GRP,by=.exact]
a <- merge(a,c,by=.exact,all=FALSE)
b <- merge(b,c,by=.exact,all=FALSE)
##############
stringdi <- function(a,b,.weights,.by,.method,.ncores) {
sdm <- list()
if (is.null(.weights)) {.weights <- rep(1,length(.by))}
if (nrow(a) < nrow(b)) {
for (i in 1:length(.by)) {
sdm[[i]] <- stringdistmatrix(a[[.by[i]]],b[[.by[i]]],method=.method,ncores=.ncores,useNames=TRUE)
}
} else {
for (i in 1:length(.by)) { #if a is shorter, switch sides; this enhances parallelization speed
sdm[[i]] <- stringdistmatrix(b[[.by[i]]],a[[.by[i]]],method=.method,ncores=.ncores,useNames=FALSE)
}
}
rsdm = dim(sdm[[1]])
csdm = ncol(sdm[[1]])
sdm = matrix(unlist(sdm),ncol=length(by))
sdm = rowSums(sdm*.weights,na.rm=T)/((0 + !is.na(sdm)) %*% .weights)
sdm = matrix(sdm,nrow=rsdm,ncol=csdm)
#use ids as row/ column names
rownames(sdm) <- a$id.a
colnames(sdm) <- b$id.b
mid <- max.col(-sdm,ties.method="first")
mid <- matrix(c(1:nrow(sdm),mid),ncol=2)
bestdis <- sdm[mid]
res <- data.table(as.numeric(rownames(sdm)),as.numeric(colnames(sdm)[mid[,2]]),bestdis)
setnames(res,c("id.a","id.b","dist"))
res
}
setkey(b,exa.id)
distances = a[,stringdi(.SD,b[J(.BY[[1]])],.weights=.weights,.by=.fuzzy,.method=.method,.ncores=.ncores),by=exa.id]
a = merge(a,distances,by=c("exa.id","id.a"))
res = merge(a,b,by=c("exa.id","id.b"))
res
}
The following points would be interesting:
I am not quite sure how to code multiple exact matching variables in the data.table style I used above (which I believe is the fasted option).
Is it possible to have nested parallelization? This means is it possible to use a parallel foreach loop on top of the computation of the string distance matrices.
I am also interested in ideas with regard to making the whole thing more efficient, i.e. to consume less memory.
Maybe you can suggest a bigger "real world" data set so that I can create a woking example. Unfortunately I cannot share even small samples of my data with you.
In the future it would also be nice to do something else than a classic left inner join. So also ideas with regard to this topic are very much appreciated.
All your comments are welcome!
I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.
I have a list of samples, each of varying lengths. I need to compare sample means (using a Mann-Whitney-Wilcoxon test) for all samples in the list. Current code is as follows:
wilcox.v = list() ##This creates the list of samples
for (i in df){
treat = list(i$treatment)
wilcox.v = c(wilcox.v,treat)
}
###This *should* iterate over all items in the list
wilcox = sapply(wilcox.v, function(i){ wilcox.test(as.numeric(wilcox.v[i,]), as.numeric(wilcox.v[-i,]), exact = FALSE)$p.value
})
I'd like to have the function return a vector of p-values, so that the broader function can re-sample if necessary.
The problem seems to lie in the need to compare a sample mean to all other sample means in the list.
I'm sure there's an easy way to do this (and I think it has something to do with calling indicies correctly), but I'm not sure!
AS joran said, you wrote your apply function a little wonky. There are two ways you can fis this.
Modify it so i is in fact an index reference:
wilcox = sapply(1:length(wilcox.v)
,function(i){ wilcox.test(as.numeric(wilcox.v[[i]])
,as.numeric(wilcox.v[[-i]]), exact = FALSE)$p.value
})
modify your function so it appropriately treats i as a list element. I'll leave this as an exercise to you (primarily since I don't want to deal with the wilcox.v[-i,] term.
Thanks for your help! This is the solution I ended up using. It's hardly elegant but it gets the job done.
mannwhit = vector()
for (i in mannwhit.v){
for (j in mannwhit.v){
if (identical(i,j) == FALSE){
p.val = wilcox.test(i, j, paired=FALSE)$p.value
mannwhit = c(mannwhit, p.val)
}
}
}
I have a question regarding R apply (and all its variants). Is there a way to update the arguments of the function while apply is working?
For example, I have a function NextSol(Prev_Sol) that generates a new solution from Prev_Sol, compares it with the original one in some way and then returns either the original or the new, depending on the result of the comparison. I need to save all the solutions returned. Currently, I am doing this:
for( i in 2:N ) {
Results[[i]] <- NextSol(Results[[i-1]])
}
But maybe there is a (faster) way to do it using apply? I have seen also that Reduce could help but I have no idea of how can I use it. Any help will be much appreciated!
As Thomas said, the for loop is the standard way of looping when one iteration depends on a previous one. (Just make sure that you correctly handle the case of N = 1 in your code.)
An alternative is to use the Reduce function. This example is adapted from the one on the ?Reduce help page.
NextSol <- function(x) x + 1 #Or whatever you want
Funcall <- function(f, ...) f(...)
Reduce(Funcall, rep.int(list(NextSol), 5), 0, right = TRUE)
## [1] 5
It's unlikely that this will be much faster, and it's arguably harder to read, so you may well decide to stick with a for loop.
Well, I suppose we can make it easier to read by wrapping it in an Iterate function.
Iterate <- function(f, init, n)
{
Reduce(
function(f, ...) f(...),
rep.int(list(f), n),
init,
right = TRUE
)
}
Iterate(NextSol, 0, 5) #same as before