Trying to replace NAs with column means in R - r

I am trying to run this simple code over data which is a data frame of 800 features and 200000 observations.
This simple code that I always used:
C <- ncol(data)
for (i in 1:C){
print(i)
data[is.na(data[,i]),i] <- mean(data[,i], na.rm=T)
}
returns:
[1] 1
Error: cannot allocate vector of size 1.6 Mb
I don't really understand why because I can independently call for the mean of the feature without any errors. Any

That error means you are running out of memory to compute the means.
Sometime, based on the number of references to an object, R will copy the object, make the change to the copy and then replace the original (reference). That is likely what is happening in your case.
I recommend you use the data.table package which allows copy-less variable modification.

Related

Most efficient way to remove elements from vector, do a calculation and put them back in R

I'm trying to find the best way to do a cut in a vector using 12 breaks, but I want the first cluster to contain values from 0 to a specific value (onePercentQuantile). So one way to do that is to remove all values below onePercentQuantile in vector while keeping track of the removed value indexes, then run a cut on vector with only 11 breaks and finally put back the removed indexes as cluster 1 in the new object, starting form the leftmost index.
I did not find a way to ignore values under a specific threshold in cut function, so I did it myself.
Here is an example with the most efficient way I could find :
vector_gene <- sample(1:100, 10000, replace=TRUE)
onePercentQuantile<- 5
indexUnderPercentQuantile <- which(vector_gene <= onePercentQuantile)
overPercentQuantile <- vector_gene[vector_gene > onePercentQuantile]
tmp <- as.numeric(cut(c(t(overPercentQuantile)),breaks=11))+1
if (1 == indexUnderPercentQuantile[1]){
tmp <- insert(tmp, 1, values=1)
indexUnderPercentQuantile <- indexUnderPercentQuantile[2:length(indexUnderPercentQuantile)]
}
for (i in 1:length(vector_gene)){
if (i>length(tmp)){
tmp <- c(tmp,rep(1,length(vector_gene)-i+1))
break
}
else if (i == indexUnderPercentQuantile[1]){
tmp <- c(tmp[1:i-1],1,tmp[i:length(tmp)])
if (length(indexUnderPercentQuantile)>1){
indexUnderPercentQuantile <- indexUnderPercentQuantile[2:length(indexUnderPercentQuantile)]
}
}
}
Using the profvis package I have traced memory usage and running time.
Up to 10.000 elements, result is instantaneous using barely no memory.
Up to 100.000 elements, memory usage is already around 9Go running in 3.5s
Up to 1.000.000 elements, the running time is so long that I shut it down.
The bottleneck is the reattribution of the removed indexes :
tmp <- c(tmp[1:i-1],1,tmp[i:length(tmp)])
I've tried the insert function in R.utils but it was worst in memory usage and running time
Is there a better way, especially in memory usage, to achieve this problem ? Thanks !

Programming a Loop on R

In order to explain what i'm trying to do , i have 73 assets and i'm trying to backtest possible pair trading strategies.
So i did test previously all possible pair combinations for cointegration and i stored them in a variable named combos, which is a 2x869 matrix containing in each column the id of both assets that are cointegrated, so if i take the first column of this variable i get (1,25) that means that the asset number 1 and number 25 in my data are cointegrated.
The second variable is datats, it an xts objects, it contain the prices of my 73 assets in different dates.
Now i'm trying to run a loop on them in order to backtest the pair trading strategy using the PairTrading package in R.
Here is the code using stock.price within the package.
install.packages("PairTrading", repos="http://R-Forge.R-project.org")
install.packages("plyr")
library(PairTrading)
library(plyr)
datats<-stock.price
combos <- combn(ncol(datats),2)
adply(combos, 2, function(x) {
price.pair <- datats[,c(x[1],x[2])]
params <-EstimateParametersHistorically(price.pair, period = 180)
signal <- Simple(params$spread, 0.05)
return.pairtrading <- Return(price.pair, lag(signal), lag(params$hedge.ratio))
returnS<-(100 * cumprod(1 + return.pairtrading))
out <- data.frame("return"=returnS)
return(out)
})
This code works fine, the problem is when i use my own database instead of the one within the package.
So what i'm trying to do is test the strategy aka the function defined previously, with the combination stored in combos, which means for the first execution i want to test the strategy between the asset number 1 and asset number 25, then between 1 and 61 and so on, so i use datats[,c(x[1],x[2])] to do that, and i store the cumulative return of the strategy returnS<-(100 * cumprod(1 + return.pairtrading)).
I tested the above function manually and it works perfectly, the problem is when i try to loop it, so my guess is the loop is working fine, but something after breaks the code.
This is the error i get:
Error in apply(merge(signal[, 1], weight.pair[, 1], return.pair[, 1], :
dim(X) must have a positive length
Here is an update, the code works fine even with my data, however it seems to break in certain points, for instance, i managed to run all combinations from 1 to 400 in a single execution, however 408 seems to break the code, and when i use the combination 408 manually, the function works fine, yet in breaks when i try to loop? this error is very confusing and i've been unable to advance today because of it, kindly help if you can.

Output of parApply different from my input

I am still quite new to r (used to program in Matlab) and I am trying use the parallel package to speed up some calculations. Below is an example which I am trying to calculate the rolling standard deviation of a matrix (by column) with the use of zoo package, with and without parallelising the codes. However, the shape of the outputs came out to be different.
# load library
library('zoo')
library('parallel')
library('snow')
# Data
z <- matrix(runif(1000000,0,1),100,1000)
#This is what I want to calculate with timing
system.time(zz <- rollapply(z,10,sd,by.column=T, fill=NA))
# Trying to achieve the same output with parallel computing
cl<-makeSOCKcluster(4)
clusterEvalQ(cl, library(zoo))
system.time(yy <-parCapply(cl,z,function(x) rollapplyr(x,10,sd,fill=NA)))
stopCluster(cl)
My first output zz has the same dimensions as input z, whereas output yy is a vector rather than a matrix. I understand that I can do something like matrix(yy,nrow(z),ncol(z)) however I would like to know if I have done something wrong or if there is a better way of coding to improve this. Thank you.
From the documentation:
parRapply and parCapply always return a vector. If FUN always returns
a scalar result this will be of length the number of rows or columns:
otherwise it will be the concatenation of the returned values.
And:
parRapply and parCapply are parallel row and column apply functions
for a matrix x; they may be slightly more efficient than parApply but
do less post-processing of the result.
So, I'd suggest you use parApply.

Faster alternative to for loop in R which calls a function with another loop

I am trying to parse a huge dataset into R (1.3Gb). The original data is a list comprised of four million of characters, being each one of them an observation of 137 variables.
First I've created a function that separates the character according to the key provided in the dataset, where "d" is each one of the characters. For the purpose of this question imagine that d has this form
"2005400d"
and the key would be
varName <- c("YEAR","AGE","GENDER","STATUS")
varIn <- c(1,5,7,8)
varEND <- c(4,6,7,8)
where varIn and varEnd track the splitting points. The function created was.
parseLine<-function(d){
k<-unlist(strsplit(d,""))
vec<-rep(NA,length(varName))
for (i in 1:length(varName)){
vec[i]<-paste(k[varIn[i]:varEnd[i]],sep="",collapse="")
}
return(vec)
}
And then in order to loop over all the data available, I've created a for loop.
df<-data.frame(matrix(ncol=length(varName)))
names(df)<-as.character(varName)
for (i in 1:length(data)){
df<-rbind(df,parseLine(data[i]))
}
However when I check the function with 1,000 iterations I got a system time of 10.82 seconds, but when I increase that to 10,000 instead of having a time of 108.2 seconds I've got a time of 614.77 which indicates that as the number of iterations increases the time needed would increase exponentially.
Any suggestion for speeding up the process? I've tried to use the library foreach, but it didn't use the parallel as I expected.
m<-foreach(i=1:10,.combine=rbind) %dopar% parseLine(data[i])
df<-a
names(df)<-as.character(varName)
Why re-invent the wheel? Use read.fwf in the utils package (attached by default)
> dat <- "2005400d"
> varName <- c("YEAR","AGE","GENDER","STATUS")
> varIn <- c(1,5,7,8)
> varEND <- c(4,6,7,8)
> read.fwf(textConnection(dat), col.names=varName, widths=1+varEND-varIn)
YEAR AGE GENDER STATUS
1 2005 40 0 d
You should get further efficiency if you specify colClasses but my effort to demonstrate this failed to show a difference. Perhaps that advice only applies to read.table and cousins.

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

Resources