I have about 300 files, each containing 1000 time series realisations (~76 MB each file).
I want to calculate the quantiles (0.05, 0.50, 0.95) at each time step from the full set of 300000 realisations.
I cannot merge together the realisations in 1 file because it would become too large.
What's the most efficient way of doing this?
Each matrix is generated by running a model, however here is a sample containing random numbers:
x <- matrix(rexp(10000000, rate=.1), nrow=1000)
There are at least three options:
Are you sure it has to be from the full set? A 10% sample should be a very, very good approximation here.
300k elements isn't that big of a vector, but a 300k x 100+ column matrix is big. Pull just the column you need into memory rather than the entire matrix (can be repeated over every column if necessary).
Do it sequentially, possibly in conjunction with a smaller sample to get you started in the right ballpark. For the 5th percentile, you just need to know how many items are above the current guess and how many are below. So something like:
Take a 1% sample, find the 5th percentile of it. Jump some tolerance above and below, such that you're sure the exact 5th percentile lies in that range.
Read in the matrix in chunks. For each chunk, count the number of observations above the range and below the range. Then retain all observations which lie within the range.
When you've read in the last chunk, you now have three pieces of information (count above, count below, vector of observations within). One way to take a quantile is to sort the whole vector and find the nth observation, and you can do that with the above pieces of information: sort the within-range observations, and find the (n-count_below)th.
Edit: Example of (3).
Note that I am not a champion algorithm designer and that someone has almost certainly designed a better algorithm for this. Also, this implementation is not particularly efficient. If speed matters to you, consider Rcpp, or even just more optimized R for this. Making a bunch of lists and then extracting values from them is not so smart, but it was easy to prototype this way so I went with it.
library(plyr)
set.seed(1)
# -- Configuration -- #
desiredQuantile <- .25
# -- Generate sample data -- #
# Use some algorithm (sampling, iteration, or something else to come up with a range you're sure the true value lies within)
guessedrange <- c( .2, .3 )
# Group the observations to correspond to the OP's files
dat <- data.frame( group = rep( seq(100), each=100 ), value = runif(10000) )
# -- Apply the algorithm -- #
# Count the number above/below and return the values within the range, by group
res <- dlply( dat, .( group ), function( x, guessedrange ) {
above <- x$value > guessedrange[2]
below <- x$value < guessedrange[1]
list(
aboveCount = sum( above ),
belowCount = sum( below ),
withinValues = x$value[ !above & !below ]
)
}, guessedrange = guessedrange )
# Exract the count of values below and the values within the range
belowCount <- sum( sapply( res, function(x) x$belowCount ) )
belowCount
withinValues <- do.call( c, sapply( res, function(x) x$withinValues ) )
str(withinValues)
# Count up until we find the within value we want
desiredQuantileCount <- floor( desiredQuantile * nrow(dat) ) #! Should fix this so it averages when there's a tie
sort(withinValues)[ desiredQuantileCount - belowCount + 1 ]
# Compare to exact value
quantile( dat$value, desiredQuantile )
In the end, the value is a little off from the exact version. I suspect I'm shifted over by one or some equally silly explanation, but maybe I'm missing something fundamental.
Related
Basic idea:
As said before, is a good idea to substitute subsisting a data frame, for a multidimensional list?
I have a function that need to generate a subset from a quite big data frame close to 30 thousand times. Thus, creating a 4 dimensional list, will give me instant access to the subset, without loosing time generating it.
However, I don't know how R treats this objects, so I would like you opinion on it.
More concrete example if needed:
What I was trying to do is to use the inputation method of KNN. Basically, the algorithm says that the value found as outliers has to be replaced with K(K in a number, it could be 1,2,3...) closest neighbor. The neighbor in this example are the rows with the same attributes in the first 4 columns. And, the closed neighbors are the one with the smallest difference between the fifth column. If it is not clear what I said, please still consider reading the code, because, I found it hard to describe in words.
This are the objects
#create a vector with random values
values <- floor(runif(5e7, 0, 50)
possible.outliers <- floor(runif(5e7, 0, 10000)
#use this values, in a mix way, create a data frame
df <- data.frame( sample(values), sample(values), sample(values),
sample(values), sample(values), sample(possible.outliers)
#all the values greater then 800 will be marked as outliers
df$isOutlier = df[,6] > 800
This is the function which will be used to replace the outliers
#with the generated data frame, do this function
#Parameter:
# *df: The entire data frame from the above
# *vector.row: The row that was marked that contains an outlier. The outlier will be replaced with the return of this function
# *numberK: The number of neighbors to take into count.
# !Very Important: Consider that, the last column, the higher the
# difference between their values, less attractive
# they are for an inputation.
foo <- function(df, vector.row, numberK){
#find the neighbors
subset = df[ vector.row[1] == df[,1] & vector.row[2] == df[,2] &
vector.row[3] == df[,3] & vector.row[4] == df[,4] , ]
#taking the "distance" from the rows, so It can find which are the
# closest neighbors
subset$distance = subset[,5] - vector.row[5]
#not need to implement
"function that find the closest neighbors from the distance on subset"
return (mean(ClosestNeighbors))
}
So, the function runtime is quite big. For this reason, I am searching for alternatives and I thought that, maybe, if I replace the subsisting for something like this:
list[[" Levels COl1 "]][[" Levels COl2 "]]
[[" Levels COl3 "]][[" Levels COl4 "]]
What this should do is an instant access to the subset, instead of generating it all the time inside the function.
Is it a reasonable idea? I`am a noob in R.
If you did not understood what is written, or would like something to be explained in more detain or in other words, please tell me, because I know it is not the most direct question.
I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.
Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)
Suppose X is vector of length 100 with X position for 100 individuals. All agents start with position 0
X <- rep(0,100)
but they are embedded in a word with boundaries. I have a function that randomly changes the X position of all the agents at a given time.
Store <- X
X <- X + runif(100)
Eventually, one agent will reach the boundary and, at that point, it stay within the limits. The most simple way to do it using a looping through the vector and checking with if (in pseudo code):
for (i in 1:length(X)) {
if (between the boundaries) {keep the new X[i]} else {assign X[i] the value in Store[i]}
}
This is useful for 100 individual, but the for-loop adds too much computational time if the number of individual (and the length of the vector) increases, for example, to 1000000.
Is there a more straightforward way to do it? I was thinking that maybe I could skip specific re assignation of values that exceed the threshold during:
X <- X + runif(100)
EDIT: Also, imagine that X is not a vector but a matrix.
I realize this question is relatively old, but I just had the same question so I didn't want to leave it unanswered.
Limiting a vector or matrix to values within a certain range, can be done in a comprehensive way by combining an apply statement with min and max functions, as shown in the example below.
# Create sample vector
X <- c(1:100); print(X)
# Create sample matrix
M <- matrix(c(1:100),nrow=10); print(M)
# Set limits
minV <- 15; maxV <- 85;
# Limit vector
sapply(X, function(y) min(max(y,minV),maxV))
# Limit matrix
apply(M, c(1, 2), function(x) min(max(x,minV),maxV))
For further information on the apply functionality I would refer to the R documentation and this article on R-Bloggers:
https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/
When I first came across apply statements I found it a difficult concept to wrap my head around, but would now consider it one of R's most powerful features.
I am trying to determine the location of raster cells that add up to a given amount, starting with the maximum value and progressing down.
Eg, my raster of 150,000 cells has a total sum value of 52,000,000;
raster1 <- raster("myvalues.asc")
cellStats(raster1,sum) = 52,000,000
I can extract the cells above the 95th percentile;
q95 <- raster1
q95[q95 < quantile(q95,0.95)] <- NA
cellStats(q95,sum) = 14,132,000
as you can see, the top 5% cells (based upon quantile maths) returns around 14 million of the original total of 'raster1'.
What i want to do is predetermine the overall sum as 10,000,000 (or x) and then cumulatively sum raster cells, starting with the maximum value and working down, until I have (and can plot) all cells that sum up to x.
I have attempted to convert 'raster1' to a vector, sort, cumulative sum etc but can't tie it back to the raster. Any help here much appreciated
S
The below is your own answer, but rewritten such that it is more useful to others (self contained). I have also changed the %in% to < which should be much more efficient.
library(raster)
r <- raster(nr=100, nc=100)
r[] = sample(ncell(r))
rs <- sort(as.vector(r), decreasing=TRUE)
r_10m <- min( rs[cumsum(rs) < 10000000] )
test <- r
test[test < r_10m ] <- NA
cellStats(test, sum)
couldnt find the edit button.....
this is something like what i need, after an hour scratching my head;
raster1v <- as.vector(raster1)
raster1vdesc <- sort(raster1v, decreasing=T)
raster1_10m <- raster1vdesc[cumsum(raster1vdesc)<10000000]
test <- raster1
test[!test%in%raster1_10m] <- NA
plot(test)
cellStats(test,sum) = 9,968,073
seems to work, perhaps, i dunno. Anything more elegant would be ideal
So I've been pondering how to do this without a for loop and I couldn't come up with a good answer. Here is an example of what I mean:
sampleData <- matrix(rnorm(25,0,1),5,5)
meanVec <- vector(length=length(sampleData[,1]))
for(i in 1:length(sampleData[,1])){
subMat <- sampleData[1:i,]
ifelse( i == 1 , sumVec <- sum(subMat) ,sumVec <- apply(subMat,2,sum) )
meanVec[i] <- mean(sumVec)
}
meanVec
The actual matrix I want to do this to is reasonably large, and to be honest, for this application it won't make a huge difference in speed, but it's a question I think should be answered:
How can I get rid of that for loop and replace with some *ply call?
Edit: In the example given, I generate sample data, and define a vector equal to the number of rows in the vector.
The for loop does the following steps:
1) takes a submatrix, from row 1 to row i
2) if i is 1, it just sums up the values in that vector
3) if i is not 1, it gets the sum of each row, then gets the mean of the sum and stores that in position i of the vector meanVec.
Finally, it prints out the mean of that sum.
This does what you describe:
cumsum(rowSums(sampleData))/seq_len(nrow(sampleData))
However, your code doesn't do the same.