Extracting raster values, from maximum, to cumulatively sum to x - r

I am trying to determine the location of raster cells that add up to a given amount, starting with the maximum value and progressing down.
Eg, my raster of 150,000 cells has a total sum value of 52,000,000;
raster1 <- raster("myvalues.asc")
cellStats(raster1,sum) = 52,000,000
I can extract the cells above the 95th percentile;
q95 <- raster1
q95[q95 < quantile(q95,0.95)] <- NA
cellStats(q95,sum) = 14,132,000
as you can see, the top 5% cells (based upon quantile maths) returns around 14 million of the original total of 'raster1'.
What i want to do is predetermine the overall sum as 10,000,000 (or x) and then cumulatively sum raster cells, starting with the maximum value and working down, until I have (and can plot) all cells that sum up to x.
I have attempted to convert 'raster1' to a vector, sort, cumulative sum etc but can't tie it back to the raster. Any help here much appreciated
S

The below is your own answer, but rewritten such that it is more useful to others (self contained). I have also changed the %in% to < which should be much more efficient.
library(raster)
r <- raster(nr=100, nc=100)
r[] = sample(ncell(r))
rs <- sort(as.vector(r), decreasing=TRUE)
r_10m <- min( rs[cumsum(rs) < 10000000] )
test <- r
test[test < r_10m ] <- NA
cellStats(test, sum)

couldnt find the edit button.....
this is something like what i need, after an hour scratching my head;
raster1v <- as.vector(raster1)
raster1vdesc <- sort(raster1v, decreasing=T)
raster1_10m <- raster1vdesc[cumsum(raster1vdesc)<10000000]
test <- raster1
test[!test%in%raster1_10m] <- NA
plot(test)
cellStats(test,sum) = 9,968,073
seems to work, perhaps, i dunno. Anything more elegant would be ideal

Related

Calculating difference between points in vector

I'm trying to calculate the difference between all points in a vector of length 10605 in R. For example, I am trying to do this:
for (i in 1:10605){
for (j in 1:10605){
differences[i] = housedata$Mean_household_income[i] - housedata$Mean_household_income[j]
}
}
It is taking so long to compute, and I'm thinking there's a more timely way to calculate the difference between all the points with each other in this vector. Does anyone have any suggestions?
Thanks!
Seems like the dist function should do that. Distance matrices are only lower triangular because distance(x,y) == distance(y,x):
my.distances <- dist(housedata$Mean_household_income,
housedata$Mean_household_income)
It's going to be faster since it's done in C code. Just type:
dist
You could loop through an incrementally shifted/wrapped copy of the vector and subtract the two vectors. You still have to loop through the length of the data once and shift and subtract the vector each time, but it will probably save some time.
Here is an example:
# make a shift/wrap function
shift <- function(df,offset){
df[((1:length(df))-1-offset)%%length(df)+1]
}
# make some data
data <- seq(1,4)
# make an empty vector to hold the data
difs = vector()
# loop through the data
for(i in 1:length(data)){
shifted <- shift(data,i)
result <- data - shifted
difs <- c(difs, result)
}
print(difs)
What about using outer? It uses a vectorized function (here -) on all combinations of two vectors and stores the results in a matrix.
For example,
x <- runif(10605)
system.time(
differences <- outer(x, x, '-')
)
takes one second on my computer.

Vectorizing a column-by-column comparison to separate values

I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.
Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)

Compare and obtain intervals intersections between rows

I have a data base like the following.
pos1<-c(5,15,25,40,80,5,18,22,38,84,5,16,50,92,31,50,20,30,50,70,27,50,60,50,90,20,40)
pos2<-c(10,17,30,42,90,10,20,24,42,87,10,19,52,100,40,70,25,32,60,90,30,60,71,60,100,25,50)
chr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)
n<-c(25,65,78,56,35,78,58,98,14,25,65,85,98,74,20,36,48,98,52,69,21,47,53,10,12,37,82)
pop<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","A","A","A","A","B","B","B","C","C","D","D")
data<-data.frame(pos1,pos2,chr,pop,n)
Position 1 and position 2 designed the start and end point of an interval for each chr and population. My intention is to obtain which interval intersects between pops A, B and C (not D) and which intervals are unique for each population.
So, for the unique intervals I would have an outcome data.frame like the following:
pos1.u<-c(25,50,92,20,30,27,90)
pos2.u<-c(30,52,100,25,32,30,100)
chr.u<-c(1,1,1,2,2,2,2)
pop.u<-c("A","B","C","A","A","B","C")
n.u<-c(78,98,74,48,98,21,12)
data.u<-data.frame(pos1.u,pos2.u,chr.u,pop.u,n.u)
And for the intervals that intersects between those 3 populations a data.frame like the following:
pos1.c<-c(5,15,40,80,5,38,85,5,16,50,70,50,60,50)
pos2.c<-c(10,17,42,90,10,42,87,10,19,60,90,60,71,60)
chr.c<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2)
pop.c<-c("A","A","A","A","B","B","B","C","C","A","A","B","B","C")
n.c<-c(25,65,56,35,78,14,25,65,85,52,69,47,53,10)
data.c<-data.frame(pos1.c,pos2.c,chr.c,pop.c,n.c)
I don't know how to write a script that does precisely this, can you help me?
I think the following code does what you ask for, although it produces different results from yours - so please check it carefully! The discrepancy I think lies in the definition of open and closed intervals. The following assumes that neither end point is included, whereas I suspect this might not be what you mean (otherwise (15,18) and (17,19) would not count as overlapping, as there is no integer value that falls in both). So you might need to adjust the open/closed definitions below.
pos1<-c(5,15,25,40,80,5,18,22,38,84,5,16,50,92,31,50,20,30,50,70,27,50,60,50,90,20,40)
pos2<-c(10,17,30,42,90,10,20,24,42,87,10,19,52,100,40,70,25,32,60,90,30,60,71,60,100,25,50)
chr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)
n<-c(25,65,78,56,35,78,58,98,14,25,65,85,98,74,20,36,48,98,52,69,21,47,53,10,12,37,82)
pop<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","A","A","A","A","B","B","B","C","C","D","D")
data<-data.frame(pos1,pos2,chr,pop,n,stringsAsFactors = FALSE)
library(intervals)
data<-data[data$pop!="D",] #remove irrelevant D entries
rownames(data) <- seq_len(nrow(data)) #reset rownames to allow for removed Ds
#set ints as a list of intervals (as required by intervals package)
ints <- tapply(1:nrow(data),data$pop,function(v)
Intervals(as.matrix(data[v,c("pos1","pos2")]),
closed=c(FALSE,FALSE), #this is where you adjust open/closed lower and upper ends of the intervals - TRUE means end value included
type="Z")) #Z is integers
pops <- unique(data$pop) #unique values of pop
popidx <- lapply(pops,function(x) which(data$pop==x)) #list of indices of these values in data
names(popidx) <- pops
#sets is a df of all pairwise combinations to check
sets <- expand.grid(pops,pops,stringsAsFactors = FALSE)
sets <- sets[sets$Var1!=sets$Var2,]
olap <- lapply(1:nrow(sets),function(i)
interval_overlap(ints[[sets$Var1[i]]],ints[[sets$Var2[i]]])) #list of overlaps
olap <- lapply(1:nrow(sets),function(i) {
df<-as.data.frame(olap[[i]],stringsAsFactors=FALSE)
df$pos1 <- as.numeric(rownames(df))
df$pos2 <- sapply(1:nrow(df),function(j) popidx[[sets$Var2[i]]][df[j,1][[1]][1]])
return(df)}) #tidy up as dfs, with correct indices in data (rather than in ints)
olap <- do.call(rbind,olap)[,-1] #join dataframes
olap$olaps <- !is.na(olap$pos2) #identify those with overlaps
#group by unique pos1 and identify max and min no of overlaps with other groups
olap <- data.frame(minoverlap=tapply(olap$olaps,olap$pos1,min),maxoverlap=tapply(olap$olaps,olap$pos1,max))
olap$rowno <- as.numeric(rownames(olap))
uniques <- data[olap$rowno[olap$maxoverlap==0],] #intervals appearing in just one pop
commons <- data[olap$rowno[olap$minoverlap>0],] #intervals with an overlap in all other pops

Calculate quantiles for large data

I have about 300 files, each containing 1000 time series realisations (~76 MB each file).
I want to calculate the quantiles (0.05, 0.50, 0.95) at each time step from the full set of 300000 realisations.
I cannot merge together the realisations in 1 file because it would become too large.
What's the most efficient way of doing this?
Each matrix is generated by running a model, however here is a sample containing random numbers:
x <- matrix(rexp(10000000, rate=.1), nrow=1000)
There are at least three options:
Are you sure it has to be from the full set? A 10% sample should be a very, very good approximation here.
300k elements isn't that big of a vector, but a 300k x 100+ column matrix is big. Pull just the column you need into memory rather than the entire matrix (can be repeated over every column if necessary).
Do it sequentially, possibly in conjunction with a smaller sample to get you started in the right ballpark. For the 5th percentile, you just need to know how many items are above the current guess and how many are below. So something like:
Take a 1% sample, find the 5th percentile of it. Jump some tolerance above and below, such that you're sure the exact 5th percentile lies in that range.
Read in the matrix in chunks. For each chunk, count the number of observations above the range and below the range. Then retain all observations which lie within the range.
When you've read in the last chunk, you now have three pieces of information (count above, count below, vector of observations within). One way to take a quantile is to sort the whole vector and find the nth observation, and you can do that with the above pieces of information: sort the within-range observations, and find the (n-count_below)th.
Edit: Example of (3).
Note that I am not a champion algorithm designer and that someone has almost certainly designed a better algorithm for this. Also, this implementation is not particularly efficient. If speed matters to you, consider Rcpp, or even just more optimized R for this. Making a bunch of lists and then extracting values from them is not so smart, but it was easy to prototype this way so I went with it.
library(plyr)
set.seed(1)
# -- Configuration -- #
desiredQuantile <- .25
# -- Generate sample data -- #
# Use some algorithm (sampling, iteration, or something else to come up with a range you're sure the true value lies within)
guessedrange <- c( .2, .3 )
# Group the observations to correspond to the OP's files
dat <- data.frame( group = rep( seq(100), each=100 ), value = runif(10000) )
# -- Apply the algorithm -- #
# Count the number above/below and return the values within the range, by group
res <- dlply( dat, .( group ), function( x, guessedrange ) {
above <- x$value > guessedrange[2]
below <- x$value < guessedrange[1]
list(
aboveCount = sum( above ),
belowCount = sum( below ),
withinValues = x$value[ !above & !below ]
)
}, guessedrange = guessedrange )
# Exract the count of values below and the values within the range
belowCount <- sum( sapply( res, function(x) x$belowCount ) )
belowCount
withinValues <- do.call( c, sapply( res, function(x) x$withinValues ) )
str(withinValues)
# Count up until we find the within value we want
desiredQuantileCount <- floor( desiredQuantile * nrow(dat) ) #! Should fix this so it averages when there's a tie
sort(withinValues)[ desiredQuantileCount - belowCount + 1 ]
# Compare to exact value
quantile( dat$value, desiredQuantile )
In the end, the value is a little off from the exact version. I suspect I'm shifted over by one or some equally silly explanation, but maybe I'm missing something fundamental.

lapply function with 2 count variables

I am very green in R, so there is probably a very easy solution to this:
I want to calculate the average correlation between the column vectors in a square matrix:
x<-matrix(rnorm(10000),ncol=100)
aux<-matrix(seq(1,10000))
loop<-sapply(aux,function(i,j) cov(x[,i],x[,j])
cor_x<-mean(loop)
When evaluating the sapply line I get the error 'subscript out of bounds'. I know I can do this via a script but is there any way to achieve this in one line of code?
No need for any loops. Just use mean(cov(x)), which does this very efficiently.
The problem is due to aux. The variable auxhas to range from 1 to 100 since you have 100 columns. But your aux is a sequence along the rows of x and hence ranges from 1 to 10000. It will work with the following code:
aux <- seq(1, 100)
loop <- sapply(aux, function(i, j) cov(x[, i], x[, j]))
Afterwards, you can calculate mean covariance with:
cor_x <- mean(loop)
If you want to exclude duplicate fields (e.g., cov(X,Y) is inherently identical to cov(Y,X)), you can use:
cor_x <- mean(loop[upper.tri(loop, diag = TRUE)])
If you also want to exclude cov(X,X), i.e., variance, you can use:
cor_x <- mean(loop[upper.tri(loop)])

Resources