Draw a heatmap with "super big" matrix - r

I want to draw a heatmap.
I have 100k*100k square matrix (50Gb(csv), numbers on right-top side and other filled by 0).
I want to ask "How can I draw a heatmap with R?" with this huge dataset.
I'm trying to this code on large RAM machine.
d = read.table("data.csv", sep=",")
d = as.matrix(d + t(d))
heatmap(d)
I tried some libraries like heatmap.2(in gplots) or something.
But they are take so much time and memories.

What I suggest you is to heavily down-sample your matrix before plotting the heatmap, e.g. doing the mean of each submatrices (as suggested by #IaroslavDomin) :
# example of big mx 10k x 10 k
bigMx <- matrix(rnorm(10000*10000,mean=0,sd=100),10000,10000)
# here we downsample the big matrix 10k x 10k to 100x100
# by averaging each submatrix
downSampledMx <- matrix(NA,100,100)
subMxSide <- nrow(bigMx)/nrow(downSampledMx)
for(i in 1:nrow(downSampledMx)){
rowIdxs <- ((subMxSide*(i-1)):(subMxSide*i-1))+1
for(j in 1:ncol(downSampledMx)){
colIdxs <- ((subMxSide*(j-1)):(subMxSide*j-1))+1
downSampledMx[i,j] <- mean(bigMx[rowIdxs,colIdxs])
}
}
# NA to disable the dendrograms
heatmap(downSampledMx,Rowv=NA,Colv=NA)
For sure with your huge matrix it will take a while to compute the downSampledMx, but it should be feasible.
EDIT :
I think downsampling should preserve recognizable "macro-patterns", e.g. see the following example :
# create a matrix with some recognizable pattern
set.seed(123)
bigMx <- matrix(rnorm(50*50,mean=0,sd=100),50,50)
diag(bigMx) <- max(bigMx) # set maximum value on the diagonal
# set maximum value on a circle centered on the middle
for(i in 1:nrow(bigMx)){
for(j in 1:ncol(bigMx)){
if(abs((i - 25)^2 + (j - 25)^2 - 10^2) <= 16)
bigMx[i,j] <- max(bigMx)
}
}
# plot the original heatmap
heatmap(bigMx,Rowv=NA,Colv=NA, main="original")
# function used to down sample
downSample <- function(m,newSize){
downSampledMx <- matrix(NA,newSize,newSize)
subMxSide <- nrow(m)/nrow(downSampledMx)
for(i in 1:nrow(downSampledMx)){
rowIdxs <- ((subMxSide*(i-1)):(subMxSide*i-1))+1
for(j in 1:ncol(downSampledMx)){
colIdxs <- ((subMxSide*(j-1)):(subMxSide*j-1))+1
downSampledMx[i,j] <- mean(m[rowIdxs,colIdxs])
}
}
return(downSampledMx)
}
# downsample x 2 and plot heatmap
downSampledMx <- downSample(bigMx,25)
heatmap(downSampledMx,Rowv=NA,Colv=NA, main="downsample x 2")
# downsample x 5 and plot heatmap
downSampledMx <- downSample(bigMx,10)
heatmap(downSampledMx,Rowv=NA,Colv=NA, main="downsample x 5")
Here's the 3 heatmaps :

Related

R: Sample a matrix for cells close to a specified position

I'm trying to find sites to collect snails by using a semi-random selection method. I have set a 10km2 grid around the region I want to collect snails from, which is broken into 10,000 10m2 cells. I want to randomly this grid in R to select 200 field sites.
Randomly sampling a matrix in R is easy enough;
dat <- matrix(1:10000, nrow = 100)
sample(dat, size = 200)
However, I want to bias the sampling to pick cells closer to a single position (representing sites closer to the research station). It's easier to explain this with an image;
The yellow cell with a cross represents the position I want to sample around. The grey shading is the probability of picking a cell in the sample function, with darker cells being more likely to be sampled.
I know I can specify sampling probabilities using the prob argument in sample, but I don't know how to create a 2D probability matrix. Any help would be appreciated, I don't want to do this by hand.
I'm going to do this for a 9 x 6 grid (54 cells), just so it's easier to see what's going on, and sample only 5 of these 54 cells. You can modify this to a 100 x 100 grid where you sample 200 from 10,000 cells.
# Number of rows and columns of the grid (modify these as required)
nx <- 9 # rows
ny <- 6 # columns
# Create coordinate matrix
x <- rep(1:nx, each=ny);x
y <- rep(1:ny, nx);y
xy <- cbind(x, y); xy
# Where is the station? (edit: not snails nest)
Station <- rbind(c(x=3, y=2)) # Change as required
# Determine distance from each grid location to the station
library(SpatialTools)
D <- dist2(xy, Station)
From the help page of dist2
dist2 takes the matrices of coordinates coords1 and coords2 and
returns the inter-Euclidean distances between coordinates.
We can visualize this using the image function.
XY <- (matrix(D, nr=nx, byrow=TRUE))
image(XY) # axes are scaled to 0-1
# Create a scaling function - scales x to lie in [0-1)
scale_prop <- function(x, m=0)
(x - min(x)) / (m + max(x) - min(x))
# Add the coordinates to the grid
text(x=scale_prop(xy[,1]), y=scale_prop(xy[,2]), labels=paste(xy[,1],xy[,2],sep=","))
Lighter tones indicate grids closer to the station at (3,2).
# Sampling probabilities will be proportional to the distance from the station, which are scaled to lie between [0 - 1). We don't want a 1 for the maximum distance (m=1).
prob <- 1 - scale_prop(D, m=1); range (prob)
# Sample from the grid using given probabilities
sam <- sample(1:nrow(xy), size = 5, prob=prob) # Change size as required.
xy[sam,] # Thse are your (**MY!**) 5 samples
x y
[1,] 4 4
[2,] 7 1
[3,] 3 2
[4,] 5 1
[5,] 5 3
To confirm the sample probabilities are correct, you can simulate many samples and see which coordinates were sampled the most.
snail.sam <- function(nsamples) {
sam <- sample(1:nrow(xy), size = nsamples, prob=prob)
apply(xy[sam,], 1, function(x) paste(x[1], x[2], sep=","))
}
SAMPLES <- replicate(10000, snail.sam(5))
tab <- table(SAMPLES)
cols <- colorRampPalette(c("lightblue", "darkblue"))(max(tab))
barplot(table(SAMPLES), horiz=TRUE, las=1, cex.names=0.5,
col=cols[tab])
If using a 100 x 100 grid and the station is located at coordinates (60,70), then the image would look like this, with the sampled grids shown as black dots:
There is a tendency for the points to be located close to the station, although the sampling variability may make this difficult to see. If you want to give even more weight to grids near the station, then you can rescale the probabilities, which I think is ok to do, to save costs on travelling, but these weights need to be incorporated into the analysis when estimating the number of snails in the whole region. Here I've cubed the probabilities just so you can see what happens.
sam <- sample(1:nrow(xy), size = 200, prob=prob^3)
The tendency for the points to be located near the station is now more obvious.
There may be a better way than this but a quick way to do it is to randomly sample on both x and y axis using a distribution (I used the normal - bell shaped distribution, but you can really use any). The trick is to make the mean of the distribution the position of the research station. You can change the bias towards the research station by changing the standard deviation of the distribution.
Then use the randomly selected positions as your x and y coordinates to select the positions.
dat <- matrix(1:10000, nrow = 100)
#randomly selected a position for the research station
rs <- c(80,30)
# you can change the sd to change the bias
x <- round(rnorm(400,mean = rs[1], sd = 10))
y <- round(rnorm(400, mean = rs[2], sd = 10))
position <- rep(NA, 200)
j = 1
i = 1
# as some of the numbers sampled can be outside of the area you want I oversampled # and then only selected the first 200 that were in the area of interest.
while (j <= 200) {
if(x[i] > 0 & x[i] < 100 & y[i] > 0 & y [i]< 100){
position[j] <- dat[x[i],y[i]]
j = j +1
}
i = i +1
}
plot the results:
plot(x,y, pch = 19)
points(x =80,y = 30, col = "red", pch = 19) # position of the station

raster calculation with condition of each cell by layers in R

I have stack raster dataset with several layers, however, I want to calculate the sum of each cell with for different layer selection, and finally generate a new layer, anyone has some good suggestion by using calc or overlay or some other raster calculation in R?
I can do by loops and make the calculation, but it will consume many times when I have many layers, and also use many of the storage, my script as follows,
## library(raster)
make_calc <- function(rr, start, end) {
rr <- as.array(rr)
start <- as.array(start)
end <- as.array(end)
dms <- dim(raster)
tmp <- array(NA, dim = dms[1:2])
for (i in 1:dms[1]) {
for (j in 1:dms[2]) {
tmp[i,j] <- sum(raster[i,j,start[i,j,1]:end[i,j,1]], na.rm = TRUE)
}
}
return(tmp)
}
rr <- raster(res = 10)
rr[] <- 1
rr <- stack(rr, rr, rr, rr)
start <- raster(res = 10)
start[] <- sample(1:2, ncell(start), replace = TRUE)
end <- raster(res = 10)
end[] <- sample(3:4, ncell(end), replace = TRUE)
result <- make_calc(rr, start, end)
Why are you coercing into arrays? You can easily collapse a raster into a vector but, that does not even seem necessary here. In the future, please try to be more clear on what your expected outcome is.
Based on your code, I really don't know what you are getting at. I am going to take a few guesses on summing specified rasters in the stack, drawing a random sample, across rasters to be summed and finally, drawing a random sample of cells to be summed.
For a sum on specified rasters in a stack, you can just index what you are after in the stack using a double bracket. In this case, rasters 1 and 3 in the stack would be the only ones summed.
library(raster)
rr <- raster(res = 10)
rr[] <- 1
rr <- stack(rr, rr, rr, rr)
( sum_1_3 <- calc(rr[[c(1,3)]], sum) )
If you are wanting a random sample of the values across rasters, for every cell, you could write a function that is passed to calc. Here is an example that grabs a random sample of n size, across the raster layers values and sums them.
rs.sum <- function(x, n=2) {sum( x[sample(1:length(x),n)], na.rm=TRUE)}
rs.sum.raster <- calc(rr, rs.sum)
If you are wanting to apply a function to a limited random selection of cells, you could create a random sample of the raster that would be used as an index. Here we create a random sample of cells, create an empty raster and pipe the sum of rasters 1 and 2 (in the stack) based on the random sample cell index. A raster in the stack is indexed using the double bracket and the raster values are indexed using a single bracket so, for raster 1 in the stack with limiting to the values in the random sample you would use: rr[[1]][rs]
rs <- sample(1:ncell(rr[[1]]), 300)
r.sum <- rr[[1]]
r.sum[] <- NA
r.sum[rs] <- rr[[1]][rs] + rr[[2]][rs]
plot(r.sum)

Imputing missing values keeping circular trend in mind

Think of a picture of Sunrise where a red circle is surrounded by yellow thick ring and then blue background. Take red as 3 then yellow as 2 and blue as 1.
11111111111
11111211111
11112221111
11222322211
22223332222
11222322221
11112221111
11111211111
This is the desired output. But, the record/file/data has missing values (30% of all elements are missing).
How can we impute missing values so as to get this desired output keeping the circular trend in mind.
This is how I would solve a problem of this sort in a very simple, straightforward way. Please note that I corrected your sample data above to be symmetric:
d <- read.csv(header=F, stringsAsFactors=F, text="
1,1,1,1,1,1,1,1,1,1,1
1,1,1,1,1,2,1,1,1,1,1
1,1,1,1,2,2,2,1,1,1,1
1,1,2,2,2,3,2,2,2,1,1
2,2,2,2,3,3,3,2,2,2,2
1,1,2,2,2,3,2,2,2,1,1
1,1,1,1,2,2,2,1,1,1,1
1,1,1,1,1,2,1,1,1,1,1
")
library(raster)
## Plot original data as raster:
d <- raster(as.matrix(d))
plot(d, col=colorRampPalette(c("blue","yellow","red"))(255))
## Simulate 30% missing data:
d_m <- d
d_m[ sample(1:length(d), length(d)/3) ] <- NA
plot(d_m, col=colorRampPalette(c("blue","yellow","red"))(255))
## Construct a 3x3 filter for mean filling of missing values:
filter <- matrix(1, nrow=3, ncol=3)
## Fill in only missing values with the mean of the values within
## the 3x3 moving window specified by the filter. Note that this
## could be replaced with a median/mode or some other whole-number
## generating summary statistic:
r <- focal(d_m, filter, mean, na.rm=T, NAonly=T, pad=T)
## Plot imputed data:
plot(r, col=colorRampPalette(c("blue","yellow","red"))(255), zlim=c(1,3))
This is an image of the original sample data:
With 30% missing values simulated:
And only those missing values interpolated with the mean of the 3x3 moving window:
Here I compare Forrest's approach with a thin plate spline (TPS). Their performance is about the same -- depending on the sample. The TPS could be preferable if the gaps were larger such that focal could not estimate anymore --- but in that case you could also use a a larger (and perhaps Gaussian, see ?focalWeight) filter.
d <- matrix(c(
1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,2,1,1,1,1,1,
1,1,1,1,2,2,2,1,1,1,1,
1,1,2,2,2,3,2,2,2,1,1,
2,2,2,2,3,3,3,2,2,2,2,
1,1,2,2,2,3,2,2,2,1,1,
1,1,1,1,2,2,2,1,1,1,1,
1,1,1,1,1,2,1,1,1,1,1), ncol=11, byrow=TRUE)
library(raster)
d <- raster(d)
plot(d, col=colorRampPalette(c("blue","yellow","red"))(255))
## Simulate 30% missing data:
set.seed(1)
d_m <- d
d_m[ sample(1:length(d), length(d)/3) ] <- NA
plot(d_m, col=colorRampPalette(c("blue","yellow","red"))(255))
# Forrest's solution:
filter <- matrix(1, nrow=3, ncol=3)
r <- focal(d_m, filter, mean, na.rm=T, NAonly=T, pad=T)
#an alterative:
rp <- rasterToPoints(d_m)
library(fields)
# thin plate spline interpolation
#(for a simple pattern like this, IDW might work, see ?interpolate)
tps <- Tps(rp[,1:2], rp[,3])
# predict
x <- interpolate(d_m, tps)
# use the orginal values where available
m <- cover(d_m, x)
i <- is.na(d_m)
cor(d[i], m[i])
## [1] 0.8846869
cor(d[i], r[i])
## [1] 0.8443165

R Surface Plot from List of X,Y,Z points

I am trying to make a surface plot for data that is in a very long list of x,y,z points. To do this, I am dividing the data into a grid of 10k squares and finding the max value of z within each square. From my understanding, each z value should be stored in a matrix where each element of the matrix corresponds to a square on the grid. Is there an easier way to do this than the code below? That last line is already pretty long and it is only one square.
x<-(sequence(101)-1)*max(eff$CFaR)/100
y<-(sequence(101)-1)*max(eff$EaR)/100
effmap<-matrix(ncol=length(x)-1, nrow=length(y)-1)
someMatrix <- max(eff$Cost[which(eff$EaR[which(eff$CFaR >= x[50] & eff$CFaR <x[51], arr.ind=TRUE)]>=y[20] & eff$EaR[which(eff$CFaR >= x[50] & eff$CFaR <x[51], arr.ind=TRUE)]< y[91])])
So this is my interpretation of what you are trying to accomplish...
df <- read.csv("effSample.csv") # downloaded from your link
df <- df[c("CFaR","EaR","Cost")] # remove unnecessary columns
df$x <- cut(df$CFaR,breaks=100,labels=FALSE) # establish bins: CFaR
df$y <- cut(df$EaR,breaks=100,labels=FALSE) # establish bins: EaR
df.max <- expand.grid(x=1:100,y=1:100) # template; 10,000 grid cells
# maximum cost in each grid cell - NOTE: most of the cells are *empty*
df.max <- merge(df.max,aggregate(Cost~x+y,df,max),all.x=TRUE)
z <- matrix(df.max$Cost,nr=100,nc=100) # Cost vector -> matrix
# colors based on z-value
palette <- rev(rainbow(20)) # palette of 20 colors
zlim <- range(z[!is.na(z)])
colors <- palette[19*(z-zlim[1])/diff(zlim) + 1]
# create the plot
library(rgl)
open3d(scale=c(1,1,10)) # CFaR and EaR range ~ 10 X Cost range
x.values <- min(df$CFaR)+(0:99)*diff(range(df$CFaR))/100
y.values <- min(df$EaR)+(0:99)*diff(range(df$EaR))/100
surface3d(x.values,y.values,z,col=colors)
axes3d()
title3d(xlab="CFaR",ylab="EaR",zlab="Cost")
The code above generates a rotatable 3D plot, so the image is just a screen shot. Notice how there are lots of "holes". This is (partially) because you provided only part of your data. However, it is important to realize that just because you imagine 10,000 grid cells (e.g., a 100 X 100 grid), does not mean that there will be data in every cell.

Choose n most evenly spread points across point dataset in R

Given a set of points, I am trying to select a subset of n points that are most evenly distributed across this set of points. In other words, I am trying to thin out the dataset while still evenly sampling across space.
So far, I have the following, but this approach likely won't do well with larger datasets. Maybe there is a more intelligent way to choose the subset of points in the first place...
The following code randomly chooses a subset of the points, and seeks to minimize the distance between the points within this subset and the points outside of this subset.
Suggestions appreciated!
evenSubset <- function(xy, n) {
bestdist <- NA
bestSet <- NA
alldist <- as.matrix(dist(xy))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA
for (i in 1:1000){
subset <- sample(1:nrow(xy),n)
subdists <- alldist[subset,-subset]
distsum <- sum(subdists,na.rm=T)
if (distsum < bestdist | is.na(bestdist)) {
bestdist <- distsum
bestSet <- subset
}
}
return(xy[bestSet,])
}
xy2 <- evenSubset(xy=cbind(rnorm(1000),rnorm(1000)), n=20)
plot(xy)
points(xy2,col='blue',cex=1.5,pch=20)
Following #Spacedman's suggestion, I used voronoi tesselation to identify and drop those points that were closest to other points.
Here, the percentage of points to drop is given to the function. This appears to work quite well, except for the fact that it is slow with large datasets.
library(tripack)
voronoiFilter <- function(occ,drop) {
n <- round(x=(nrow(occ) * drop),digits=0)
subset <- occ
dropped <- vector()
for (i in 1:n) {
v <- voronoi.mosaic(x=subset[,'Longitude'],y=subset[,'Latitude'],duplicate='error')
info <- cells(v)
areas <- unlist(lapply(info,function(x) x$area))
smallest <- which(areas == min(areas,na.rm=TRUE))
dropped <- c(dropped,which(paste(occ[,'Longitude'],occ[,'Latitude'],sep='_') == paste(subset[smallest,'Longitude'],subset[smallest,'Latitude'],sep='_')))
subset <- subset[-smallest,]
}
return(occ[-dropped,])
}
xy <- cbind(rnorm(500),rnorm(500))
colnames(xy) <- c('Longitude','Latitude')
xy2 <- voronoiFilter(xy, drop=0.7)
plot(xy)
points(xy2,col='blue',cex=1.5,pch=20)

Resources