Imputing missing values keeping circular trend in mind - r

Think of a picture of Sunrise where a red circle is surrounded by yellow thick ring and then blue background. Take red as 3 then yellow as 2 and blue as 1.
11111111111
11111211111
11112221111
11222322211
22223332222
11222322221
11112221111
11111211111
This is the desired output. But, the record/file/data has missing values (30% of all elements are missing).
How can we impute missing values so as to get this desired output keeping the circular trend in mind.

This is how I would solve a problem of this sort in a very simple, straightforward way. Please note that I corrected your sample data above to be symmetric:
d <- read.csv(header=F, stringsAsFactors=F, text="
1,1,1,1,1,1,1,1,1,1,1
1,1,1,1,1,2,1,1,1,1,1
1,1,1,1,2,2,2,1,1,1,1
1,1,2,2,2,3,2,2,2,1,1
2,2,2,2,3,3,3,2,2,2,2
1,1,2,2,2,3,2,2,2,1,1
1,1,1,1,2,2,2,1,1,1,1
1,1,1,1,1,2,1,1,1,1,1
")
library(raster)
## Plot original data as raster:
d <- raster(as.matrix(d))
plot(d, col=colorRampPalette(c("blue","yellow","red"))(255))
## Simulate 30% missing data:
d_m <- d
d_m[ sample(1:length(d), length(d)/3) ] <- NA
plot(d_m, col=colorRampPalette(c("blue","yellow","red"))(255))
## Construct a 3x3 filter for mean filling of missing values:
filter <- matrix(1, nrow=3, ncol=3)
## Fill in only missing values with the mean of the values within
## the 3x3 moving window specified by the filter. Note that this
## could be replaced with a median/mode or some other whole-number
## generating summary statistic:
r <- focal(d_m, filter, mean, na.rm=T, NAonly=T, pad=T)
## Plot imputed data:
plot(r, col=colorRampPalette(c("blue","yellow","red"))(255), zlim=c(1,3))
This is an image of the original sample data:
With 30% missing values simulated:
And only those missing values interpolated with the mean of the 3x3 moving window:

Here I compare Forrest's approach with a thin plate spline (TPS). Their performance is about the same -- depending on the sample. The TPS could be preferable if the gaps were larger such that focal could not estimate anymore --- but in that case you could also use a a larger (and perhaps Gaussian, see ?focalWeight) filter.
d <- matrix(c(
1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,2,1,1,1,1,1,
1,1,1,1,2,2,2,1,1,1,1,
1,1,2,2,2,3,2,2,2,1,1,
2,2,2,2,3,3,3,2,2,2,2,
1,1,2,2,2,3,2,2,2,1,1,
1,1,1,1,2,2,2,1,1,1,1,
1,1,1,1,1,2,1,1,1,1,1), ncol=11, byrow=TRUE)
library(raster)
d <- raster(d)
plot(d, col=colorRampPalette(c("blue","yellow","red"))(255))
## Simulate 30% missing data:
set.seed(1)
d_m <- d
d_m[ sample(1:length(d), length(d)/3) ] <- NA
plot(d_m, col=colorRampPalette(c("blue","yellow","red"))(255))
# Forrest's solution:
filter <- matrix(1, nrow=3, ncol=3)
r <- focal(d_m, filter, mean, na.rm=T, NAonly=T, pad=T)
#an alterative:
rp <- rasterToPoints(d_m)
library(fields)
# thin plate spline interpolation
#(for a simple pattern like this, IDW might work, see ?interpolate)
tps <- Tps(rp[,1:2], rp[,3])
# predict
x <- interpolate(d_m, tps)
# use the orginal values where available
m <- cover(d_m, x)
i <- is.na(d_m)
cor(d[i], m[i])
## [1] 0.8846869
cor(d[i], r[i])
## [1] 0.8443165

Related

calculating area of most suitable raster habitat in R

I have run Maxent for multiple species under present conditions and also under future climate change scenarios. I was quantifying changes between present and future suitable habitat using the nicheOverlap function and Schoener's D statistic. Quite a few of the organisms in my study are just moving farther up mountains so there is a lot of overlap as the future distribution is inside the present distribution (just occupying less area at higher elevations). By looking at the ascii files in QGIS I can see that there is less suitable habitat in terms of area in the future, so I want to quantify this. I have scoured the internet for a good way to calculate area for rasters and never found anything that perfectly suited my fancy. I therefore wrote up something that is an amalgamation of bits and pieces of various scripts. It is pasted below.
Two questions:
1) do you all agree this is doing what I think it is doing (calculating area in square kilometers)?
2) is there a way to simplify this? Specifically you'll see I go from a raster to a dataframe back to raster? Maybe I could stay in rasters?
Thanks for any input!
Rebecca
####
library(raster)
#load rasters
m <- raster("SpeciesA_avg.asc")
mf <- raster("SpeciesA_future_layers_avg.asc")
#change to dataframe
m.df <- as.data.frame(m, xy=TRUE)
#get rid of NAs
m.df1 <- na.omit(m.df)
#keep only cells that that have a suitability score above 0.5 (scores range from 0 to 1)
m.df2 <- m.df1[m.df1$SpeciesA_avg> 0.5,]
#re-rasterize just the suitable area
m.raster <- rasterFromXYZ(m.df2)
##same as above but for future projection
mf.df <- as.data.frame(mf, xy=TRUE)
mf.df1 <- na.omit(mf.df)
mf.df2 <- mf.df1[mf.df1$SpeciesA_future_layers_avg>0.5,]
mf.raster <-rasterFromXYZ(mf.df2)
#get sizes of all cells in current distribution raster
#note my original layers were 30 seconds or 1 km2.
cell_size<-area(m.raster, na.rm=TRUE, weights=FALSE)
#delete NAs from all raster cells. It looks like these come back when switching from dataframe to raster
cell_size1<-cell_size[!is.na(cell_size)]
#compute area [km2] of all cells in raster
raster_area_present<-length(cell_size1)*median(cell_size1)
raster_area_present
#get sizes of all cells in future raster [km2]
cell_size<-area(mf.raster, na.rm=TRUE, weights=FALSE)
#delete NAs from vector of all raster cells
cell_size1<-cell_size[!is.na(cell_size)]
#compute area [km2] of all cells in geo_raster
raster_area_future<-length(cell_size1)*median(cell_size1)
raster_area_future
##calculate change in area
dif_area <- raster_area_present - raster_area_future
dif_area
When you ask a question, you should provide a simple self-contained example. Not just dump your script that points to files we do not have. Writing a simple example teaches your R, and often helps you solve the problem by yourself. Anyway, I here is some example data and solution to your problem, I think:
library(raster)
#example data
m <- mf <- raster(ncol=10, nrow=10, vals=0)
m[,1] <- NA
m[,3:7] <- 1
mf[,6:9] <- 1
# get rid of NAs (the example has none); should not be needed
m <- reclassify(m, cbind(NA, NA, 0))
mf <- reclassify(mf, cbind(NA, NA, 0))
# keep cells > 0.5 (scores range from 0 to 1)
m <- round(m)
mf <- round(mf)
# now combine the two layers, for example:
x <- m + mf * 10
# area of each cell
a <- area(x)
# sum area by class
z <- zonal(a, x, sum)
# zone value
#[1,] 0 152327547
#[2,] 1 152327547
#[3,] 10 101551698
#[4,] 11 101551698
zone 0 is "not current, nor future", 1 is "current only", 10 is "future only" and 11 is "current and future"
The areas are in m^2.
You may want to check out this tutorial on maxent and other spatial distribution models: http://rspatial.org/sdm/

Draw a heatmap with "super big" matrix

I want to draw a heatmap.
I have 100k*100k square matrix (50Gb(csv), numbers on right-top side and other filled by 0).
I want to ask "How can I draw a heatmap with R?" with this huge dataset.
I'm trying to this code on large RAM machine.
d = read.table("data.csv", sep=",")
d = as.matrix(d + t(d))
heatmap(d)
I tried some libraries like heatmap.2(in gplots) or something.
But they are take so much time and memories.
What I suggest you is to heavily down-sample your matrix before plotting the heatmap, e.g. doing the mean of each submatrices (as suggested by #IaroslavDomin) :
# example of big mx 10k x 10 k
bigMx <- matrix(rnorm(10000*10000,mean=0,sd=100),10000,10000)
# here we downsample the big matrix 10k x 10k to 100x100
# by averaging each submatrix
downSampledMx <- matrix(NA,100,100)
subMxSide <- nrow(bigMx)/nrow(downSampledMx)
for(i in 1:nrow(downSampledMx)){
rowIdxs <- ((subMxSide*(i-1)):(subMxSide*i-1))+1
for(j in 1:ncol(downSampledMx)){
colIdxs <- ((subMxSide*(j-1)):(subMxSide*j-1))+1
downSampledMx[i,j] <- mean(bigMx[rowIdxs,colIdxs])
}
}
# NA to disable the dendrograms
heatmap(downSampledMx,Rowv=NA,Colv=NA)
For sure with your huge matrix it will take a while to compute the downSampledMx, but it should be feasible.
EDIT :
I think downsampling should preserve recognizable "macro-patterns", e.g. see the following example :
# create a matrix with some recognizable pattern
set.seed(123)
bigMx <- matrix(rnorm(50*50,mean=0,sd=100),50,50)
diag(bigMx) <- max(bigMx) # set maximum value on the diagonal
# set maximum value on a circle centered on the middle
for(i in 1:nrow(bigMx)){
for(j in 1:ncol(bigMx)){
if(abs((i - 25)^2 + (j - 25)^2 - 10^2) <= 16)
bigMx[i,j] <- max(bigMx)
}
}
# plot the original heatmap
heatmap(bigMx,Rowv=NA,Colv=NA, main="original")
# function used to down sample
downSample <- function(m,newSize){
downSampledMx <- matrix(NA,newSize,newSize)
subMxSide <- nrow(m)/nrow(downSampledMx)
for(i in 1:nrow(downSampledMx)){
rowIdxs <- ((subMxSide*(i-1)):(subMxSide*i-1))+1
for(j in 1:ncol(downSampledMx)){
colIdxs <- ((subMxSide*(j-1)):(subMxSide*j-1))+1
downSampledMx[i,j] <- mean(m[rowIdxs,colIdxs])
}
}
return(downSampledMx)
}
# downsample x 2 and plot heatmap
downSampledMx <- downSample(bigMx,25)
heatmap(downSampledMx,Rowv=NA,Colv=NA, main="downsample x 2")
# downsample x 5 and plot heatmap
downSampledMx <- downSample(bigMx,10)
heatmap(downSampledMx,Rowv=NA,Colv=NA, main="downsample x 5")
Here's the 3 heatmaps :

3-d point matching/clustering in R

I would like to match points in 3-dimensional space.
Therefore, I am using the Hungarian Method described in this question: Finding the best matching pairwise points from 2 vectors
Here is my example using R:
# packages
library(rgl)
library(clue)
library(plyr)
library(fields)
set.seed(1)
a <- c(rep(2,7), 3,4,5,6,3,4,5,6,7,7,7,7,7,7) # x values
b <- c(rep(3,7),3,3,3,3, 3,3,3,3,3,3,3,3,3,3) # y values
c <- c(seq(1,7),1,1,1,1,7,7,7,7,1,2,3,4,5,6) # z values
# transform the points
set.seed(2)
a1 <- a + seq(1,length(a))
b1 <- b + 8
c1 <- c + 9
# plot the data
plot3d(a,b,c, col="red", pch=16,size=10)
plot3d(a1,b1,c1, lwd=10, col="blue", pch=16,size=10, add=TRUE)
# run the Hungarian Method
A <- cbind(a,b,c)
B <- cbind(a1,b1,c1)
distances <- rdist(A,B) # calculate Euclidean Distance between points
min.dist <- solve_LSAP(distances) # minimizing the sum of distance
min.dist.num <- as.numeric(min.dist)
# plot the minimized lines between point sets
for (ii in 1:dim(B)[1]){
D <- c(A[ii,1], B[min.dist.num[ii],1])
R <- c(A[ii,2], B[min.dist.num[ii],2])
W <- c(A[ii,3], B[min.dist.num[ii],3])
segments3d(D,R,W,col=2,lwd=1)
}
# calculate the share of points that is matched correctly
sum(1:dim(B)[1]==min.dist.num)/dim(B)[1]* 100
The problem here is that only 5% of the points are matched correctly (see last line of the code). In my view, the main trouble is that the algorithm does not take the structure of the object (a square) into account.
Question: Is there any method that performs better for this sample data?
In my original data, the dimensional structure of the points is way more complicated. I have a cloud of data and within this cloud there are multiple subfigures.
I am seeking primarily for a solution in R, but other implementations (e.g. MATLAB, Excel, Java) are also welcome.

calculate average correlation for neighboring pixels through time

I have a stack of 4 rasters. I would like the average correlation through time between a pixel and each of its 8 neighbors.
some data:
library(raster)
r1=raster(matrix(runif(25),nrow=5))
r2=raster(matrix(runif(25),nrow=5))
r3=raster(matrix(runif(25),nrow=5))
r4=raster(matrix(runif(25),nrow=5))
s=stack(r1,r2,r3,r4)
so for a pixel at position x, which has 8 neighbors at the NE, E, SE, S etc positions, I want the average of
cor(x,NE)
cor(x,E)
cor(x,SE)
cor(x,S)
cor(x,SW)
cor(x,W)
cor(x,NW)
cor(x,N)
and the average value saved at position x in the resulting raster. The edge cells would be NA or, if possible a flag to calculate the average correlation just with the cells it touches (either 3 or 5 cells).
Thanks!
I don't believe #Pascal's suggestion of using focal() could work because focal() takes a single raster layer as an argument, not a stack. This is the solution that is easiest to understand. It could be made more efficient by minimizing the number of times you extract values for each focal cell:
library(raster)
set.seed(2002)
r1 <- raster(matrix(runif(25),nrow=5))
r2 <- raster(matrix(runif(25),nrow=5))
r3 <- raster(matrix(runif(25),nrow=5))
r4 <- raster(matrix(runif(25),nrow=5))
s <- stack(r1,r2,r3,r4)
## Calculate adjacent raster cells for each focal cell:
a <- adjacent(s, 1:ncell(s), directions=8, sorted=T)
## Create column to store correlations:
out <- data.frame(a)
out$cors <- NA
## Loop over all focal cells and their adjacencies,
## extract the values across all layers and calculate
## the correlation, storing it in the appropriate row of
## our output data.frame:
for (i in 1:nrow(a)) {
out$cors[i] <- cor(c(s[a[i,1]]), c(s[a[i,2]]))
}
## Take the mean of the correlations by focal cell ID:
r_out_vals <- aggregate(out$cors, by=list(out$from), FUN=mean)
## Create a new raster object to store our mean correlations in
## the focal cell locations:
r_out <- s[[1]]
r_out[] <- r_out_vals$x
plot(r_out)

Graph to compare two matrices in R

I have two matrices (of approximately 300 x 100) and I would like to plot a graph to see the parts of the first one that are higher than those of the second.
I can do, for instance:
# Calculate the matrices and put them into m1 and m2
# Note that the values are between -1 and 1
par(mfrow=c(1,3))
image(m1, zlim=c(-1,1))
image(m2, zlim=c(-1,1))
image(m1-m2, zlim=c(0,1))
This will plot only the desired regions in the 3rd plot but I would like to do something a bit different, like putting a line around those areas over the first plot in order to highlight them directly there.
Any idea how I can do that?
Thank you
nico
How about:
par(mfrow = c(1, 3))
image(m1, zlim = c(-1, 1))
contour(m1 - m2, add = TRUE)
image(m2, zlim = c(-1, 1))
contour(m1 - m2, add = TRUE)
image(m1 - m2, zlim = c(0, 1))
contour(m1 - m2, add = TRUE)
This adds a contour map around the regions. Sort of puts rings around the areas of the 3rd plot (might want to fiddle with the (n)levels of the contours to get fewer 'circles').
Another way of doing your third image might be:
image(m1>m2)
this produces a matrix of TRUE/FALSE values which gets imaged as 0/1, so you have a two-colour image. Still not sure about your 'putting a line around' thing though...
Here's some code I wrote to do something similar. I wanted to highlight contiguous regions above a 0.95 threshold by drawing a box round them, so I got all the grid squares above 0.95 and did a clustering on them. Then do a bit of fiddling with the clustering output to get the rectangle coordinates of the regions:
computeHotspots = function(xyz, thresh, minsize=1, margin=1){
### given a list(x,y,z), return a data frame where each row
### is a (xmin,xmax,ymin,ymax) of bounding box of a contiguous area
### over the given threshhold.
### or approximately. lets use the clustering tools in R...
overs <- which(xyz$z>thresh,arr.ind=T)
if(length(overs)==0){
## found no hotspots
return(NULL)
}
if(length(overs)==2){
## found one hotspot
xRange <- cbind(xyz$x[overs[,1]],xyz$x[overs[,1]])
yRange <- cbind(xyz$y[overs[,2]],xyz$y[overs[,2]])
}else{
oTree <- hclust(dist(overs),method="single")
oCut <- cutree(oTree,h=10)
oXYc <- data.frame(x=xyz$x[overs[,1]],y=xyz$y[overs[,2]],oCut)
xRange <- do.call("rbind",tapply(oXYc[,1],oCut,range))
yRange <- do.call("rbind",tapply(oXYc[,2],oCut,range))
}
### add user-margins
xRange[,1] <- xRange[,1]-margin
xRange[,2] <- xRange[,2]+margin
yRange[,1] <- yRange[,1]-margin
yRange[,2] <- yRange[,2]+margin
## put it all together
xr <- apply(xRange,1,diff)
xm <- apply(xRange,1,mean)
xRange[xr<minsize,1] <- xm[xr<minsize]-(minsize/2)
xRange[xr<minsize,2] <- xm[xr<minsize]+(minsize/2)
yr <- apply(yRange,1,diff)
ym <- apply(yRange,1,mean)
yRange[yr<minsize,1] <- ym[yr<minsize]-(minsize/2)
yRange[yr<minsize,2] <- ym[yr<minsize]+(minsize/2)
cbind(xRange,yRange)
}
Test code:
x=1:23
y=7:34
m1=list(x=x,y=y,z=outer(x,y,function(x,y){sin(x/3)*cos(y/3)}))
image(m1)
hs = computeHotspots(m1,0.95)
That should give you a matrix of rectangle coordinates:
> hs
[,1] [,2] [,3] [,4]
1 13 15 8 11
2 3 6 17 20
3 22 24 18 20
4 13 16 27 30
Now you can draw them over the image with rect:
image(m1)
rect(hs[,1],hs[,3],hs[,2],hs[,4])
and to show they are where they should be:
image(list(x=m1$x,y=m1$y,z=m1$z>0.95))
rect(hs[,1],hs[,3],hs[,2],hs[,4])
You could of course adapt this to draw circles, but more complex shapes would be tricky. It works best when the regions of interest are fairly compact.
Barry

Resources