Why does gstat.predict() function often return NaN values (GSTAT Package)? (R version 3.3.2, Windows 10) - r

I am trying to simulate a combination of two different random fields (yy1 and yy2 with different mean and correlation length) with an irregular boundary using Gstat package in R. I have attached the picture of my expected outcome. The code is not giving such output consistently and I am frequently getting atleast one of the yy1 and yy2 as NaNs, which results in the Undesired output as shown in image.
The key steps I used are:
1) Created two gstat objects with different means and psill (rf1 and rf2)
2) Created two computational grids (one for each random field) in the form of data frame with two variables “x” and “y” coordinates.
3) Predicted two random fields using unconditional simulation.
Any help in this regard would be highly appreciated.
Attachments: 2 images (link provided) and 1 R code
1) Expected Outcome
2) Undesired Outcome
library(gstat)
xy <- expand.grid(1:150, 1:200) # 150 x 200 grid is created in the form of a dataframe with x and y vectors
names(xy)<-c('x','y') # giving names to the variables
# creating gsat objects
rf1<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(1,0,0), model=vgm(psill=0.025, range=5, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
rf2<-gstat(formula=z~1,locations=~x+y,dummy = T,beta=c(4,0,0), model=vgm(psill=0.025, range=10, model='Exp'), nmax=20) # dummy=T treats this as a unditional simulation
# creating two computational grid
rows<-nrow(xy)
xy_shift <- expand.grid(60:90, 75:100)
names(xy_shift)<-c('x','y')
library(dplyr) # for antijoin
xy1<-xy[1:(rows/2),]
xy1<-anti_join(xy1, xy_shift, by = c("x","y")) # creating the irregular boundary
xy2<-rbind(xy[(rows/2+1):rows,],xy_shift)
library(sp)
yy1<- predict(rf1, newdata=xy1, nsim=1) # random field 1
yy2<- predict(rf2, newdata=xy2, nsim=1) # random field 2
rf1_label<-gl(1,length(yy1[,1]),labels="field1")
rf2_label<-gl(1,length(yy2[,1]),labels="field2")
yy1<-cbind(yy1,field=rf1_label)
yy2<-cbind(yy2,field=rf2_label)
yy<-rbind(yy1,yy2)
yyplot<-yy[,c(1,2,3)]
# plotting the field
gridded(yyplot) = ~x+y
spplot(obj=yyplot[1],scales=list(draw = TRUE))

Related

Clustering leads to very concentrated clusters

To understand my problem, you will need the whole dataset: https://pastebin.com/82paf0G8
Pre-processing: I had a list of orders and 696 unique item numbers, and wanted to cluster them, based on how frequent each pair of items are ordered together. I calculated for each pair of items, number of frequency of occurence within the same order. I.e the highest number of occurrence was 489 between two items. I then "calculated" the similarity/correlation, by: Frequency / "max frequency of all pairs" (489). Now I have the dataset that I have uploaded.
Similarity/correlation: I don't know if my similarity approach is the best in this case. I also tried with something called "Jaccard’s coefficient/index", but get almost same results.
The dataset: The dataset contains material numbers V1 and V2. and N is the correlation between the two material numbers between 0 - 1.
With help from another one, I managed to create a distance matrix and use the PAM clustering.
Why PAM clustering? A data scientist suggest this: You have more than 95% of pairs without information, this makes all these materials are at the same distance and a single cluster very dispersed. This problem can be solved using a PAM algorithm, but still you will have a very concentrated group. Another solution is to increase the weight of the distances other than one.
Problem 1: The matrix is only 567x567. I think for clustering I need the 696x696 full matrix, even though a lot of them are zeros. But i'm not sure.
Problem 2: Clustering does not do very well. I get very concentrated clusters. A lot of items are clustered in the first cluster. Also, according to how you verify PAM clusters, my clustering results are poor. Is it due to the similarity analysis? What else should I use? Is it due to the 95% of data being zeros? Should I change the zeros to something else?
The whole code and results:
#Suppose X is the dataset
df <- data.table(X)
ss <- dcast(rbind(df, df[, .(V1 = V2, V2 = V1, N)]), V1~V2, value.var = "N")[, -1]
ss <- ss/max(ss, na.rm = TRUE)
ss[is.na(ss)] <- 0
diag(ss) <- 1
Now using the PAM clustering
dd2 <- as.dist(1 - sqrt(ss))
pam2 <- pam(dd2, 4)
summary(as.factor(pam2$clustering))
But I get very concentrated clusters, as:
1 2 3 4
382 100 23 62
I'm not sure where you get the 696 number from. After you rbind, you have a dataframe with 567 unique values for V1 and V2, and then you perform the dcast, and end up with a matrix as expected 567 x 567. Clustering wise I see no issue with your clusters.
dim(df) # [1] 7659 3
test <- rbind(df, df[, .(V1 = V2, V2 = V1, N)])
dim(test) # [1] 15318 3
length(unique(test$V1)) # 567
length(unique(test$V2)) # 567
test2 <- dcast(test, V1~V2, value.var = "N")[,-1]
dim(test2) # [1] 567 567
#Mayo, forget what the data scientist said about PAM. Since you've mentioned this work is for a thesis. Then from an academic viewpoint, your current justification to why PAM is required, does not hold any merit. Essentially, you need to either prove or justify why PAM is a necessity for your case study. And given the nature of (continuous) variables in the dataset, V1, V2, N, I do not see the logic on why PAM is applicable here (like I mentioned in the comments, PAM works best for mixed variables).
Continuing further, See this post on correlation detection in R;
# Objective: Detect Highly Correlated variables, visualize them and remove them
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
# compute correlation matrix using the cor()
res<- cor(my_data)
round(res, 2) # Unfortunately, the function cor() returns only the correlation coefficients between variables.
# Visualize the correlation
# install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
# Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. In the right side of the correlogram, the legend color shows the correlation coefficients and the corresponding colors.
# tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
#Apply correlation filter at 0.80,
#install.packages("caret", dependencies = TRUE)
library(caret)
highlyCor <- colnames(my_data)[findCorrelation(res, cutoff = 0.80, verbose = TRUE)]
# show highly correlated variables
highlyCor
[1] "disp" "mpg"
removeHighCor<- findCorrelation(res, cutoff = 0.80) # returns indices of highly correlated variables
# remove highly correlated variables from the dataset
my_data<- my_data[,-removeHighCor]
[1] 32 4
Hope this helps.

Specify a blocking factor in H2O

In the R version of H2O, is it possible to specify a blocking factor when splitting data in training/validation/test sets and/or when doing cross-validation?
I'm working on a clinical dataset with multiple observations from the same patient that should be kept together during these operations.
If this is not possible to do within the H2O framework then suggestions on how to achieve this in R and integrate with H2O functions would be great.
Thanks!
When using H2O-3 with cross validation, you can tell the training algorithm which fold number an observation belongs to with the fold_column parameter. See:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/fold_column.html
The code example below (copied from the link above) shows folds being assigned randomly. But you could alternately write a piece of code to assign them specifically yourself.
library(h2o)
h2o.init()
# import the cars dataset:
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
# convert response column to a factor
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])
# set the predictor names and the response column name
predictors <- c("displacement","power","weight","acceleration","year")
response <- "economy_20mpg"
# create a fold column with 5 folds
# randomly assign fold numbers 0 through 4 for each row in the column
fold_numbers <- h2o.kfold_column(cars, nfolds=5)
# rename the column "fold_numbers"
names(fold_numbers) <- "fold_numbers"
# print the fold_assignment column
print(fold_numbers)
# append the fold_numbers column to the cars dataset
cars <- h2o.cbind(cars,fold_numbers)
# try using the fold_column parameter:
cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = cars,
fold_column="fold_numbers", seed = 1234)
# print the auc for your model
print(h2o.auc(cars_gbm, xval = TRUE))

siber.ellipses -> Error in rmultireg(Y, X, Bbar, A, nu, V) : not a matrix

I am very new to R and am currently trying to create siber ellipses.
I watched the potcast Using ellipses to compare community members:(http://www.tcd.ie/Zoology/research/research/theoretical/Rpodcasts.php#siber) and got along just fine in the beginning. Whenever i get to the function of the siber.ellipses i get an Error:
(Error in rmultireg(Y, X, Bbar, A, nu, V) : not a matrix)
I can not figure out why. I get it with my own data as well as with the example data in the zip file provided along with the script.
I have researched the Error message online but could not come up with an answer.
It almost can be an error in the script or the data since I used those exactly as provided. My R version is 3.3.2
Does it have something to do with some kind of settings?
What could be the reason?
Can someone please helpe me :)
thanks
# this demo generates some random data for M consumers based on N samples and
# constructs a standard ellipse for each based on SEAc and SEA_B
rm(list = ls())
library(siar)
-------------------------------------------------------------------------
# ANDREW - REMOVE THESE LINES WHICH SHOULD BE REDUNDANT
# change this line
setwd("C:/Users/elisabeth/Desktop/R/demo")
# -----------------------------------------------------------------------------
# now close all currently open windows
graphics.off()
# read in some data
# NB the column names have to be exactly, "group", "x", "y"
mydata <- read.table("example_ellipse_data.txt",sep="\t",header=T)
# make the column names availble for direct calling
attach(mydata)
# now loop through the data and calculate the ellipses
ngroups <- length(unique(group))
# split the isotope data based on group
spx <- split(x,group)
spy <- split(y,group)
# create some empty vectors for recording our metrics
SEA <- numeric(ngroups)
SEAc <- numeric(ngroups)
TA <- numeric(ngroups)
dev.new()
plot(x,y,col=group,type="p")
legend("topright",legend=as.character(paste("Group ",unique(group))),
pch=19,col=1:length(unique(group)))
for (j in unique(group)){
# Fit a standard ellipse to the data
SE <- standard.ellipse(spx[[j]],spy[[j]],steps=1)
# Extract the estimated SEA and SEAc from this object
SEA[j] <- SE$SEA
SEAc[j] <- SE$SEAc
# plot the standard ellipse with d.f. = 2 (i.e. SEAc)
# These are plotted here as thick solid lines
lines(SE$xSEAc,SE$ySEAc,col=j,lty=1,lwd=3)
# Also, for comparison we can fit and plot the convex hull
# the convex hull is plotted as dotted thin lines
#
# Calculate the convex hull for the jth group's isotope values
# held in the objects created using split() called spx and spy
CH <- convexhull(spx[[j]],spy[[j]])
# Extract the area of the convex hull from this object
TA[j] <- CH$TA
# Plot the convex hull
lines(CH$xcoords,CH$ycoords,lwd=1,lty=3)
}
# print the area metrics to screen for comparison
# NB if you are working with real data rather than simulated then you wont be
# able to calculate the population SEA (pop.SEA)
# If you do this enough times or for enough groups you will easily see the
# bias in SEA as an estimate of pop.SEA as compared to SEAc which is unbiased.
# Both measures are equally variable.
print(cbind(SEA,SEAc,TA))
# So far we have fitted the standard ellipses based on frequentist methods
# and calculated the relevant metrics (SEA and SEAc). Now we turn our attention
# to producing a Bayesian estimate of the standard ellipse and its area SEA_B
reps <- 10^4 # the number of posterior draws to make
# Generate the Bayesian estimates for the SEA for each group using the
# utility function siber.ellipses
SEA.B <- siber.ellipses(x,y,group,R=reps)
Error in rmultireg(Y, X, Bbar, A, nu, V) : not a matrix
Reinstall the SIAR package using Jackson's Github site:
library(devtools)
install_github("andrewljackson/siar#v4.2.2", build_vingettes == TRUE)
library(siar)

spBayes spLM function with duplicate coordinates

I am using the spRecover function in package spBayes to produce a spatial univariate model.
Here is a reproducible example where there I made a duplicate coordinate point. The modeling procedure itself executes just fine, but it won't let me recover the spatial effects for each site:
require(spBayes)
set.seed(444)
N = 200
y = rnorm(N,0,100)
x = rnorm(N,2,7)
df <- as.data.frame(cbind((rnorm(N,5,2.5)),rep('location1',N)))
coord <- cbind(runif(N,-30,30),runif(N,-180,180))
coord[2,] <- coord [1,]
n.samples <- 1000
bef.sp <- spLM(y ~ x, ## the equation
data = df, coords=coord, ## data and coordinates
starting=list("phi"=3/200,"sigma.sq"=0.08,"tau.sq"=0.02),## start values
tuning=list("phi"=0.1, "sigma.sq"=0.05, "tau.sq"=0.05), ## tuning values
priors=list("phi.Unif"=c(3/1500, 3/50), "sigma.sq.IG"=c(2, 0.08),"tau.sq.IG"=c(2, 0.02)), ## priors
cov.model="exponential",n.samples=n.samples)
burn.in <- floor(0.75*n.samples)
bef.sp <- spRecover(bef.sp, start=burn.in, thin=2)
The error received is:
Error in spRecover(bef.sp, start = burn.in, thin = 2) :
c++ error: dpotrf failed
I found a post by the package author indicating this error might come up if one has replicated coordinates. I definitely have duplicated coordinates, since many sites were sampled many times (on the same day; this is not a time-series issue). How do I get the model to accept that there is lots of replication within each coordinate pair, and to recover individual spatial effects values for each site?
Thanks!

R: Calculate sill, range and nugget from a raster object

I need to calculate the sill, range and nugget from a raster layer. I have explored gstat, usdm packages where one can create variogram however I couln't find a function which given a raster layer will estimate these parameters.In most of the functions these parameters have to be defined eg. krigging.
I have raster data layers for different heights which looks similar to
I would like get the sill, nugget and range from the parameters of semivariogram fitted to these data layers to create a plot similar to this:
The original data layers are available here as a multiband tiff. Here is a figure from this paper which further illustrates the concept.
Using gstat, here is an example:
library(raster)
library(gstat)
demo(meuse, ask = FALSE, echo = FALSE)
set.seed(131) # make random numbers reproducible
# add some noise with .1 variance
meuse.grid$dist = meuse.grid$dist + rnorm(nrow(meuse.grid), sd=sqrt(.1))
r = raster(meuse.grid["dist"])
v = variogram(dist~1, as(r, "SpatialPixelsDataFrame"))
(f = fit.variogram(v, vgm("Sph")))
# model psill range
# 1 Nug 0.09035948 0.000
# 2 Sph 0.06709838 1216.737
f$psill[2] # sill
# [1] 0.06709838
f$range[2] # range
# [1] 1216.737
f$psill[1] # nugget
# [1] 0.09035948
Plug in your own raster for r, and it should work. Change the Sph to fit another variogram model, try plot(v,f) to verify the plot.
This is just a guess. This is how I estimate semi variance
where n is the number of layers which their mean is less than the total mean. m is the total mean across all the layers. r is the mean of each layer that fell below the total mean.
s <- stack("old_gap_.tif")
m <- cellStats(mean(s), stat="mean", na.rm=T) # 0.5620522
r <- m[m < 0.5620522]
sem <- 1/53 * (0.5620522 - r)^2
plot(sem, r)

Resources