When to define new columns in krige in R? - r

I am wondering: is it ever necessary to redefine your own columns when kriging? The error below seems to indicate this:
Warning: singular model in variogram fit
> sk1 <- krige(formula=Zs~1, locations=~Xs+Ys, data=sampled, newdata=pred.grid, model=fit.sph, beta=0)
Error in `[.data.frame`(object, , -coord.numbers, drop = FALSE) :
undefined columns selected
Is there a problem that I'm not seeing? Or, do I need to define my own columns? Thanks.
The following program is completely reproducable and runnable from here down:
library(gstat)
x <- seq(0,2000,by=20)
y <- seq(0,2000,by=20)
x = sample(x,10,replace=T)
y = sample(y,10,replace=T)
z = sample(0.532:3.7,10,replace=T)
samples = data.frame(x,y,z)
# detrend the samples:
print(mean(samples$z))
#create object of class gstat
h <- gstat(formula=z~1, locations=~x+y, data=samples)
samples.vgm <- variogram(h) # create method of class "gstatVariogram"
plot(samples.vgm,main='Variogram of Samples NOT detrended') # plot method for class "gstatVariogram"
# DETREND
z = samples$z
x = samples$x
y = samples$y
trend <- lm(z~x+y)
c = trend$coefficients[[1]]
a = trend$coefficients[[2]]
b = trend$coefficients[[3]]
#z_prime = z - (a*x + b*y +c)
# SUBTRACT THE PREDICTED LINE
Xs <- c()
Ys <- c()
Zs <- c()
print('started the loop')
for (i in 1:nrow(samples)){
i = samples[i,]
x=i$x
y=i$y
z=i$z
z_prime = z - (a*x+b*y+c)
Xs <- c(Xs,x)
Ys <- c(Ys,y)
Zs <- c(Zs,z_prime)
}
sampled <- data.frame(Xs=Xs,Ys=Ys,Zs=Zs)
print(sampled)
print('the length of sampled is')
print(length(sampled[[1]]))
# "result" is the new dataset with Z's detrended
# print(levelplot(Zs~Xs+Ys,sampled))
# define the domain or kriging estimation
x <- seq(0,2000,by=20)
y <- seq(0,2000,by=20)
# make data frame with prediction locations
pred.grid <- data.frame(x=rep(x,times=length(y)),y=rep(y,each=length(x)))
#create object of class gstat
g <- gstat(formula=Zs~1, locations=~Xs+Ys, data=sampled)
sampled.vgm <- variogram(g) # create method of class "gstatVariogram"
plot(sampled.vgm,main='Variogram of Samples hopefully detrended') # plot method for class "gstatVariogram"
vg.sph <- vgm(psill=1.0,model='Sph', range = 500)
fit.sph <- fit.variogram(sampled.vgm, model = vg.sph)
sk1 <- krige(formula=Zs~1, locations=~Xs+Ys, data=sampled, newdata=pred.grid, model=fit.sph, beta=0)

Add library(gstat) to the top of your code, then it's reproducible.
To answer your question directly, the reason you receive an undefined columns selected error is because your newdata does not have the correct column names. The column names need to match the data column names, which are Xs and Ys in this case. Redefine the pred.grid to have columns Xs and Ys to solve your problem. I tested, and your code runs.
pred.grid <- data.frame(Xs=rep(x,times=length(y)),Ys=rep(y,each=length(x)))
As for other comments: Warning: singular model in variogram fit is a result of not being able to fit a model based on the sample semivariogram data. If you take a look at the plot of your data (below), it's very clear that no empirical function will be able to fit this. In your case, it is because you have only one point per bin (11 points total) so there's really not enough data to fit the semivariogram. Even reducing the number of bins, there still would not be enough data support to fit an empirical semivariogram.
Changing your number of samples to 500,
x = sample(x,500,replace=T)
y = sample(y,500,replace=T)
z = sample(0.532:3.7,500,replace=T)
it becomes very clear that the data you are generating are uncorrelated such that samples closer to one another in x-y space are not more similar than samples farther away (pure nugget semivariogram). Is this what you wanted?

Related

Creating a point distance component to a monte carlo simulation function in R

I am attempting to do some Monte Carlo simulations, where I have a population of 325 samples in a field. I want to create a list of composite samples (samples consisting of multiple subsamples) from the dataset, while increasing sample size, repeated 100 times. I have created the function that will do so, and have supplied that below in the code.
##Create an example data set
# x and y are coordinates
x <- c(1:100)
y <- rev(c(1:100))
## z and w are soil test values
set.seed(2345)
z <- rnorm(100,mean=50, sd=10)
set.seed(2345)
w <- rnorm(100, mean=75, sd=5)
data <- data.frame(x, y, z, w)
##Initialize list
data.step.sim.list <- list()
## Code that increases sample size
for(i in seq_len(nrow(data))){
thisdat <- replicate(100,data[sample(1:nrow(data), size=i, replace = F),], simplify = F)
data.step.sim.list[[i]] <- thisdat
}
The product becomes a list n long (n being length of dataset), with each list consisting of a list of 100 dataframes (100 coming from 100 replications) that are length 1:n length long.
I have x and y data for each sample as well, and want to stipulate that each subsample collected would be at least 'm' meters from the other samples.
I have created a function that will calculate each distance seen below. I cannot find a way to implement this into my current code. Would anyone know how to do this?
#function to compute distances
calc.dist <- function(x1, y1, x2, y2) {
d <- sqrt(((x2 - x1)^2) + ((y2 - y1)^2))
return(d)
} #end function calc.dist

Plotting Forecast and Real values in one plot using a Rolling Window

I have a code which takes the input as the Yield Spread (dependent var.) and Forward Rates(independent var.) and operate an auto.arima to get the orders. Afterwards, I am forecasting the next 25 dates (forc.horizon). My training data are the first 600 (training). Then I am moving the time window 25 dates, meaning using the data from 26 to 625, estimating the auto.arima and then forecasting the data from 626 to 650 and so on. My data sets are 2298 rows (date) and 30 columns (maturity).
I want to store all of the forecasts and then plot the forecasted and real values in the same plot.
This is the code I have, but it doesn't store the forecasts in a way to plot later.
forecast.func <- function(NS.spread, ind.v, maturity, training, forc.horizon){
NS.spread <- NS.spread/100
forc <- c()
j <- 0
for(i in 1:floor((nrow(NS.spread)-training)/forc.horizon)){
# test data
y <- NS.spread[(1+j):(training+j) , maturity]
f <- ind.v[(1+j):(training+j) , maturity]
# auto- arima
c <- auto.arima(y, xreg = f, test= "adf")
# forecast
e <- ind.v[(training+j+1):(training+j+forc.horizon) , maturity]
h <- forecast(c, xreg = lagmatrix(e, -1))
forc <- c(forc, list(h))
j <- j + forc.horizon
}
return(forc)
}
a <- forecast.func(spread.NS.JPM, Forward.rate.JPM, 10, 600, 25)
lapply(a, plot)
Here's a link to my two datasets:
https://drive.google.com/drive/folders/1goCxllYHQo3QJ0IdidKbdmfR-DZgrezN?usp=sharing
LOOK AT THE END for a full functional example on how to handle AUTO.ARIMA MODEL with DAILY DATA using XREG and FOURIER SERIES with ROLLING STARTING TIMES and cross validated training and test.
Without a reproducible example no one can help you, because they can't run your code. You need to provide data. :-(
Even if it's not part of StackOverflow to discuss statistics matters, why don't you do an auto.arima with xreg instead of lm + auto.arima on residuals? Especially, considering how you forecast at the end, that training method looks really wrong. Consider using:
fit <- auto.arima(y, xreg = lagmatrix(f, -1))
h <- forecast(fit, xreg = lagmatrix(e, -1))
auto.arima will automatically calculate the best parameters by max likelihood.
On your coding question..
forc <- c() should be outside of the for loop, otherwise at every run you delete your previous results.
Same for j <- 0: at every run you're setting it back to 0. Put it outside if you need to change its value at every run.
The output of forecast is an object of class forecast, which is actually a type of list. Therefore, you can't use cbind effectively.
I'm my opinion, you should create forc in this way: forc <- list()
And create a list of your final results in this way:
forc <- c(forc, list(h)) # instead of forc <- cbind(forc, h)
This will create a list of objects of class forecast.
You can then plot them with a for loop by getting access at every object or with a lapply.
lapply(output_of_your_function, plot)
This is as far as I can go without a reproducible example.
FINAL EDIT
FULL FUNCTIONAL EXAMPLE
Here I try to sum up a conclusion out of the million comments we wrote.
With the data you provided, I built a code that can handle everything you need.
From training and test to model, till forecast and finally plotting which have the X axis with the time as required in one of your comments.
I removed the for loop. lapply is much better for your case.
You can leave the fourier series if you want to. That's how Professor Hyndman suggests to handle daily time series.
Functions and libraries needed:
# libraries ---------------------------
library(forecast)
library(lubridate)
# run model -------------------------------------
.daily_arima_forecast <- function(init, training, horizon, tt, ..., K = 10){
# create training and test
tt_trn <- window(tt, start = time(tt)[init] , end = time(tt)[init + training - 1])
tt_tst <- window(tt, start = time(tt)[init + training], end = time(tt)[init + training + horizon - 1])
# add fourier series [if you want to. Otherwise, cancel this part]
fr <- fourier(tt_trn[,1], K = K)
frf <- fourier(tt_trn[,1], K = K, h = horizon)
tsp(fr) <- tsp(tt_trn)
tsp(frf) <- tsp(tt_tst)
tt_trn <- ts.intersect(tt_trn, fr)
tt_tst <- ts.intersect(tt_tst, frf)
colnames(tt_tst) <- colnames(tt_trn) <- c("y", "s", paste0("k", seq_len(ncol(fr))))
# run model and forecast
aa <- auto.arima(tt_trn[,1], xreg = tt_trn[,-1])
fcst <- forecast(aa, xreg = tt_tst[,-1])
# add actual values to plot them later!
fcst$test.values <- tt_tst[,1]
# NOTE: since I modified the structure of the class forecast I should create a new class,
# but I didnt want to complicate your code
fcst
}
daily_arima_forecast <- function(y, x, training, horizon, ...){
# set up x and y together
tt <- ts.intersect(y, x)
# set up all starting point of the training set [give it a name to recognize them later]
inits <- setNames(nm = seq(1, length(y) - training, by = horizon))
# remove last one because you wouldnt have enough data in front of it
inits <- inits[-length(inits)]
# run model and return a list of all your models
lapply(inits, .daily_arima_forecast, training = training, horizon = horizon, tt = tt, ...)
}
# plot ------------------------------------------
plot_daily_forecast <- function(x){
autoplot(x) + autolayer(x$test.values)
}
Reproducible Example on how to use the previous functions
# create a sample data
tsp(EuStockMarkets) <- c(1991, 1991 + (1860-1)/365.25, 365.25)
# model
models <- daily_arima_forecast(y = EuStockMarkets[,1],
x = EuStockMarkets[,2],
training = 600,
horizon = 25,
K = 5)
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[1]]
Example for the author of the post
# your data
load("BVIS0157_Forward.rda")
load("BVIS0157_NS.spread.rda")
spread.NS.JPM <- spread.NS.JPM / 100
# pre-work [out of function!!!]
set_up_ts <- function(m){
start <- min(row.names(m))
end <- max(row.names(m))
# daily sequence
inds <- seq(as.Date(start), as.Date(end), by = "day")
ts(m, start = c(year(start), as.numeric(format(inds[1], "%j"))), frequency = 365.25)
}
mts_spread.NS.JPM <- set_up_ts(spread.NS.JPM)
mts_Forward.rate.JPM <- set_up_ts(Forward.rate.JPM)
# model
col <- 10
models <- daily_arima_forecast(y = mts_spread.NS.JPM[, col],
x = stats::lag(mts_Forward.rate.JPM[, col], -1),
training = 600,
horizon = 25,
K = 5) # notice that K falls between ... that goes directly to the inner function
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[5]]

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

How to modify slots with paired random effects in lmer

This is a follow up question to a previous post (How to modify slots lme4 >1.0). I have a similar pairwise data structure and want the random effect to consider both "pops" in the pair. I have a functional random intercept model using the code previously suggested:
dat <- data.frame(pop1 = c(2,1,1,1,1,3,2,2,2,3,5,3,5,4,6),
pop2 = c(1,3,4,5,6,2,4,5,6,4,3,6,4,6,5),
X = c(20,25,18,40,36,70,68,72,78,76,97,100,115,110,108),
Y = c(18,16,15,40,22,18,18,18,18,45,10,47,67,5,6))
#build random effects matrix
Zl<-lapply(c("pop1","pop2"),function(nm)Matrix:::fac2sparse(dat[[nm]],"d",drop=FALSE))
ZZ<-Reduce("+",Zl[-1],Zl[[1]])
#specify model structure
mod<-lFormula(Y~X+(1|pop1),data=dat,REML=TRUE)
#replace slot
mod$reTrms$Zt <- ZZ
#fit model
dfun<-do.call(mkLmerDevfun,mod)
opt<-optimizeLmer(dfun)
mkMerMod(environment(dfun),opt,mod$reTrms,fr=mod$fr)
However, when attempting to add a random slope variable:
mod2<-lFormula(Y~X+(1+X|pop1),data=dat,REML=TRUE)
mod2$reTrms$Zt <- ZZ
dfun<-do.call(mkLmerDevfun,mod2)
Results in the same error identified in the previous post (where the issue was calling the wrong data frame): "Error in Lambdat %*% Ut :
Cholmod error 'A and B inner dimensions must match' at file ../MatrixOps/cholmod_ssmult.c, line 82"
View lm for each pop
plot(1,type="n",xlim=c(0,150),ylim=c(0,75),ylab = "Y",xlab="X")
for(i in 1:length(unique(c(dat$pop1,dat$pop2)))){
subdat<-dat[which(dat$pop1==i | dat$pop2==i),]
out<-summary(lm(subdat$Y~subdat$X))
x=subdat$X
y=x*out$coefficients[2,1]+out$coefficients[1,1]
lines(x,y,col=i))
}
legend(125,60,1:6,col=1:6,lty=1,title="Pop")
dat <- data.frame(pop1 = c(2,1,1,1,1,3,2,2,2,3,5,3,5,4,6),
pop2 = c(1,3,4,5,6,2,4,5,6,4,3,6,4,6,5),
X = c(20,25,18,40,36,70,68,72,78,76,97,100,115,110,108),
Y = c(18,16,15,32,22,29,32,38,44,45,51,47,67,59,61))
It helps to try to understand what the original code is actually doing:
## build random effects matrix
## 1. sparse dummy-variable matrices for each population ID
Zl <- lapply(dat[c("pop1","pop2")],
Matrix::fac2sparse,to="d",drop.unused.levels=FALSE)
## 2. take the sum of all components of the list of dummy-variable matrices ...
ZZ <- Reduce("+",Zl[-1],Zl[[1]])
The Reduce form is convenient in general if we have a long list, but it helps to see that in this case it's just Zl[[1]]+Zl[[2]] ...
all.equal(Zl[[1]]+Zl[[2]],ZZ) ## TRUE
What does this RE structure look like?
library(gridExtra)
grid.arrange(
image(t(Zl[[1]]),main="pop 1",sub="",xlab="pop",ylab="obs"),
image(t(Zl[[2]]),main="pop 2",sub="",xlab="pop",ylab="obs"),
image(t(ZZ),main="combined",sub="",xlab="RE",ylab="obs"),
nrow=1)
For the random slope, I think we want to take each filled element of ZZ and replace it with the X value observed for the corresponding observation/row of dat: the indexing here is a bit obscure - in this case it boils down to there being 2 filled values in each row of Z/column of Zt (the #p slot of the sparse matrix gives a zero-indexed pointer to the first non-zero element in each column ...)
vals <- dat$X[rep(1:(length(ZZ#p)-1),diff(ZZ#p))]
ZZX <- ZZ
ZZX#x <- vals
image(t(ZZX))
library(lme4)
mod <- lFormula(Y~X+(X|pop1),data=dat,REML=TRUE)
## replace slot
mod$reTrms$Zt <- rbind(ZZ,ZZX)
## fit model
dfun <- do.call(mkLmerDevfun,mod)
opt <- optimizeLmer(dfun)
m1 <- mkMerMod(environment(dfun),opt,mod$reTrms,fr=mod$fr)
This seems to work, but you should certainly check it with your own knowledge of what's supposed to be going on here ...

How to predict using a locally smoothed mean?

(Statistics beginner here).
I have some training data (x,y), and wish to make prediction for new data x_new.
Now let's assume I have the data for the plot below, but I do not know how y is computed. So I would like to use the data I have a calculate for any given x the local mean of y data, as this seems like the best guess I can make.
install.packages("gplots")
library("gplots")
x <- abs(rnorm(500))
y <- rnorm(500, mean=2*x, sd=2+2*x)
bandplot(x,y)
Is there a R function to predict y for a given x, using the locally smoothed mean (here shown in red thanks to the function bandplot), or something similar?
wapply from gplots returns the locally smoothed mean as a table for x and y.
x <- 1:1000
y <- rnorm(1000, mean=1, sd=1 + x/1000 )
wapply(x,y,mean)
to predict, one would need, I guess, to resolve the closest x that is in the table returned by wapply, then deduce the local mean for y.
For a value a, the closest x will be given by the index:
index = which(abs(wapply(x,y,mean)$x-a)==min(abs(wapply(x,y,mean)$x-a)))
then the prediction should be:
pred = wapply(x,y,mean)[index]
So in one line:
locally_smoothed_mean_prediction = function(a) wapply(x,y,mean)$y[which(abs(wapply(x,y,mean)$x-a)==min(abs(wapply(x,y,mean)$x-a)))]
> locally_smoothed_mean_prediction(600)
[1] 1.055642

Resources