I have been using the variofit function in R's gstat package to fit semivariogram models to some spatial data I have, and I am confused by a couple of the models that have been generated. Basically for these few models, I will get a model that has a range for autocorrelation, but not a partial sill. I was told that even without a sill, though, the model should still have some sort of shape to reflect the range, but plotting this model results in the flat lines that are shown in the attached screenshot. I do not think it is a matter of bad initial values as I let variofit parse out the best initial values from a matrix of many values made by expand.grid. I wanted to know whether this is being plotted correctly contrary to what I've been told, and what exactly it means to have a range but no partial sill value. I know when I used an alternative model fitting function from geoR (fit.variogram), these models could be fit to a periodic or wave distribution, though poorly so/probably overfit — so would this be some indication of that, which variofit just cannot plot? I unfortunately can't share the data, but I included an example of the code I have used to make these models if it will help to answer my question:
geo.entPC <- as.geodata(cbind(jitteryPC, log.PC[,5], coords.col=1:2, data.col=5))
test.pc.grid2 <- expand.grid(seq(0,2,0.2),seq(0,100,10))
variog.function.col2 <-function (x) {
vario.cloud <- variog(x, estimator.type = "classical", option="bin")
variogram.mod <- variofit(vario.cloud , ini.cov.pars=test.pc.grid2, fix.nug=FALSE, weights="equal")
plot(vario.cloud)
lines(variogram.mod, col="red")
summary(x)
}
variog.function.col2(geo.entPC)
From the attached plot showing the empirical variogram, I would not expect to find any sensible spatial correlation. This is in accordance with the fitted variogram, which is essentially a pure nugget model. The spatial range might be a relic of the numerical optimization, or the partial spatial sill might (numerically) differ from 0 at a digit that is not shown in the summary of the fitted variogram. However, no matter what the range is for an irrelevant small partial sill, the spatial correlation is neglectable.
Depending on the data, it is sometimes beneficial to limit the maximum distance of pairs used to calculate the empirical variogram - but make sure to have "enough" pairs in each bin.
Related
Goal: I aim to use t-SNE (t-distributed Stochastic Neighbor Embedding) in R for dimensionality reduction of my training data (with N observations and K variables, where K>>N) and subsequently aim to come up with the t-SNE representation for my test data.
Example: Suppose I aim to reduce the K variables to D=2 dimensions (often, D=2 or D=3 for t-SNE). There are two R packages: Rtsne and tsne, while I use the former here.
# load packages
library(Rtsne)
# Generate Training Data: random standard normal matrix with J=400 variables and N=100 observations
x.train <- matrix(nrom(n=40000, mean=0, sd=1), nrow=100, ncol=400)
# Generate Test Data: random standard normal vector with N=1 observation for J=400 variables
x.test <- rnorm(n=400, mean=0, sd=1)
# perform t-SNE
set.seed(1)
fit.tsne <- Rtsne(X=x.train, dims=2)
where the command fit.tsne$Y will return the (100x2)-dimensional object containing the t-SNE representation of the data; can also be plotted via plot(fit.tsne$Y).
Problem: Now, what I am looking for is a function that returns a prediction pred of dimension (1x2) for my test data based on the trained t-SNE model. Something like,
# The function I am looking for (but doesn't exist yet):
pred <- predict(object=fit.tsne, newdata=x.test)
(How) Is this possible? Can you help me out with this?
From the author himself (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that
map?
t-SNE learns a non-parametric mapping, which means that it does not
learn an explicit function that maps data from the input space to the
map. Therefore, it is not possible to embed test points in an existing
map (although you could re-run t-SNE on the full dataset). A potential
approach to deal with this would be to train a multivariate regressor
to predict the map location from the input data. Alternatively, you
could also make such a regressor minimize the t-SNE loss directly,
which is what I did in this paper (https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf).
So you can't directly apply new data points. However, you can fit a multivariate regression model between your data and the embedded dimensions. The author recognizes that it's a limitation of the method and suggests this way to get around it.
t-SNE does not really work this way:
The following is an expert from the t-SNE author's website (https://lvdmaaten.github.io/tsne/):
Once I have a t-SNE map, how can I embed incoming test points in that
map?
t-SNE learns a non-parametric mapping, which means that it does not
learn an explicit function that maps data from the input space to the
map. Therefore, it is not possible to embed test points in an existing
map (although you could re-run t-SNE on the full dataset). A potential
approach to deal with this would be to train a multivariate regressor
to predict the map location from the input data. Alternatively, you
could also make such a regressor minimize the t-SNE loss directly,
which is what I did in this paper.
You may be interested in his paper: https://lvdmaaten.github.io/publications/papers/AISTATS_2009.pdf
This website in addition to being really cool offers a wealth of info about t-SNE: http://distill.pub/2016/misread-tsne/
On Kaggle I have also seen people do things like this which may also be of intrest:
https://www.kaggle.com/cherzy/d/dalpozz/creditcardfraud/visualization-on-a-2d-map-with-t-sne
This the mail answer from the author (Jesse Krijthe) of the Rtsne package:
Thank you for the very specific question. I had an earlier request for
this and it is noted as an open issue on GitHub
(https://github.com/jkrijthe/Rtsne/issues/6). The main reason I am
hesitant to implement something like this is that, in a sense, there
is no 'natural' way explain what a prediction means in terms of tsne.
To me, tsne is a way to visualize a distance matrix. As such, a new
sample would lead to a new distance matrix and hence a new
visualization. So, my current thinking is that the only sensible way
would be to rerun the tsne procedure on the train and test set
combined.
Having said that, other people do think it makes sense to define
predictions, for instance by keeping the train objects fixed in the
map and finding good locations for the test objects (as was suggested
in the issue). An approach I would personally prefer over this would
be something like parametric tsne, which Laurens van der Maaten (the
author of the tsne paper) explored a paper. However, this would best
be implemented using something else than my package, because the
parametric model is likely most effective if it is selected by the
user.
So my suggestion would be to 1) refit the mapping using all data or 2)
see if you can find an implementation of parametric tsne, the only one
I know of would be Laurens's Matlab implementation.
Sorry I can not be of more help. If you come up with any other/better
solutions, please let me know.
t-SNE fundamentally does not do what you want. t-SNE is designed only for visualizing a dataset in a low (2 or 3) dimension space. You give it all the data you want to visualize all at once. It is not a general purpose dimensionality reduction tool.
If you are trying to apply t-SNE to "new" data, you are probably not thinking about your problem correctly, or perhaps simply did not understand the purpose of t-SNE.
I am generating data via API calls, one data point at a time. I want to feed each point to a Stan model, save the updated model, and discard the data point.
Is this possible with Stan?
If so, how do you deal with group-level parameters? For example, if my model has J group-level parameters, but I'm only inputing one data point at a time, will this not generate an error?
I think your problem can be conceptualized as Bayesian updating. In other words, you beliefs about the parameters are currently represented by some joint distribution, then you get one more data point, and you want to update your beliefs in light of this data point. And then repeat that process.
If so, then you can do a Stan model that has only one data point, but you need some way of representing your current beliefs with a probability distribution to use as the prior. This typically would be done with some multivariate normal distribution on the parameters in the unconstrained space. You can use the unconstrain_pars function in the rstan package to obtain a matrix of unconstrained posterior draws and then see what multivariate normal it is close to. You probably want to use some shrunken covariance estimator for the multivariate normal if you have a lot of parameters. Then, in your Stan program use a multivariate normal prior on the parameters and do whatever transformations you need to do to get transformed parameters in the constrained space (many such transformations are documented in the Stan User Manual).
It is true that when you estimate a hierarchical model with only one data point, that data point has essentially no information about the groups that the particular data point are not in. However, in that case, the margins of the posterior distribution for the parameters of the omitted groups will be essentially the same as the prior distribution. That is fine.
I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.
I'm drawing ROC curves for a series of classifiers that I've implemented. The thing is that I get the next error message in the cases when I had a C.50 classifier with a cost-matrix (I'm working with RStudio).
Error in predict.C5.0(classifier.cost.1, data, type="prob"): confidence values (i.e. class probabilities) should not be used with costs.
The classifier is fine and when I don't use type="prob" in the predict command, it works fine too, but I don't get to draw the ROC curve.
This is the code I'm using to create my own ROC curves:
pred.class.cost <- predict(classifier.cost.1, data, type="prob")
perf.class.cost <- performance(prediction(pred.class.cost[,2], data$class),"tpr","fpr")
ROC.class.cost <- data.frame(x=perf.class.cost#x.values[[1]],y=perf.class.cost#y.values[[1]])
So two questions here:
What does the error mean and how can I fix it?
If it's not possible to fix it, any other way to create my own ROC curves? (I use then ggplot2 to get all the ROC curves and plot them together.
Any help would be much appreciated. Thanks!
The predict section of the C5.0 documentation explains that:
When the cost argument is used in the main function, class probabilities derived from the class
distribution in the terminal nodes may not be consistent with the final predicted class. For this
reason, requesting class probabilities from a model using unequal costs will throw an error.
To get around this, let's say that you want to give more weight to the positive class, then you could oversample from the positives or undersample from the negatives (I prefer the latter). This will have a similar effect to applying a cost and will then allow you to get probabilities and generate a ROC curve.
I do spatial modelling of variable T (temperature). I use what is commonly used in literature - perform regression (using variables like altitude etc.) and then spatially interpolate the residuals using IDW. R package gstat seems to have this option:
interpolated <- idw(T ~ altitude, stations, grid, idp=6)
spplot(interpolated["var1.pred"])
But in the documentation of idw() they write:
Function idw performs [...] . Don't use with predictors in the formula.
And actually, the result looks exactly like if only regression was performed, without spatial interpolation of the residuals. I know I can do it manually:
m1 <- lm(T ~ altitude, data = data.frame(stations))
Tres <- resid(m1)
res.int <- idw(Tres ~ 1, stations, grid, idp=6)
Tpred <- predict.lm(m1, grid)
spplot(SpatialGridDataFrame(grid, data.frame(T = Tpred + data.frame(res.int)['var1.pred'])))
But this have many drawbacks - the model is not in one object, so you cannot directly do summary, check for deviance, residuals and most importantly, do crossvalidation... everything will have to be done manually. So,
Is there a way how to do regression and IDW in one model in R?
Note that I don't want to use different method of spatial interpolation, because IDW is used in this area of modelling and was well tested for these purposes.
So what you want is to do regression first, and then perform IDW on the residuals. This cannot be done in one go, both from a theoretical and a software point of view:
Theoretical, kriging presents a unified way of treating both the model and the residuals in one go using a linear model and create a prediction, i.e. linear regression with spatially correlated residuals. In case of a hybrid model, this theoretical frame is not present. Because of this ad hoc nature I would not recommend doing this.
Practically, gstat simply does not support doing this in one go.
I'd recommend simply going for kriging, which is a very well established and published method. If this has not been used much in your area of expertise, this is a good time to introduce it. You could also have a look at the Tps function in the fields package.
Some years back I wrote a technical report for the Dutch Meteorological Office which might be of interest to you, it deals with the interpolation of evaporation.