Simple variogram in R, understanding gstat::variogram() and object gstat - r

I have a data.frame in R whose variables represent locations and whose observations are measures of a certain variable in those locations. I want to measure the decay of dependence for certain locations depending on distance, so the variogram comes particularly useful for my studies.
I am trying to use gstat library but I am a bit confused about certain parameters. As far as I understand the (empirical) variogram should only need as basic data:
The locations of the variables
Observations for these variables
And then other parameters like maximun distance, directions, ...
Now, gstat::variogram() function requires as first input an object of class gstat. Checking the documentation of function gstat() I see that it outputs an object of this class, but this function requires a formula argument, which is described as:
formula that defines the dependent variable as a linear model of independent variables; suppose the dependent variable has name z, for ordinary and simple kriging use the formula z~1; for simple kriging also define beta (see below); for universal kriging, suppose z is linearly dependent on x and y, use the formula z~x+y
Could someone explain me what this formula is for?

try
methods(variogram)
and you'll see that gstat has several methods for variogram, one requiring a gstat object as first argument.
Given a data.frame, the easiest is to use the formula method:
variogram(z~1, ~x+y, data)
which specifies that in data, z is the observed variable of interest, ~1 specifies a constant mean model, ~x+y specify that the coordinates are found in columns x and y of data.

Related

How to figure out the parameters from mppm in R

I am working using the spatstat library in R.
I have several point pattern objects built from my own dataset. The point patterns contain only the x and y coordinates of the points in them. I wanted to fit the point patterns to a Gibbs process with Strauss interaction to build a model and simulate similar point patterns. I was able to use ppm function for that purpose if I work with one point pattern at a time. I used rmhmodel function on the ppm object returned from the ppm function. The rmhmodel function gave me the parameters beta, gamma and r, which I needed to use in rStrauss function further to simulate new point patterns. FYI, I am not using the simulate function directly as I want the new simulated point pattern to have flexible number of points that simulate does not give me.
Now, if I want to work with all the point patterns I have, I can build a hyperframe of point patterns as described in the replicated point pattern chapter of the Baddeley textbook, but it requires mppm function instead of ppm function to fit the model and mppm is not working with rmhmodel when I am trying to figure out the model parameters beta, gamma and r.
How can I extract the fitted beta, gamma and r from a mppm object?
There are several ways to do this.
If you print a fitted model (obtained from ppm or from mppm) simply by typing the name of the object, the printed output contains a description of the fitted model including the model parameters.
If you apply the function parameters to a fitted model obtained from ppm you will obtain a list of the parameter values with appropriate names.
fit <- ppm(cells ~ 1, Strauss(0.12))
fit
parameters(fit)
For a model obtained from mppm, there could be different parameter values applying to each row of the hyperframe of data, so you would have to do lapply(subfits(model), parameters) and the result is a list with one entry for each row of the hyperframe, containing the parameters relevant to each row.
A <- hyperframe(Bugs=waterstriders)
mfit <- mppm(Bugs ~ 1, data=A, Strauss(5))
lapply(subfits(mfit), parameters)
Alternatively you can extract the canonical parameters by coef and transform them to the natural parameters.
You wrote:
I am not using the simulate function directly as I want the new simulated point pattern to have flexible number of points that simulate does not give me.
This cannot be right. The function simulate.mppm generates simulated realisations with a variable number of points. Try simulate(mfit).

Partial Canonical Correspondence Analysis in R

I am having issues with conducting a partial Canonical Correspondence Analysis (pCCA) in R. The data associated with code is quite extensive so I am unable to include it here.
The following code produces the error below it. In the pCCA model, I am attempting to account for both environmental and spatial variables in explaining species matrix. Spatial variables are Latitude and Longitude values. Env2 variables are a host of continuous and a few binary (0,1) environmental variables.
mod2 <-cca(species ~ env2 + spatial)
Error in model.frame.default(~env2 + spatial, na.action = na.pass, xlev = NULL) : invalid type (list) for variable 'env2'
I have used unlist () for both env2 and spatial, but it does not work.
Thoughts?
The right-hand side of formula must have variables, but it seems that you have data frames of several variables. This will not work, but gives a similar error message as in your post (and this is documented). Further, your formula will not define partial CCA, because the formula does not contain function Condition() that defines the terms partialled out.
The formula interface may work if you have numerical matrices as terms, but it won't work with unlist() variables.
If you are using vegan 2.5-1 or newer, you can define a partial CCA without formula interface as
cca(species, env2, spatial)
and the data frames env2 and spatial are automatically expanded to model matrices, and spatial terms are partialled out before analysing the effects of env2 terms.

Prediction at a new value using lowess function in R

I am using lowess function to fit a regression between two variables x and y. Now I want to know the fitted value at a new value of x. For example, how do I find the fitted value at x=2.5 in the following example. I know loess can do that, but I want to reproduce someone's plot and he used lowess.
set.seed(1)
x <- 1:10
y <- x + rnorm(x)
fit <- lowess(x, y)
plot(x, y)
lines(fit)
Local regression (lowess) is a non-parametric statistical method, it's a not like linear regression where you can use the model directly to estimate new values.
You'll need to take the values from the function (that's why it only returns a list to you), and choose your own interpolation scheme. Use the scheme to predict your new points.
Common technique is spline interpolation (but there're others):
https://www.r-bloggers.com/interpolation-and-smoothing-functions-in-base-r/
EDIT: I'm pretty sure the predict function does the interpolation for you. I also can't find any information about what exactly predict uses, so I've tried to trace the source code.
https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/stats/R/loess.R
else { ## interpolate
## need to eliminate points outside original range - not in pred_
I'm sure the R code calls the underlying C implementation, but it's not well documented so I don't know what algorithm it uses.
My suggestion is: either trust the predict function or roll out your own interpolation algorithm.

Weighting the inverse of the variance in linear mixed model

I have a linear mixed model which is run 50 different times in a loop.
Each time the model is run, I want the response variable b to be weighted inversely with the variance. So if the variance of b is small, I want the weighting to be bigger and vice versa. This is a simplified version of the model:
model <- lme(b ~ type, random = ~1|replicate,weights = ~ I(1/b))
Here's the R data files:
b: https://www.dropbox.com/s/ziipdtsih5f0252/b.Rdata?dl=0
type: https://www.dropbox.com/s/90682ewib1lw06e/type.Rdata?dl=0
replicate: https://www.dropbox.com/s/kvrtao5i2g4v3ik/replicate.Rdata?dl=0
I'm trying to do this using the weights option in lme. Right now I have this as:
weights = ~ I(1/b).
But I don't think this is correct....maybe weights = ~ I(1/var(b)) ??
I also want to adjust this slightly as b consists of two types of data specified in the factor variable (of 2 levels) type.
I want to inversely weight the variance of each of these two levels separately. How could I do this?
I'm not sure it makes sense to talk about weighting the response variable in this manner. The descriptions I have found in the R-SIG-mixed-models mailing list refer to using inverse weighting derived from the predictor variables, either the fixed effects or the random effects. The weighting is used in minimizing the deviations of approximation of the model fits to the response. There is a function that returns the fixed effects variance (a sub-class of the varFunc family of functions) and it has a help page (linked from the weights section of the ?gls page):
?varFixed
?varFunc
It requires a formula object as its argument. So my original guess was:
model <- lme(b ~ type, random = ~1|replicate, weights = varFixed( ~type) )
Which you proved incorrect. How about seeing if this works:
model <- lme(b ~ type, random = ~1|replicate, weights = varFixed( ~1| type) )
(My continuing guess is that this weighting is the default situation and specifying these particular weights may not be needed. The inverse nature of the weighting is implied and does not need to be explicitly stated with "1/type". In the case of mixed models the "correct" construction depends on the design and the prior science and none of this has been presented, so this is really only a syntactic comment and not an endorsement of this model. I did not download the files. Seems rather odd to have three separate files and no code for linking them into a dataframe. Generally one would want to have a single data object within which the column names would be used in the formulas of the regression function. (I also suspect this is the default behavior of this function and so my untested prediction is that that you would be getting no change by omitting that 'weights' parameter.)

How to call randomForest predict for use with ROCR?

I am having a hard time understanding how to build a ROC curve and now I came to the conclusion that maybe I don't create the model correctly. I am running a randomforest model in the dataset where the class attribute "y_n" is 0 or 1. I have divided the datasets as bank_training and bank_testing for the prediction purpose.
Here are the steps i do:
bankrf <- randomForest(y_n~., data=bank_training, mtry=4, ntree=2,
keep.forest=TRUE, importance=TRUE)
bankrf.pred <- predict(bankrf, bank_testing, type='response',
predict.all=TRUE, norm.votes=TRUE)
Is it correct what I do till now? The bankrf.pred object that is created is a list object with 2 classes named: aggregate and individuals. I dont understand where did this 2 class names came out? Moreover when I run:
summary(bankrf.pred)
Length Class Mode
aggregate 22606 factor numeric
individual 45212 -none- character
What does this summary mean? The datasets (training & testing) are 22605 and 22606 long each. If someone can explain me what is happening I would be very grateful. I think there is something wrong in all this.
When I try to design the ROC curve with ROCR I use the following code:
library(ROCR)
pred <- prediction(bank_testing$y_n, bankrf.pred$c(0,1))
Error in is.data.frame(labels) : attempt to apply non-function
Is just a mistake in the way I try to create the ROC curve or is it from the beginning with randomForest?
The documentation for the function you are attempting to use includes this description of its two main arguments:
predictions A vector, matrix, list, or data frame containing the
predictions.
labels A vector, matrix, list, or data frame containing the true
class labels. Must have the same dimensions as 'predictions'.
You are currently passing the variable y_n to the predictions argument, and what looks to me like nonsense to the labels argument.
The predictions will be stored in the output of the random forest model. As documented at ?predict.randomForest, it will be a list with two components. aggregate will contain the predicted values for the entire forest, while individual will contain the predicted values for each individual tree.
So you probably want to do something like this:
predictions(bankrf.pred$aggregate, bank_testing$y_n)
See how that works? The predicted values are passed to the predictions argument, while the "labels" or true values, are passed to the labels argument.
You should erase the predict.all=TRUE argument from predict if you simply want to get the predicted classes. By using predict.all=TRUE you are telling the function to keep the predictions of all trees rather than the prediction from the forest.

Resources