Thin plate spline on Linear Network with Spatstat - r

I have been using lppm from spatstat and I want to fit a log-linear model.
I can define covariates as linfun object and use in the model.
Let's say we are interested in modeling the car theft problem in Australia. let's assume cov1 is the distance to the nearest school and cov2 is the distance to the nearest police department.
We want to use X and Y coordinates in the model.
lppm(L~cov1+cov2+x+y} would work? X and Y's in the model are the location of events?
how can I use thin-plate spline on the linear network? I can create grids on ppp but lpp is not as straight forward as I think. Can I pass a matrix to lppm object?

Code in spatstat for linear networks is still under development, but lppm is based upon ppm, so you can look at the help files or documents about ppm for explanation. The variable names appearing in the model formula can be
the names of images (of class im or linim)
the names of spatial functions (class funxy or linfun)
the symbols x, y (representing cartesian coordinates)
the symbol marks representing categorical mark value
A term in the model formula may be just the name of one of these variables, or an expression involving these variable names, including functions applied to these variables.
Your example would work.
You can get B-splines of the cartesian coordinates by including a term such as bs(x)
If you need more help, first read chapter 9 of the spatstat book

Related

Accounting for Spatial Autocorrelation in Model

I am trying to account for spatial autocorrelation in a model in R. Each observation is a country for which I have the average latitude and longitude. Here's some sample data:
country <- c("IQ", "MX", "IN", "PY")
long <- c(43.94511, -94.87018, 78.10349, -59.15377)
lat <- c(33.9415073, 18.2283975, 23.8462264, -23.3900255)
Pathogen <- c(10.937891, 13.326284, 12.472374, 12.541716)
Answer.values <- c(0, 0, 1, 0)
data <- data.frame(country, long, lat, Pathogen, Answer.values)
I know spatial autocorrelation is an issue (Moran's i is significant in the whole dataset). This is the model I am testing (Answer Values (a 0/1 variable) ~ Pathogen Prevalence (a continuous variable)).
model <- glm(Answer.values ~ Pathogen,
na.action = na.omit,
data = data,
family = "binomial")
How would I account for spatial autocorrelation with a data structure like that?
There are a lot of potential answers to this. One easy(ish) way is to use mgcv::gam() to add a spatial smoother. Most of your model would stay the same:
library(mgcv)
gam(Answer.values ~ Pathogen +s([something]),
family="binomial",
data=data)
where s([something]) is some form of smooth spatial term. Three possible/reasonable choices would be:
a spherical spline (?mgcv::smooth.construct.sos.smooth.spec), which takes lat/long as input; this would be useful if (1) you have data over a significant fraction of the earth's surface (so that a smoother that constructs a 2D planar spatial smooth is less reasonable); (2) you want to account for distance between locations in a continuous way
a Markov random field (?mgcv::smooth.construct.mrf.smooth.spec). This is essentially the spatial analogue of a discrete order-1 autoregressive structure (i.e. countries are directly correlated only with their direct neighbours, however you choose to define that). In order to do this you have to come up somehow with a neighbourhood list (i.e. a list of countries, where the elements are lists of countries that are neighbours of the original countries). You could do this however you like, e.g. by finding nearest neighbours geographically. (Check out some introductions to spatial statistics/spatial data analysis in R.) (On the other hand, if you're testing Moran's I then you've presumably already come up with some way to identify first-order neighbours ...)
if you're comfortable treating lat/long as coordinates in a 2D plane, then you have lot of choices of smoothing basis, e.g. ?mgcv::smooth.construct.gp.smooth.spec (Gaussian process smoothers, which include most of the standard spatial autocorrelation models as special cases)
A helpful link for getting up to speed with GAMs in R ...

Replicated point patterns on linear networks in spatstat

I have point pattern data from a replicated experiment where the points in each replicate are constrained to the same linear network (the data are from daily surveys of a bike path for snakes: each day gives a separate point pattern of locations where animals are found).
I know that in spatstat it is possible to fit point processes to multiple point patterns simultaneously (with mppm), and to fit point process models on linear networks (with lppm); is it possible to do both simultaneously? As far as I can tell, mppm will not accept lpp objects: is there another way of fitting this type of model?
This is not yet fully supported in spatstat.
However, you can do most of what you want by converting the lpp objects to quadrature schemes using linequad and then using these quadrature schemes instead of ppp objects in the hyperframe. Example:
X1 <- spiders
X2 <- runiflpp(25, domain(spiders))
A <- linequad(X1)
B <- linequad(X2)
f <- function(x,y)x
H <- hyperframe(X=solist(A,B), Z=list(f,f))
fit <- mppm(X ~ Z, data=H)
Most methods for mppm will work correctly; except that you can't simulate or predict the fitted model, because it doesn't know that it's supposed to be on a network.
If you have a long list of ppp objects, then instead of converting the point patterns to quadrature schemes one-by-one, you could use solapply( , linequad).

sommer, multivariate liner mixed model analysis, plant breeding applications

I had the opportunity to read the sommer documentation but I was able to find some example of regression on markers (rrBLUP parametrization), just examples using the kinship parametrization (GBLUP parametrization). Please, could you gently say if it is possible
on sommer to regress directly on markers, instead of using the kinship matrix? Especially under multivariate scenarios (multiple traits, locations etc) modelling an unstructured var-cov for the marker effects
In sommer >= 3.7 is straight forward to fit an rrBLUP model in the multivariate setting, the DT_cpdata has a good example
librayr(sommer)
data(DT_cpdata)
mix.rrblup <- mmer(fixed=cbind(color,Yield)~1,
random=~vs(list(GT),Gtc=unsm(2)) + vs(Rowf,Gtc=diag(2))
rcov=~vs(units,Gtc=unsm(2)),
data=DT)
summary(mix.rrblup)
A <- A.mat(GT)
mix.gblup <- mmer(fixed=cbind(color,Yield)~1,
random=~vs(id,Gu=A, Gtc=unsm(2)) + vs(Rowf,Gtc=diag(2))
rcov=~vs(units,Gtc=unsm(2)),
data=DT)
summary(mix.gblup)
the vs() function makes a variance structure for a given random effect, and the covariance structure for the univariate/multivariate setting is provided in the Gtc argument as a matrix, where i.e. diagonal, unstructured or a customized structure can be provided. When the user want to provide a customized matrix as a random effect such as a marker matrix GT to do rrBLUP it has to be provided in a list() to internally help sommer to put it in the right format, whereas in the GBLUP version the random effect id which has the labels for individuals can have a covariance matrix provided in the Gu argument.

How to plot SVM classification hyperplane

Here is my sample code for SVM classification.
train <- read.csv("traindata.csv")
test <- read.csv("testdata.csv")
svm.fit=svm(as.factor(value)~ ., data=train, kernel="linear", method="class")
svm.pred = predict(svm.fit,test,type="class")
The feature value in my example is a factor which gives two levels (either true or false). I wanted to
plot a graph of my svm classifier and group them into two groups. One group
those with a "true" and another group as false. How do we produce a 3D or 2D SVM plot? I tried with plot(svm.fit, train) but it doesn't seem to work out for me.
There is this answer i found on SO but I am not clear with what t, x, y, z, w, and cl are in the answer.
Plotting data from an svm fit - hyperplane
i have about 50 features in my dataset which the last column is a factor. Any simple way of doing it or if any one could help me explain his answer.
The short answer is: you cannot. Your data is 50 dimensional. You cannot plot 50 dimensions. The only thing you can do are some rough approximations, reductions and projections, but none of these can actually represent what is happening inside. In order to plot 2D/3D decision boundary your data has to be 2D/3D (2 or 3 features, which is exactly what is happening in the link provided - they only have 3 features so they can plot all of them). With 50 features you are left with statistical analysis, no actual visual inspection.
You can obviously take a look at some slices (select 3 features, or main components of PCA projections). If you are not familiar with underlying linear algebra you can simply use gmum.r package which does this for you. Simply train svm and plot it forcing "pca" visualization, like here: http://r.gmum.net/samples/svm.basic.html.
library(gmum.r)
# We will perform basic classification on breast cancer dataset
# using LIBSVM with linear kernel
data(svm_breast_cancer_dataset)
# We can pass either formula or explicitly X and Y
svm <- SVM(X1 ~ ., svm.breastcancer.dataset, core="libsvm", kernel="linear", C=10)
## optimization finished, #iter = 8980
pred <- predict(svm, svm.breastcancer.dataset[,-1])
plot(svm, mode="pca")
which gives
for more examples you can refer to project website http://r.gmum.net/
However this only shows points projetions and their classification - you cannot see the hyperplane, because it is highly dimensional object (in your case 49 dimensional) and in such projection this hyperplane would be ... whole screen. Exactly no pixel would be left "outside" (think about it in this terms - if you have 3D space and hyperplane inside, this will be 2D plane.. now if you try to plot it in 1D you will end up with the whole line "filled" with your hyperplane, because no matter where you place a line in 3D, projection of the 2D plane on this line will fill it up! The only other possibility is that the line is perpendicular and then projection is a single point; the same applies here - if you try to project 49 dimensional hyperplane onto 3D you will end up with the whole screen "black").

loess predict with new x values

I am attempting to understand how the predict.loess function is able to compute new predicted values (y_hat) at points x that do not exist in the original data. For example (this is a simple example and I realize loess is obviously not needed for an example of this sort but it illustrates the point):
x <- 1:10
y <- x^2
mdl <- loess(y ~ x)
predict(mdl, 1.5)
[1] 2.25
loess regression works by using polynomials at each x and thus it creates a predicted y_hat at each y. However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat, for example, the span or degree. When I do predict(mdl, 1.5), how is predict able to produce a value at this new x? Is it interpolating between two nearest existing x values and their associated y_hat? If so, what are the details behind how it is doing this?
I have read the cloess documentation online but am unable to find where it discusses this.
However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat
Maybe you have used print(mdl) command or simply mdl to see what the model mdl contains, but this is not the case. The model is really complicated and stores a big number of parameters.
To have an idea what's inside, you may use unlist(mdl) and see the big list of parameters in it.
This is a part of the manual of the command describing how it really works:
Fitting is done locally. That is, for the fit at point x, the fit is made using points in a neighbourhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance). The size of the neighbourhood is controlled by α (set by span or enp.target). For α < 1, the neighbourhood includes proportion α of the points, and these have tricubic weighting (proportional to (1 - (dist/maxdist)^3)^3). For α > 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables.
For the default family, fitting is by (weighted) least squares. For
family="symmetric" a few iterations of an M-estimation procedure with
Tukey's biweight are used. Be aware that as the initial value is the
least-squares fit, this need not be a very resistant fit.
What I believe is that it tries to fit a polynomial model in the neighborhood of every point (not just a single polynomial for the whole set). But the neighborhood does not mean only one point before and one point after, if I was implementing such a function I put a big weight on the nearest points to the point x, and lower weights to distal points, and tried to fit a polynomial that fits the highest total weight.
Then if the given x' for which height should be predicted is closest to point x, I tried to use the polynomial fitted on the neighborhoods of the point x - say P(x) - and applied it over x' - say P(x') - and that would be the prediction.
Let me know if you are looking for anything special.
To better understand what is happening in a loess fit try running the loess.demo function from the TeachingDemos package. This lets you interactively click on the plot (even between points) and it then shows the set of points and their weights used in the prediction and the predicted line/curve for that point.
Note also that the default for loess is to do a second smoothing/interpolating on the loess fit, so what you see in the fitted object is probably not the true loess fitting information, but the secondary smoothing.
Found the answer on page 42 of the manual:
In this algorithm a set of points typically small in number is selected for direct
computation using the loess fitting method and a surface is evaluated using an interpolation
method that is based on blending functions. The space of the factors is divided into
rectangular cells using an algorithm based on k-d trees. The loess fit is evaluated at
the cell vertices and then blending functions do the interpolation. The output data
structure stores the k-d trees and the fits at the vertices. This information
is used by predict() to carry out the interpolation.
I geuss that for predict at x, predict.loess make a regression with some points near x, and calculate the y-value at x.
Visit https://stats.stackexchange.com/questions/223469/how-does-a-loess-model-do-its-prediction

Resources