Confusion about Markov random fields in the mgcv package in R

Confusion about Markov random fields in the mgcv package in R - r

In order to implement a spatial analysis, I tried a simple Markov random field smoother in an example in the mgcv package in R, where the manual is here:
https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.construct.mrf.smooth.spec.html
This is the example I tried:
library(mgcv)
data(columb) ## data frame
data(columb.polys) ## district shapes list
xt <- list(polys=columb.polys) ## neighbourhood structure info for MRF
b <- gam(crime ~ s(district,bs="mrf",xt=xt),data=columb,method="REML")
However, when I took a look at estimated coefficients in b$coefficients, there are 48 estimates from the Markov random field smoother:
> b$coefficients
(Intercept) s(district).1 s(district).2 s(district).3 s(district).4
35.12882390 -10.96490165 20.99250496 16.04968951 10.49535483
s(district).5 s(district).6 s(district).7 s(district).8 s(district).9
16.56626217 14.55352540 17.90043996 -0.60239588 13.41215603
s(district).10 s(district).11 s(district).12 s(district).13 s(district).14
18.61920671 -11.13853418 -2.95677338 7.89719220 3.04717540
s(district).15 s(district).16 s(district).17 s(district).18 s(district).19
-11.18235328 12.57473374 19.83013619 10.56130003 12.36240748
s(district).20 s(district).21 s(district).22 s(district).23 s(district).24
15.65160761 20.40965885 24.79853590 0.05312873 -14.65881026
s(district).25 s(district).26 s(district).27 s(district).28 s(district).29
-13.01294201 7.16191556 -9.36311304 3.65410713 -16.37092777
s(district).30 s(district).31 s(district).32 s(district).33 s(district).34
11.23500771 13.92036006 -14.67653893 -12.39341674 11.02216471
s(district).35 s(district).36 s(district).37 s(district).38 s(district).39
-12.93210046 -15.48924425 3.42745125 -2.54916472 -1.90604972
s(district).40 s(district).41 s(district).42 s(district).43 s(district).44
-16.25160966 -7.46491914 -4.48126353 -7.61064264 -2.91807488
s(district).45 s(district).46 s(district).47 s(district).48
-12.12765102 6.68446503 2.55883220 -0.20920888
However, the district shapes list shows 49 areas (from 0~48). When I tried my data, the same situation happened because data with 28 areas only produced 27 estimates from the Markov random field smoother.
My understanding is, the Markov random field used as a spatial function can be regarded as a structured random effect; however, the Markov random field smoother in the mgcv package in R seems to automatically set up the first area as a reference level. I am wondering whether it is just like a fixed effect but under the consideration of spatial autocorrelation?
If so, an extended problem is how to explain such output? I feel very weird in that the spatial estimate can be explained like the difference between each area and the reference area, but this interpretation is not too meaningful.
I am thinking whether we can fit a Markov random field smoother like a random effect in R. Hope anyone who is familiar with this package can provide some suggestions. Thanks!

The coefficients in a multivariate Gaussian smoothing are not and should not be interpreted as coefficients individually applied to each covariate s is a function of. They describe the correlation relationship between the covariates; correlation described by a number of coefficients to be set by the k parameter in the s R function.
By default, s sets k to its maximum, n-1. k cannot be bigger than n-1 with n the number of covariates in s as an intercept is necessary to set the average level the smoothing function will vary around and the total number of fitted parameters must not be bigger than the size of the "data".
For further details, check the paragraph on choose.k in the mgcv help file.
If you are interested by something directly applicable to each of your districts, you should check the values predicted by the smoothing function. Following gamObject help it is given by the fitted.values item.
Here I get:
> b$fitted.values
[1] 18.81758 22.12502 30.13315 33.14305 44.11208 30.17184 20.96227 39.77438
[9] 35.64875 32.88071 54.08242 49.13961 43.58527 49.65618 47.64344 50.99036
[17] 32.48752 46.50207 51.70913 21.95138 40.98711 36.13709 21.90757 45.66465
[25] 52.92006 43.65122 45.45233 48.74153 53.49958 57.88845 18.43111 20.07698
[33] 40.25183 23.72681 36.74403 16.71899 44.32493 47.01028 18.41338 20.69650
[41] 20.15782 17.60067 36.51737 30.54075 31.18387 16.83831 25.62500 28.60866
[49] 25.47928
The result of plot(b) allows to visualize the fit, it
looks good and the correspondence between observed and estimated seems reasonable: plot(columb$crime,b$fitted.values)

Related

sommer, multivariate liner mixed model analysis, plant breeding applications

I had the opportunity to read the sommer documentation but I was able to find some example of regression on markers (rrBLUP parametrization), just examples using the kinship parametrization (GBLUP parametrization). Please, could you gently say if it is possible
on sommer to regress directly on markers, instead of using the kinship matrix? Especially under multivariate scenarios (multiple traits, locations etc) modelling an unstructured var-cov for the marker effects

In sommer >= 3.7 is straight forward to fit an rrBLUP model in the multivariate setting, the DT_cpdata has a good example
librayr(sommer)
data(DT_cpdata)
mix.rrblup <- mmer(fixed=cbind(color,Yield)~1,
random=~vs(list(GT),Gtc=unsm(2)) + vs(Rowf,Gtc=diag(2))
rcov=~vs(units,Gtc=unsm(2)),
data=DT)
summary(mix.rrblup)
A <- A.mat(GT)
mix.gblup <- mmer(fixed=cbind(color,Yield)~1,
random=~vs(id,Gu=A, Gtc=unsm(2)) + vs(Rowf,Gtc=diag(2))
rcov=~vs(units,Gtc=unsm(2)),
data=DT)
summary(mix.gblup)
the vs() function makes a variance structure for a given random effect, and the covariance structure for the univariate/multivariate setting is provided in the Gtc argument as a matrix, where i.e. diagonal, unstructured or a customized structure can be provided. When the user want to provide a customized matrix as a random effect such as a marker matrix GT to do rrBLUP it has to be provided in a list() to internally help sommer to put it in the right format, whereas in the GBLUP version the random effect id which has the labels for individuals can have a covariance matrix provided in the Gu argument.

Writing syntax for bivariate survival censored data to fit copula models in R

library(Sunclarco)
library(MASS)
library(survival)
library(SPREDA)
library(SurvCorr)
library(doBy)
#Dataset
diabetes=data("diabetes")
data1=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME1","STATUS1"))
data2=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME2","STATUS2"))
#Adding variable which identify cluster
data1$CLUSTER<- rep(1,197)
data2$CLUSTER<- rep(2,197)
#Renaming the variable so that that we hve uniformity in the common items in the data
names(data1)[5] <- "TIME"
names(data1)[6] <- "STATUS"
names(data2)[5] <- "TIME"
names(data2)[6] <- "STATUS"
#merge the files
Total_data=rbind(data1,data2)
# Re arranging the database
diabete_full=orderBy(~LASER+TRT_EYE+AGE_DX,data=Total_data)
diabete_full
#using Sunclarco package for Clayton a nd Gumbel
Clayton_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01072631 0.005818201
#rho 0.79887565 0.058942208
#theta 0.10224445 0.090585891
#beta_LASER 0.16780224 0.157652947
#beta_TRT_EYE 0.24580489 0.162333369
#beta_ADULT 0.09324001 0.158931463
# Estimate StandardError
#Kendall's Tau 0.04863585 0.04099436
Clayton_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01131751 0.003140733
#rho 0.79947406 0.012428824
#beta_LASER 0.14244235 0.041845100
#beta_TRT_EYE 0.27246433 0.298184235
#beta_ADULT 0.06151645 0.253617142
#theta 0.18393973 0.151048024
# Estimate StandardError
#Kendall's Tau 0.08422381 0.06333791
Gumbel_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="GH",marginal="Weibull")
# Estimates StandardErrors
#lambda 0.01794495 0.01594843
#rho 0.70636113 0.10313853
#theta 0.87030690 0.11085344
#beta_LASER 0.15191936 0.14187943
#beta_TRT_EYE 0.21469814 0.14736381
#beta_ADULT 0.08284557 0.14214373
# Estimate StandardError
#Kendall's Tau 0.1296931 0.1108534
Gumbel_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="GH",marginal="Weibull")
Am required to fit copula models in R for different copula classes particularly the Gaussian, FGM,Pluckett and possibly Frank (if i still have time). The data am using is Diabetes data available in R through the package Survival and Survcorr.
Its my thesis am working on and its a study for the exploratory purposes to see how does copula class serves different purposes as in results they lead to having different results on the same. I found a package Sunclarco in Rstudio which i was able to fit Clayton and Gumbel copula class but its not available yet for the other classes.
The challenge am facing is that since i have censored data which has to be incorporated in likelihood estimation then it becomes harder fro me to write a syntax since as I don't have a strong programming background. In addition, i have to incorporate the covariates present in programming and see their impact on the association if it present or not. However, taking to my promoter he gave me insights on how to approach the syntax writing for this puzzle which goes as follows
• ******First of all, forget about the likelihood function. We only work with the log-likelihood function. In this way, you do not need to take the product of the contributions over each of the observations, but can take the sum of the log-contributions over the different observations.
• Next, since we have a balanced design, we can use the regular data frame structure in which we have for each cluster only one row in the data frame. The different variables such as the lifetimes, the indicators and all the covariates are the columns in this data frame.
• Due to the bivariate setting, there are only 4 possible ways to give a contribution to the log-likelihood function: both uncensored, both censored, first uncensored and second censored, or first censored and second uncensored. Well, to create the loglikelihood function, you create a new variable in your data frame in which you put the correct contribution of the log-likelihood based on which individual in the couple is censored. When you take the sum of this variable, you have the value of the log-likelihood function.
• Since this function depends on parameters, you can use any optimizer, like optim or nlm to get your optimal values. By careful here, optim and nlm look for the minimum of a function, not a maximum. This is easy solved since the minimum of a function -f is the same as the maximum of a function f.
• Since you have for each copula function, the different expressions for the derivatives, it should be possible to get the likelihood functions now.******
Am still struggling to find a way as for each copula class each of the likelihood changes as the generator function is also unique for the respective copula since it needs to be adapted during estimation. Lastly, I should run analysis for both one and two steps of copula estimations as i will use to compare results.
if someone could help me to figure it out then I will be eternally grateful. Even if for just one copula class e.g. Gaussian then I will figure it the rest based on the one that am requesting to be assisted since I tried everything and still i have nothing to show up for and now i feel time is running out to get answers by myself.

Fit generic function (special case Power Law) to a histogram in R [duplicate]

I am trying to plot a power law line to fit x and y data that I already have in a data frame. I have tried power.law.fit in the igraph library but it isn't working. The data frame is:
dat=data.frame(
x=1:8,
ygm=c( 251.288, 167.739, 112.856, 109.705, 102.064, 94.331, 95.206, 91.415)
)

I generally use one of two strategies here, I take the log and fit a linear model or I use nls. I think you could figure out the logged model if you wanted to, so Ill show the nls method here.
nls1=nls(ygm~i*x^-z,start=list(i=-3,z=-2),data=dat)
Double check that is the formula you want, this method accepts a pretty broad class of formulas. Spend some time fooling with starting values. In particular try to think of frontiers where the likelihood surface could do weird things. Try values on both sides of the wierd places so you can be assured that you are not in a local optima.
> nls1
Nonlinear regression model
model: ygm ~ i * x^-z
data: dat
i z
245.0356 0.5449
residual sum-of-squares: 811.4
...
> predict(nls1)
[1] 245.03564 167.95574 134.66070 115.12256 101.94200 92.30101 84.86458
[8] 78.90891
> plot(dat)
> lines(predict(nls1))

loess predict with new x values

I am attempting to understand how the predict.loess function is able to compute new predicted values (y_hat) at points x that do not exist in the original data. For example (this is a simple example and I realize loess is obviously not needed for an example of this sort but it illustrates the point):
x <- 1:10
y <- x^2
mdl <- loess(y ~ x)
predict(mdl, 1.5)
[1] 2.25
loess regression works by using polynomials at each x and thus it creates a predicted y_hat at each y. However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat, for example, the span or degree. When I do predict(mdl, 1.5), how is predict able to produce a value at this new x? Is it interpolating between two nearest existing x values and their associated y_hat? If so, what are the details behind how it is doing this?
I have read the cloess documentation online but am unable to find where it discusses this.

However, because there are no coefficients being stored, the "model" in this case is simply the details of what was used to predict each y_hat
Maybe you have used print(mdl) command or simply mdl to see what the model mdl contains, but this is not the case. The model is really complicated and stores a big number of parameters.
To have an idea what's inside, you may use unlist(mdl) and see the big list of parameters in it.
This is a part of the manual of the command describing how it really works:
Fitting is done locally. That is, for the fit at point x, the fit is made using points in a neighbourhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance). The size of the neighbourhood is controlled by α (set by span or enp.target). For α < 1, the neighbourhood includes proportion α of the points, and these have tricubic weighting (proportional to (1 - (dist/maxdist)^3)^3). For α > 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables.
For the default family, fitting is by (weighted) least squares. For
family="symmetric" a few iterations of an M-estimation procedure with
Tukey's biweight are used. Be aware that as the initial value is the
least-squares fit, this need not be a very resistant fit.
What I believe is that it tries to fit a polynomial model in the neighborhood of every point (not just a single polynomial for the whole set). But the neighborhood does not mean only one point before and one point after, if I was implementing such a function I put a big weight on the nearest points to the point x, and lower weights to distal points, and tried to fit a polynomial that fits the highest total weight.
Then if the given x' for which height should be predicted is closest to point x, I tried to use the polynomial fitted on the neighborhoods of the point x - say P(x) - and applied it over x' - say P(x') - and that would be the prediction.
Let me know if you are looking for anything special.

To better understand what is happening in a loess fit try running the loess.demo function from the TeachingDemos package. This lets you interactively click on the plot (even between points) and it then shows the set of points and their weights used in the prediction and the predicted line/curve for that point.
Note also that the default for loess is to do a second smoothing/interpolating on the loess fit, so what you see in the fitted object is probably not the true loess fitting information, but the secondary smoothing.

Found the answer on page 42 of the manual:
In this algorithm a set of points typically small in number is selected for direct
computation using the loess fitting method and a surface is evaluated using an interpolation
method that is based on blending functions. The space of the factors is divided into
rectangular cells using an algorithm based on k-d trees. The loess fit is evaluated at
the cell vertices and then blending functions do the interpolation. The output data
structure stores the k-d trees and the fits at the vertices. This information
is used by predict() to carry out the interpolation.

I geuss that for predict at x, predict.loess make a regression with some points near x, and calculate the y-value at x.
Visit https://stats.stackexchange.com/questions/223469/how-does-a-loess-model-do-its-prediction

Error with gls function in nlme package in R

I keep getting an error like this:
Error in `coef<-.corARMA`(`*tmp*`, value = c(18.3113452983211, -1.56626248550284, :
Coefficient matrix not invertible
or like this:
Error in gls(archlogfl ~ co2, correlation = corARMA(p = 3)) : false convergence (8)
with the gls function in nlme.
The former example was with the model gls(archlogflfornma~nma,correlation=corARMA(p=3)) where archlogflfornma is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
and nma is
[1] 138 139 142 148 150 134 137 135
You can see the model in the latter, and archlogfl is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
[9] 2.105556 2.176747
and co2 is
[1] 597.5778 917.9308 1101.0430 679.7803 886.5347 597.0668 873.4995
[8] 816.3483 1427.0190 423.8917
I have R 2.13.1.
Roland

#GavinSimpson's comment above, that trying to estimate a model with 5 parameters from 10 observations is very hopeful, is correct. The general rule of thumb is that you should have at least 10 times as many data points as parameters, and that's for standard fixed effect/regression parameters. (Generally variance structure parameters such as AR parameters are even a bit harder/require a bit more data than regression parameters to estimate.)
That said, in a perfect world one could hope to estimate parameters even from overfitted models. Let's just explore what happens though:
archlogfl <- c(2.611840,2.618454,2.503317,
2.305531,2.180464,2.185764,2.221760,2.211320,
2.105556,2.176747)
co2 <- c(597.5778,917.9308,1101.0430,679.7803,
886.5347,597.0668,873.4995,
816.3483,1427.0190,423.8917)
Take a look at the data,
plot(archlogfl~co2,type="b")
library(nlme)
g0 <- gls(archlogfl~co2)
plot(ACF(g0),alpha=0.05)
This is an autocorrelation function of the residuals, with 95% confidence intervals (note that these are curvewise confidence intervals, so we would expect about 1/20 points to fall outside these boundaries in any case).
So there is indeed some (graphical) evidence for some autocorrelation here. We'll fit an AR(1) model, with verbose output (to understand the scale on which these parameters are estimated, you'll probably have to dig around in Pinheiro and Bates 2000: what's presented in the printout are the unconstrained values of the parameters, what's printed in the summaries are the constrained values ...
g1 <- gls(archlogfl ~co2,correlation=corARMA(p=1),
control=glsControl(msVerbose=TRUE))
Let's see what's left after we fit AR1:
plot(ACF(g1,resType="normalized"),alpha=0.05)
Now fit AR(2):
g2 <- gls(archlogfl ~co2,correlation=corARMA(p=2),
control=glsControl(msVerbose=TRUE))
plot(ACF(g2,resType="normalized"),alpha=0.05)
As you correctly state, trying to go to AR(3) fails.
gls(archlogfl ~co2,correlation=corARMA(p=3))
You can play with tolerances, starting conditions, etc., but I don't think it's going to help much.
gls(archlogfl ~co2,correlation=corARMA(p=3,value=c(0.9,-0.5,0)),
control=glsControl(tolerance=1e-4,msVerbose=TRUE),verbose=TRUE)
If I were absolutely desperate to get these values I would code my own generalized least-squares function, constructing the AR(3) correlation matrix from scratch, and try to run it with some slightly more robust optimizer, but I would really have to have a good reason to work that hard ...
Another alternative would be to use arima to fit to the residuals from a gls or lm fit without autocorrelation: arima(residuals(g0),c(3,0,0)). (You can see that if you do this with arima(residuals(g0),c(2,0,0)) the answers are close to (but not quite equal to) the results from gls with corARMA(p=2).)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex