Error with gls function in nlme package in R - r

I keep getting an error like this:
Error in `coef<-.corARMA`(`*tmp*`, value = c(18.3113452983211, -1.56626248550284, :
Coefficient matrix not invertible
or like this:
Error in gls(archlogfl ~ co2, correlation = corARMA(p = 3)) : false convergence (8)
with the gls function in nlme.
The former example was with the model gls(archlogflfornma~nma,correlation=corARMA(p=3)) where archlogflfornma is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
and nma is
[1] 138 139 142 148 150 134 137 135
You can see the model in the latter, and archlogfl is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
[9] 2.105556 2.176747
and co2 is
[1] 597.5778 917.9308 1101.0430 679.7803 886.5347 597.0668 873.4995
[8] 816.3483 1427.0190 423.8917
I have R 2.13.1.
Roland

#GavinSimpson's comment above, that trying to estimate a model with 5 parameters from 10 observations is very hopeful, is correct. The general rule of thumb is that you should have at least 10 times as many data points as parameters, and that's for standard fixed effect/regression parameters. (Generally variance structure parameters such as AR parameters are even a bit harder/require a bit more data than regression parameters to estimate.)
That said, in a perfect world one could hope to estimate parameters even from overfitted models. Let's just explore what happens though:
archlogfl <- c(2.611840,2.618454,2.503317,
2.305531,2.180464,2.185764,2.221760,2.211320,
2.105556,2.176747)
co2 <- c(597.5778,917.9308,1101.0430,679.7803,
886.5347,597.0668,873.4995,
816.3483,1427.0190,423.8917)
Take a look at the data,
plot(archlogfl~co2,type="b")
library(nlme)
g0 <- gls(archlogfl~co2)
plot(ACF(g0),alpha=0.05)
This is an autocorrelation function of the residuals, with 95% confidence intervals (note that these are curvewise confidence intervals, so we would expect about 1/20 points to fall outside these boundaries in any case).
So there is indeed some (graphical) evidence for some autocorrelation here. We'll fit an AR(1) model, with verbose output (to understand the scale on which these parameters are estimated, you'll probably have to dig around in Pinheiro and Bates 2000: what's presented in the printout are the unconstrained values of the parameters, what's printed in the summaries are the constrained values ...
g1 <- gls(archlogfl ~co2,correlation=corARMA(p=1),
control=glsControl(msVerbose=TRUE))
Let's see what's left after we fit AR1:
plot(ACF(g1,resType="normalized"),alpha=0.05)
Now fit AR(2):
g2 <- gls(archlogfl ~co2,correlation=corARMA(p=2),
control=glsControl(msVerbose=TRUE))
plot(ACF(g2,resType="normalized"),alpha=0.05)
As you correctly state, trying to go to AR(3) fails.
gls(archlogfl ~co2,correlation=corARMA(p=3))
You can play with tolerances, starting conditions, etc., but I don't think it's going to help much.
gls(archlogfl ~co2,correlation=corARMA(p=3,value=c(0.9,-0.5,0)),
control=glsControl(tolerance=1e-4,msVerbose=TRUE),verbose=TRUE)
If I were absolutely desperate to get these values I would code my own generalized least-squares function, constructing the AR(3) correlation matrix from scratch, and try to run it with some slightly more robust optimizer, but I would really have to have a good reason to work that hard ...
Another alternative would be to use arima to fit to the residuals from a gls or lm fit without autocorrelation: arima(residuals(g0),c(3,0,0)). (You can see that if you do this with arima(residuals(g0),c(2,0,0)) the answers are close to (but not quite equal to) the results from gls with corARMA(p=2).)

Related

Writing syntax for bivariate survival censored data to fit copula models in R

library(Sunclarco)
library(MASS)
library(survival)
library(SPREDA)
library(SurvCorr)
library(doBy)
#Dataset
diabetes=data("diabetes")
data1=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME1","STATUS1"))
data2=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME2","STATUS2"))
#Adding variable which identify cluster
data1$CLUSTER<- rep(1,197)
data2$CLUSTER<- rep(2,197)
#Renaming the variable so that that we hve uniformity in the common items in the data
names(data1)[5] <- "TIME"
names(data1)[6] <- "STATUS"
names(data2)[5] <- "TIME"
names(data2)[6] <- "STATUS"
#merge the files
Total_data=rbind(data1,data2)
# Re arranging the database
diabete_full=orderBy(~LASER+TRT_EYE+AGE_DX,data=Total_data)
diabete_full
#using Sunclarco package for Clayton a nd Gumbel
Clayton_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01072631 0.005818201
#rho 0.79887565 0.058942208
#theta 0.10224445 0.090585891
#beta_LASER 0.16780224 0.157652947
#beta_TRT_EYE 0.24580489 0.162333369
#beta_ADULT 0.09324001 0.158931463
# Estimate StandardError
#Kendall's Tau 0.04863585 0.04099436
Clayton_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01131751 0.003140733
#rho 0.79947406 0.012428824
#beta_LASER 0.14244235 0.041845100
#beta_TRT_EYE 0.27246433 0.298184235
#beta_ADULT 0.06151645 0.253617142
#theta 0.18393973 0.151048024
# Estimate StandardError
#Kendall's Tau 0.08422381 0.06333791
Gumbel_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="GH",marginal="Weibull")
# Estimates StandardErrors
#lambda 0.01794495 0.01594843
#rho 0.70636113 0.10313853
#theta 0.87030690 0.11085344
#beta_LASER 0.15191936 0.14187943
#beta_TRT_EYE 0.21469814 0.14736381
#beta_ADULT 0.08284557 0.14214373
# Estimate StandardError
#Kendall's Tau 0.1296931 0.1108534
Gumbel_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="GH",marginal="Weibull")
Am required to fit copula models in R for different copula classes particularly the Gaussian, FGM,Pluckett and possibly Frank (if i still have time). The data am using is Diabetes data available in R through the package Survival and Survcorr.
Its my thesis am working on and its a study for the exploratory purposes to see how does copula class serves different purposes as in results they lead to having different results on the same. I found a package Sunclarco in Rstudio which i was able to fit Clayton and Gumbel copula class but its not available yet for the other classes.
The challenge am facing is that since i have censored data which has to be incorporated in likelihood estimation then it becomes harder fro me to write a syntax since as I don't have a strong programming background. In addition, i have to incorporate the covariates present in programming and see their impact on the association if it present or not. However, taking to my promoter he gave me insights on how to approach the syntax writing for this puzzle which goes as follows
• ******First of all, forget about the likelihood function. We only work with the log-likelihood function. In this way, you do not need to take the product of the contributions over each of the observations, but can take the sum of the log-contributions over the different observations.
• Next, since we have a balanced design, we can use the regular data frame structure in which we have for each cluster only one row in the data frame. The different variables such as the lifetimes, the indicators and all the covariates are the columns in this data frame.
• Due to the bivariate setting, there are only 4 possible ways to give a contribution to the log-likelihood function: both uncensored, both censored, first uncensored and second censored, or first censored and second uncensored. Well, to create the loglikelihood function, you create a new variable in your data frame in which you put the correct contribution of the log-likelihood based on which individual in the couple is censored. When you take the sum of this variable, you have the value of the log-likelihood function.
• Since this function depends on parameters, you can use any optimizer, like optim or nlm to get your optimal values. By careful here, optim and nlm look for the minimum of a function, not a maximum. This is easy solved since the minimum of a function -f is the same as the maximum of a function f.
• Since you have for each copula function, the different expressions for the derivatives, it should be possible to get the likelihood functions now.******
Am still struggling to find a way as for each copula class each of the likelihood changes as the generator function is also unique for the respective copula since it needs to be adapted during estimation. Lastly, I should run analysis for both one and two steps of copula estimations as i will use to compare results.
if someone could help me to figure it out then I will be eternally grateful. Even if for just one copula class e.g. Gaussian then I will figure it the rest based on the one that am requesting to be assisted since I tried everything and still i have nothing to show up for and now i feel time is running out to get answers by myself.

Trouble with 'fitdistrplus' package, t-distribution

I am trying to fit t-distributions to my data but am unable to do so. My first try was
fitdistr(myData, "t")
There are 41 warnings, all saying that NaNs are produced. I don't know how, logarithms seem to be involved. So I adjusted my data somewhat so that all data is >0, but I still have the same problem (9 fewer warnings though...). Same problem with sstdFit(), produces NaNs.
So instead I try with fitdist which I've seen on stackoverflow and CrossValidated:
fitdist(myData, "t")
I then get
Error in mledist(data, distname, start, fix.arg, ...) :
'start' must be defined as a named list for this distribution
What does this mean? I tried looking into the documentation but that told me nothing. I just want to possibly fit a t-distribution, this is so frustrating :P
Thanks!
Start is the initial guess for the parameters of your distribution. There are logs involved because it is using maximum likelihood and hence log-likelihoods.
library(fitdistrplus)
dat <- rt(100, df=10)
fit <- fitdist(dat, "t", start=list(df=2))
I think it's worth adding that in most cases, using the fitdistrplus package to fit a t-distribution to real data will lead to a very bad fit, which is actually quite misleading. This is because the default t-distribution functions in R are used, and they don't support shifting or scaling. That is, if your data has a mean other than 0, or is scaled in some way, then the fitdist function will simply lead to a bad fit.
In real life, if data fits a t-distribution, it is usually shifted (i.e. has a mean other than 0) and / or scaled. Let's generate some data like that:
data = 1.5*rt(10000,df=5) + 0.5
Given this data has been sampled from the t-distribution with 5 degrees of freedom, you'd think that trying to fit a t-distribution to this should work quite nicely. But actually, here is the result. It estimates a df of 2, and provides a bad fit as shown in the qq plot.
> fit_bad <- fitdist(data,"t",start=list(df=3))
> fit_bad
Fitting of the distribution ' t ' by maximum likelihood
Parameters:
estimate Std. Error
df 2.050967 0.04301357
> qqcomp(list(fit_bad)) # generates plot to show fit
When you fit to a t-distribution you want to not only estimate the degrees of freedom, but also a mean and scaling parameter.
The metRology package provides a version of the t-distribution called t.scaled that has a mean and sd parameter in addition to the df parameter [metRology]. Now let's fit it again:
> library("metRology")
> fit_good <- fitdist(data,"t.scaled",
start=list(df=3,mean=mean(data),sd=sd(data)))
> fit_good
Fitting of the distribution ' t.scaled ' by maximum likelihood
Parameters:
estimate Std. Error
df 4.9732159 0.24849246
mean 0.4945922 0.01716461
sd 1.4860637 0.01828821
> qqcomp(list(fit_good)) # generates plot to show fit
Much better :-) The parameters are very close to how we generated the data in the first place! And the QQ plot shows a much nicer fit.

Confusion about Markov random fields in the mgcv package in R

In order to implement a spatial analysis, I tried a simple Markov random field smoother in an example in the mgcv package in R, where the manual is here:
https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.construct.mrf.smooth.spec.html
This is the example I tried:
library(mgcv)
data(columb) ## data frame
data(columb.polys) ## district shapes list
xt <- list(polys=columb.polys) ## neighbourhood structure info for MRF
b <- gam(crime ~ s(district,bs="mrf",xt=xt),data=columb,method="REML")
However, when I took a look at estimated coefficients in b$coefficients, there are 48 estimates from the Markov random field smoother:
> b$coefficients
(Intercept) s(district).1 s(district).2 s(district).3 s(district).4
35.12882390 -10.96490165 20.99250496 16.04968951 10.49535483
s(district).5 s(district).6 s(district).7 s(district).8 s(district).9
16.56626217 14.55352540 17.90043996 -0.60239588 13.41215603
s(district).10 s(district).11 s(district).12 s(district).13 s(district).14
18.61920671 -11.13853418 -2.95677338 7.89719220 3.04717540
s(district).15 s(district).16 s(district).17 s(district).18 s(district).19
-11.18235328 12.57473374 19.83013619 10.56130003 12.36240748
s(district).20 s(district).21 s(district).22 s(district).23 s(district).24
15.65160761 20.40965885 24.79853590 0.05312873 -14.65881026
s(district).25 s(district).26 s(district).27 s(district).28 s(district).29
-13.01294201 7.16191556 -9.36311304 3.65410713 -16.37092777
s(district).30 s(district).31 s(district).32 s(district).33 s(district).34
11.23500771 13.92036006 -14.67653893 -12.39341674 11.02216471
s(district).35 s(district).36 s(district).37 s(district).38 s(district).39
-12.93210046 -15.48924425 3.42745125 -2.54916472 -1.90604972
s(district).40 s(district).41 s(district).42 s(district).43 s(district).44
-16.25160966 -7.46491914 -4.48126353 -7.61064264 -2.91807488
s(district).45 s(district).46 s(district).47 s(district).48
-12.12765102 6.68446503 2.55883220 -0.20920888
However, the district shapes list shows 49 areas (from 0~48). When I tried my data, the same situation happened because data with 28 areas only produced 27 estimates from the Markov random field smoother.
My understanding is, the Markov random field used as a spatial function can be regarded as a structured random effect; however, the Markov random field smoother in the mgcv package in R seems to automatically set up the first area as a reference level. I am wondering whether it is just like a fixed effect but under the consideration of spatial autocorrelation?
If so, an extended problem is how to explain such output? I feel very weird in that the spatial estimate can be explained like the difference between each area and the reference area, but this interpretation is not too meaningful.
I am thinking whether we can fit a Markov random field smoother like a random effect in R. Hope anyone who is familiar with this package can provide some suggestions. Thanks!
The coefficients in a multivariate Gaussian smoothing are not and should not be interpreted as coefficients individually applied to each covariate s is a function of. They describe the correlation relationship between the covariates; correlation described by a number of coefficients to be set by the k parameter in the s R function.
By default, s sets k to its maximum, n-1. k cannot be bigger than n-1 with n the number of covariates in s as an intercept is necessary to set the average level the smoothing function will vary around and the total number of fitted parameters must not be bigger than the size of the "data".
For further details, check the paragraph on choose.k in the mgcv help file.
If you are interested by something directly applicable to each of your districts, you should check the values predicted by the smoothing function. Following gamObject help it is given by the fitted.values item.
Here I get:
> b$fitted.values
[1] 18.81758 22.12502 30.13315 33.14305 44.11208 30.17184 20.96227 39.77438
[9] 35.64875 32.88071 54.08242 49.13961 43.58527 49.65618 47.64344 50.99036
[17] 32.48752 46.50207 51.70913 21.95138 40.98711 36.13709 21.90757 45.66465
[25] 52.92006 43.65122 45.45233 48.74153 53.49958 57.88845 18.43111 20.07698
[33] 40.25183 23.72681 36.74403 16.71899 44.32493 47.01028 18.41338 20.69650
[41] 20.15782 17.60067 36.51737 30.54075 31.18387 16.83831 25.62500 28.60866
[49] 25.47928
The result of plot(b) allows to visualize the fit, it
looks good and the correspondence between observed and estimated seems reasonable: plot(columb$crime,b$fitted.values)

Fit generic function (special case Power Law) to a histogram in R [duplicate]

I am trying to plot a power law line to fit x and y data that I already have in a data frame. I have tried power.law.fit in the igraph library but it isn't working. The data frame is:
dat=data.frame(
x=1:8,
ygm=c( 251.288, 167.739, 112.856, 109.705, 102.064, 94.331, 95.206, 91.415)
)
I generally use one of two strategies here, I take the log and fit a linear model or I use nls. I think you could figure out the logged model if you wanted to, so Ill show the nls method here.
nls1=nls(ygm~i*x^-z,start=list(i=-3,z=-2),data=dat)
Double check that is the formula you want, this method accepts a pretty broad class of formulas. Spend some time fooling with starting values. In particular try to think of frontiers where the likelihood surface could do weird things. Try values on both sides of the wierd places so you can be assured that you are not in a local optima.
> nls1
Nonlinear regression model
model: ygm ~ i * x^-z
data: dat
i z
245.0356 0.5449
residual sum-of-squares: 811.4
...
> predict(nls1)
[1] 245.03564 167.95574 134.66070 115.12256 101.94200 92.30101 84.86458
[8] 78.90891
> plot(dat)
> lines(predict(nls1))

How to get bootstrapped p-values and bootstrapped t-values and how does the function boot() work?

I would like to get the bootstrapped t-value and the bootstrapped p-value of a lm.
I have the following code (basically copied from a paper) which works.
# First of all you need the following packages
install.packages("car")
install.packages("MASS")
install.packages("boot")
library("car")
library("MASS")
library("boot")
boot.function <- function(data, indices){
data <- data[indices,]
mod <- lm(prestige ~ income + education, data=data) # the liear model
# the first element of the following vector contains the t-value
# and the second element is the p-value
c(summary(mod)[["coefficients"]][2,3], summary(mod)[["coefficients"]][2,4])
}
Now, I compute the bootstrapping model, which gives me the following:
duncan.boot <- boot(Duncan, boot.function, 1999)
duncan.boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Duncan, statistic = boot.function, R = 1999)
Bootstrap Statistics :
original bias std. error
t1* 5.003310e+00 0.288746545 1.71684664
t2* 1.053184e-05 0.002701685 0.01642399
I have two questions:
My understanding is that the bootsrapped value is the original plus the bias, which means that both bootstrapped values (the bootstrapped t-value as well as the bootstrapped p-value) are greater than the original values. This in turn is not possible, because if the t-value rises (which means more significance) the p-values MUST be lower, right? Therefore I think that I have not yet really understood the output of the boot function (here: duncan.boot). How do I compute the bootstrapped values?
I do not understand how the boot() works. If you look at duncan.boot <- boot(Duncan, boot.function, 1999) you see that I have not passed any arguments for the function "boot.function". I suppose that R sets data <- Duncan. But since I have not passed anything for the argument "indices", I do not understand how the following line in the function "boot.function" works data <- data[indices,]
I hope the questions make sense!??
The boot function is "expecting" to get a function that has two arguments: the first being a data.frame and the second being an "indices" vector (possibly with duplicate entries and probably not using all the indices) to use in selecting rows and probably having some duplicate or triplicates.) It then samples with replacement determined by the pattern of duplicates and triplicates from the original dataframe (multiple times determined by "R" with different "choice sets"), passes those to the indices argument in the boot.function, and then collects the results of the R number of function applications.
Regarding what is reported by the print method for boot objects, take a look at this (done after examining the returned object with str()
> duncan.boot$t0
[1] 5.003310e+00 1.053184e-05
> apply(duncan.boot$t, 2, mean)
[1] 5.342895220 0.002607943
> apply(duncan.boot$t, 2, mean) - duncan.boot$t0
[1] 0.339585441 0.002597411
It becomes more obvious that the T0 value is from the original data while the bias is the difference between the mean of the boot()-ed values and the T0 values. I don't think it makes a lot of sense to be asking why p-values based on parametric considerations are increasing in association with an increase in estimated t-statistics. You are really in two disparate regions of statistical thought when you do that. I would have interpreted the increase in p-values as an effect of the sampling process, which does not take into account the Normal distribution assumptions. It is simply saying something about the sampling distribution of the p-value (which is really just another sample statistic).
(Comment: The sourcebook used at the time of R development was Davison and Hinkley's "Bootstrap Methods and their Applications". I'm no claiming any support for my answer above, but I thought to put it in as a reference after Hagen Brenner asked about sampling with two indices in the comments below. There are many unexpected aspects of bootstrapping that arise after one goes beyond the simple parametric estimation and I would first turn to that reference if I were tackling more complex sampling situations.)

Resources