Piece-wise linear and non-linear regression in R - r

I have a question which is perhaps more a statistical query than one related to r directly, however it may be that I am just invoking an r package incorrectly so I will post the question here. I have the following dataset:
x<-c(1e-08, 1.1e-08, 1.2e-08, 1.3e-08, 1.4e-08, 1.6e-08, 1.7e-08,
1.9e-08, 2.1e-08, 2.3e-08, 2.6e-08, 2.8e-08, 3.1e-08, 3.5e-08,
4.2e-08, 4.7e-08, 5.2e-08, 5.8e-08, 6.4e-08, 7.1e-08, 7.9e-08,
8.8e-08, 9.8e-08, 1.1e-07, 1.23e-07, 1.38e-07, 1.55e-07, 1.76e-07,
1.98e-07, 2.26e-07, 2.58e-07, 2.95e-07, 3.25e-07, 3.75e-07, 4.25e-07,
4.75e-07, 5.4e-07, 6.15e-07, 6.75e-07, 7.5e-07, 9e-07, 1.15e-06,
1.45e-06, 1.8e-06, 2.25e-06, 2.75e-06, 3.25e-06, 3.75e-06, 4.5e-06,
5.75e-06, 7e-06, 8e-06, 9.25e-06, 1.125e-05, 1.375e-05, 1.625e-05,
1.875e-05, 2.25e-05, 2.75e-05, 3.1e-05)
y2<-c(-0.169718017273307, 7.28508517630734, 71.6802510299446, 164.637259265704,
322.02901173786, 522.719633360006, 631.977073772459, 792.321270345847,
971.810607095548, 1132.27551798986, 1321.01923840546, 1445.33152600664,
1568.14204073109, 1724.30089942149, 1866.79717333592, 1960.12465709003,
2028.46548012508, 2103.16027631327, 2184.10965255236, 2297.53360080873,
2406.98288043262, 2502.95194879366, 2565.31085776325, 2542.7485752473,
2499.42610084412, 2257.31567571328, 2150.92120390084, 1998.13356362596,
1990.25434682546, 2101.21333152526, 2211.08405955931, 1335.27559108724,
381.326449703455, 430.9020598199, 291.370887491989, 219.580548355043,
238.708972427248, 175.583544448326, 106.057481792519, 59.8876372379487,
26.965143266819, 10.2965349811467, 5.07812046132922, 3.19125838983254,
0.788251933518549, 1.67980552001939, 1.97695007279929, 0.770663673279958,
0.209216903989619, 0.0117903221723813, 0.000974437796492681,
0.000668823762763647, 0.000545308757270207, 0.000490042305650751,
0.000468780182460397, 0.000322977916070751, 0.000195423690538495,
0.000175847622407421, 0.000135771259866332, 9.15607623591363e-05)
which when plot looks like this:
I have then attempted to use the segmentation package to generate three linear regressions (solid black line) in three regions (10^⁻8--10^⁻7,10^⁻7--10^⁻6 and >10^-6) since I have a theoretical basis for finding different relationships in these different regions. Clearly however my attempt using the following code was unsuccessful:
lin.mod <- lm(y2~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=c(0.0000001,0.000001))
Thus my first question- are there further parameters of the segmentation I can tweak other than the breakpoints? So far as I understand I have iterations set to maximum as default here.
My second question is: could I perhaps attempt a segmentation using the nls package? It looks as though the first two regions on the plot (10^⁻8--10^⁻7 and 10^-7--10^-6) are further from linear then the final section so perhaps a polynomial function would be better here?
As an example of a result I find acceptable I have annoted the original plot by hand:
.
Edit: The reason for using linear fits is the simplicity they provide, to my untrained eye it would require a fairly complex nonlinear function to regress the dataset as a single unit. One thought that had crossed my mind was to fit a lognormal model to the data as this may work given the skew along a log x-axis. I do not have enough competence in R to do this however as my knowledge only extends to fitdistr which so far as I understand would not work here.
Any help or guidance in a relevant direction would be most appreciated.

If you are not satisfied with the segmented package, you can try the earth package with the mars algorithm. But here, I find that the result of the segmented model is very acceptable. see the R-Squared below.
lin.mod <- lm(y2~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=c(0.0000001,0.000001))
summary(segmented.mod)
Meaningful coefficients of the linear terms:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.163e+02 1.143e+02 -1.893 0.0637 .
x 4.743e+10 3.799e+09 12.485 <2e-16 ***
U1.x -5.360e+10 3.824e+09 -14.017 NA
U2.x 6.175e+09 4.414e+08 13.990 NA
Residual standard error: 232.9 on 54 degrees of freedom
Multiple R-Squared: 0.9468, Adjusted R-squared: 0.9419
Convergence attained in 5 iterations with relative change 3.593324e-14
You can check the result by plotting the model :
plot(segmented.mod)
To get the coefficient of the plots , you can do this:
 intercept(segmented.mod)
$x
              Est.
intercept1 -216.30
intercept2 3061.00
intercept3   46.93
> slope(segmented.mod)
$x
             Est.   St.Err.  t value  CI(95%).l  CI(95%).u
slope1  4.743e+10 3.799e+09  12.4800  3.981e+10  5.504e+10
slope2 -6.177e+09 4.414e+08 -14.0000 -7.062e+09 -5.293e+09
slope3 -2.534e+06 5.396e+06  -0.4695 -1.335e+07  8.285e+06

Related

CFA in R with lavaan: different results for fit measures with command fitmeasures() than in summary

I'm calculating a confirmatory factor analysis with the following model:
library(lavaan)
CFA <- "
A =~ BK01_01_z+BK03_01_z+ BK03_03_z+ BK03_04_z+BK03_05_z+ BK03_07_z+ BK03_08_z+ BK05_01_z+BK05_02_z+ BK05_03_z+ BK05_04_z
B=~GK04_01_z + GK04_02_z+ GK04_03_z+GK04_04_z+GK04_05_z
C =~ GS09_01_z+GS09_02_z
Z=~A+B+C
"
fit <- cfa(CFA, data = df_clean, estimator ="WLSMV",
ordered = c("GS09_01_z",
"GS09_02_z"))
As you can see, there are two ordinal (binary) variables that are supposed to load onto one factor. It may also be important that the data is non-normal.
When I'm looking at the results now, I'm getting different results for different commands.
With:
summary(fit, fit.measures=TRUE)
I'm getting RMSEA = 0.069; CFI = 0.663; TLI = 0.609
with:
fitmeasures(fit, c("cfi","rmsea","srmr","tli"))
these are the results:
cfi = 0.964; rmsea = 0.041; srmr = 0.060; tli = 0.958
I've tried to search for my problem, but I couldn't find out why? Maybe someone has encountered a similar issue?
You might be getting Standard results vs Robust results.
For instance, when using alternative estimators such as MLM instead of ML, you get both a Standard and a Robust result.
I have tried it myself and by using fitmeasures I get the standard results, not the robust. Check if that is the case for you as well.
J.

How to fit an inverse guassian distribution to my data, preferably using fitdist {fitdistrplus}

I am trying to analyze some Reaction Time data using GLMM. to find a distribution that fits my data best.I used fitdist() for gamma and lognormal distributions. the results showed that lognormal fits my data better.
However, recently i read that the inverse gaussian distribution might be a better fit for reaction time data.
I used nigFitStart to obtain the start values:
library(GeneralizedHyperbolic)
invstrt <- nigFitStart(RTtotal, startValues = "FN")
which gave me this:
$paramStart
mu delta alpha beta
775.953984862 314.662306398 0.007477984 -0.004930604
so i tried using the start parameteres for fitdist:
require(fitdistrplus)
fitinvgauss <- fitdist(RTtotal, "invgauss", start = list(mu=776, delta=314, alpha=0.007, beta=-0.05))
but i get the following error:
Error in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, :
'start' must specify names which are arguments to 'distr'.
i also used ig_fit{goft} and got the following results:
Inverse Gaussian MLE
mu 775.954
lambda 5279.089
so, this time i used these two parameters for the start argument in fitdist and still got the exact same error:
> fitinvgauss <- fitdist(RTtotal, "invgauss", start = list(mu=776, lambda=5279))
Error in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, :
'start' must specify names which are arguments to 'distr'.
someone had mentioned that changing the parametere names from mu and lambda to mean and shape had solved their problem, but i tried it and still got the same error.
Any idea how i can fix this? or could you suggest an alternative way to fit inverse gaussian to my data?
thank you
dput(RTtotal)
c(594.96, 659.5, 706.14, 620.92, 811.05, 420.63, 457.08, 585.53,
488.59, 484.87, 496.72, 769.01, 458.92, 521.76, 889.08, 514.11,
553.09, 564.68, 1057.19, 437.79, 660.33, 639.58, 643.45, 419.47,
469.16, 457.78, 530.58, 538.73, 557.17, 1140.09, 560.03, 543.18,
1093.29, 607.59, 430.2, 712.06, 716.6, 566.69, 989.71, 449.96,
653.22, 556.52, 654.8, 472.54, 600.26, 548.36, 597.51, 471.97,
596.72, 600.29, 706.77, 511.6, 475.89, 599.13, 570.12, 767.57,
402.68, 601.56, 610.02, 891.95, 483.22, 588.78, 505.95, 554.15,
445.54, 489.02, 678.13, 532.06, 652.61, 654.79, 535.08, 1215.66,
633.6, 645.92, 454.37, 535.81, 508.97, 690.78, 685.97, 703.04,
731.99, 592.75, 662.03, 1400.33, 599.73, 1021.34, 1232.35, 855.1,
780.32, 554.4, 1965.77, 841.89, 1262.76, 721.62, 788.95, 1104.24,
1237.4, 1193.04, 513.91, 474.74, 380.56, 570.63, 700.96, 380.89,
481.96, 723.63, 835.22, 781.1, 468.76, 555.1, 522.22, 944.29,
541.06, 559.18, 738.68, 880.58, 500.14, 1856.97, 1001.59, 703.7,
1022.35, 1813.35, 1128.73, 864.75, 1166.77, 1220.4, 776.56, 2073.72,
1223.88, 617, 1387.71, 595.57, 1506.13, 678.41, 1797.87, 2111.04,
1116.61, 1038.6, 894.25, 778.51, 908.51, 1346.69, 989.09, 1334.17,
877.31, 649.31, 978.22, 1276.84, 1001.58, 1049.66, 1131.83, 700.8,
1267.21, 693.52, 1182.3)
So I'm guessing that you failed to tell us that you also have the statmod-package loaded (or perhaps some other package with a 'invgauss'-family including a dinvgauss function). You should be able to tell which package dinvgauss comes from by reading the top line of the help page for the function, So after installing that package and reading the help page (which one should ALWAYS do) for ?dinvgauss:
fitinvgauss <- fitdist(RTtotal, "invgauss",
start = list(mean=776, dispersion=314, shape=1))
fitinvgauss
# --------------
Fitting of the distribution ' invgauss ' by maximum likelihood
Parameters:
estimate Std. Error
mean 779.2535 NA
dispersion -1007.5490 NA
shape 4972.5745 NA
All I did was read the error message and then read the help page and use the correct names for that function's parameters. (And then play around a bit to get the parameter starting values into the feasible range of values.)

Ordinal regression - proportional odds assumption not met for variable in interaction

I try to analyze a dataset with an ordinal response (0-4) and three categorical factors. I'm interested in the interactions of all three factors as well as the main effects. I used the clm function of the package "ordinal" and checked the assumptions by using the "nominal_test" function. It revealed a significant difference for one of the predictors. And now I don't know how to proceed... I tried to put the problematic factor and all its interactions in the "nominal" argument (see code) and R gives me warnings. Nevertheless, I made several likelyhood ratio tests always comparing a model including an interaction with one missing it (ANOVA(without,with, test="Chisq")) and get some nice significant results. Still, I feel like I have no clue what I'm doing here and I don't trust the results. So my question is: Is it ok what I did? What else can I do? or is the data just 'unanalyzable'?
Here is the code for the test:
# this is the model
res=clm(cue~ intention:outcome:age+
intention:outcome+
intention:age+
outcome:age+
intention+outcome+age+
Gender,
data=xdata)
#proportional odds assumption
nominal_test(res)
# Df logLik AIC LRT Pr(>Chi)
#<none> -221.50 467.00
#intention 3 -215.05 460.11 12.891 0.004879 **
#outcome 3 -219.44 468.87 4.124 0.248384
#age
#Gender 3 -219.50 469.00 3.994 0.262156
#intention:outcome
#intention:age
#outcome:age 6 -217.14 470.28 8.716 0.190199
#intention:outcome:age 12 -188.09 424.19 66.808 1.261e-09 ***
And here is an example of how I tried to solve it -> and check the 3-way-interaction of all three predictors. I did the same for the 2-way-interactions as well...
res=clm(cue~ outcome:age+
outcome+age+
Gender,
nominal= ~ intention:age:outcome+
intention:age+
intention:outcome+
intention,
data=xdata)
res.red=clm(cue~ outcome:age+
outcome+age+
Gender,
nominal= ~
intention:age+
intention:outcome+
intention,
data=xdata)
anova(res,res.red, test="Chisq")
# no.par AIC logLik LR.stat df Pr(>Chisq)
#res.red 26 412.50 -180.25
#res 33 424.11 -179.05 2.3945 7 0.9348
And here is the warning that R gives me when I try to cenverge the model:
Warning message:
(-3) not all thresholds are increasing: fit is invalid
In addition: Absolute convergence criterion was met, but relativecriterion was not met
I'm especially concerned about the sentence "Fit is not valid"... I don't know what to do with this and would be happy about any idea or hint!
Thank you!
Have you tried to use a more general model like the partial proportional odds model? Your data only has to be nominal, not ordinal to use this model. If you find hugh differences between the log likelihoods, your assumption about ordinality is not met.
You can use vlgm() from the VGAM package. Here are a few examples.
As I don't know how your data looks like, I can't say whether it's unanalyzable, but the code would be something like this:
library(VGAM)
res <- vglm(cue ~ intention:outcome:age+
intention:outcome+
intention:age+
outcome:age+
intention+outcome+age+
Gender,
family = cumulative(parallel = FALSE ~ intention),
data = xdata)
summary(res)
I think you could use pchiq() as proposed in the example I posted above to compare both models like you did before with anova():
pchisq(deviance(res) - deviance(res.red),
df = df.residual(res) - df.residual(res.red), lower.tail = FALSE)

How to feed data into ode while doing optimisation

I'm new to R. I found very useful code, which I've tried to use for my purposes. however, I get an error:
Error in func(time, state, parms, ...) : object 'k4' not found and Error in func(time, state, parms, ...) : object 'E' not found
I don't know where the problem is as I can see all parameters and data.frame is correct as well.
Thank you everyone for taking time to look at this. I've tried to reduce the number of parameters to3 (k10, k11,k12), and using estimated values for the remaining (embeded values in the code). However, I still get an error message, the E value from data.frame is not passed into rxnrate function and as result ode can't use it. I've tried to use events and forcing functions but it doesn't seem to work. Thank you for spotting P4, it was a typo, should be P, I've corrected already.
Editors note: This was crossposted to Rhelp and that message included the source of this code as a stackoverflow question "r-parameter and initial conditions fitting ODE models with nls.lm."
#set working directory
setwd("~/R/wkspace")
#load libraries
library(ggplot2)
library(reshape2)
library(deSolve)
library(minpack.lm)
time=c(22,23,24,46,47,48)
cE=c(15.92,24.01,25.29,15.92,24.01,25.29)
cP=c(0.3,0.14,0.29,0.3,0.14,0.29)
cL=c(6.13,3.91,38.4,6.13,3.91,38.4)
df<-data.frame(time,cE,cP,cL)
df
names(df)=c("time","cE","cP","cL")
#rate function
rxnrate=function(t,c,parms){
#rate constant passed through a list called
k1=parms$k1
k2=parms$k2
k3=parms$k3
k4=parms$k4
k5=parms$k5
k6=parms$k6
k7=parms$k7
k8=parms$k8
k9=parms$k9
k10=parms$k10
#c is the concentration of species
#derivatives dc/dt are computed below
r=rep(0,length(c))
r[1]=(k1+(k2*E^k10)/(k3^k10+E^k10))/(1+P/k6)-k4* ((1+k5*P)/(1+k7*E))*c["pLH"]; #dRP_LH/dt
r[2]=(1/k8)*k4*((1+k5*P)/(1+k7*E))*c["p"]-k9*c["L"] #dL/dt
return(list(r))
}
ssq=function(myparms){
#initial concentration
cinit=c(pLH=unname(myparms[11]),LH=unname(myparms[12]))
print(cinit)
#time points for which conc is reported
#include the points where data is available
t=c(seq(0,46,2),df$time)
t=sort(unique(t))
#parameters from the parameters estimation
k1=myparms[1]
k2=myparms[2]
k3=myparms[3]
k4=myparms[4]
k5=myparms[5]
k6=myparms[6]
k7=myparms[7]
k8=myparms[8]
k9=myparms[9]
k10=myparms[10]
#solve ODE for a given set of parameters
out=ode(y=cinit,times=t,func=rxnrate,parms=list(k1=k1,k2=k2,k3=k3,k4=k4,k5=k5,k6=k6,k7=k7,k8=k8,k9=k9,k10=k10,E=cE,P=cP))
#Filter data that contains time points
outdf=data.frame(out)
outdf=outdf[outdf$time%in% df$time,]
#Evaluate predicted vs experimental residual
preddf=melt(outdf,id.var="time",variable.name="species",value.name="conc")
expdf=melt(df,id.var="time",variable.name="species",value.name="conc")
ssqres=preddf$conc-expdf$conc
return(ssqres)
}
# parameter fitting using levenberg marquart
#initial guess for parameters
myparms=c(k1=500, k2=4500, k3=200,k4=2.42, k5=0.26,k6=12.2,k7=0.004,k8=55,k9=24,k10=8,pLH=14.5,LH=3.55)
#fitting
fitval=nls.lm(par=myparms,fn=ssq)
#summary of fit
summary(fitval)
#estimated parameter
parest=as.list(coef(fitval))

gwr fitting using package mgcv and R2Bayesx in R

I want to compare GWR fittings produced between spgwr and mgcv, but I got a error with gam function of mgcv . Here is a example :
require(spgwr)
require(mgcv)
require(R2BayesX)
data(columbus)
col.bw <- gwr.sel(crime ~ income + housing, data=columbus,verbose=F,
coords=cbind(columbus$x, columbus$y))
col.gauss <- gwr(crime ~ income + housing, data=columbus,
coords=cbind(columbus$x, columbus$y),
bandwidth=col.bw, hatmatrix=TRUE)
#gwr fitting with Intercept
col.gam<-gam(crime ~s(x,y)+s(x,y)*income+s(x,y)*housing, data=columbus)#mgcv ERROR
b1<-bayesx(crime ~sx(x,y)+sx(x,y)*income+sx(x,y)*housing, data=columbus)#R2Bayesx ERROR
Question:
How to fit the same gwr using gam and bayesx function(the smooth functions of location )
How to control the parameters to be similiar as possible including optimal bandwidth
The mgcv error comes from the factor that you are specifying the "interactions" between the spatial smooth and variables income and housing. Read ?gam.models for details on using by terms. I think for this you need
col.gam <- gam(crime ~s(x,y, k = 5) + s(x,y, by = income, k = 5) +
s(x,y, by = housing, k = 5), data=columbus)
In this example, as there are only 49 observations, you need to restrict the dimensions of the basis functions, which I do here with k = 5, but you should investigate whether you need to vary these a little, within the constraints of the data.
By the looks of the error from bayesx, you have the same issue of specifying the model incorrectly. I'm not familiar with bayesx(), but it looks like it uses the same s() function as supplied with mgcv, so the model specification should be the same as I show above.
As for 2. can you expand on what you mean here Comparable getween gam() and bayesx() or getting both or one of these comparable with the spgwr() model?

Resources