Non Linear Regression Error (Single Gradient Matrix) - r

I've seen a few of these previously for very simple functions, however the function i'm trying to fit is basically a mixture of 3 functions
A gaussian (which dominates at x=0)
An exponential (which takes over post gaussian)
and a constant which rounds out the values
From the other examples of this error that I have read it seems that the issue is caused by poor initial guesses, but I have no idea how to correct this or if this is even the actual issue given the size of my function.
Here is my code and one sample of the data I'm looking at.:
Value<-c(163301.080,269704.110,334570.550,409536.530,433021.260,418962.060,349554.460,253987.570,124461.710,140750.480,52612.790,54286.427,26150.025,14631.210,15780.244,8053.618,4402.581,2251.137,2743.511,1707.508,1246.894)
Height<-c(400,300,200,0,-200,-400,-600,-800,-1000,-1000,-1200,-1220,-1300,-1400,-1400,-1500,-1600,-1700,-1700,-1800,-1900)
Framed<-data.frame(Value,Height)
i<-nls(Value~a*exp(-Height^2/(2*b^2))+ c*exp(-d*abs(Height)) + e,
data=Framed,start = list(a=410000,b=5,c=10000,d=5,e=1200))
plot(Value~Height)
summary(i)
Thanks for your help now i have the same problem again, i've used your technique below (R noob) was using the manipulate plot in mathematica previously and i think i've got a relatively good fit for the data, here is a graph of the data i'm also attempting to fit (Sorry can't upload it, not enough reputation)
http://imgur.com/GtzIzSr
However i am getting the same issue, is this to do with my fit or the massive amounts of variability at low distances?

You're right about this usually being about bat starting values, and that's (part of) your case. Looking at your data and your guesses, it's clear that something is wrong. But before going into that, note that Framed was not created in the correct order. It should be X Y, or:
Framed <- data.frame(Height, Value)
With that in mind, try the following:
Vals2 <- 410000*exp(-Height^2/(2*5^2)) - 10000*exp(-5*abs(Height)) + 1200
plot(Framed)
lines(Height, Vals2)
You should get
This shows how bad your guesses are. Playing around with your function, it can be easily seen that b is far off. Change it to 500, and then:
That's much better, but still won't fit. And if you change the other parameters (c, d, and e), you'll notice they don't seem to affect the data too much, or at all. That's probably because a is much bigger and you have Height^2 in the first term. If you simplify your function, and run:
i<-nls(Value~a*exp(-Height^2/(2*b^2)), start = list(a=410000,b=500))
You'll find a fit. This is probably because non-linear functions get harder to fit as the number of parameter increases, specially if there are covariance between them. Less parameters are fitted much easier. You'll have to decide however if you can work with only a and b.
But if you plot that, it still doesn't look good. It's clear that your Value does not have its maximum at Height = 0, like it should from your description and from the simulated curve. There seems to be an error with your data, because if you try Height <- Height+200 along with the above changes, you'll get
> summary(i)
Formula: Value ~ a * exp(-Height^2/(2 * b^2))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 449820.71 10236.43 43.94 <2e-16 ***
b 496.60 12.54 39.59 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17790 on 19 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 2.164e-06
Now that's up to you to check if your data is indeed shifted and if you can simplify the function.

Related

Wrong degree of freedom for between-subject factor in lmer

I'm testing how visual perspective(1, completely first person -> 11, completely third person) can vary as a function of Culture (AA, EA), Valence (Positive, Negative) and Event Type (Memory, Imagination) while control age (continuous), sex (M, F) and SES (continuous) and allowing individual differences.
This is an unbalanced design as participants can have as we give participants 10 prompts, but participants can choose to either recall or imagine a relevant event. Therefore, each participants may have as many memories (no greater than 10) and as many imaginations (no greater than 10) as they want. In total we have 363 participants.
My dataset looks like this:
The model I fit looks like
VP.full.lm <- lmer(Visual.Perspective ~ Culture * Event.Type * Valence +
Sex + Age + SES +
(1|Participant.Number),
data=VP_Long)
When I run anova() function to see the effects of all variables, here is the output:
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Culture 30.73 30.73 1 859.1 4.9732 0.0260008 *
Valence 6.38 6.38 1 3360.3 1.0322 0.3097185
Event.Type 1088.61 1088.61 1 3385.9 176.1759 < 2.2e-16 ***
Sex 45.12 45.12 1 358.1 7.3014 0.0072181 **
Age 7.95 7.95 1 358.1 1.2869 0.2573719
SES 6.06 6.06 1 358.7 0.9807 0.3226824
Culture:Valence 6.39 6.39 1 3364.6 1.0348 0.3091004
Culture:Event.Type 71.53 71.53 1 3389.7 11.5766 0.0006756 ***
Valence:Event.Type 2.89 2.89 1 3385.4 0.4682 0.4938573
Culture:Valence:Event.Type 3.47 3.47 1 3390.6 0.5617 0.4536399
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you can see, the DF for effect of culture is off -- since culture is a between-subject factor, its DF cannot be larger than our sample size. I've tried to use ddf = Roger-Kenward and tested the effect of culture using emmeans::test(contrast(emmeans(VP.full.lm,c("Culture")), "trt.vs.ctrl"), joint = T), yet none of these methods solved the problems with the degree of freedom issue.
I also thought about that maybe those participants who did not provide both memories and imaginations are confusing the lmer model, so I subsetted my data to only include participants who provided both types of events. However, the degree of freedom problem persists. It's also worth mentioning that once I removed the interaction between Culture and Event.Type, the degree of freedom became plausible.
I wonder if anyone knows what is going on here, and how can we fix this issue? Or is there way we can explain away this weird issue...?
Thanks so much in advance!
This question might be more appropriate for CrossValidated ...
Not a complete solution, but some ideas:
from a practical point of view, the difference between 363 (or even 350) denominator df and 859 ddf is very small: the manual p-value calculation based on an F-statistic of 4.9732 gives pf(4.9732,1,350,lower.tail=FALSE)=0.0264, hardly different from your value of 0.260.
since you are fitting a simple model (LMM not GLMM, only a single simple random effect, etc.), you might be able to refit your model in lme (from the nlme package): it uses a simpler df computation that might give you the 'right' answer in this case. Alternatively, you can get code from here that implements a (slightly extended) version of the algorithm from lme.
since you're doing type-III Anova, you should be very careful with the parameterization/contrasts in your model: if you're not using centered (sum-to-zero) contrasts, your results may not mean what you think (the afex::mixed() function does some checks to make sure that this is true). It's conceivable (although I doubt it) that the contrasts are throwing of your ddf calculations as well.
it's not clear how you're measuring "visual perspective", but if it's a ratings scale you might be better off with an ordinal response model ...

How to set up a one-way repeated measures MANOVA in R with no between-subject factors

Main Question
I'm looking for help setting up a one-way repeated measures MANOVA in R for a data-set that has no between-subject factors.
Background
While there are plenty of good guides out there for setting up RM MANOVAs with between-subject factors, I have, as of yet, been unable to find any when you have an entirely within-subject design. This problem seems like it should be fairly straight-forward, but I am new to using MANOVA, so I am not sure if I'm approaching the problem correctly. I have been primarily using the car package in R, though I am open to suggestions for how to do this differently.
To demonstrate the problem, I'll use a subset of the OBrienKaiser data set, and I'll assume that each of the levels of the Hours within-subjects factor instead represents the measurement of a different dependent variable. I'll then take the pre and post conditions to be the two levels of my single within-subjects independent variable. To keep things concise, I'll only look at the first three levels from Hours.
So what I have for my data set is 16 subjects measured in two different conditions (pre and post) on 3 different dependent variables (1,2, and 3).
data <- subset(OBrienKaiser,select=c(pre.1,pre.2,pre.3,post.1,post.2,post.3))
My goal is to look for a difference between pre and post across the combination of the three different dependent variables. I've relied heavily on this guide...
http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Multivariate-Linear-Models.pdf
... but the case used there is not quite the same, and includes between-subject conditions that I do not have. So far, I have been able to produce potentially correct results with the following approach, but my problem is that I'm not fully sure I've set the problem up correctly. In general, my approach was to follow the steps outline above, but simply omit the between-subject conditions.
My Approach
I start by designing the idata matrix for the Anova() function call by treating the condition and the dependent variables as two different within-subject factors
Condition <- factor(rep(c('Pre','Post'),each=3),levels=c('Pre','Post'))
Measure <- factor(rep(c('M1','M2','M3'),2),levels=c('M1','M2','M3'))
idata <- data.frame(Condition,Measure)
Next, build the multivariate linear model on the data-set, ignoring the between-subject factors.
mod.mlm <- lm(cbind(pre.1,pre.2,pre.3,post.1,post.2,post.3)~1,data=data)
I then call Anova() on the linear-model object mod.mlm, using the idata I defined earlier, and setting my within-subjects design to be ~Condition.
av.out <- Anova(mod.mlm,idata=idata,idesign=~Condition,type=3)
This yields the following output...
Type III Repeated Measures MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den Df Pr(>F)
(Intercept) 1 0.91438 160.189 1 15 2.08e-09 ***
Condition 1 0.37062 8.833 1 15 0.009498 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To me this process and this result seems reasonable. There are two things that give me pause.
When setting up the linear model, the lack of a predictor variable seems odd, but I think it ends up being defined in the call to Anova() with idesign. If that is just how one sets up the Anova with car then, great. It just seems odd to construct a linear model with out a predictor when I have a predictor variable I'm explicitly interested in.
If I use summary(an.out) to take a more in-depth look at the model that comes out, I can see the contrasts that go into the design. The contrast that is produced with the above approach codes pre with 1 and post with -1. I believe that this is the appropriate contrast for what I am trying to do, but I am not completely sure. Given that one can pass in a custom contrast using imatrix or contrasts in the call to Anova(), I'd like to be sure that what I am trying to test (i.e., differences between pre and post across the three dependent variables) is what I'm actually testing.
Any help and/or advice on how to understanding repeated measures MANOVA in general in this context would be greatly appreciated, as well as specific advice on how to implement this in R.
Bonus
I would also like to do the same thing in Matlab, so if anyone has specific advice on that it would be appreciated (though I realize this might require its own question).

What does this short R script do?

I know python and C++ but have very little experience with R. I'm supposed to figure out what my old coworker's script does - he hasn't been here for several years but I have his files. He has about 10 python files that pass data into a temp file and then into the next python script, which I'm able to track, but he has one R script that I don't understand because I don't know R.
The input to the R script is temp4.txt:
1.414442 0.0043
1.526109 0.0042
1.600553 0.0046
1.637775 0.0045
...etc
Where column 1 is the x-axis of a growth curve (time units) and column 2 is growth level (units OD600, which is a measure of cell density).
The R script is only 4 lines:
inp1 <- scan('/temp4.txt', list(0,0))
decay <- data.frame(t = inp1[[1]], amp = inp1[[2]])
form <- nls(amp ~ const*(exp(fact*t)), data=decay, start = list(const = 0.01, fact = 0.5))
summary(form)
The R script's output:
Parameters:
Estimate Std. Error t value Pr(>|t|)
const 2.293e-03 9.658e-05 23.74 <2e-16 ***
fact 7.106e-01 8.757e-03 81.14 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.002776 on 104 degrees of freedom
Correlation of Parameter Estimates:
const
fact -0.9905
Where the "fact" number is what he's pulling out in the next python script as the value to continue forward with analysis. Usually it's a positive value e.g. "6.649e-01 6.784e-01 6.936e-01 6.578e-01 6.949e-01 6.546e-01 0.6623768 0.6710339 6.952e-01 6.711e-01 6.721e-01 6.520e-01" but because temp file gets overwritten each time I only have one version of it with the negative value -0.9905 which he's throwing away negative values in the next python script.
I need to know what exactly he's doing to recreate it... I know the <- is passing data into an object, so it's the nls() function that's confusing me...
Thanks anyone who can explain the R for me.
The first line reads the data into R
The second line restructures the data into a data frame (a table structure that is used commonly in R, it will be passed as the data to nls in line 3).
This looks like older code, most modern coders would replace lines 1 and 2 with one call to read.table.
Line 3 fits a non-linear least squares equation to the data read in previously and line 4 prints the summary of the fit including the estimates of the parameters for the next python script to read.
The non-linear model that is being fit is an exponential growth curve and the fact parameter is a measure of the rate of growth.
GSee's link in the comments is a good description of the NLS function, but in case you're not used to R documentation, here's a quick rundown of nls.
NLS function is nonlinear least-squares modelling function. It is similar to a linear regression, except that it is believed that at least one of the parameters is not linear (sine function, cosine function, x^2 function, etc.). Non-linear parameters can sometimes be transformed into a linear parameter (i.e. by doing a logarithm conversion or some such) but that's not always the case.
The first option is the model that is being tested: amp ~ const*(exp(fact*t)), means that we would like to model amp as the dependent variable and would like e^(fact*t) to be our independent (non-linear) variable.
The next option just tells us which data object to use (data = decay).
The start argument tells us the starting values to build the model off of, in this case const = .01 and fact = .5.
So the first command reads in the data to the object inp1.
The second creates an object that has class data.frame (which is what most analysis in R is done on). This is basically a table with two columns. In this case the columns are given names (t and amp).
The third command creates an object that has class nls. This object basically contains the information that the nls command generates.
The fourth command prints out a summary of the nls-class object - basically all the pertinent details of the analysis.
The output reads as follows:
First the estimates and std. deviation for the two parameters, const and fact, in the nonlinear model. The t-value and P columns show you a statistical calculation of whether the parameters are statistically significantly different from 0.
Signif. codes is a legend for the stars showing to the right of the parameter estimates - what the p value is.
Residual standard error is an indicator of how much of the std. error is not explained by the model.
This last one I'm not sure about, as I haven't used NLS in a while, but I think it's right.
Correlation of estimates shows how strongly correlated the parameters are. In this case, a -.9905 value is a very strong negative correlation - as fact goes up, const goes down and it's very predictable.

R: Regression with a holdout of certain variables

I'm doing a multi-linear regression model using lm(), Y is response variable (e.g.: return of interests) and others are explanatory variable (100+ cases, 30+ variables).
There are certain variables which are considered as key variables (concerning investment), when I ran the lm() function, R returns a model with adj.r.square of 97%. But some of the key variables are not significant predictors.
Is there a way to do a regression by keeping all of the key variables in the model (as significant predictors)? It doesn't matter if the adjusted R square decreases.
If the regression doesn't work, is there other methodology?
thank you!
==========================
the data set is uploaded
https://www.dropbox.com/s/gh61obgn2jr043y/df.csv
==========================
additional questions:
what if some variables have impact from previous period to current period?
Example: one takes a pill in the morning when he/she has breakfast and the effect of pills might last after lunch (and he/she takes the 2nd pill at lunch)
I suppose I need to take consideration of data transformation.
* My first choice is to plus a carry-over rate: obs.2_trans = obs.2 + c-o rate * obs.1
* Maybe I also need to consider the decay of pill effect itself, so a s-curve or a exponential transformation is also necessary.
take variable main1 for example, I can use try-out method to get an ideal c-o rate and s-curve parameter starting from 0.5 and testing by step of 0.05, up to 1 or down to 0, until I get the highest model score - say, lowest AIC or highest R square.
This is already a huge quantity to test.
If I need to test more than 3 variables in the same time, how could I manage that by R?
Thank you!
First, a note on "significance". For each variable included in a model, the linear modeling packages report the likelihood that the coefficient of this variable is different from zero (actually, they report p=1-L). We say that, if L is larger (smaller p), then the coefficient is "more significant". So, while it is quite reasonable to talk about one variable being "more significant" than another, there is no absolute standard for asserting "significant" vs. "not significant". In most scientific research, the cutoff is L>0.95 (p<0.05). But this is completely arbitrary, and there are many exceptions. recall that CERN was unwilling to assert the existence of the Higgs boson until they had collected enough data to demonstrate its effect at 6-sigma. This corresponds roughly to p < 1 × 10-9. At the other extreme, many social science studies assert significance at p < 0.2 (because of the higher inherent variability and usually small number of samples). So excluding a variable from a model because it is "not significant" really has no meaning. On the other hand you would be hard pressed to include a variable with high p while excluding another variable with lower p.
Second, if your variables are highly correlated (which they are in your case), then it is quite common that removing one variable from a model changes all the p-values greatly. A retained variable that had a high p-value (less significant), might suddenly have low p-value (more significant), just because you removed a completely different variable from the model. Consequently, trying to optimize a fit manually is usually a bad idea.
Fortunately, there are many algorithms that do this for you. One popular approach starts with a model that has all the variables. At each step, the least significant variable is removed and the resulting model is compared to the model at the previous step. If removing this variable significantly degrades the model, based on some metric, the process stops. A commonly used metric is the Akaike information criterion (AIC), and in R we can optimize a model based on the AIC criterion using stepAIC(...) in the MASS package.
Third, the validity of regression models depends on certain assumptions, especially these two: the error variance is constant (does not depend on y), and the distribution of error is approximately normal. If these assumptions are not met, the p-values are completely meaningless!! Once we have fitted a model we can check these assumptions using a residual plot and a Q-Q plot. It is essential that you do this for any candidate model!
Finally, the presence of outliers frequently distorts the model significantly (almost by definition!). This problem is amplified if your variables are highly correlated. So in your case it is very important to look for outliers, and see what happens when you remove them.
The code below rolls this all up.
library(MASS)
url <- "https://dl.dropboxusercontent.com/s/gh61obgn2jr043y/df.csv?dl=1&token_hash=AAGy0mFtfBEnXwRctgPHsLIaqk5temyrVx_Kd97cjZjf8w&expiry=1399567161"
df <- read.csv(url)
initial.fit <- lm(Y~.,df[,2:ncol(df)]) # fit with all variables (excluding PeriodID)
final.fit <- stepAIC(initial.fit) # best fit based on AIC
par(mfrow=c(2,2))
plot(initial.fit) # diagnostic plots for base model
plot(final.fit) # same for best model
summary(final.fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 11.38360 18.25028 0.624 0.53452
# Main1 911.38514 125.97018 7.235 2.24e-10 ***
# Main3 0.04424 0.02858 1.548 0.12547
# Main5 4.99797 1.94408 2.571 0.01195 *
# Main6 0.24500 0.10882 2.251 0.02703 *
# Sec1 150.21703 34.02206 4.415 3.05e-05 ***
# Third2 -0.11775 0.01700 -6.926 8.92e-10 ***
# Third3 -0.04718 0.01670 -2.826 0.00593 **
# ... (many other variables included)
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.76 on 82 degrees of freedom
# Multiple R-squared: 0.9824, Adjusted R-squared: 0.9779
# F-statistic: 218 on 21 and 82 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(initial.fit)
title("Base Model",outer=T,line=-2)
plot(final.fit)
title("Best Model (AIC)",outer=T,line=-2)
So you can see from this that the "best model", based on the AIC metric, does in fact include Main 1,3,5, and 6, but not Main 2 and 4. The residuals plot shows no dependance on y (which is good), and the Q-Q plot demonstrates approximate normality of the residuals (also good). On the other hand the Leverage plot shows a couple of points (rows 33 and 85) with exceptionally high leverage, and the Q-Q plot shows these same points and row 47 as having residuals not really consistent with a normal distribution. So we can re-run the fits excluding these rows as follows.
initial.fit <- lm(Y~.,df[c(-33,-47,-85),2:ncol(df)])
final.fit <- stepAIC(initial.fit,trace=0)
summary(final.fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 27.11832 20.28556 1.337 0.185320
# Main1 1028.99836 125.25579 8.215 4.65e-12 ***
# Main2 2.04805 1.11804 1.832 0.070949 .
# Main3 0.03849 0.02615 1.472 0.145165
# Main4 -1.87427 0.94597 -1.981 0.051222 .
# Main5 3.54803 1.99372 1.780 0.079192 .
# Main6 0.20462 0.10360 1.975 0.051938 .
# Sec1 129.62384 35.11290 3.692 0.000420 ***
# Third2 -0.11289 0.01716 -6.579 5.66e-09 ***
# Third3 -0.02909 0.01623 -1.793 0.077060 .
# ... (many other variables included)
So excluding these rows results in a fit that has all the "Main" variables with p < 0.2, and all except Main 3 at p < 0.1 (90%). I'd want to look at these three rows and see if there is a legitimate reason to exclude them.
Finally, just because you have a model that fits your existing data well, does not mean that it will perform well as a predictive model. In particular, if you are trying to make predictions outside of the "model space" (equivalent to extrapolation), then your predictive power is likely to be poor.
Significance is determined by the relationships in your data .. not by "I want them to be significant".
If the data says they are insignificant, then they are insignificant.
You are going to have a hard time getting any significance with 30 variables, and only 100 observations. With only 100+ observations, you should only be using a few variables. With 30 variables, you'd need 1000's of observations to get any significance.
Maybe start with the variables you think should be significant, and see what happens.

Using mlogit in R with variables that only apply to certain alternatives

I am attempting to use mlogit in R to produce a transportation mode choice. The problem is that I have a variable that only applies to certain alternatives.
To be more specific, I am attempting to predict the probability of using auto, transit and non motorized modes of transportation. My predictors are: distance, transit wait time, number of vehicles in household and in vehicle travel time.
It works when I format it this way:
> amres<-mlogit(mode~ivt+board|distance+nveh,data=AMLOGIT)
However, the results I get for in vehicle travel time (ivt) does not make sense:
> summary(amres)
Call:
mlogit(formula = mode ~ ivt + board | distance + nveh, data = AMLOGIT,
method = "nr", print.level = 0)
Frequencies of alternatives:
auto tansit nonmotor
0.24654 0.28378 0.46968
nr method
5 iterations, 0h:0m:2s
g'(-H)^-1g = 6.34E-08
gradient close to zero
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
tansit:(intercept) 7.8392e-01 8.3761e-02 9.3590 < 2.2e-16 ***
nonmotor:(intercept) 3.2853e+00 7.1492e-02 45.9532 < 2.2e-16 ***
ivt 1.6435e-03 1.2673e-04 12.9691 < 2.2e-16 ***
board -3.9996e-04 1.2436e-04 -3.2161 0.001299 **
tansit:distance 3.2618e-04 2.0217e-05 16.1336 < 2.2e-16 ***
nonmotor:distance -2.9457e-04 3.3772e-05 -8.7224 < 2.2e-16 ***
tansit:nveh -1.5791e+00 4.5932e-02 -34.3799 < 2.2e-16 ***
nonmotor:nveh -1.8008e+00 4.8577e-02 -37.0720 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -10107
McFadden R^2: 0.30354
Likelihood ratio test : chisq = 8810.1 (p.value = < 2.22e-16)
As you can see, the stats look great, but ivt should be a negitive coefficient and not a positive one. My thoughts are that the non-motorized portion, which is all 0, is affecting it. I believe what I have to do is use the third par of the equation as seen below:
> amres<-mlogit(mode~board|distance+nveh|ivt,data=AMLOGIT)
However, this results in:
Error in solve.default(H, g[!fixed]) :
Lapack routine dgesv: system is exactly singular: U[10,10] = 0
I believe this is, again, because the variable is all 0's for non-motorized but I am unsure how to fix this. How do I include an alternative specific variable if it does not apply to all alternatives?
I am not well versed in the various implementations of logit models, but I imagine it has to do with making sure you have variation across persons and alternatives to the matrix can be properly determined with variation across alternatives and choosers.
What do you get from saying
amres<-mlogit(mode~distance| nveh | ivt+board,data=AMLOGIT)
mlogit has a group separation between the pipes, as I understand it as follows: first part is your basic formula, the second part is variables that don't vary across alternatives (i.e. are only person specific, gender, income--I think nveh should be here) while the third part varies by alternative.
Ken Train, incidentally, has a set of vignettes on mlogit specifically that might be helpful. Viton mentions the partition with pipes.
Ken Train's Vignettes
Philip Viton's Vignettes
Yves Croissant's Vignettes
Looks like you may have perfect separation. Have you checked this by e.g. looking at crosstables of the variables? (Can't fit a model if one combination of predictors allows for perfect prediction...) Would be helpful to know size of dataset in this regard - you may be over-fitting for the amount of data you have. This is a general problem in modelling, not specific to mlogit.
You say "the stats look great" but values for Pr(>|t|)s and the Likelihood ratio test look implausibly significant, which would be consistent with this problem. This means the estimates of the coefficients are likely to be inaccurate. (Are they similar to the coefficients produced by univariate modelling ?). Perhaps a simpler model would be more appropriate.
Edit #user3092719 :
You're fitting a generalized linear model, which can easily be overfit (as the outcome variable is discrete or nominal - i.e. has a restricted no. of values). mlogit is an extension of logistic regression; here's a simple example of the latter to illustrate:
> df1 <- data.frame(x=c(0, rep(1, 3)),
y=rep(c(0, 1), 2))
> xtabs( ~ x + y, data=df1)
y
x 0 1
0 1 0
1 1 2
Note the zero in the top right corner. This shows 'perfect separation' which means you that if x=0 you know for sure that y=0 based on this set. So a probabilistic predictive model doesn't make much sense.
Some output from
> summary(glm(y ~ x, data=df1, binomial(link = "logit")))
gives
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.57 6522.64 -0.003 0.998
x 19.26 6522.64 0.003 0.998
Here the size of the Std. Errors are suspiciously large relative to the value of the coefficients. You should also be alerted by Number of Fisher Scoring iterations: 17 - the large no. iterations needed to fit suggests numerical instability.
Your solution seems to involve ensuring that this problem of complete separation does not occur in your model, although hard to be sure without having a minimal working example.

Resources