Wrong degree of freedom for between-subject factor in lmer - r

I'm testing how visual perspective(1, completely first person -> 11, completely third person) can vary as a function of Culture (AA, EA), Valence (Positive, Negative) and Event Type (Memory, Imagination) while control age (continuous), sex (M, F) and SES (continuous) and allowing individual differences.
This is an unbalanced design as participants can have as we give participants 10 prompts, but participants can choose to either recall or imagine a relevant event. Therefore, each participants may have as many memories (no greater than 10) and as many imaginations (no greater than 10) as they want. In total we have 363 participants.
My dataset looks like this:
The model I fit looks like
VP.full.lm <- lmer(Visual.Perspective ~ Culture * Event.Type * Valence +
Sex + Age + SES +
(1|Participant.Number),
data=VP_Long)
When I run anova() function to see the effects of all variables, here is the output:
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Culture 30.73 30.73 1 859.1 4.9732 0.0260008 *
Valence 6.38 6.38 1 3360.3 1.0322 0.3097185
Event.Type 1088.61 1088.61 1 3385.9 176.1759 < 2.2e-16 ***
Sex 45.12 45.12 1 358.1 7.3014 0.0072181 **
Age 7.95 7.95 1 358.1 1.2869 0.2573719
SES 6.06 6.06 1 358.7 0.9807 0.3226824
Culture:Valence 6.39 6.39 1 3364.6 1.0348 0.3091004
Culture:Event.Type 71.53 71.53 1 3389.7 11.5766 0.0006756 ***
Valence:Event.Type 2.89 2.89 1 3385.4 0.4682 0.4938573
Culture:Valence:Event.Type 3.47 3.47 1 3390.6 0.5617 0.4536399
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you can see, the DF for effect of culture is off -- since culture is a between-subject factor, its DF cannot be larger than our sample size. I've tried to use ddf = Roger-Kenward and tested the effect of culture using emmeans::test(contrast(emmeans(VP.full.lm,c("Culture")), "trt.vs.ctrl"), joint = T), yet none of these methods solved the problems with the degree of freedom issue.
I also thought about that maybe those participants who did not provide both memories and imaginations are confusing the lmer model, so I subsetted my data to only include participants who provided both types of events. However, the degree of freedom problem persists. It's also worth mentioning that once I removed the interaction between Culture and Event.Type, the degree of freedom became plausible.
I wonder if anyone knows what is going on here, and how can we fix this issue? Or is there way we can explain away this weird issue...?
Thanks so much in advance!

This question might be more appropriate for CrossValidated ...
Not a complete solution, but some ideas:
from a practical point of view, the difference between 363 (or even 350) denominator df and 859 ddf is very small: the manual p-value calculation based on an F-statistic of 4.9732 gives pf(4.9732,1,350,lower.tail=FALSE)=0.0264, hardly different from your value of 0.260.
since you are fitting a simple model (LMM not GLMM, only a single simple random effect, etc.), you might be able to refit your model in lme (from the nlme package): it uses a simpler df computation that might give you the 'right' answer in this case. Alternatively, you can get code from here that implements a (slightly extended) version of the algorithm from lme.
since you're doing type-III Anova, you should be very careful with the parameterization/contrasts in your model: if you're not using centered (sum-to-zero) contrasts, your results may not mean what you think (the afex::mixed() function does some checks to make sure that this is true). It's conceivable (although I doubt it) that the contrasts are throwing of your ddf calculations as well.
it's not clear how you're measuring "visual perspective", but if it's a ratings scale you might be better off with an ordinal response model ...

Related

Interpretation of .L, .Q., .C, .4… for logistic regression

I've done a good amount of googling and the explanations either don't make any sense or they say just use factors instead of ordinal data. I understand that the ``.Lis linear,.Q` is quadratic, ... etc. But I don't know how to actually say what it means. So for example let's say
Primary.L 7.73502 0.984
Primary.Q 6.81674 0.400
Primary.C -4.07055 0.450
Primary^4 1.48845 0.600
where the first column is the variable, second is the estimate, and the third is the p-value. What would I be saying about the variables as they increase in order? Is this basically saying what model I would use so this would be 7.73502x + 6.81674x^2 - 4.07055x^3 is how the model is? Or would it just include quadratic? All of this is so confusing. If anyone could shine a light into how to interpret these .L, .Q, .C, etc., that would be fantastic.
example
> summary(glm(DEPENDENT ~ Year, data = HAVE, family = "binomial"))
Call:
glm(formula = DEPENDENT ~ Year, family = "binomial", data = HAVE)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.3376 -0.2490 -0.2155 -0.1635 3.1802
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.572966 0.028179 -126.798 < 2e-16 ***
Year.L -2.212443 0.150295 -14.721 < 2e-16 ***
Year.Q -0.932844 0.162011 -5.758 8.52e-09 ***
Year.C 0.187344 0.156462 1.197 0.2312
Year^4 -0.595352 0.147113 -4.047 5.19e-05 ***
Year^5 -0.027306 0.135214 -0.202 0.8400
Year^6 -0.023756 0.120969 -0.196 0.8443
Year^7 0.079723 0.111786 0.713 0.4757
Year^8 -0.080749 0.103615 -0.779 0.4358
Year^9 -0.117472 0.098423 -1.194 0.2327
Year^10 -0.134956 0.095098 -1.419 0.1559
Year^11 -0.106700 0.089791 -1.188 0.2347
Year^12 0.102289 0.088613 1.154 0.2484
Year^13 0.125736 0.084283 1.492 0.1357
Year^14 -0.009941 0.084058 -0.118 0.9059
Year^15 -0.173013 0.088781 -1.949 0.0513 .
Year^16 -0.146597 0.090398 -1.622 0.1049
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 18687 on 80083 degrees of freedom
Residual deviance: 18120 on 80067 degrees of freedom
AIC: 18154
Number of Fisher Scoring iterations: 7
That output indicates that your predictor Year is an "ordered factor" meaning R not only understands observations within that variable to be distinct categories or groups (i.e., a factor) but also that the various categories have a natural order to them where one category is considered larger than another.
In this situation, R's default is to fit a series of polynomial functions or contrasts to the levels of the variable. The first is linear (.L), the second is quadratic (.Q), the third is cubic (.C), and so on. R will fit one fewer polynomial functions than the number of available levels. Thus, your output indicates there are 17 distinct years in your data.
You can probably think of those 17 (counting the intercept) predictors in your output as entirely new variables all based on the order of your original variable because R creates them using special values that make all the new predictors orthogonal (i.e., unrelated, linearly independent, or uncorrelated) to each other.
One way to see the values that got used is to use the model.matrix() function on your model object.
model.matrix(glm(DEPENDENT ~ Year, data = HAVE, family = "binomial"))
If you run the above, you will find a bunch of repeated numbers within each of the new variable columns where the changes in repetition correspond to where your original Year predictor switched categories. The specific values themselves hold no real meaning to you because they were chosen/computed by R to make all the contrasts linearly independent of one another.
Therefore, your model in the R output would be:
logit(p) = -3.57 + -2.21 * Year.L + -0.93 * Year.Q + ... + -0.15 * Year^16
where p is the probability of presence of the characteristic of interest, and the logit transformation is defined as the logged odds where odds = p / (1 - p) and logged odds = ln(odds). Therefore logit(p) = ln(p / (1 - p)).
The interpretation of a particular beta test is then generalized to: Which contrasts contribute significantly to explain any differences between levels in your dependent variable? Because your Year.L predictor is significant and negative, this suggests a linear decreasing trend in logit across years, and because your Year.Q predictor is significant and negative, this suggests a deacceleration trend is detectable in the pattern of logits across years. Third order polynomials model jerk, and fourth order polynomials model jounce (a.k.a., snap). However, I would stop interpreting around this order and higher because it quickly becomes nonsensical to practical folk.
Similarly, to interpret a particular beta estimate is a bit nonsensical to me, but it would be that the odds of switching categories in your outcome at a given level of a particular contrast (e.g., quadratic) as compared to the odds of switching categories in your outcome at the given level of that contrast (e.g., quadratic) less one unit is equal to the odds ratio had by exponentiating the beta estimate. For the quadratic contrast in your example, the odds ratio would be exp(-0.9328) = 0.3935, but I say it's a bit nonsensical because the units have little practical meaning as they were chosen by R to make the predictors linearly independent from one another. Thus I prefer focusing on the interpretation of a given contrast test as opposed to the coefficient in this circumstance.
For further reading, here is a webpage at UCLA's wonderful IDRE that discusses how to interpret odds ratios in logistic regression, and here is a crazy cool but intense stack exchange answer that walks through how R chooses the polynomial contrast weights.

Power analysis for multiple regression using pwr and R

I want to determine the sample size necessary to detect an effect of an interaction term of two continuous variables (scaled) in a multiple regression with other covariates.
We have found an effect where previous smaller studies have failed. These effects are small, but a reviewer is asking us say that previous studies were probably underpowered, and to provide some measure to support that.
I am using the pwr.f2.test() function in the pwr package, as follows:
pwr.f2.test(u = nominator, v = denominator, f2 = effect size, sig.level = 0.05, power = 0.8), and the denominator I set to NULL so I can get sample size.
Here is my model output from summary():
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.2333 20.8127 -1.02 0.30800
age 0.0740 0.0776 0.95 0.34094
wkdemand 1.6333 0.5903 2.77 0.00582 **
hoops 0.8662 0.6014 1.44 0.15028
wtlift 5.2417 1.3912 3.77 0.00018 ***
height05 0.2205 0.0467 4.72 2.9e-06 ***
amtRS 0.1041 0.2776 0.37 0.70779
allele1_numS -0.0731 0.2779 -0.26 0.79262
amtRS:allele1_numS 0.6267 0.2612 2.40 0.01670 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.17 on 666 degrees of freedom
Multiple R-squared: 0.0769, Adjusted R-squared: 0.0658
F-statistic: 6.94 on 8 and 666 DF, p-value: 8.44e-09
And the model effects sizes estimates from modelEffectSizes() function in lmSupport package:
Coefficients
SSR df pEta-sqr dR-sqr
(Intercept) 53.5593 1 0.0016 NA
age 46.7344 1 0.0014 0.0013
wkdemand 393.9119 1 0.0114 0.0106
hoops 106.7318 1 0.0031 0.0029
wtlift 730.5385 1 0.0209 0.0197
height05 1145.0394 1 0.0323 0.0308
amtRS 7.2358 1 0.0002 0.0002
allele1_numS 3.5599 1 0.0001 0.0001
amtRS:allele1_numS 296.2219 1 0.0086 0.0080
Sum of squared errors (SSE): 34271.3
Sum of squared total (SST): 37127.3
The question:
What value do I put in the f2 slot of pwr.f2.test()? I take it the numerator is going to be 1, and I should use the pEta-sqr from modelEffectSizes(), so in this case 0.0086?
Also, the estimated sample sizes I get are often much larger than our sample size 675 - does this mean we were 'lucky' to have picked up a significant effects (we'll only detect them 50% of the time, given the effect size)? Note that I we have multiple measures of different things all pointing to the same finding, so I'm relatively satisfied there.
What value do I put in the f2 slot of pwr.f2.test()?
For each of pwr functions, you enter three of the four quantities (effect size, sample size, significance level, power) and the fourth will be calculated (1). In pwr.f2.test u and v are the numerator and denominator degrees of freedom. And f2 is used as the effect size measure. E.g. you will put there an effect size estimate.
Is pEta-sqr the correct 'effect size' to use?
Now, there are many different effect size measures. Pwr uses specifically Cohen´s F 2 and it is different from pEta-sqr, so I wouldn´t recommend it.
Which effect size measure I could use then?
As #42- mentioned, you could try to use delta-R2 effect, which in your output variables is labeled “dR-sqr”. You could do this with variation of Cohen’s f 2 measuring local effect size which was described by Selya et al. (2012). It uses the following equation:
In the equation, B is the variable of interest, A is the set of all other variables , R2AB is the proportion of variance accounted for by A and B together (relative to a model with no regressors), and R²A is the proportion of variance accounted for by A (relative to a model with no regressors). I would do as #42- suggested – e.g. build two models, one with the interaction and one without and use their delta-R2 effect size.
Importantly, as #42- correctly pointed out, if the reviewers ask you if prior studies were underpowered, you need to use the sample sizes of those studies to make any power calculation. If you are using parameters of your own study, first of all you already know the answer – that you did have sufficient power to detect a difference, and second, you are doing it post hoc which also doesn´t sound correct.
https://www.statmethods.net/stats/power.html
Selya et al., 2012: A Practical Guide to Calculating Cohen’s f2, a Measure of Local Effect Size, from PROC MIXED. Front Psychol. 2012;3:111.

R: Regression with a holdout of certain variables

I'm doing a multi-linear regression model using lm(), Y is response variable (e.g.: return of interests) and others are explanatory variable (100+ cases, 30+ variables).
There are certain variables which are considered as key variables (concerning investment), when I ran the lm() function, R returns a model with adj.r.square of 97%. But some of the key variables are not significant predictors.
Is there a way to do a regression by keeping all of the key variables in the model (as significant predictors)? It doesn't matter if the adjusted R square decreases.
If the regression doesn't work, is there other methodology?
thank you!
==========================
the data set is uploaded
https://www.dropbox.com/s/gh61obgn2jr043y/df.csv
==========================
additional questions:
what if some variables have impact from previous period to current period?
Example: one takes a pill in the morning when he/she has breakfast and the effect of pills might last after lunch (and he/she takes the 2nd pill at lunch)
I suppose I need to take consideration of data transformation.
* My first choice is to plus a carry-over rate: obs.2_trans = obs.2 + c-o rate * obs.1
* Maybe I also need to consider the decay of pill effect itself, so a s-curve or a exponential transformation is also necessary.
take variable main1 for example, I can use try-out method to get an ideal c-o rate and s-curve parameter starting from 0.5 and testing by step of 0.05, up to 1 or down to 0, until I get the highest model score - say, lowest AIC or highest R square.
This is already a huge quantity to test.
If I need to test more than 3 variables in the same time, how could I manage that by R?
Thank you!
First, a note on "significance". For each variable included in a model, the linear modeling packages report the likelihood that the coefficient of this variable is different from zero (actually, they report p=1-L). We say that, if L is larger (smaller p), then the coefficient is "more significant". So, while it is quite reasonable to talk about one variable being "more significant" than another, there is no absolute standard for asserting "significant" vs. "not significant". In most scientific research, the cutoff is L>0.95 (p<0.05). But this is completely arbitrary, and there are many exceptions. recall that CERN was unwilling to assert the existence of the Higgs boson until they had collected enough data to demonstrate its effect at 6-sigma. This corresponds roughly to p < 1 × 10-9. At the other extreme, many social science studies assert significance at p < 0.2 (because of the higher inherent variability and usually small number of samples). So excluding a variable from a model because it is "not significant" really has no meaning. On the other hand you would be hard pressed to include a variable with high p while excluding another variable with lower p.
Second, if your variables are highly correlated (which they are in your case), then it is quite common that removing one variable from a model changes all the p-values greatly. A retained variable that had a high p-value (less significant), might suddenly have low p-value (more significant), just because you removed a completely different variable from the model. Consequently, trying to optimize a fit manually is usually a bad idea.
Fortunately, there are many algorithms that do this for you. One popular approach starts with a model that has all the variables. At each step, the least significant variable is removed and the resulting model is compared to the model at the previous step. If removing this variable significantly degrades the model, based on some metric, the process stops. A commonly used metric is the Akaike information criterion (AIC), and in R we can optimize a model based on the AIC criterion using stepAIC(...) in the MASS package.
Third, the validity of regression models depends on certain assumptions, especially these two: the error variance is constant (does not depend on y), and the distribution of error is approximately normal. If these assumptions are not met, the p-values are completely meaningless!! Once we have fitted a model we can check these assumptions using a residual plot and a Q-Q plot. It is essential that you do this for any candidate model!
Finally, the presence of outliers frequently distorts the model significantly (almost by definition!). This problem is amplified if your variables are highly correlated. So in your case it is very important to look for outliers, and see what happens when you remove them.
The code below rolls this all up.
library(MASS)
url <- "https://dl.dropboxusercontent.com/s/gh61obgn2jr043y/df.csv?dl=1&token_hash=AAGy0mFtfBEnXwRctgPHsLIaqk5temyrVx_Kd97cjZjf8w&expiry=1399567161"
df <- read.csv(url)
initial.fit <- lm(Y~.,df[,2:ncol(df)]) # fit with all variables (excluding PeriodID)
final.fit <- stepAIC(initial.fit) # best fit based on AIC
par(mfrow=c(2,2))
plot(initial.fit) # diagnostic plots for base model
plot(final.fit) # same for best model
summary(final.fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 11.38360 18.25028 0.624 0.53452
# Main1 911.38514 125.97018 7.235 2.24e-10 ***
# Main3 0.04424 0.02858 1.548 0.12547
# Main5 4.99797 1.94408 2.571 0.01195 *
# Main6 0.24500 0.10882 2.251 0.02703 *
# Sec1 150.21703 34.02206 4.415 3.05e-05 ***
# Third2 -0.11775 0.01700 -6.926 8.92e-10 ***
# Third3 -0.04718 0.01670 -2.826 0.00593 **
# ... (many other variables included)
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.76 on 82 degrees of freedom
# Multiple R-squared: 0.9824, Adjusted R-squared: 0.9779
# F-statistic: 218 on 21 and 82 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(initial.fit)
title("Base Model",outer=T,line=-2)
plot(final.fit)
title("Best Model (AIC)",outer=T,line=-2)
So you can see from this that the "best model", based on the AIC metric, does in fact include Main 1,3,5, and 6, but not Main 2 and 4. The residuals plot shows no dependance on y (which is good), and the Q-Q plot demonstrates approximate normality of the residuals (also good). On the other hand the Leverage plot shows a couple of points (rows 33 and 85) with exceptionally high leverage, and the Q-Q plot shows these same points and row 47 as having residuals not really consistent with a normal distribution. So we can re-run the fits excluding these rows as follows.
initial.fit <- lm(Y~.,df[c(-33,-47,-85),2:ncol(df)])
final.fit <- stepAIC(initial.fit,trace=0)
summary(final.fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 27.11832 20.28556 1.337 0.185320
# Main1 1028.99836 125.25579 8.215 4.65e-12 ***
# Main2 2.04805 1.11804 1.832 0.070949 .
# Main3 0.03849 0.02615 1.472 0.145165
# Main4 -1.87427 0.94597 -1.981 0.051222 .
# Main5 3.54803 1.99372 1.780 0.079192 .
# Main6 0.20462 0.10360 1.975 0.051938 .
# Sec1 129.62384 35.11290 3.692 0.000420 ***
# Third2 -0.11289 0.01716 -6.579 5.66e-09 ***
# Third3 -0.02909 0.01623 -1.793 0.077060 .
# ... (many other variables included)
So excluding these rows results in a fit that has all the "Main" variables with p < 0.2, and all except Main 3 at p < 0.1 (90%). I'd want to look at these three rows and see if there is a legitimate reason to exclude them.
Finally, just because you have a model that fits your existing data well, does not mean that it will perform well as a predictive model. In particular, if you are trying to make predictions outside of the "model space" (equivalent to extrapolation), then your predictive power is likely to be poor.
Significance is determined by the relationships in your data .. not by "I want them to be significant".
If the data says they are insignificant, then they are insignificant.
You are going to have a hard time getting any significance with 30 variables, and only 100 observations. With only 100+ observations, you should only be using a few variables. With 30 variables, you'd need 1000's of observations to get any significance.
Maybe start with the variables you think should be significant, and see what happens.

Using mlogit in R with variables that only apply to certain alternatives

I am attempting to use mlogit in R to produce a transportation mode choice. The problem is that I have a variable that only applies to certain alternatives.
To be more specific, I am attempting to predict the probability of using auto, transit and non motorized modes of transportation. My predictors are: distance, transit wait time, number of vehicles in household and in vehicle travel time.
It works when I format it this way:
> amres<-mlogit(mode~ivt+board|distance+nveh,data=AMLOGIT)
However, the results I get for in vehicle travel time (ivt) does not make sense:
> summary(amres)
Call:
mlogit(formula = mode ~ ivt + board | distance + nveh, data = AMLOGIT,
method = "nr", print.level = 0)
Frequencies of alternatives:
auto tansit nonmotor
0.24654 0.28378 0.46968
nr method
5 iterations, 0h:0m:2s
g'(-H)^-1g = 6.34E-08
gradient close to zero
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
tansit:(intercept) 7.8392e-01 8.3761e-02 9.3590 < 2.2e-16 ***
nonmotor:(intercept) 3.2853e+00 7.1492e-02 45.9532 < 2.2e-16 ***
ivt 1.6435e-03 1.2673e-04 12.9691 < 2.2e-16 ***
board -3.9996e-04 1.2436e-04 -3.2161 0.001299 **
tansit:distance 3.2618e-04 2.0217e-05 16.1336 < 2.2e-16 ***
nonmotor:distance -2.9457e-04 3.3772e-05 -8.7224 < 2.2e-16 ***
tansit:nveh -1.5791e+00 4.5932e-02 -34.3799 < 2.2e-16 ***
nonmotor:nveh -1.8008e+00 4.8577e-02 -37.0720 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Log-Likelihood: -10107
McFadden R^2: 0.30354
Likelihood ratio test : chisq = 8810.1 (p.value = < 2.22e-16)
As you can see, the stats look great, but ivt should be a negitive coefficient and not a positive one. My thoughts are that the non-motorized portion, which is all 0, is affecting it. I believe what I have to do is use the third par of the equation as seen below:
> amres<-mlogit(mode~board|distance+nveh|ivt,data=AMLOGIT)
However, this results in:
Error in solve.default(H, g[!fixed]) :
Lapack routine dgesv: system is exactly singular: U[10,10] = 0
I believe this is, again, because the variable is all 0's for non-motorized but I am unsure how to fix this. How do I include an alternative specific variable if it does not apply to all alternatives?
I am not well versed in the various implementations of logit models, but I imagine it has to do with making sure you have variation across persons and alternatives to the matrix can be properly determined with variation across alternatives and choosers.
What do you get from saying
amres<-mlogit(mode~distance| nveh | ivt+board,data=AMLOGIT)
mlogit has a group separation between the pipes, as I understand it as follows: first part is your basic formula, the second part is variables that don't vary across alternatives (i.e. are only person specific, gender, income--I think nveh should be here) while the third part varies by alternative.
Ken Train, incidentally, has a set of vignettes on mlogit specifically that might be helpful. Viton mentions the partition with pipes.
Ken Train's Vignettes
Philip Viton's Vignettes
Yves Croissant's Vignettes
Looks like you may have perfect separation. Have you checked this by e.g. looking at crosstables of the variables? (Can't fit a model if one combination of predictors allows for perfect prediction...) Would be helpful to know size of dataset in this regard - you may be over-fitting for the amount of data you have. This is a general problem in modelling, not specific to mlogit.
You say "the stats look great" but values for Pr(>|t|)s and the Likelihood ratio test look implausibly significant, which would be consistent with this problem. This means the estimates of the coefficients are likely to be inaccurate. (Are they similar to the coefficients produced by univariate modelling ?). Perhaps a simpler model would be more appropriate.
Edit #user3092719 :
You're fitting a generalized linear model, which can easily be overfit (as the outcome variable is discrete or nominal - i.e. has a restricted no. of values). mlogit is an extension of logistic regression; here's a simple example of the latter to illustrate:
> df1 <- data.frame(x=c(0, rep(1, 3)),
y=rep(c(0, 1), 2))
> xtabs( ~ x + y, data=df1)
y
x 0 1
0 1 0
1 1 2
Note the zero in the top right corner. This shows 'perfect separation' which means you that if x=0 you know for sure that y=0 based on this set. So a probabilistic predictive model doesn't make much sense.
Some output from
> summary(glm(y ~ x, data=df1, binomial(link = "logit")))
gives
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.57 6522.64 -0.003 0.998
x 19.26 6522.64 0.003 0.998
Here the size of the Std. Errors are suspiciously large relative to the value of the coefficients. You should also be alerted by Number of Fisher Scoring iterations: 17 - the large no. iterations needed to fit suggests numerical instability.
Your solution seems to involve ensuring that this problem of complete separation does not occur in your model, although hard to be sure without having a minimal working example.

How to create an ADBUDG Economics Marketing Model with R

So I am trying to recreate this blog post, but the author does not give very detailed steps for use with R.
So, I have this set of data which contains total paid search spend, branded spend, non-branded spend, paid clicks (marked simply as clicks), organic visits, total visits, visitors and transactions for an ecommerce site.
I attempted to do the first part by myself and was given this output:
> fit <- lm(organic.visits ~ visits + clicks +visitors + transactions, data=mydata)
> summary(fit)
Call:
lm(formula = organic.visits ~ visits + clicks + visitors + transactions,
data = mydata)
Residuals:
Min 1Q Median 3Q Max
-5916.6 -1100.9 -237.4 811.6 8444.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7028.56997 502.55911 13.986 < 2e-16 ***
visits 1.15842 0.04406 26.295 < 2e-16 ***
clicks 0.46578 0.09884 4.712 3.39e-06 ***
visitors -1.13322 0.04442 -25.513 < 2e-16 ***
transactions -1.11505 0.19823 -5.625 3.49e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1903 on 399 degrees of freedom
Multiple R-squared: 0.8236, Adjusted R-squared: 0.8219
F-statistic: 465.8 on 4 and 399 DF, p-value: < 2.2e-16
First: I want to know if I did this correctly, or if there is a better way to do it.
Second: I want to know how to do the second part of the post which is the ADBUDG Economics Marketing Model. I tried to search online, but have not found anything useful in helping meet complete this part of the analysis. The author does not give any direction on how to do it.
First, you'll want to set natural search clicks as your independent variable - to the left of the "~" - and have the others be the dependent variables (of which paid search clicks should be one).
Then take a look at the estimate for paid search clicks to see how they're affecting natural search.
To find the best estimates of the ADBUDG model parameters, you'll need to use some kind of solver. First create the equation (b + (a-b) * spend^c / (d + spend ^ c)), then calculate the error from each point by subtracting the ADBUDG predictions from the actuals. Then you can use Excel's solver or a solver in R to find the a, b, c, and d that minimize the total squared error.
a gives you the upper bound, and b gives your the lower bound based on the data.
First of all you need to make sure that your response variable (or dependent variable), in this case, organic visits is a continuous variable. If so, then your model is correct. However, it is always needed to cheeck for model assumptions, such as the distribution of the residuals and their homocedasticity. You can visualize that by using:
plot(fit)
Regarding results interpretation. Each one of the predictive variables (or independent variables) show a significant effect. If these variables are also continuous, this means that each one of this variables have a relationship with the response variable which is statistically different from 0. For instance, in the case of the variable "visits", we expect that, for an increase of 1.15 in visitors we would expect an increase of 1 in organic visits. The effect is negative when estimates are smaller than zero.
You can see that by including the individual effect of these variables, you can explain the 82% of the variance in your response variable (see R squared).
Cheers!
Pablo

Resources