I'm having a lot of trouble understanding this table:
For example, do I interpret line 3 as a MOOSES_Rating of 5 is has a .07 effect size higher than the reference category? This difference isn't statistically significant at the 10% level, because the p-value is .281.
Thank you so much.
Related
I have a concern with a GLMM I am running and I would be very grateful if you could help me out.
I am modelling the factors that cause a frog species to make either a type 1 or type 2 calls. I am using a GLMM logistic regression. The data from this were generated from recordings of individuals in frog choruses of various sizes. For each male in the dataset, I randomly chose 100 of his calls, and then determined if they were type 1 or 2 (type 1 call =0, type 2 call =1). So each frog is represented by the same number of calls (100 calls), and some frogs are represented in several choruses of different sizes (total n= 12400). The response variable is whether each call in the dataset is type 1 or 2, and my fixed effects are: the size of the chorus a frog is calling in (2,3,4,5,6), the body condition of the frog (residuals from an LM of mass on body length), and standardized body length (SVL) (body length and body condition score are not correlated so no VIF issues). I included frog ID and the chorus ID as random intercepts.
Model results
The model looks fine, and the coefficients seem sensible; they are about what I expect. The only thing that worries me is that, when I calculate the 95%CI for the coefficients, the coefficient for body condition has a huge range (-9.7 to 6.3) (see screenshot). This seems crazy. Even when exponentiated, it seems quite crazy (0 to 492). Is this reasonable?
This variable was involved in a significant interaction with chorus size; does this explain a wide CI? Or does this suggest my approach is flawed? Instead of having each male equally represented in the dataset by 100 calls in each chorus he is in, should I instead collapse that down to a proportion (e.g. proportion of type 2 calls out of the 100 randomly selected calls for each male) and model this as a poisson regression or something? Is the way I’m doing my logistic regression a reasonable approach? I have run model checks and everything and they all seem to point to logistic regression being suitable for my data, at least as I have set it up currently.
Thanks for any help you can provide!
Values I get after standardizing condition:
2.5 % 97.5 %
.sig01 2.0948132676 3.1943483
.sig02 0.0000000000 2.0980214
(Intercept) -3.1595281536 -1.2902779
chorus_size 0.8936643930 1.0418465
cond_resid -0.8872467384 0.5746653
svl -0.0865697646 1.2413117
chorus_size:cond_resid -0.0005998784 0.1383067
Intro
I'm trying to construct a GLM that models the quantity (mass) of eggs the specimens of a fish population lays depending on its size and age.
Thus, the variables are:
eggW: the total mass of layed eggs, a continuous and positive variable ranging between 300 and 30000.
fishW: mass of the fish, continuous and positive, ranging between 3 and 55.
age: either 1 or 2 years.
No 0's, no NA's.
After checking and realising assuming a normal distribution was probably not appropriate, I decided to use a Gamma distribution. I chose Gamma basically because the variable was positive and continuous, with increasing variance with higher values and appeared to be skewed, as you can see in the image below.
Frequency distribution of eggW values:
fishW vs eggW:
The code
myglm <- glm(eggW ~ fishW * age, family=Gamma(link=identity),
start=c(mean(data$eggW),1,1,1),
maxit=100)
I added the maxit factor after seeing it suggested on a post of this page as a solution to glm.fit: algorithm did not converge error, and it worked.
I chose to work with link=identity because of the more obvious and straightforward interpretation of the results in biological terms rather than using an inverse or log link.
So, the code above results in the next message:
Warning messages: 1: In log(ifelse(y == 0, 1, y/mu)) : NaNs
produced 2: step size truncated due to divergence
Importantly, no error warnings are shown if the variable fishW is dropped and only age is kept. No errors are reported if a log link is used.
Questions
If the rationale behind the design of my model is acceptable, I would like to understand why these errors are reported and how to solve or avoid them. In any case, I would appreciate any criticism or suggestions.
You are looking to determine the weight of the eggs based upon age and weight of the fish correct? I think you need to use:
glm(eggW ~ fishW + age, family=Gamma(link=identity)
Instead of
glm(eggW ~ fishW * age, family=Gamma(link=identity)
Does your dataset have missing values?
Are your variables highly correlated?
Turn fishW * age into a seperate column and just pass that to the algo
I am estimating an ordered probit (for those who only know probit I also added a very short explanation in the overleaf-hyperlink below). However my dependent variable is a percentage which has been categorised in eight percentage-groups. Meaning I know e.g. that category 1 means 0 percent, that category 2 means 0<y<5 etc. Consequently I know all of the thresholds alpha and could use them in my likelihood function (cf. equation (2) in this overleaf-link). Does somebody know a command for this in R or Stata or does such a command even exist?
I think you can do it like this with oprobit or oglm in Stata:
webuse nhanes2f
constraint 1 [cut1]_cons=-3
constraint 2 [cut2]_cons=-2
constraint 3 [cut3]_cons=-1
constraint 4 [cut4]_cons=0
oprobit health female black age c.age#c.age, constraint(1 2 3 4)
oglm health female black age c.age#c.age, link(probit) constraint(1 2 3 4)
Stata uses Pr(y=j|x)=Pr(cut_{j-1} < x'b + u <= cut_{j}), so this may not fit with your bins exactly. You might need to add/subtract c(mindouble) from the cut point to get what you want.
This is off-topic here, but I think the ordinal approach does not square with the observation that your "latent" variable has a limited range with an unknown scale.
I would try intreg as a robustness check. It does not deal with the range issue, but the scale issue is not a problem in that setting.
I am doing an lm()regression with R where I use stock quotations. I used exponential weights for the regression : the older the data, the less weight. My weights formula is like this : alpha^(seq(685,1,by=-1))) (the data length is 685), and to find alpha I tried every value between 0.9 and 1.1 with a step of 0.0001 and I chose the alpha which minimizes the difference between the predicted values and the real values. This alpha is equal to 0.9992 so I would like to know if it is statistically different from 1.
In other words I would like to know if the weights are different from 1. Is it possible to achieve that and if so, how could I do this ?
I don't really know whether this question should be asked on stats.stackexchange but it involves Rso I hope it is not misplaced.
I can't find the type of problem I have and I was wondering if someone knew the type of statistics it involves. I'm not sure it's even a type that can be optimized.
I'd like to optimize three variables, or more precisely the combination of 2. The first is a likert scale average the other is the frequency of that item being rated on that likert scale, and the third is the item ID. The likert is [1,2,3,4]
So:
3.25, 200, item1. Would mean that item1 was rated 200 times and got an average of 3.25 in rating.
I have a bunch of items and I'd like to find the high value items. For instance, an item that is 4,1 would suck because while it is rated highest, it is rated only once. And a 1,1000 would also suck for the inverse reason.
Is there a way to optimize with a simple heuristic? Someone told me to look into confidence bands but I am not sure how that would work. Thanks!
Basically you want to ignore scores with fewer than x ratings, where x is a threshold that can be estimated based on the variance in your data.
I would recommend estimating the variance (standard deviation) of your data and putting a threshold on your standard error, then translating that error into the minimum number of samples required to produce that bound with 95% confidence. See: http://en.wikipedia.org/wiki/Standard_error_(statistics)
For example, if your data has standard deviation 0.5 and you want to be 95% sure your score is within 0.1 of the current estimate, then you need (0.5/0.1)^2 = 25 ratings.