I'm studying the effect of different predictors (dummy, categorical and continuos variables) on presence of birds, obtained from bird counts at-sea. To do that I used a glmmadmb function and binomial family.
I've plotted the relationship between response variable and predictors in order to asses the model fit and the marginal effect of each predictor. To draw the graphs I used visreg function, specifying the transformation of the vertical axis:
visreg(modelo.bn7, type="conditional", scale="response", ylab= "Bird Presence")
The output graphs showed a confident bands very wide when I used the original scale of the response variable (covering the whole vertical axis). In case of graphs without transformation, confident bands were shorter but they had the same extension in the different levels of dummy variables. Does anyone know how the confidents bands are calculated in binomial distributions? Could it reflect that I have a problem in the estimated coefficients or in the model fit?
The confidence bands are calculated using p-values for binomial distribution... For detailed explanation you can ask on stats.stackexchange.com. If the bands are very wide (and the interpretation of 'wide' is subjective and mostly based on what is your goal) then it shows that your estimates may not be very accurate. High p-values usually are due to small or insufficient number of observations used for building the model. If the number of observations are large, then it does indicate a poor fit.
Related
I have made a GAM model using "mgcv" package with the family = inverse.guassian(link = identity) and I am really happy with the fit. After plotting the smooth terms using gratia:draw(GAM, residuals = TRUE) I am really confused by the y-axis. What does "effect" mean?
Any help would be very appreciated!
Thank you
Technically this should read "Partial effect" (and I'll be fixing this shortly). This is the smooth effect of the covariate on the response conditional upon the other estimated terms.
Most smooths in {mgcv} are subject to a sum-to-zero identifiability constraint (so we can include an intercept in the model, which is especially useful when we have factor parametric terms in the model also), so they are centred about 0. The 0 line then means the overall mean (on the link scale) of the response (or the reference levels if factor parametric terms are involved in the model); negative values on the axis indicate where the effect of the covariate reduces the response below the average value, and positive values on the axis indicate those covariate values where the response is increased above the average. All conditional upon the other estimated model terms.
This plot, which I previously created, shows predicted probabilities of claim onset based on two variables, PIB (scaled across the x-axis) and W, presented as its 75th and 25th percentiles. Confidence intervals for the predictions are presented alongside the two lines.
Probability of Claim Onset
As I theorize that W and PIB have an interactive effect on claim onset, I'd like to see if there is any significance in the marginal effect of W on PIB. Confidence intervals of the predicted probabilities alone cannot confirm that this effect is insignificant, per my reading here (https://www.sociologicalscience.com/download/vol-6/february/SocSci_v6_81to117.pdf).
I know that you can calculate marginal effect easily from predicted probabilities by subtracting one from the other. Yet, I don't understand how I can get the confidence intervals for the marginal effect -- obviously needed to determine when and where my two sets of probabilities are indeed significantly different from one another.
The function that I used for calculating predicted probabilities of the zeroinfl() model object and the confidence intervals of those predicted probabilities is derived from an online posting (https://stat.ethz.ch/pipermail/r-help/2008-December/182806.html). I'm happy to provide more code if needed, but as this is not a question about an error, I am not sure it is needed.
So, I'm not entirely sure this is the correct answer, but to anyone who might come across the same problem I did:
Assuming that the two prediction lines maintain the same variance, you can pool SE before then calculating. See the wikipedia for Pooled Variance to confirm.
SEpooled <- ((pred_1_OR_pred_2$SE * sqrt(simulation_n))^2) * (sqrt((1/simulation_n)+(1/simulation_n)))
low_conf <- (pred_1$PP - pred_2$PP) - (1.96*SEpooled)
high_conf <- (pred_1$PP - pred_2$PP) + (1.96*SEpooled)
##Add this to the plot
lines(pred_1$x_val, low_conf, lty=2)
lines(pred_1$x_val, high_conf, lty=2)
I have constructed a mixed effect model using lmer() with the aim of comparing the growth in reading scores for four different groups of children as they age.
I would like to plot a graph of the 4 different slopes with confidence intervals in R in order to visualize this relationship but I keep getting stuck.
I have tried to use the plot function and some versions of the ggplot as I have done for previous lm() models but it isn't working so far. Here is my attempted model which I hope looks at how the change in reading scores over time(age) interacts with a child's SESDLD grouping (this indicated whether a child has a language problem and whether or not they are high or low income).
AgeSES.model <- lmer(ReadingMeasure ~ Age.c*SESDLD1 + (1|childid), data = reshapedomit, REML = FALSE)
The ReadingMeasure is a continuous score, age.c is centered age measured in months. SESDLD1 is a categorical measure which has 4 levels. I would expect four positive slopes of ReadingMeasure growth with different intercepts and probably differing slopes.
I would really appreciate any pointers on how to do this!
Thank you so much!!
The type of plot I would like to achieve - this was done in Stata
I have used the package lsmeans in R to get the average estimate for all observations for my treatment factor (across the levels of a block factor in the experimental design that has been included with systematic effect because it only had 3 levels). I have used a sqrt transformation for my response variable.
Thus I have used the following commands in R.
First defining model
model<-sqrt(response)~treatment+block
Then applying lsmeans
model_lsmeans<-lsmeans(model,~treatment)
Then plotting this
plot(model_lsmeans,ylab="treatment", xlab="response(with 95% CI)")
This gives a very nice graph with estimates and 95% confidense intervals for the different treatment.
The problems is just that this graph is for the transformed response.
How do I get this same plot with the backtransformed response (so the squared response)?
I have tried to create a new data frame and extract the lsmean, lower.CL, and upper.CL:
a<-summary(model_lsmeans)
New_dataframe<-as.data.frame(a[c("treatment","lsmean","lower.CL","upper.CL")])
And then make these squared
New_dataframe$lsmean<-New_dataframe$lsmean^2
New_dataframe$lower.CL<-New_dataframe$lower.CL^2
New_dataframe$upper.CL<-New_dataframe$upper.CL^2
New_dataframe
This gives me the estimates and CI boundaries squared that I need.
The problem is that I cannot make the same graph for thise estimates and CI as the one that I did in LS means above.
How can I do this? The reason that I ask is that I want to have graphs that are all of a similar style for my article. Since I very much like this LSmeans plot, and it is very convenient for me to use on the non-transformed response variables, I would like to have all my graphs in this style.
Thank you very much for your help! Hope everything is clear!
Kind regards
Ditlev
I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!
Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.
I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/
For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.