Number of likert scale points impact on statistical inferences - r

I have a survey in which we are debating whether using a 5- or 7-point likert scale for questions around agreement (strongly agree-strongly disagree). The question is whether the 7-point scale would help or hinder the statistical inferences we could make from the data with a sample size of say, 1,800.
One may assume a 7-point likert scale would give you more variability, but at the cost of a wider confidence interval, especially when looking at stratifying by demographic variables.
A back-of-the-envelop calculation of what the confidence interval would be given a random distribution along a 7-point scale and a 5-point scale for a sample size of 1,800 is ~9% and ~6.5% respectively. They both seem high but a 9% CI seems like a high cost for added variability, but I am interested in other's takes.

My thoughts:
The standard 5-category likert scale is typical. If you need a sample size of 1,800 to get a width of ~6.5% for your confidence interval, I'd go with the 5-point scale. That's a lot of people to only get a ~9% width for your CI, which allows you to only estimate which decile your outcome variable is in.

Related

Fit survival-like data or inversed S-curves

I am dealing with some data producing survival-like curves where, instead of time vs survival, I have the log concentration of a substrate against bacterial optical density. The higher the concentration, the lower the OD. And lethal concentration varies with bacterial strain. I am attaching a plot to illustrate all that, where I show five bacterial strains. Each point in the graphic represents three independent replicates. I did not plot them for clarity purposes.
The questions are:
1- Can I use survival and survminer libraries, which are devoted to survival studies?
2- If not, how can I adjust such curves? An alternative would be to focus on the part of the curves where OD starts decreasing, but I do not fancy that idea that much.
Any help would be hugely appreciated.
Best,
David

R - calculate confidence interval of grouped data

Suppose you have a dataset named data as follows:
Gender Pneumonia_Incidence lower_CI upper_CI
Male 38000 30000 44000
Female 34000 32000 38000
I would now like to calculate the total pneumonia incidence, which can be done easily:
sum(data$Pneumonia_Incidence)
But how can I calculate lower and upper CI for this estimate? Is it valid to just sum lower_CI and upper_CI values or would I need to do something else?
How can I calculate lower and upper CI for this estimate?
You cannot with the information you have provided. You would need to know the variance of each estimated prevalence. This must be known since someone has calculated the confidence intervals. With these you could then obtain a pooled estimate for the total variance and then calculate the overall confidence interval.
If this is important to you, I strongly suggest you consult a qualified statistician, or at least a reputable text book. This is not simple high school math. There may be other issues such as sampling weights involved. I certainly wouldn't seek statistical advice here, and probably not even at that other place often mentioned. What if your boss asked you how you calculated the confidence interval? Would you say that you consulted the internet?
Is it valid to just sum lower_CI and upper_CI values ...
No. Variances involve sample sizes. Consider this. Imagine two groups, one with a very large sample size and one with a very small one. The group with the large sample size will have a narrower confidence interval than the group with the small sample size will. If you just added the two intervals, you would end up with an overall interval that was equally weighted by both groups, which intuitively doesn't seem correct. It's a biased estimate.
... or would I need to do something else?
Consult a statistician. :)

R question: How to compute area under the curve with respect to increase?

I'm looking to illustrate an effect using area under the curve (AUC) in r. Specifically, I need to illustrate the AUC with respect to increase from baseline over 5 days.
I'm looking at biological data where the outcome is not coded as 0,1 but rather as continuous over time. I'm interested in creating this graph described in Fekedulegn et al., 2007 (https://www.ncbi.nlm.nih.gov/pubmed/17766693) with my own data.
I've seen several wonderful packages for AUC so far, but none to compute the AUC with respect to increase from baseline with the exception of one answer, described here: Incremental Area Under the Curve (iAUC) in R). This didn't quite work for my problem though.
The variables are as follows in a long-form dataset:
Baseline: total # of predictive biomarker produced when assessed at baseline
Panel_value: total # of predictive biomarker produced over 5 days (continuous)
Reap: frequency of cognitive reappraisal (continuous), but needed to bin this variable
Reap.quantiles: mutated reappraisal variables to plot values for each quantile
Reap.hilo: mutated reappraisal variable to plot values for those at top and bottom quantile for the frequency of cognitive reappraisal
Day: day of infection
I would greatly appreciate it if anyone can provide insight on any packages to compute AUC with respect to increase from baseline or recommend any alternative methods - bonus points if I can do it in ggplot as plotted here! Thanks in advance!

Simulating data using existing data and probability

I have measured multiple attributes (height, species, crown width, condition etc) for about 1500 trees in a city. Using remote sensing techniques I also have the heights for the rest of the 9000 trees in the city. I want to simulate/generate/estimate the missing attributes for these unmeasured trees by using their heights.
From the measured data I can obtain proportion of each species in the measured population (and thus a rough probability), height distributions for each species, height-crown width relationships for the species, species-condition relationship and so on. I want to use the height data for the unmeasured trees to first estimate the species and then estimate the rest of the attributes too using probability theory. So for a height of say 25m its more likely to be a Cedar (height range 5 - 30 m) rather than a Mulberry tree (height range 2 -8 m) and more likely to be a cedar (50% of population) than an oak (same height range but 2% of population) and hence will have a crown width of 10m and have a health condition of 95% (based on the distributions for cedar trees in my measured data). But also I am expecting some of the other trees of 25m to be given oak, just less frequently than cedar based on the proportion in population.
Is there a way to do this using probability theory in R preferably utilising Bayesian or machine learning methods?
Im not asking for someone to write the code for me - I am fairly experienced with R. I just want to be pointed in the right direction i.e. a package that does this kind of thing neatly.
Thanks!
Because you want to predict a categorical variable, i.e. the species, you should consider using a tree regression, a method which can be found in the R packages rpart and RandomForest. These models excel when you have a discrete number of categories and you need to slot your observations into those categories. I think those packages would work in your application. As a comparison, you can also look at multinomial regression (mnlogit, nnet, maxent) which can also predict categorical outcomes; unfortunately multinomial regression can get unwieldy with large numbers of outcomes and/or large datasets.
If you want to then predict the individual values for individual trees in your species, first run a regression of all of your measured variables, including species type, on the measured trees. Then take the categorical labels that you predicted and predict out-of-sample for the unmeasured trees where you use the categorical labels as predictors for the unmeasured variable of interest, say tree height. That way the regression will predict the average height for that species/dummy variable, plus some error and incorporating any other information you have on that out-of-sample tree.
If you want to use a Bayesian method, you consider using a hierarchical regression to model these out-of-sample predictions. Sometimes hierarchical models do better at predicting as they tend to be fairly conservative. Consider looking at the package Rstanarm for some examples.
I suggest you looking over Bayesian Networks with table CPDs over your random variables. This is a generative model that can handle missing data and do inference over casual relationships between variables. Bayesian Network structure can be specified by-hand or learned from data by a algorithm.
R has several implementations of Bayesian Networks with bnlearn being one of them: http://www.bnlearn.com/
Please see a tutorial on how to use it here: https://www.r-bloggers.com/bayesian-network-in-r-introduction/
For each species, the distribution of the other variables (height, width, condition) is probably a fairly simple bump. You can probably model the height and width as a joint Gaussian distribution; dunno about condition. Anyway with a joint distribution for variables other than species, you can construct a mixture distribution of all those per-species bumps, with mixing weights equal to the proportion of each species in the available data. Given the height, you can find the conditional distribution of the other variables conditional on height (and it will also be a mixture distribution). Given the conditional mixture, you can sample from it as usual: pick a bump with frequency equal to its mixing weight, and then sample from the selected bump.
Sounds like a good problem. Good luck and have fun.

visreg plot binomial distribution

I'm studying the effect of different predictors (dummy, categorical and continuos variables) on presence of birds, obtained from bird counts at-sea. To do that I used a glmmadmb function and binomial family.
I've plotted the relationship between response variable and predictors in order to asses the model fit and the marginal effect of each predictor. To draw the graphs I used visreg function, specifying the transformation of the vertical axis:
visreg(modelo.bn7, type="conditional", scale="response", ylab= "Bird Presence")
The output graphs showed a confident bands very wide when I used the original scale of the response variable (covering the whole vertical axis). In case of graphs without transformation, confident bands were shorter but they had the same extension in the different levels of dummy variables. Does anyone know how the confidents bands are calculated in binomial distributions? Could it reflect that I have a problem in the estimated coefficients or in the model fit?
The confidence bands are calculated using p-values for binomial distribution... For detailed explanation you can ask on stats.stackexchange.com. If the bands are very wide (and the interpretation of 'wide' is subjective and mostly based on what is your goal) then it shows that your estimates may not be very accurate. High p-values usually are due to small or insufficient number of observations used for building the model. If the number of observations are large, then it does indicate a poor fit.

Resources