What am I doing wrong in calculating OR? - r

I am calculating odds ratios over several subsets of a population. Here is one subset:
"Normal" and "0" are the reference groups. "Dibirads" is the outcome and "BMIcat" are the levels of exposure. I will also attach the code I used to calculate beta coefficients and Odds Ratios
However, the results show as follows:
If you calculate them mathematically, they aren't matching with the R output. Is something wrong with my code? The Beta coefficients still match up with the Odds Ratios so something is wrong there as well.

Related

Function to produce a single metric to compare the shape of two distributions (predictions vs actuals)

I am assessing the accuracy of a model that predicts count data.
My actual data has quite an unusual distribution - although I have a large amount of data, the shape is unlike any standard distributions (poisson, normal, negative binomial etc.).
As part of my assessment, I want a metric for how well the distribution of the predictions match the distribution of actual data. I've tried using standard model performance metrics, such as MAE or RMSE, but they don't seem to capture how well the predictions match the expected distribution.
My initial idea was to split the predictions into deciles, and calculate what proportion fall in each decile. This would be a very rough indication of the underlying distribution. I would then calculate the same for my 'actuals' and sum the absolute differences between the proportions.
This works to some extent, but feels a bit clunky, and the split into deciles feels arbitrary. Is there a function in R to produce a single metric for how well two distributions match?

Negative Binomial Regression Assumption Testing

First post!
I'm a biologist with limited background in applied statistics and R. Basically know enough to be dangerous, so I'd appreciate it someone could confirm/deny that I'm on the right path.
My datasets consists of count data (wildlife visits to water wells) as a response variable and multiple continuous predictor variables (environmental measurements).
First, I eliminated multicolinearity by dropping a few predictor variables. Second, I investigated the distribution of the response variable. Initially, it looked Poisson. However, a Poisson exact test came back as significant, and the variance of the response variable was around 200 with a mean around 9, i.e. overdispersed. Due to this, I decided to move forward with Negative Binomial and Quasipoisson regressions. Both selected the same model, the residuals of which are in a normal distribution. Further, a plot of residuals over predicted values is unbiased and homoscedastic.
Questions:
1. Have I selected the correct regressions to model this data?
2. Are there additional assumptions of the NBR and QpR that I need to test? How should I/Where can I learn about how to do these?
3. Did I check for overdispersion correctly? Is there a difference in comparing the mean and variance vs comparing the conditional mean and variance of the response variable?
4. While the NBR and QpR called the same model, is there a way to select which is the "better" approach?
5. I would like to eventually publish. Are there more analyses I should perform on my selected model?

How to interpret a plot of promp result?

My PCA result using prcomp() function is summarised and plot as followings. How to interpret the plot results? It shows in some online article that the points present the amount of variance attributed to the different principal components. However, the value seems not matching with any of the statistics, e.g., standard deviation, the proportion of variance, or cumulative proportion.
> summary(data_pca)
> plot(data_pca,type="lines")
I got the hint from #Roland and #Maurits. Here, the variance is exactly the square of standard deviation.

How to perform a bootstrapped paired t-test in R?

I would like to perform a bootstrapped paired t-test in R. I have tried this for multiple datasets that returned p<.05 when using a parametric paired t-test however when I run the bootstrap I get p-values between 0.4 and 0.5. Am I running this incorrectly?
differences<-groupA-groupB
t.test(differences) #To get the t-statistic e.g. 1.96
Repnumber <- 10000
tstat.values <- numeric(Repnumber)
for (i in 1:Repnumber) {
group1 = sample(differences, size=length(differences), replace=T)
tstat.values[i] = t.test(group1)$statistic
}
#### To get the bootstrap p-value compare the # of tstat.values
greater (or lesser) than or equal to the original t-statistic divided
by # of reps:
sum(tstat.values<=-1.96)/Repnumber
Thank you!
It looks like you're comparing apples and oranges. For the single t-test of differences you're getting a t-statistic, which, if greater than a critical value indicates whether the difference between group1 and group2 is significantly different from zero. Your bootstrapping code does the same thing, but for 10,000 bootstrapped samples of differences, giving you an estimate of the variation in the t-statistic over different random samples from the population of differences. If you take the mean of these bootstrapped t-statistics (mean(tstat.values)) you'll see it's about the same as the single t-statistic from the full sample of differences.
sum(tstat.values<=-1.96)/Repnumber gives you the percentage of bootstrapped t-statistics less than -1.96. This is an estimate of the percentage of the time that you would get a t-statistic less than -1.96 in repeated random samples from your population. I think this is essentially an estimate of the power of your test to detect a difference of a given size between group1 and group2 for a given sample size and significance level, though I'm not sure how robust such a power analysis is.
In terms of properly bootstrapping the t-test, I think what you actually need to do is some kind of permutation test that checks whether your actual data is an outlier when compared with repeatedly shuffling the labels on your data and doing a t-test on each shuffled dataset. You might want to ask a question on CrossValidated, in order to get advice on how to do this properly for your data. These CrossValidated answers might help: here, here, and here.

Meta analysis in R with adjusted ORs

I would like to calculate a summary odds ratio value for two or more papers where the only information I have is the individual odds ratios with their 95% confidence intervals. Is this possible? I have been poking around in the meta package, and only figured out how to do it with crude counts.
Thanks so much!
It is quite simple.
You just need to use the natural logarithm of the odds ratio (logOR), and its standard errror (and corresponding variance). These can be easily back-calculated from the 95% confidence intervals according to the normal distribution. Finally, pool logORs with their variance.
For instance, after you have built a data frame (eg called mydata) with logOR and variance for each study, you can easily proceed with a random effect meta-analysis with the metafor package in R as follows:
res <- rma(logOR, variance, data=mydata, method="DL")
forest(res)
In the future, you may consider posting similar questions in CrossValidated.

Resources