Calculating visit to visit variability in R - r

I am computing visit to visit variability. This statistic is traditionally computed with blood pressure measurements as patients are divided in quantiles of standard deviation of the blood pressure statistic (ex. mean arterial pressure [MAP]). Given that standard deviation is a single statistic, I need to understand how to compute its quartiles in order to calculate visit to visit variability (https://www.jhltonline.org/article/S1053-2498(18)31737-6/pdf).
I am trying out the code with a predictor like mpg from mtcars dataset.
I can compute its standard deviation with the sd function but am not sure how to compute quartiles of the sd in R. Would appreciate all help with this issue.
summary(sd(mtcars$mpg)) # returns the same value for each quantile. I would like to see different values for the quantiles.

Related

Is there a method in R to find confidence interval of a proportion by non-parametric percentile bootstrap method

I have a table which has been generated in one of the publication and I am trying to do similar analysis for another study. please help me out to find CI by non-parametric percentile bootstrap method.
here is the table
I want to know how the ci for 1-RR is calculated.
here is the statistical explanation for it:
We used the Kaplan-Meier estimator18 to construct
cumulative incidence curves and to estimate the risk for
each outcome. The risks were compared via ratios and
differences. We estimated the risk ratio for each outcome
using only matched pairs in which both individuals were
still at risk 7 days after receipt of the third vaccine dose in
those vaccinated. We analysed outcomes in the full
population and in subgroups defined by strata of age,
sex, and number of comorbidities. 95% CIs were
calculated using the nonparametric percentile bootstrap
method with 1000 repetitions. The effectiveness of the
third dose was estimated as 1 – risk ratio. As a sensitivity
analysis, vaccine effectiveness was also estimated as
1 – incidence rate ratio derived from a Poisson regression
using the same dataset, with no further adjustment.
Analyses were done using R software (version 4.0.4)."
How to calculate CI.

R - calculate confidence interval of grouped data

Suppose you have a dataset named data as follows:
Gender Pneumonia_Incidence lower_CI upper_CI
Male 38000 30000 44000
Female 34000 32000 38000
I would now like to calculate the total pneumonia incidence, which can be done easily:
sum(data$Pneumonia_Incidence)
But how can I calculate lower and upper CI for this estimate? Is it valid to just sum lower_CI and upper_CI values or would I need to do something else?
How can I calculate lower and upper CI for this estimate?
You cannot with the information you have provided. You would need to know the variance of each estimated prevalence. This must be known since someone has calculated the confidence intervals. With these you could then obtain a pooled estimate for the total variance and then calculate the overall confidence interval.
If this is important to you, I strongly suggest you consult a qualified statistician, or at least a reputable text book. This is not simple high school math. There may be other issues such as sampling weights involved. I certainly wouldn't seek statistical advice here, and probably not even at that other place often mentioned. What if your boss asked you how you calculated the confidence interval? Would you say that you consulted the internet?
Is it valid to just sum lower_CI and upper_CI values ...
No. Variances involve sample sizes. Consider this. Imagine two groups, one with a very large sample size and one with a very small one. The group with the large sample size will have a narrower confidence interval than the group with the small sample size will. If you just added the two intervals, you would end up with an overall interval that was equally weighted by both groups, which intuitively doesn't seem correct. It's a biased estimate.
... or would I need to do something else?
Consult a statistician. :)

How to perform a bootstrapped paired t-test in R?

I would like to perform a bootstrapped paired t-test in R. I have tried this for multiple datasets that returned p<.05 when using a parametric paired t-test however when I run the bootstrap I get p-values between 0.4 and 0.5. Am I running this incorrectly?
differences<-groupA-groupB
t.test(differences) #To get the t-statistic e.g. 1.96
Repnumber <- 10000
tstat.values <- numeric(Repnumber)
for (i in 1:Repnumber) {
group1 = sample(differences, size=length(differences), replace=T)
tstat.values[i] = t.test(group1)$statistic
}
#### To get the bootstrap p-value compare the # of tstat.values
greater (or lesser) than or equal to the original t-statistic divided
by # of reps:
sum(tstat.values<=-1.96)/Repnumber
Thank you!
It looks like you're comparing apples and oranges. For the single t-test of differences you're getting a t-statistic, which, if greater than a critical value indicates whether the difference between group1 and group2 is significantly different from zero. Your bootstrapping code does the same thing, but for 10,000 bootstrapped samples of differences, giving you an estimate of the variation in the t-statistic over different random samples from the population of differences. If you take the mean of these bootstrapped t-statistics (mean(tstat.values)) you'll see it's about the same as the single t-statistic from the full sample of differences.
sum(tstat.values<=-1.96)/Repnumber gives you the percentage of bootstrapped t-statistics less than -1.96. This is an estimate of the percentage of the time that you would get a t-statistic less than -1.96 in repeated random samples from your population. I think this is essentially an estimate of the power of your test to detect a difference of a given size between group1 and group2 for a given sample size and significance level, though I'm not sure how robust such a power analysis is.
In terms of properly bootstrapping the t-test, I think what you actually need to do is some kind of permutation test that checks whether your actual data is an outlier when compared with repeatedly shuffling the labels on your data and doing a t-test on each shuffled dataset. You might want to ask a question on CrossValidated, in order to get advice on how to do this properly for your data. These CrossValidated answers might help: here, here, and here.

calculating aggregate confidence intervals for forecasts

I'm using ARIMA models to estimate sales forecast for a company. The company's sales channel is broken down into 4 sales channels and I'm running 4 different models to estimate the sales for each channel. Eventually, I'm going to aggregate the sales of these channels to find the total forecasted sales for the whole company. My questions is, how should i go about finding the confidence interval for the overall forecast? Adding up the confidence intervals of each channel is not correct since that will give me a very large interval.
I'd really appreciate if anyone can give me some idea on how to approach this sort of issue. Thanks in advance!
The confidence interval is based on the variance. For a 95% interval, it is the forecast plus/minus 1.96*sqrt(variance). So you need to estimate the variance of the sum. It is not equal to the sum of the variances unless the four channels are uncorrelated. You will need to compute the correlations between the four channels, and then use the equation for the variance of a sum (check any intro stats book).

How to set a weighted least-squares in r for heteroscedastic data?

I'm running a regression on census data where my dependent variable is life expectancy and I have eight independent variables. The data is aggregated be cities, so I have many thousand observations.
My model is somewhat heteroscedastic though. I want to run a weighted least-squares where each observation is weighted by the city’s population. In this case, it would mean that I want to weight the observations by the inverse of the square root of the population. It’s unclear to me, however, what would be the best syntax. Currently, I have:
Model=lm(…,weights=(1/population))
Is that correct? Or should it be:
Model=lm(…,weights=(1/sqrt(population)))
(I found this question here: Weighted Least Squares - R but it does not clarify how R interprets the weights argument.)
From ?lm: "weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used." R doesn't do any further interpretation of the weights argument.
So, if what you want to minimize is the sum of (the squared distance from each point to the fit line * 1/sqrt(population) then you want ...weights=(1/sqrt(population)). If you want to minimize the sum of (the squared distance from each point to the fit line * 1/population) then you want ...weights=1/population.
As to which of those is most appropriate... that's a question for CrossValidated!
To answer your question, Lucas, I think you want weights=(1/population). R parameterizes the weights as inversely proportional to the variances, so specifying the weights this way amounts to assuming that the variance of the error term is proportional to the population of the city, which is a common assumption in this setting.
But check the assumption! If the variance of the error term is indeed proportional to the population size, then if you divide each residual by the square root of its corresponding sample size, the residuals should have constant variance. Remember, dividing a random variable by a constant results in the variance being divided by the square of that constant.
Here's how you can check this: Obtain residuals from the regression by
residuals = lm(..., weights = 1/population)$residuals
Then divide the residuals by the square roots of the population variances:
standardized_residuals = residuals/sqrt(population)
Then compare the sample variance among the residuals corresponding to the bottom half of population sizes:
variance1 = var(standardized_residuals[population < median(population)])
to the sample variance among the residuals corresponding to the upper half of population sizes:
variance2 = var(standardized_residuals[population > median(population)])
If these two numbers, variance1 and variance2 are similar, then you're doing something right. If they are drastically different, then maybe your assumption is violated.

Resources