Related
I have survey data (picture sample below) I'm working with to find 95% confidence intervals for. The Q#d columns (Q1d, Q2d, etc.) each correspond to different questions on the survey (Likert scale with results dichotomized, 1 = yes, 0 = no). The intervention column describes whether the results were before intervention (FALSE) or after intervention (TRUE). What I want to do is get the 95% confidence intervals on the difference in proportions for each question before and after intervention.
For example, let's say for Q1d the proportion that answered "yes" before intervention is .2 and after the intervention is .5. The difference would be .3 or 30%, and I want to calculate the confidence interval (let's say between 25% and 35%) on the difference. I want to do this for every single question in the survey (all Q1d). I have not been able to find a way to iterate through and do this for all questions (columns). I've written a function that can successfully do it for one column, but iterating through column names isn't working for me, and I don't know how to store the results as a vector/dataframe. I've included the function below. Any guidance?
Thanks so much!!
get_conf_int <- function(df, colName) {
myenc <- enquo(colName)
p <- df %>%
group_by(Intervention) %>%
summarize(success=sum(UQ(myenc)==1, na.rm=TRUE), total=n())
prop.test(x=pull(p,success), n=pull(p, total))$conf.int[2:1]*-100
}
And I can call the function like:
get_conf_int(db, Q1d)
I'm using prop.test to find confidence interval for now, but open to other methods as well.
I can't assure you if prop.table is better than binom.test, you should read more about those two.
library(dplyr)
# just for this example, you have your survey here
df <- data.frame(Intervention=sample(x = c(TRUE,FALSE), size = 20, replace = TRUE),
Q1d=sample(x = 0:1, size = 20, replace = TRUE),
Q2d=sample(x = 0:1, size = 20, replace = TRUE),
Q3d=sample(x = 0:1, size = 20, replace = TRUE),
Q4d=sample(x = 0:1, size = 20, replace = TRUE),
Q5d=sample(x = 0:1, size = 20, replace = TRUE),
Q6d=sample(x = 0:1, size = 20, replace = TRUE),
Q7d=sample(x = 0:1, size = 20, replace = TRUE))
# vector with the sum of FALSE and the sum of TRUE
count_Intervention <- c(length(which(!df$Intervention)),length(which(df$Intervention)))
# group by TRUE/FALSE and sum(count) the 1's
df_sum <- df %>%
group_by(Intervention) %>%
summarize(across((colnames(df)[-1]),list(sum)))
# for new info. I added the pvalue, that might be important
new_df <- data.frame(Question=as.character(), LowerConfInt=as.numeric(), UpperConfInt=as.numeric(), Pvalue = as.numeric())
#loop
for (Q_d in colnames(df_sum)[-1]) {
lower <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$conf.int[1]
upper <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$conf.int[2]
pvalue <- prop.test(as.vector(t(df_sum[,Q_d])), count_Intervention)$p.value
new_df <- rbind(new_df, data.frame(Q_d, lower, upper, pvalue))
}
new_df
Q_d lower upper pvalue
1 Q1d_1 -0.2067593 0.8661000 0.34844258
2 Q2d_1 -0.9193444 -0.1575787 0.05528499
3 Q3d_1 -0.4558861 0.5218202 1.00000000
4 Q4d_1 -0.4558861 0.5218202 1.00000000
5 Q5d_1 -0.7487377 0.3751114 0.74153726
6 Q6d_1 -0.2067593 0.8661000 0.34844258
7 Q7d_1 -0.4558861 0.5218202 1.00000000
I am doing the simulation of linear regression with R.
A regression model I consider is y_i = a + b_1 * x_1i + b_2 * x_2i + e_i.
The parameter design as follows:
x_1i ~ N(2,1), x_2i ~ Poisson(4), e_i ~ N(0, 1), theta = (a, b_1, b_2)
The following code I am doing is that I would like to generate 100 independent random samples of (y, x_1, x_2) 1000 times using the distribution which I have mentioned above, and I also want to estimate theta_hat (the estimator of theta). After getting the theta_hat, I would like to plot the distribution of estimator of a (a_hat), b_1 (b_1_hat), b_2 (b_2_hat), respectively.
## Construct 1000 x_1
x_1_1000 <- as.data.frame(replicate(n = 1000,expr = rnorm(n = 100,
mean = 2, sd = 1)))
colnames(x_1_1000) <- paste("x_1", 1:1000, sep = "_")
x_2_1000 <- as.data.frame(replicate(n = 1000,expr = rpois(n = 100,
lambda = 4)))
colnames(x_2_1000) <- paste("x_2", 1:1000, sep = "_")
error_1000 <- as.data.frame(replicate(n = 1000, expr = rnorm(n = 100,
mean = 0, sd = 1)))
colnames(error_1000) <- paste("e", 1:1000, sep = "_")
y_1000 <- as.data.frame(matrix(data = 0, nrow = 100, ncol = 1000))
y_1000 = 1 + x_1_1000 * 1 + x_2_1000*(-2) + error_1000
colnames(y_1000) <- paste("y", 1:1000, sep = "_")
######################################################################
lms <- lapply(1:1000, function(x) lm(y_1000[,x] ~ x_1_1000[,x] + x_2_1000[,x]))
theta_hat_1000 <- as.data.frame(sapply(lms, coef))
After doing the linear regression, I just store the result into lms which is a list. Because I just want the data of coefficient, I store those simulation coefficients into "theta_hat_1000" However, when I wanna plot the distribution graph, I cannot get what I want in the final. I have tried two ways to solve the problem but still being confused.
The first way I tried is that I just rename the data frame "theta_hat_1000". I have successfully renamed the column_i, where i from 1 to 1000. However, I just cannot successfully rename the rows.
rownames(theta_hat_1000[1,]) <- "ahat"
rownames(theta_hat_1000[2,]) <- "x1hat"
rownames(theta_hat_1000[3,]) <- "x2hat"
The code listed above did not show any error message but finally failed to change the row names. Thus, I have tried the following code
rownames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
This has successfully renamed. However, when I want to check there is anything store in the data frame, it reports "NULL"
theta_hat_1000$ahat
NULL
Therefore, I notice that there is something weird. Thus I have tried the second way like the following.
I have tried to unlist "theta_hat_1000" which is a list stored in my global environment. However, after doing such things, I am not getting what I want. The expected result is just getting three rows and each row with 1000 values, but the actual is that I got 3000 obs with 1 column.
The ideal result is getting three columns and each column with 1000 values and putting them into a data frame to do a further process like using ggplot to demonstrate the distribution of estimated coefficients.
I have stuck on this for several hours. It would be appreciated if anyone can help me and give me some suggestions.
This line theta_hat_1000$ahat in your code does not work, because "ahat" is a rowname not a column name in the data frame. You would get the result by calling theta_hat_1000["ahat",].
However, I understand that your desired result is actually a dataframe with 3 columns (and 1000 rows) representing the 3 parameters of your regression model (intercept, x1, x2). This line in your code as.data.frame(sapply(lms, coef)) produces a dataframe with 3 rows and 1000 columns. You can, for instance, transpose the matrix before changing it into a data frame to get 1000 rows and 3 columns.
theta_hat_1000 <- sapply(lms, coef)
theta_hat_1000 <- as.data.frame(t(theta_hat_1000))
colnames(theta_hat_1000) <- c("ahat", "x1hat", "x2hat")
head(theta_hat_1000)
ahat x1hat x2hat
1 2.0259326 0.7417404 -2.111874
2 0.7827929 0.9437324 -1.944320
3 1.1034906 1.0091594 -2.035405
4 0.9677150 0.8168757 -1.905367
5 1.0518646 0.9616123 -1.985357
6 0.8600449 1.0781489 -2.017061
Now you could also call the variables with theta_hat_1000$ahat.
I am attempting to follow the example from LSD.test on my data set. Unfortunately, my data set has unequal sample sizes, I have read that this can be handled by a weighted mean; does anyone have experience with this? Is there a way to calculate this outside of the function?
Here is my example code:
library(agricolae)
data <- as.data.frame(c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 11)))
data$Value <- c(59.15,48.90,29.65,32.60,63.85,53.85,66.40,55.05,54.75,39.95,63.20,57.40,59.15,54.10,49.40,78.70,66.20,90.75,
81.20,52.25,53.70,51.10,48.60,50.15,63.40,56.15,38.40,66.45,53.35,45.30,46.60,53.20,53.95,44.55,49.15,42.65,
68.25,67.60,57.90,47.85,52.90)
colnames(data) <- c("Treatment", "Value")
cal <- lm(Value ~ Treatment, data = data)
model<-aov(cal)
out <- LSD.test(model,"Treatment", p.adj="bonferroni")
#stargraph
# Variation range: max and min
plot(out)
#endgraph
# Old version LSD.test()
df<-df.residual(model)
MSerror<-deviance(model)/df
out <- with(data,LSD.test(Value,Treatment,df,MSerror))
#stargraph
# Variation interquartil range: Q75 and Q25
plot(out,variation="IQR")
#endgraph
out<-LSD.test(model,"Treatment",p.adj="hommel",console=TRUE)
plot(out,variation="SD") # variation standard deviation
This is still "working" however not listing the "Minimum significant difference" as in the typical example:
library(agricolae)
data(sweetpotato)
model<-aov(yield~virus, data=sweetpotato)
out <- LSD.test(model,"virus", p.adj="bonferroni")
#stargraph
# Variation range: max and min
plot(out)
#endgraph
# Old version LSD.test()
df<-df.residual(model)
MSerror<-deviance(model)/df
out <- with(sweetpotato,LSD.test(yield,virus,df,MSerror))
#stargraph
# Variation interquartil range: Q75 and Q25
plot(out,variation="IQR")
#endgraph
out<-LSD.test(model,"virus",p.adj="hommel",console=TRUE)
plot(out,variation="SD") # variation standard deviation
Edit 1:I have also tried using unequal groups with HSD with agricolae, however this returns an error:
data(sweetpotato)
A<-sweetpotato[-c(4,5,7),]
modelUnbalanced <- aov(yield ~ virus, data=A)
outUn <-HSD.test(modelUnbalanced, "virus",group=FALSE, unbalanced = TRUE)
Error in HSD.test(modelUnbalanced, "virus", group = FALSE, unbalanced = TRUE) :
unused argument (unbalanced = TRUE)
Edit 2: I have now used the group=FALSE command to find the individual interactions:
difference pvalue signif. LCL UCL
A - B -14.820000 0.0301 * -28.664094 -0.9759055
A - C -2.245000 1.0000 -16.089094 11.5990945
A - D -3.557727 1.0000 -17.083524 9.9680696
B - C 12.575000 0.0943 . -1.269094 26.4190945
B - D 11.262273 0.1554 -2.263524 24.7880696
C - D -1.312727 1.0000 -14.838524 12.2130696
Now how would I find what difference = p =0.05? I have looked at a linear relationship between the three p values that aren't equal to 1, and the root square of the difference. The relationship isn't perfect but it is strong R2 = 0.98, likely an effect of the sample sizes.
Could I use this to predict what the least significant difference would be # p = 0.05? Or am I completely misguided?
Any help is greatly appreciated!
Cheers,
I constructed a binary logistic model. The response variable is binary. There are 4 regressors - 2 binary and 2 integers. I want to find the outliers and delete them. For this i have create some plots:
par(mfrow = c(2,2))
plot(hat.ep,rstudent.ep,col="#E69F00", main="hat-values versus studentized residuals",
xlab="Hat value", ylab="Studentized residual")
dffits.ep <- dffits(model_logit)
plot(id,dffits.ep,type="l", col="#E69F00", main="Index Plot",
xlab="Identification", ylab="Diffits")
cov.ep <- covratio(model_logit)
plot(id,cov.ep,type="l",col="#E69F00", main="Covariance Ratio",
xlab="Identification", ylab="Covariance Ratio")
cook.ep <- cooks.distance(model_logit)
plot(id,cook.ep,type="l",col="#E69F00", main="Cook's Distance",
xlab="Identification", ylab="Cook's Distance")
According to the plots there is an outlier. How can I identify which observation is the outlier?
I have tried :
> outlierTest(model_logit)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
1061 1.931043 0.053478 NA
Are there some other functions for outlier detection?
Well this answer comes quite late. I'm unsure if you have found the answer or not. Continuing further, in the absence of a minimum reproducible example, I'll attempt to answer the question using some dummy data and two custom functions. For a given continuous variable, outliers are those observations that lie outside of 1.5*IQR, where IQR, the ‘Inter Quartile Range’ is the difference between the 75th and 25th quartiles. I also recommend you to see this post containing far better solutions than my crude answer.
> df <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
> head(df)
X Y Z
1 NA 0.8651 0.2784
2 -0.06838 0.4700 2.0483
3 -0.18734 0.9887 1.8353
4 -0.05015 0.7731 2.4464
5 0.25010 0.9941 1.3979
6 -0.26664 0.6778 1.1277
> boxplot(df$Y) # notice the outliers above the top whisker
Now, I'll create a custom function to detect the outliers and the other function will replace the outlier values with NA.
# this function will return the indices of the outlier values
> findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
# check for outliers
> outliers <- findOutlier(df)
# custom function to remove outliers
> removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
> filterData<- removeOutlier(df, outliers)
> boxplot(filterData$Y)
I am interested to build a R function that I can use to test the limits of the Taylor series approximation. I am aware that there is limits to what I am doing, but it's exactly those limits I wish to investigate.
I have two normally distributed random variables x and y. x has a mean of 7 and a standard deviation (sd) of 1. y has a mean of 5 and a sd of 4.
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
I know how to estimate the mean ratio of y/x, like this
# E(y/x) = E(y)/E(x) - Cov(y,x)/E(x)^2 + Var(x)*E(y)/E(x)^3
me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
[1] 1.328125
I am however stuck on how to estimate the Standard Deviation of the ratio? I realize I have to use a Taylor expansion, but not how to use it.
Doing a simple simulation I get
x <- rnorm(10^4, mean = 4, sd = 1); y <- rnorm(10^4, mean = 5, sd = 4)
sd(y/x)
[1] 2.027593
mean(y/x)[1]
1.362142
There is an analytical expression for the PDF of the ratio of two gaussians, done
by David Hinkley (e.g. see Wikipedia). So we could compute all momentums, means etc. I typed it and apparently it clearly doesn't have finite second momentum, thus it doesn't have finite standard deviation. Note, I've denoted your Y gaussian as my X, and your X as my Y (formulas assume X/Y). I've got mean value of ratio pretty close to the what you've got from simulation, but last integral is infinite, sorry. You could sample more and more values, but from sampling std.dev is growing as well, as noted by #G.Grothendieck
library(ggplot2)
m.x <- 5; s.x <- 4
m.y <- 4; s.y <- 1
a <- function(x) {
sqrt( (x/s.x)^2 + (1.0/s.y)^2 )
}
b <- function(x) {
(m.x*x)/s.x^2 + m.y/s.y^2
}
c <- (m.x/s.x)^2 + (m.y/s.y)^2
d <- function(x) {
u <- b(x)^2 - c*a(x)^2
l <- 2.0*a(x)^2
exp( u / l )
}
# PDF for the ratio of the two different gaussians
PDF <- function(x) {
r <- b(x)/a(x)
q <- pnorm(r) - pnorm(-r)
(r*d(x)/a(x)^2) * (1.0/(sqrt(2.0*pi)*s.x*s.y)) * q + exp(-0.5*c)/(pi*s.x*s.y*a(x)^2)
}
# normalization
nn <- integrate(PDF, -Inf, Inf)
nn <- nn[["value"]]
# plot PDF
p <- ggplot(data = data.frame(x = 0), mapping = aes(x = x))
p <- p + stat_function(fun = function(x) PDF(x)/nn) + xlim(-2.0, 6.0)
print(p)
# first momentum
m1 <- integrate(function(x) x*PDF(x), -Inf, Inf)
m1 <- m1[["value"]]
# mean
print(m1/nn)
# some sampling
set.seed(32345)
n <- 10^7L
x <- rnorm(n, mean = m.x, sd = s.x); y <- rnorm(n, mean = m.y, sd = s.y)
print(mean(x/y))
print(sd(x/y))
# second momentum - Infinite!
m2 <- integrate(function(x) x*x*PDF(x), -Inf, Inf)
Thus, it is impossible to test any Taylor expansion for std.dev.
With the cautions suggested by #G.Grothendieck in mind: a useful mnemonic for products and quotients of independent X and Y variables is
CV^2(X/Y) = CV^2(X*Y) = CV^2(X) + CV^2(Y)
where CV is the coefficient of variation (sd(X)/mean(X)), so CV^2 is Var/mean^2. In other words
Var(Y/X)/(m(Y/X))^2 = Var(X)/m(X)^2 + Var(Y)/m(Y)^2
or rearranging
sd(Y/X) = sqrt[ Var(X)*m(Y/X)^2/m(X)^2 + Var(Y)*m(Y/X)^2/m(Y)^2 ]
For random variables with the mean well away from zero, this is a reasonable approximation.
set.seed(101)
y <- rnorm(1000,mean=5)
x <- rnorm(1000,mean=10)
myx <- mean(y/x)
sqrt(var(x)*myx^2/mean(x)^2 + var(y)*myx^2/mean(y)^2) ## 0.110412
sd(y/x) ## 0.1122373
Using your example is considerably worse because the CV of Y is close to 1 -- I initially thought it looked OK, but now I see that it's biased as well as not capturing the variability very well (I'm also plugging in the expected values of the mean and SD rather than their simulated values, but for such a large sample that should be a minor part of the error.)
me.x <- 4; sd.x <- 1
me.y <- 5; sd.y <- 4
myx <- me.y/me.x - 0/me.x^2 + sd.x*me.y/me.x^3
x <- rnorm(1e4,me.x,sd.x); y <- rnorm(1e4,me.y,sd.y)
c(myx,mean(y/x))
sdyx <- sqrt(sd.x^2*myx^2/me.x^2 + sd.y^2*myx^2/me.y^2)
c(sdyx,sd(y/x))
## 1.113172 1.197855
rvals <- replicate(1000,
sd(rnorm(1e4,me.y,sd.y)/rnorm(1e4,me.x,sd.x)))
hist(log(rvals),col="gray",breaks=100)
abline(v=log(sdyx),col="red",lwd=2)
min(rvals) ## 1.182698
All the canned delta-method approaches to computing the variance of Y/X use the point estimate for Y/X (i.e. m(Y/X) = mY/mX), rather than the second-order approximation you used above. Constructing higher-order forms for both the mean and the variance should be straightforward if possibly tedious (a computer algebra system might help ...)
mvec <- c(x = me.x, y = me.y)
V <- diag(c(sd.x, sd.y)^2)
car::deltaMethod(mvec, "y/x", V)
## Estimate SE
## y/x 1.25 1.047691
library(emdbook)
sqrt(deltavar(y/x,meanval=mvec,Sigma=V)) ## 1.047691
sqrt(sd.x^2*(me.y/me.x)^2/me.x^2 + sd.y^2*(me.y/me.x)^2/me.y^2) ## 1.047691
For what it's worth, I took the code in #SeverinPappadeux's answer and made it into a function gratio(mx,my,sx,sy). For the Cauchy case (gratio(0,0,1,1)) it gets confused and reports a mean of 0 (which should be NA/divergent) but correctly reports the variance/std dev as divergent. For the parameters specified by the OP (gratio(5,4,4,1)) it gives mean=1.352176, sd=NA as above. For the first parameters I tried above (gratio(10,5,1,1)) it gives mean=0.5051581, sd=0.1141726.
These numerical experiments strongly suggest to me that the ratio of Gaussians sometimes has a well-defined variance, but I don't know when (time for another question on Math StackOverflow or CrossValidated?)
Such approximations are unlikely to be useful since the distribution may not have a finite standard deviation. Look at how unstable it is:
set.seed(123)
n <- 10^6
X <- rnorm(n, me.x, sd.x)
Y <- rnorm(n, me.y, sd.y)
sd(head(Y/X, 10^3))
## [1] 1.151261
sd(head(Y/X, 10^4))
## [1] 1.298028
sd(head(Y/X, 10^5))
## [1] 1.527188
sd(Y/X)
## [1] 1.863168
Contrast that with what happens when we try the same thing with a normal random variable:
sd(head(Y, 10^3))
## [1] 3.928038
sd(head(Y, 10^4))
## [1] 3.986802
sd(head(Y, 10^5))
## [1] 3.984113
sd(Y)
## [1] 3.999024
Note: If you were in a different situation, e.g. the denominator has compact support, then you could do this:
library(car)
m <- c(x = me.x, y = me.y)
v <- diag(c(sd.x, sd.y)^2)
deltaMethod(m, "y/x", v)