I compare categorical data from three different groups.
I wonder if it is possible to easily add p-values of chi-squared tests to facet ggplots (since I am analyzing a big data set). I just read that there is a marvelous way to do so when comparing means https://www.r-bloggers.com/add-p-values-and-significance-levels-to-ggplots/. However, I could not find a solution for other tests (like the chisq.test in my case).
d.test <- data.frame(
results = sample(c("A","B","C"), 30, replace =TRUE),
test = sample(c("test1", "test2","test3"), 30, replace = TRUE)
)
chisq.test(d.test$results,d.test$test)
ggplot(d.test, aes(results) ) +
geom_bar() + facet_grid(test ~ .)
Many thanks for your help! ;D
Store your p-value in a variable
pval <- chisq.test(d.test$results,d.test$test)$p.value
Use annotate to plot text manually
ggplot(d.test, aes(results) ) +
geom_bar() + facet_grid(test ~ .) +
annotate("text", x=1, y=5, label=pval)
Change its positioning with x and y
ggplot(d.test, aes(results) ) +
geom_bar() + facet_grid(test ~ .) +
annotate("text", x=2, y=3, label=pval)
Change significant digits displayed with signif
ggplot(d.test, aes(results) ) +
geom_bar() + facet_grid(test ~ .) +
annotate("text", x=1, y=5, label=signif(pval,4))
Add a 'label' p-value: with
ggplot(d.test, aes(results) ) +
geom_bar() + facet_grid(test ~ .) +
annotate("text", x=1, y=5, label=paste0("p-value: ", signif(pval,4)))
broom has methods to create tidy dataframes of most statistical test outputs. Then you can use that output as a data = argument within geom_text.
Generate data
library(broom)
library(dplyr)
library(ggplot2)
fakedata <-
data.frame(groups = sample(c("pop1", "pop2", "pop3", "pop4"), 120, replace = T),
results = sample(c("A","B","C"), 120, replace = TRUE),
test = sample(c("test1", "test2","test3"), 120, replace = TRUE))
Conduct and tidy tests
fakedata.test <-
fakedata %>%
group_by(groups) %>%
do(fit = chisq.test(.$results, .$test)) %>%
tidy(fit)
# A tibble: 4 x 5
# Groups: groups [4]
groups statistic p.value parameter method
<fctr> <dbl> <dbl> <int> <fctr>
1 pop1 3.714286 0.44605156 4 Pearson's Chi-squared test
2 pop2 2.321429 0.67687042 4 Pearson's Chi-squared test
3 pop3 2.294897 0.68169829 4 Pearson's Chi-squared test
4 pop4 10.949116 0.02714188 4 Pearson's Chi-squared test
Visualize
fakedata %>%
ggplot(aes(results, test)) +
geom_jitter(width = 0.2, height = 0.2, shape = 1, size = 2) +
geom_text(data = fakedata.test,
aes(3, 3.5,
label = paste0("χ²(", parameter, ")=", round(statistic, 2), "; p=", round(p.value, 2))),
hjust = 1) +
facet_wrap(~groups)
Related
sample_data = read.table("http://freakonometrics.free.fr/db.txt",
header=TRUE, sep=";")
head(sample_data)
model = glm(Y~0+X1+X2+X3,family=binomial,data=sample_data)
summary(model)
exp(coef(model ))
exp(cbind(OR = coef(model ), confint(model )))
I have the above sample data on logistic regression with categorical predictor
I try the above code i get the following output,
OR 2.5 % 97.5 %
X1 1.67639337 1.352583976 2.09856514
X2 1.23377720 1.071959330 1.42496949
X3A 0.01157565 0.001429430 0.08726854
X3B 0.06627849 0.008011818 0.54419759
X3C 0.01118084 0.001339984 0.08721028
X3D 0.01254032 0.001545240 0.09539880
X3E 0.10654454 0.013141540 0.87369972
but I am wondering how to extract OR and CI only for factors. My
desired output will be:
OR 2.5 % 97.5 %
X3A 0.01157565 0.001429430 0.08726854
X3B 0.06627849 0.008011818 0.54419759
X3C 0.01118084 0.001339984 0.08721028
X3D 0.01254032 0.001545240 0.09539880
X3E 0.10654454 0.013141540 0.87369972
Can any one help me the code to extract it?
additionally I want to plot the above OR with confidence interval for
the extracted one.
Can you also help me the code with plot,or box plot?
You could filter out the rows that are the same as variable names in your data frame, since those row names with factor levels appended will not match:
result <- exp(cbind(OR = coef(model ), confint(model )))
result[!rownames(result) %in% names(sample_data),]
#> OR 2.5 % 97.5 %
#> X3A 0.01157565 0.001429430 0.08726854
#> X3B 0.06627849 0.008011818 0.54419759
#> X3C 0.01118084 0.001339984 0.08721028
#> X3D 0.01254032 0.001545240 0.09539880
#> X3E 0.10654454 0.013141540 0.87369972
To extract the necessary rows and plot them, the full reproducible code would be:
library(tidyverse)
sample_data <- read.table("http://freakonometrics.free.fr/db.txt",
header = TRUE, sep = ";")
model <- glm(Y ~ 0 + X1 + X2 + X3,family = binomial, data = sample_data)
result <- exp(cbind(OR = coef(model), confint(model)))
#> Waiting for profiling to be done...
result %>%
as.data.frame(check.names = FALSE) %>%
rownames_to_column(var = "Variable") %>%
filter(!Variable %in% names(sample_data)) %>%
ggplot(aes(x = OR, y = Variable)) +
geom_vline(xintercept = 1, linetype = 2) +
geom_errorbarh(aes(xmin = `2.5 %`, xmax = `97.5 %`), height = 0.1) +
geom_point(size = 2) +
scale_x_log10(name = "Odds ratio (log scale)") +
theme_minimal(base_size = 16)
Created on 2022-06-14 by the reprex package (v2.0.1)
One possibility, using broom to extract the coefficients, dplyr::filter to select the terms you want, and dwplot to plot.
library(broom)
library(dotwhisker)
library(dplyr)
tt <- (tidy(model, exponentiate = TRUE, conf.int = TRUE)
|> filter(stringr::str_detect(term, "^X3"))
)
dwplot(tt)
In addition, I would suggest:
library(ggplot2)
dwplot(tt) + scale_x_log10() + geom_vline(xintercept = 1, lty = 2) +
labs(x="Odds ratio")
To extract all but the first 2 rows, use a negative index on the rows.
I will also coerce to data.frame and add an id, it will be needed to plot the confidence intervals.
ORCI <- exp(cbind(OR = coef(model), confint(model)))[-(1:2), ]
ORCI <- cbind.data.frame(ORCI, id = row.names(ORCI))
I'm using ggplot to visualize many linear regressions and facet them by groups. I'd like geom_smooth() to show the trend line as one color if P < 0.05, a different color if P < 0.10, and not show it at all if P ≥ 0.10.
I managed to do this using a loop to extract P-values from lm() for each regression, then join them with the data used for plotting. Then I add another column of color names to pass to aes(), determined conditionally from the P-values, and use scale_color_identity() to achieve my goal.
Here's an example:
library(tidyverse)
#make mtcars a tibble and cyl a factor, for convenience
mtcars1 <- as_tibble(mtcars) %>% dplyr::mutate(cyl = as.factor(cyl))
#initialize a list to store p-values from lm() for each level of factor
p.list <- vector(mode = "list", length = length(levels(mtcars1$cyl)))
names(p.list) <- levels(mtcars1$cyl)
#loop to calculate p-values for each level of mtcars$cyl
for(i in seq_along(levels(mtcars1$cyl))){
mtcars.sub <- mtcars1 %>% dplyr::filter(cyl == levels(.$cyl)[i])
lm.pval <- mtcars.sub %>%
dplyr::distinct(cyl) %>%
dplyr::mutate(P =
summary(lm(mpg ~ disp, data = mtcars.sub))$coefficients[2,4] ##extract P-value
)
p.list[[i]] <- lm.pval
}
#join p-values to dataset and add column to use with scale_color_identity()
mtcars.p <- mtcars1 %>% dplyr::left_join(dplyr::bind_rows(p.list, .id = "cyl"), by = "cyl") %>%
dplyr::mutate(p.color = ifelse(P < 0.05, "black",
ifelse(P < 0.10, "lightblue", NA)))
#plot
ggplot(data = mtcars.p, aes(x = disp, y = mpg)) +
geom_smooth(method = "lm",
se = FALSE,
aes(color = p.color)) +
geom_point() +
scale_color_identity(name = NULL,
na.translate = FALSE,
labels = c("P < 0.05", "P < 0.10"),
guide = "legend") +
facet_wrap(~cyl, scales = "free")
This seems like too many initial steps for something that should be relatively easy. Are these steps necessary, or is there a more efficient way of doing this? Can ggplot or any other packages out there do this on their own, without having to first extract p-values from lm()?
After specifying your regression function, you can include the line function within ggplot:
myline<-lm(mpg ~ disp, data = mtcars)
ggplot(data = mtcars, aes(x = disp, y = mpg)) +
geom_abline(slope = coef(myline)[[2]], intercept = coef(myline)[[1]], color='blue')+
geom_point(color='red') +
scale_color_identity(name = NULL,
na.translate = FALSE,
labels = c("P < 0.05", "P < 0.10"),
guide = "legend") +
facet_wrap(~cyl, scales = "free")
The same as above, you can use this geom_smooth() command as well:
geom_smooth(slope = coef(myline)[[2]], intercept = coef(myline)[[1]], color='blue',se=F,method='lm')+
We may simplify the steps with a group by operation and also instead of extracting each component, the output can be in a tibble with tidy from broom
library(broom)
library(dplyr)
library(tidyr)
mtcars1 %>%
group_by(cyl) %>%
summarise(out = list(tidy(lm(mpg ~ disp, data = cur_data())))) %>%
unnest(out)
-output
# A tibble: 6 x 6
cyl term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) 40.9 3.59 11.4 0.00000120
2 4 disp -0.135 0.0332 -4.07 0.00278
3 6 (Intercept) 19.1 2.91 6.55 0.00124
4 6 disp 0.00361 0.0156 0.232 0.826
5 8 (Intercept) 22.0 3.35 6.59 0.0000259
6 8 disp -0.0196 0.00932 -2.11 0.0568
I need to run the same exercise (Monte Carlo simulations) from Stata to R.
The codes I have used in Stata are the codes bellow. How can I do this using R? (I have searched for many tutorials, but I still didn't manage to do it in R).
* Simulations (10, 100 and 1000 sample replications/iterations)
clear
drop _all
set obs 100
set seed 54231
gen x = ((rnormal()))*10 + 40
* Generating true_y, considering Beta = 0,035
gen true_y = 5+0.03500*x
save truth, replace
twoway scatter true_y x
program hetero1
version 13
args c
use truth, clear
gen y = true_y + rnormal()
regress y x
end
foreach i in 10 100 1000 {
simulate _b, reps (`i'): hetero1
sum _b_x
twoway histogram _b_x, fraction xline(+0.03500, lcolor(red)) xline(`r(mean)', lcolor(green)) fcolor(none) lcolor(gs0) legend(off) title(`i' Repetições)
graph save graf`i'.gph, replace
}
gr combine graf10.gph graf100.gph graf1000.gph
graph export "graf.png", as(png) replace
At the end, I need to obtain these 3 histograms (of estimated beta/coefficients), considering 10, 100 and 1000 sample replications. The red line refers to the "true" coefficient and the green one is the mean of the estimated coefficients - [see the image in the link]
This should do it:
# Simulations (10, 100 and 1000 sample replications/iterations)
library(ggplot2)
library(dplyr)
library(gridExtra)
n <- 100
set.seed(54231)
x <- rnorm(n)*10 + 40
# Generating true_y, considering Beta = 0,035
true_y <- 5+0.03500*x
plot(x, true_y)
b <- t(replicate(1110, coef(lm(true_y + rnorm(n) ~ x))))
b <- as.data.frame(b) %>%
rename("a" = "(Intercept)",
"b" = "x") %>%
mutate(
obs = 1:n(),
n = case_when(
obs %in% 1:10 ~ "N = 10",
obs %in% 11:110 ~ "N = 100",
TRUE ~ "N = 1000"),
n = factor(n, levels=c("N = 10", "N = 100", "N = 1000")))
b10 <- b %>% filter(n == "N = 10")
g1 <- ggplot() +
geom_histogram(data = b10, aes(x=b), bins=3, col="white") +
geom_vline(xintercept = 0.03500, col="red") +
geom_vline(data = b10 %>% summarise(b=mean(b)), aes(xintercept = b), col="green") +
facet_wrap(~n) +
theme_bw()
b100 <- b %>% filter(n == "N = 100")
g2 <- ggplot() +
geom_histogram(data = b100, aes(x=b), bins=10, col="white") +
geom_vline(xintercept = 0.03500, col="red") +
geom_vline(data = b100 %>% summarise(b=mean(b)), aes(xintercept = b), col="green") +
facet_wrap(~n) +
theme_bw()
b1000 <- b %>% filter(n == "N = 1000")
g3 <- ggplot() +
geom_histogram(data = b1000, aes(x=b), bins=25, col="white") +
geom_vline(xintercept = 0.03500, col="red") +
geom_vline(data = b1000 %>% summarise(b=mean(b)), aes(xintercept = b), col="green") +
facet_wrap(~n) +
theme_bw()
library(gridExtra)
grid.arrange(g1, g2, g3, nrow=2)
I would like to make a plot that has multiple geom_smooth(method="loess") lines for differing thresholds, but I'm having some issues.
Specifically, I want a geom_smooth() line for the all points >1 standard deviation (SD) or < -1 SD (which includes -/+2SD), one for <-2SD and >2SD, and one with all the points together. However, I'm running into an issue where it is only doing the smooth for the data within each category (i.e. greater than 1 SD but less than 2 SD.
I have made some toy data here:
#test data
a <- c(rnorm(10000, mean=0, sd = 1))
b <- c(rnorm(10000, mean=0, sd = 1))
test <- as.data.frame(cbind(a,b))
test3$Thresholds <- cut(test$a, breaks = c(-Inf,-2*sd(test$a),-sd(test$a),0,sd(test$a), 2*sd(test$a), Inf),
labels = c("2_SD+", "1_SD", "0_SD","0_SD", "1_SD", "2_SD+"))
plot <- ggplot(test3, aes(x=b, y=a, color=Thresholds, alpha = 0.25, legend = F)) + geom_point() + geom_smooth(method="loess")
This creates the following plot:
Does anyone have any suggestions?
If you want smoothing done for different quantities of x and y you have to manipulate the data component...
library(ggplot2)
library(dplyr)
#test data
a <- c(rnorm(10000, mean=0, sd = 1))
b <- c(rnorm(10000, mean=0, sd = 1))
test <- as.data.frame(cbind(a,b))
test$Thresholds <- cut(test$a, breaks = c(-Inf,-2*sd(test$a),-sd(test$a),0,sd(test$a), 2*sd(test$a), Inf),
labels = c("2_SD+", "1_SD", "0_SD","0_SD", "1_SD", "2_SD+"))
ggplot(test, aes(x=b, y=a)) +
geom_point() +
# just 2
geom_smooth(data = test %>% filter(Thresholds == "2_SD+"), method="loess") +
# 1 and 2
geom_smooth(data = test %>% filter(Thresholds == "1_SD" | Thresholds == "2_SD+" ), method="loess", color = "yellow") +
#all
geom_smooth(data = test, method="loess", color = "red")
#> `geom_smooth()` using formula 'y ~ x'
#> `geom_smooth()` using formula 'y ~ x'
#> `geom_smooth()` using formula 'y ~ x'
Suppose I have 2 data frames, one for 2015 and one for 2016. I want to run a regression for each data frame and plot one of the coefficient for each regression with their respective confidence interval. For example:
set.seed(1020022316)
library(dplyr)
library(stargazer)
df16 <- data.frame(
x1 = rnorm(1000, 0, 2),
t = sample(c(0, 1), 1000, T),
e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.5 * x1 + 2 * t + e) %>%
select(-e)
df15 <- data.frame(
x1 = rnorm(1000, 0, 2),
t = sample(c(0, 1), 1000, T),
e = rnorm(1000, 0, 10)
) %>% mutate(y = 0.75 * x1 + 2.5 * t + e) %>%
select(-e)
lm16 <- lm(y ~ x1 + t, data = df16)
lm15 <- lm(y ~ x1 + t, data = df15)
stargazer(lm15, lm16, type="text", style = "aer", ci = TRUE, ci.level = 0.95)
I want to plot t=1.558, x=2015, and t=2.797, x=2016 with their respective .95 CI. What is the best way of doing this?
I could do it 'by hand', but I hope there is a better way.
library(ggplot2)
df.plot <-
data.frame(
y = c(lm15$coefficients[['t']], lm16$coefficients[['t']]),
x = c(2015, 2016),
lb = c(
confint(lm15, 't', level = 0.95)[1],
confint(lm16, 't', level = 0.95)[1]
),
ub = c(
confint(lm15, 't', level = 0.95)[2],
confint(lm16, 't', level = 0.95)[2]
)
)
df.plot %>% ggplot(aes(x, y)) + geom_point() +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1) +
geom_hline(aes(yintercept=0), linetype="dashed")
Best: The figure quality (looks nice), code elegance, easy to expand (more than 2 regressions)
This is a bit too long for a comment, so I post it as a partial answer.
It is unclear from your post if your main problem is to get the data into the right shape, or if it is the plotting itself. But just to follow up on one of the comments, let me show you how to do run several models using dplyr and broom that makes plotting easy. Consider the mtcars-dataset:
library(dplyr)
library(broom)
models <- mtcars %>% group_by(cyl) %>%
do(data.frame(tidy(lm(mpg ~ disp, data = .),conf.int=T )))
head(models) # I have abbreviated the following output a bit
cyl term estimate std.error statistic p.value conf.low conf.high
(dbl) (chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
4 (Intercept) 40.8720 3.5896 11.39 0.0000012 32.752 48.99221
4 disp -0.1351 0.0332 -4.07 0.0027828 -0.210 -0.06010
6 (Intercept) 19.0820 2.9140 6.55 0.0012440 11.591 26.57264
6 disp 0.0036 0.0156 0.23 0.8259297 -0.036 0.04360
You see that this gives you all coefficients and confidence intervals in one nice dataframe, which makes plotting with ggplot easier. For instance, if your datasets have identical content, you could add a year identifier to them (e.g. df1$year <- 2000; df2$year <- 2001 etc), and bind them together afterwards (e.g. using bind_rows, of you can use bind_rows's .id option). Then you can use the year identifer instead of cyl in the above example.
The plotting then is simple. To use the mtcars data again, let's plot the coefficients for disp only (though you could also use faceting, grouping, etc):
ggplot(filter(models, term=="disp"), aes(x=cyl, y=estimate)) +
geom_point() + geom_errorbar(aes(ymin=conf.low, ymax=conf.high))
To use your data:
df <- bind_rows(df16, df15, .id = "years")
models <- df %>% group_by(years) %>%
do(data.frame(tidy(lm(y ~ x1+t, data = .),conf.int=T ))) %>%
filter(term == "t") %>%
ggplot(aes(x=years, y=estimate)) + geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high))
Note that you can easily add more and more models just by binding more and more data to the main dataframe. You can also easily use faceting, grouping or position-dodgeing to adjust the look of the corresponding plot if you want to plot more than one coefficient.
This is the solution I have right now:
gen_df_plot <- function(reg, coef_name){
df <- data.frame(y = reg$coefficients[[coef_name]],
lb = confint(reg, coef_name, level = 0.95)[1],
ub = confint(reg, coef_name, level = 0.95)[2])
return(df)
}
df.plot <- lapply(list(lm15,lm16), gen_df_plot, coef_name = 't')
df.plot <- data.table::rbindlist(df.plot)
df.plot$x <- as.factor(c(2015, 2016))
df.plot %>% ggplot(aes(x, y)) + geom_point(size=4) +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") +
geom_hline(aes(yintercept=0), linetype="dashed") + theme_bw()
I don't love it, but it works.
Here is what might be generalized code. I have made a change to how "x" is defined so that you don't have to worry about alphabetic reordering of the factor.
#
# Paul Gronke and Paul Manson
# Early Voting Information Center at Reed College
#
# August 27, 2019
#
#
# Code to plot a single coefficient from multiple models, provided
# as an easier alternative to "coefplot" and "dotwhisker". Some users
# may find those packages more capable
#
# Code adapted from https://stackoverflow.com/questions/35582052/plot-regression-coefficient-with-confidence-intervals
# gen_df_plot function will create a tidy data frame for your plot
# Currently set up to display 95% confidence intervals
gen_df_plot <- function(reg, coef_name){
df <- data.frame(y = reg$coefficients[[coef_name]],
lb = confint(reg, coef_name, level = 0.95)[1],
ub = confint(reg, coef_name, level = 0.95)[2])
return(df)
}
# Populate the data frame with a list of your model results.
df.plot <- lapply(list(model1, # List your models here
model2),
gen_df_plot,
coef_name = 'x1') # Coefficient name
# Convert the list to a tidy data frame
df.plot <- data.table::rbindlist(df.plot)
# Provide the coefficient or regression labels below, in the
# order that you want them to appear. The "levels=unique(.)" parameter
# overrides R's desire to order the factor alphabetically
df.plot$x <- c("Group 1",
"Group 2") %>%
factor(., levels = unique(.),
ordered = TRUE)
# Create your plot
df.plot %>% ggplot(aes(x, y)) +
geom_point(size=4) +
geom_errorbar(aes(ymin = lb, ymax = ub), width = 0.1, linetype="dotted") +
geom_hline(aes(yintercept=0), linetype="dashed") +
theme_bw() +
ggtitle("Comparing Coefficients") +
ylab("Coefficient Value")```