Calculating Wilson interval to plot binomial proportions in R? - r

I have a data frame consisting of six variables -- one two-level grouping variable indicating treatment status and four binary (0/1) variables. I would like to plot the proportion of successes with 95% confidence intervals as error bars for each binary variable, including separate dots and colors for each treatment group.
I'm currently plotting these as shown below.
df2 <-
df %>%
select(., c(q1_active, # select variables
q2_appt,
q2_trmt,
q2_img,
q2_tele,
q2_trav))
df3 <-
df2 %>%
pivot_longer(cols = starts_with("q2"),
names_to = "variable",
names_prefix = "q2",
values_to = "values")
se <- function(x) sqrt(var(x)/length(x)) #creates function to calculate standard error of the mean
df4 <-
df3 %>%
group_by(variable, q1_active) %>% # group by both binom variable and treatment status
mutate(means=mean(values)) %>% # calculate proportions for binomial variables
mutate(se=se(values)) %>% # calculates std error
distinct(means, .keep_all=TRUE)
ungroup() %>%
drop_na() # there is one "NA" group in the treatment variable I do not need
pos <- position_dodge(.5)
p2 <-
df5 %>%
ggplot(., aes(x=variable, y=means)) +
geom_point(aes(colour=as.factor(q1_active)),position=pos) +
geom_errorbar(aes(ymin=means-(1.96*se), ymax=means+(1.96*se),
colour=as.factor(q1_active),
group=as.factor(q1_active)),
width=.2, position=pos) +
labs(title="Title Here",
subtitle="Subtitle Here",
x="",
y="")
The plot looks okay. I know the proportions are correct because I've double-checked the "means" variable.
However, I'm unsure that I'm calculating the standard error correctly for these proportions. Additionally (and as you can likely see), when I run the plot, I have one proportion with zero frequency. I would like to instead calculate and plot the Wilson interval for these proportions instead of the standard error as I have done.
Could someone(s) guide me on how to correctly calculate for these binomial proportions the Wilson (or "exact") confidence interval -- either before or after I pivot my data frame -- and how to plot these using ggplot?
I'm relatively new to coding and R, so please forgive any sloppy code or misunderstandings. And please let me know if you need clarification on anything. Thank you in advance.

Related

ggplot statistical differences in plot labels (reproducible code included)

I have a code that generates two plots (actually from different datasets) like this one:
#Plot 1
p1 <- ggplot(mtcars,aes(x=factor(cyl),fill=factor(gear)))+
geom_bar(position="fill")+
geom_text(aes(label=scales::percent(..count../sum(..count..))),
stat='count',position=position_fill(vjust=0.5))
#Plot 2
p2 <- ggplot(mtcars,aes(x=factor(cyl),fill=factor(gear)))+
geom_bar(position="fill")+
geom_text(aes(label=scales::percent(..count../sum(..count..))),
stat='count',position=position_fill(vjust=0.5))
plot <- p1 + p2
plot
Is it possible using gglpot o other library to test statistical differences among factos and if there are statistical differences among them to change the label from 25% from something like "25% ↑" or "25% ** " so what I want is to compare values and change labeling to include statistical differences. In my example values are the same but in reality plots are coming from different datasets.
As MrFlick mentioned, ggplot might not be the right tool to do the calculations. But once you have your calculations, you could do something like that
# some date with calculated levels of significance
dplyr::tibble(YEAR=rep(c(2019,2020),eac=3),
GRP=rep(c("A","B","C"),2),
VAL=c(20,100,30,25,70,30),
SIG=rep(c("*","***",""),2)) %>%
# create labels
dplyr::group_by(GRP) %>%
dplyr::mutate(LABEL=dplyr::case_when(VAL/sum(VAL)<0.5 ~ paste("<",SIG),
VAL/sum(VAL)>0.5 ~ paste(">",SIG),
TRUE ~ paste(""))) %>%
dplyr::ungroup() %>%
# calculate percentages
dplyr::group_by(YEAR) %>%
dplyr::mutate(VAL=VAL/sum(VAL)) %>%
dplyr::ungroup() %>%
# plot data: combining percentages and sig-levels as label
ggplot2::ggplot(ggplot2::aes(x=YEAR,
y=VAL,
fill=GRP,
label=glue::glue("{scales::percent(VAL)} {LABEL}"))) +
ggplot2::geom_bar(stat="identity") +
ggplot2::geom_text(position=ggplot2::position_fill(0.5))

R - ggplot2 - limit bar chart output for categorical data

I am trying to create a bar chart in ggplot2 that limits output on the x-axis to the top-10% most frequent categorical variables.
My dataframe is a dataset that contains statistics on personal loans. I am examining the relationship between two categories, Loan Status and Occupation.
First, I want to limit Loan Status to loans that have been "charged off." Next, I want to plot how many loans have been charged off across various occupations using a bar chart. There are 67 unique values for Occupation - I want to limit the plot to only the most frequent occupations (by integer or percentage, i.e. "7" or "10%" works).
In the code below, I am using the forcats function fct_infreq to order the bar chart by frequency in descending order. However, I cannot find a function to limit the number of x-axis categories. I have experimented with quantile, scale_x_discrete, etc. but those don't seem to work for categorical data.
Thanks for your help!
df %>% filter(LoanStatus %in% c("Chargedoff")) %>%
ggplot() +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
scale_x_discrete(limits = c(quantile(df$Occupation, 0.9), quantile(df$Occupation, 1)))
Resulting error:
Error in (1 - h) * qs[i] : non-numeric argument to binary operator
UPDATE:
Using Yifu's answer below, I was able to get the desired output like this:
pd_occupation <- pd %>%
dplyr::filter(LoanStatus == "Chargedoff") %>%
group_by(Occupation) %>%
mutate(group_num = n())
table(pd_occupation$group_num)#to view the distribution
ggplot(subset(pd_occupation, group_num >= 361)) +
geom_bar(aes(fct_infreq(Occupation)), stat = 'count') +
ggtitle('Loan Charge-Offs by Occupation')
You can do it in dplyr instead:
#only use cars whose carb appears more than 7 times to create a plot
mtcars %>%
group_by(carb) %>%
mutate(group_num = n()) %>%
# you can substitute the number with 10% percentitle or whatever you want
dplyr::filter(group_num >= 7) #%>%
#ggplot()
#create your plot
The idea is to filter the observations and pass it to ggplot rather than filter data in ggplot.

How to split x-axis as decile in R and make ggplot

Hi I am wondering how to split x-axis as decile in R and make ggplot?
I currently have age range data and NO2 pollution data. The two datasets share the same geographic reference named ward. I wish to plot my demographic data in quantiles of equal number of ward (Total 298).
I tried the quantile regression in R where I used the following:
library(SparseM)
library(quantreg)
mydata<- read.csv("M:/Desktop10/Test2.csv")
attach(mydata)
Y <- cbind(NO2.value)
X <- cbind(age.0.to.4, age..5.to.9, age.10.to.14, age.15.to.19, age.20.to.24, age.25.to.29, age.30.to.44, age.45.to.59, age.60.to.64, age.65.to.74, age.75.to.84, age.85.to.89, age.above.90)
quantreg.all <- rq(Y ~ X, tau = seq(0.05, 0.95, by = 0.05), data=mydata)
quantreg.plot <- summary(quantreg.all)
plot(quantreg.plot)
But what I get are not what I expected as the y-axies is not the NO2 data.
The ideal plot is attached:
Many thanks for your help and suggestions.
If I understand your question, I think the cut function combined with the quantile function will create the deciles. Here's an example with fake data.
In the code below, we use the cut function to split the data into deciles and we use the quantile function to set the breaks argument for cut. This tells cut to group the data into 10 groups of equal size, from smallest values of NO2 to largest.
group_by(age) means we create the deciles separately for each age group. This means that there are equal numbers of subjects within each decile in a given age group, but the NO2 cutoff values for each decile are different for different age groups. To create deciles over the data as a whole, just remove group_by(age). This will result in the same NO2 cutoff values for each decile across all age groups, but within a given age group, the number of subjects will not be the same in each decile.
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(NO2=c(runif(600, 0, 10), runif(400, 1, 11)),
age=rep(c("0-10","11-20"), c(600,400)))
# Create decile groups
dat = dat %>%
group_by(age) %>%
mutate(decile = cut(NO2, breaks=quantile(NO2, probs=seq(0,1,0.1)),
labels=10:1, include.lowest=TRUE),
decile = fct_rev(decile))
Now we plot using ggplot2. The stat_summary function returns the mean for each decile in each age group.
ggplot(dat, aes(decile, NO2, colour=age, group=age)) +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.y=mean, geom="point") +
expand_limits(y=0) +
theme_bw()

RStudio one numeric variable against n numeric variables in n plots

I'm eviews user and eviews very basically draws scatter plots matrix.
In the following graph, I have 13 different group datas and Eviews draws one group data against 12 groups' data in 12 plots in one graph with regression line.
How can I realize same graph with Rstudio?
Here is an example on how to do the requested plot in ggplot:
First some data:
z <- matrix(rnorm(1000), ncol= 10)
The basic idea here is to convert the wide matrix to long format where the variable that is compared to all others is duplicated as many times as there are other variables. Each of these other variables gets a specific label in the key column. ggplot likes the data in this format
library(tidyverse)
z %>%
as.tibble() %>% #convert matrix to tibble or data.frame
gather(key, value, 2:10) %>% #convert to long format specifying variable columns 2:10
mutate(key = factor(key, levels = paste0("V", 1:10))) %>% #specify levels so the facets go in the correct order to avoid V10 being before V2
ggplot() +
geom_point(aes(value, V1))+ #plot points
geom_smooth(aes(value, V1), method = "lm", se = F)+ #plot lm fit without se
facet_wrap(~key) #facet by key

Creating a histogram in R that shows the difference in the number of errors made by three groups

I have to create a histogram in RStudio with the number of errors of three (3) Data groups. There are GroupA, GroupB and GroupC. Each one of them has 4 variables and one of them is the "errors" variable. So its like GroupA$errors etc..
How am I going to combine these 3 Groups and make a plot on which on the x axis shows 3 bars (each one of them is each group) and on the y axis the number of errors?
dput: http://pastebin.com/vGEPDNFf
With your data:
myData <- data.frame(case = 1:48,
group = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3),
age = c(70,68,61,68,77,72,64,65,69,67,71,75,73,68,65,69,63,70,78,73,76,78,65,68,75,65,62,69,70,71,60,69,60,66,75,70,62,63,79,79,66,76,64,61,70,67,69,63),
errors = c(9,6,7,8,10,11,4,5,5,6,12,8,9,3,7,6,8,6,12,7,13,10,8,8,11,5,9,6,9,6,9,7,5,3,6,6,7,5,9,8,6,6,3,4,7,5,4,5))
Here is the code you have to run in R:
library(ggplot2)
library(dplyr)
myData %>%
group_by(group) %>%
summarize(total.errors=sum(errors)) %>%
ggplot(aes(x=factor(group), y=total.errors)) + geom_bar(stat = "identity")
It gives you the following figure:

Resources