How to rescale y axis to see proportional amounts? - r

I have a data set of around 70k obs. and I want to plot them in a x axis with 5(or more) different factors and wrap them through three types of different severity.
The main problem is that the majority of obs are gathered in 1 factor (severity =3 ) so i can't even read the other 2. ylim doesn't help me because it actually changes the results completely instead of make them a percentage.
Should I do the separation by myself? Or is there any command that could do that for me?
I am attaching below an image to make my problem more comprehensive.
I want to judge each factor based on severity.
Here is the sample of the code.
acc.10 <- read.csv("Accidents2010.csv")
install.packages("ggplot2")
library(ggplot2)
install.packages("stringr")
library(stringr)
acc.10$Road_Type <- as.factor(acc.10$Road_Type)
acc.10$X1st_Road_Class <- as.factor(acc.10$X1st_Road_Class)
ggplot(acc.10, aes(x = Road_Type )) +
geom_bar(width = 0.4) +
ggtitle("Accidents based on Road Type") +
xlab("Road Type")
ggplot(acc.10, aes(x = acc.10$X1st_Road_Class )) +
geom_bar(width = 0.4) +
ggtitle("Accidents based on 1st Road Class") +
xlab("1st Road Class")
data.10 <- acc.10[which(acc.10$X1st_Road_Class == 3),]
#we will check light conditions in order to
data.10$Light_Conditions <- as.factor(data.10$Light_Conditions)
#we plot to see the distribution
ggplot(data.10, aes(x = Light_Conditions)) +
geom_bar(width = 0.5) +
ggtitle("Accidents based on Light Conditions") +
xlab("Light Conditions")
ggplot(data.10[which(as.numeric(data.10$Accident_Severity) == 3),]
, aes(x = Light_Conditions)) +
geom_bar(width = 0.5) +
ggtitle("Accidents based on Light Conditions") +
xlab("Light Conditions")
#We drill harder to see if there are connections of survivability
data.10$Accident_Severity <- as.factor(data.10$Accident_Severity)
ggplot(data.10, aes(x = Light_Conditions, fill = Accident_Severity)) +
geom_bar(width = 0.5) +
ggtitle("Accidents based on Light Conditions and Survivability") +
xlab("Light Conditions")
# We will try to wrap them based on severity instead of the bar graph
ggplot(data.10, aes (x = Light_Conditions)) +
geom_bar(width = 0.5) +
ggtitle("Accident seperated by severity affected of Light Conditions") +
facet_wrap(~Accident_Severity) +
xlab("Light Conditions") +
ylab("Total Count")
And the file with data is here: https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data/datafile/4c03ef8d-992d-44df-8543-412d23f3661b/preview

Thanks a lot to #Peter K his solution worked
It is not in percentage the y axis but it does not really matter because the data now
are clearly readable.
I set the sample code
ggplot(data.10, aes (x = Light_Conditions)) +
geom_bar(width = 0.5) +
ggtitle("Accident seperated by severity affected of Light Conditions") +
facet_wrap(~Accident_Severity, scales = 'free_y') +
xlab("Light Conditions") +
ylab("Total Count")
the command facet_wrap(~Accident_Severity, scales = 'free_y') solved the problem
https://i.imgur.com/gyXV1EZ.png
The photo is above but i dont have the reputation to post it. Thanks a lot again.

Related

How to add percentages on top of an histogram when data is grouped

This is not my data (for confidentiality reasons), but I have tried to create a reproducible example using a dataset included in the ggplot2 library. I have an histogram summarizing the value of some variable by group (factor of 2 levels). First, I did not want the counts but proportions of the total, so I used that code:
library(ggplot2)
library(dplyr)
df_example <- diamonds %>% as.data.frame() %>% filter(cut=="Premium" | cut=="Ideal")
ggplot(df_example,aes(x=z,fill=cut)) +
geom_histogram(aes(y=after_stat(width*density)),binwidth=1,center=0.5,col="black") +
facet_wrap(~cut) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
scale_y_continuous(labels=scales::percent_format(accuracy=2,suffix="")) +
scale_fill_manual(values=c("#CC79A7","#009E73")) +
labs(x="Depth (mm)",y="Count") +
theme_bw() + theme(legend.position="none")
It gave me this as a result.
enter image description here
The issue is that I would like to print the numeric percentages on top of the bins and haven't find a way to do so.
As I saw it done for printing counts elsewhere, I attempted to print them using stat_bin(), including the same y and label values as the y in geom_histogram, thinking it would print the right numbers:
ggplot(df_example,aes(x=z,fill=cut)) +
geom_histogram(aes(y=after_stat(width*density)),binwidth=1,center=0.5,col="black") +
stat_bin(aes(y=after_stat(width*density),label=after_stat(width*density*100)),geom="text",vjust=-.5) +
facet_wrap(~cut) +
scale_x_continuous(breaks=seq(0,9,by=1)) +
scale_y_continuous(labels=scales::percent_format(accuracy=2,suffix="")) +
scale_fill_manual(values=c("#CC79A7","#009E73")) +
labs(x="Depth (mm)",y="%") +
theme_bw() + theme(legend.position="none")
However, it does print way more values than there are bins, these values do not appear consistent with what is portrayed by the bar heights and they do not print in respect to vjust=-.5 which would make them appear slightly above the bars.
enter image description here
What am I missing here? I know that if there was no grouping variable/facet_wrap, I could use after_stat(count/sum(count)) instead of after_stat(width*density) and it seems that it would have fixed my issue. But I need the histograms for both groups to appear next to each other. Thanks in advance!
You have to use the same arguments in stat_bin as for the histogram when adding your labels to get same binning for both layers and to align the labels with the bars:
library(ggplot2)
library(dplyr)
df_example <- diamonds %>%
as.data.frame() %>%
filter(cut == "Premium" | cut == "Ideal")
ggplot(df_example, aes(x = z, fill = cut)) +
geom_histogram(aes(y = after_stat(width * density)),
binwidth = 1, center = 0.5, col = "black"
) +
stat_bin(
aes(
y = after_stat(width * density),
label = scales::number(after_stat(width * density), scale = 100, accuracy = 1)
),
geom = "text", binwidth = 1, center = 0.5, vjust = -.25
) +
facet_wrap(~cut) +
scale_x_continuous(breaks = seq(0, 9, by = 1)) +
scale_y_continuous(labels = scales::number_format(scale = 100)) +
scale_fill_manual(values = c("#CC79A7", "#009E73")) +
labs(x = "Depth (mm)", y = "%") +
theme_bw() +
theme(legend.position = "none")

How can I change the order and label of a panel in facet_wrap?

Folks-
I'm embarrassed to solicit advice for something that seems like it should be so easy, but my frustration outweighs my embarrassment. How can I change the order, label, and line color of a single panel in facet_wrap while using automatic ordering, labelling, and coloring for the other panels. Specifically, I would like to plot the "Bronx Cheer Rate" for the country Freedonia and each of its four states (Chico, Groucho, Harpo, and Zeppo, named for Freedonia's founding fathers), but making "Freedonia" the first panel in the graph and making its line black. This is what I have:
My (admittedly inelegant) solution is to
Recode "Freedonia" as "aaa" (so it appears first).
Use a geom_line statement that subsets the data to "aaa" and changes the line color to black.
Change the label of the panel back to "Freedonia." I'm fine until I get to the third step.
Here's some code with a reproducible (or is it replicable?) example:
library(dplyr)
library(ggplot2)
library(data.table)
#Simulate Data
set.seed(581)
state <- rep(c("Chico","Groucho","Harpo","Freedonia","Zeppo"), each=4)
x <- rep(1:4, times = 5)
y <- 100 + rnorm(20, 0, 5)*x + rnorm(20, 0, 20)
df <- cbind(state, x, y) %>% data.table() %>%
.[ , .(state, x = as.numeric(x), y = as.numeric(y))]
#Recode
df <- df[ , state := recode(state, "Freedonia" = "aaa")]
#Generate Labels
labels <- unique(df$state[which(state != "Freedonia")])
labels <- c("Freedonia", labels)
#Grid Plot with Freedonia First
p <- ggplot(df, aes(x, y, color = state)) +
geom_line() +
geom_line(data = subset(df, state == "aaa"), color = "black") +
ggtitle("Average Bronx Cheers by Quarter (1934)") +
theme_bw() +
theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Quarter") +
ylab("Bronx Cheer Rate") +
#facet_wrap(~ state)
#facet_wrap(~ state, labeller = labeller(state = labels))
facet_wrap(~ state, labeller = labeller(setNames(nm = labels)))
p
Here's the result.
I realize that with just five panels, it would be trivial to do this manually (with, say, scale_fill_manual), but you may have surmised that I'm not really interested in Freedonia, but, rather, in a state that has many counties--too many to do manually. I've looked not exhaustively, but thoroughly, and haven't seen anything that addresses this exact problem.
I'd be very grateful for your help.
Regards,
David
You could set up the factor levels of state in a way that 'Freedonia' is the first level and rest of them come later.
library(ggplot2)
df$state <- factor(df$state, levels = c('Freedonia',
setdiff(unique(df$state), 'Freedonia')))
ggplot(df, aes(x, y, color = state)) +
geom_line() +
geom_line(data = subset(df, state == "Freedonia"), color = "black") +
ggtitle("Average Bronx Cheers by Quarter (1934)") +
theme_bw() +
theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Quarter") +
ylab("Bronx Cheer Rate") +
facet_wrap(~ state)

dodge columns in ggplot2

I am trying to create a picture that summarises my data. Data is about prevalence of drug use obtained from different practices form different countries. Each practice has contributed with a different amount of data and I want to show all of this in my picture.
Here is a subset of the data to work on:
gr<-data.frame(matrix(0,36))
gr$drug<-c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b")
gr$practice<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r")
gr$country<-c("c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3","c1","c1","c1","c1","c1","c1","c1","c1","c1","c1","c2","c2","c2","c2","c2","c2","c3","c3")
gr$prevalence<-c(9.14,5.53,16.74,1.93,8.51,14.96,18.90,11.18,15.00,20.10,24.56,22.29,19.41,20.25,25.01,25.87,29.33,20.76,18.94,24.60,26.51,13.37,23.84,21.82,23.69,20.56,30.53,16.66,28.71,23.83,21.16,24.66,26.42,27.38,32.46,25.34)
gr$prop<-c(0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406,0.027,0.023,0.002,0.500,0.011,0.185,0.097,0.067,0.066,0.023,0.433,0.117,0.053,0.199,0.098,0.100,0.594,0.406)
gr$low.CI<-c(8.27,4.80,12.35,1.83,7.22,14.53,18.25,10.56,14.28,18.76,24.25,21.72,18.62,19.83,24.36,25.22,28.80,20.20,17.73,23.15,21.06,13.12,21.79,21.32,22.99,19.76,29.60,15.41,28.39,23.25,20.34,24.20,25.76,26.72,31.92,24.73)
gr$high.CI<-c(10.10,6.37,22.31,2.04,10.00,15.40,19.56,11.83,15.74,21.52,24.87,22.86,20.23,20.68,25.67,26.53,29.86,21.34,20.21,26.10,32.79,13.63,26.02,22.33,24.41,21.39,31.48,17.98,29.04,24.43,22.01,25.12,27.09,28.05,33.01,25.95)
The code I wrote is this
p<-ggplot(data=gr, aes(x=factor(drug), y=as.numeric(gr$prevalence), ymax=max(high.CI),position="dodge",fill=practice,width=prop))
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
p + theme_bw()+
geom_bar(stat="identity",position = position_dodge(0.9)) +
labs(x="Drug",y="Prevalence") +
geom_errorbar(ymax=gr$high.CI,ymin=gr$low.CI,position=position_dodge(0.9),width=0.25,size=0.25,colour="black",aes(x=factor(drug), y=as.numeric(gr$prevalence), fill=practice)) +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The figure I obtain is this one where bars are all on top of each other while I want them "dodge".
I also obtain the following warning:
ymax not defined: adjusting position using y instead
Warning message:
position_dodge requires non-overlapping x intervals
Ideally I would get each bar near one another, with their error bars in the middle of its bar, all organised by country.
Also should I be concerned about the warning (which I clearly do not fully understand)?
I hope this makes sense. I hope I am close enough, but I don't seem to be going anywhere, some help would be greatly appreciated.
Thank you
ggplot's geom_bar() accepts the width parameter, but doesn't line them up neatly against one another in dodged position by default. The following workaround references the solution here:
library(dplyr)
# calculate x-axis position for bars of varying width
gr <- gr %>%
group_by(drug) %>%
arrange(practice) %>%
mutate(pos = 0.5 * (cumsum(prop) + cumsum(c(0, prop[-length(prop)])))) %>%
ungroup()
x.labels <- gr$practice[gr$drug == "a"]
x.pos <- gr$pos[gr$drug == "a"]
ggplot(gr,
aes(x = pos, y = prevalence,
fill = country, width = prop,
ymin = low.CI, ymax = high.CI)) +
geom_col(col = "black") +
geom_errorbar(size = 0.25, colour = "black") +
facet_wrap(~drug) +
scale_fill_manual(values = c("c1" = "gray79",
"c2" = "gray60",
"c3" = "gray39"),
guide = F) +
scale_x_continuous(name = "Drug",
labels = x.labels,
breaks = x.pos) +
labs(title = "Drug usage by country and practice", y = "Prevalence") +
theme_classic()
There is a lot of information you are trying to convey here - to contrast drug A and drug B across countries using the barplots and accounting for proportions, you might use the facet_grid function. Try this:
colour<-c(rep("gray79",10),rep("gray60",6),rep("gray39",2))
gr$drug <- paste("Drug", gr$drug)
p<-ggplot(data=gr, aes(x=factor(practice), y=as.numeric(prevalence),
ymax=high.CI,ymin = low.CI,
position="dodge",fill=practice, width=prop))
p + theme_bw()+ facet_grid(drug~country, scales="free") +
geom_bar(stat="identity") +
labs(x="Practice",y="Prevalence") +
geom_errorbar(position=position_dodge(0.9), width=0.25,size=0.25,colour="black") +
ggtitle("Drug usage by country and practice") +
scale_fill_manual(values = colour)+ guides(fill=F)
The width is too small in the C1 country and as you indicated the one clinic is quite influential.
Also, you can specify your aesthetics with the ggplot(aes(...)) and not have to reset it and it is not needed to include the dataframe objects name in the aes function within the ggplot call.

R: how to plot a line plot with obvious distinction between different time periods (line with dots)

I have a data consisting of 14 different time periods where I would like to plot it in a way that viewer can see where the 14 periods lie. I used to achieve this through using different colors
mycolors = c(brewer.pal(name="Set2", n = 7), brewer.pal(name="Set2", n = 7))
ggplot(derv, aes(x=Date, y=derv, colour = Season)) +
geom_point() +
geom_abline(intercept = 0, slope = 0) +
geom_abline(intercept = neg.cut, slope = 0) +
geom_abline(intercept = pos.cut, slope = 0) +
scale_color_manual(values = mycolors) + ggtitle(" Derivative")+ylab("Derivative")
I have used the above code to product such as plot but now in a new report, I can only use black and white scheme. So I am wondering how I can plot such a plot in R. I have thought of using alternating line types for the 14 different time periods but I do not how to achieve through ggplot. I have tried the following code, but it does not work.The line type stayed the same.
ggplot(derv, aes(x=Date, y=derv)) +
geom_line() +
geom_abline(intercept = 0, slope = 0) +
geom_abline(intercept = neg.cut, slope = 0) +
geom_abline(intercept = pos.cut, slope = 0) +
#scale_color_manual(values = mycolors) + ggtitle("S&P 500 (Smoothed) Derivative") + ylab("Derivative")+
scale_linetype_manual(values = c("dashed","solid","dashed","solid","dashed","solid","dashed",
"solid","dashed","solid","dashed","solid","dashed","solid"))
If you need to show where season changes, couldn't you just use an alternating linetype or alternating point marker? See below for two examples. You can play around with different point markers and linetypes to get the look you want. For more on creating linetypes, see this SO answer. For more on additional point markers (beyond the standard one available through pch), see, for example, here and here. I've also included a way to add the three horizontal lines with less code.
# Fake data
x = seq(0,2*pi,length.out=20*14)
dat=data.frame(time=x, y=sin(x) + sin(5*x) + cos(2*x) + cos(7*x),
group=0:(length(x)-1) %/% 20)
ggplot(dat, aes(time, y)) +
geom_hline(yintercept=c(-0.5,0,0.5), colour="grey50") +
geom_point(aes(shape=factor(group), size=factor(group))) +
scale_shape_manual(values=rep(c(3,15),7)) +
scale_size_manual(values=rep(c(2,1.5),7)) +
theme_bw() + guides(shape=FALSE, size=FALSE)
ggplot(dat, aes(time, y, linetype=factor(group))) +
geom_hline(yintercept=c(-0.5,0,0.5), colour="grey50") +
geom_line(size=0.8) +
scale_linetype_manual(values=rep(1:2,7)) +
theme_bw() + guides(linetype=FALSE)

ggplot2: plotting error bars for groups without overlap

I wish to show the effect of two pollutants on the same outcome and was happy with the plot when there are no groups. Now when I want to plot the same data for all-year and stratified by season, I either get overlaps of error bars or three separate panels which are not optimal for my need.
Sample data could be accessed from here:
https://drive.google.com/file/d/0B_4NdfcEvU7LV2RrMjVyUmpoSDg/edit?usp=sharing
As an example with the following code I create a plot for all-year:
ally<-subset(df, seas=="allyear")
ggplot(ally,aes(x = set, y = pinc,ymin = lcinc, ymax =ucinc,color=pair,shape=pair)) +
geom_point(position=position_dodge(width=0.5) ,size = 2.5) +
geom_linerange(position=position_dodge(width=0.5), size =0.5) + theme_bw() +
geom_hline(aes(yintercept = 0)) +
labs(colour="Pollutant", shape="Pollutant", y="Percent Increase", x="") +
scale_x_discrete(labels=c(NO2=expression(NO[2]),
NOx=expression(NO[x]),
Coarse= expression(Coarse),
PM25=expression(PM[2.5]),
PM10=expression(PM[10]))) +
theme(plot.title = element_text(size = 12,face="bold" )) +
theme(axis.title=element_text(size="12") ,axis.text=element_text(size=12))
But when I add facet_grid(. ~ seas) I will have three separate panels. How can I display this data for all year and divided by seasons in one panel?
Either color or shape needs to be used to represent season, not pollutant.
Then this should come close to what you want:
library(ggplot2)
ggplot(df, aes(x = set, y = pinc,ymin = lcinc, ymax =ucinc,
color=seas, shape=pair)) +
geom_point(position=position_dodge(width=0.5), size = 2.5) +
geom_linerange(position=position_dodge(width=0.5), size =0.5) + theme_bw() +
geom_hline(aes(yintercept = 0)) +
labs(colour="Season", shape="Pollutant", y="Percent Increase", x="") +
scale_x_discrete(labels=c(NO2=expression(NO[2]),
NOx=expression(NO[x]),
Coarse= expression(Coarse),
PM25=expression(PM[2.5]),
PM10=expression(PM[10]))) +
theme(plot.title = element_text(size = 12,face="bold" )) +
theme(axis.title=element_text(size="12") ,axis.text=element_text(size=12))
I do think that facetting gives you better graphs here --
if you want to focus attention on the comparison between seasons for each pollutant, use this (facet_grid(~pair, labeller=label_both)):
if you want to focus attention on the comparison between pollutants for each season, use this (facet_grid(~seas, labeller=label_both)):

Resources