I am doing some research on non-defaulters and defaulters with regards to banking. In that context I am plotting their distributions relative to some score in a bar plot. The higher the score, the better the credit rating.
Since the number of defaults is very limited compared to the number of non-defaults plotting the defaults and non-defaults on the same bar plot is not very giving as you hardly can see the defaults. I then make a second bar plot based on the defaulters' scores only, but on the same interval scale as the full bar plot of both the scores of the defaulters and non-defaulters. I would then like to add vertical lines to the first bar plot indicating where the highest defaulter score is located and the lowest defaulter score is located. That is to get a view of where the distribution of the defaulters fit into that of the overall distribution of both defaulters and non-defaulters.
Below is the code I am using replaced with (seeded) random data instead.
library(ggplot2)
#NDS represents non-defaults and DS defaults on the same scale
#although here being just some random normals for the sake of simplicity.
set.seed(10)
NDS<-rnorm(10000,sd=1)-2
DS<-rnorm(100,sd=2)-5
#Cutoffs are constructed such that intervals of size 0.3
#contain all values of NDS & DS
minCutoff<--9.3
maxCutoff<-2.1
#Generate the actual interval "bins"
NDS_CUT<-cut(NDS,breaks=seq(minCutoff, maxCutoff, by = 0.3))
DS_CUT<-cut(DS,breaks=seq(minCutoff, maxCutoff, by = 0.3))
#Manually generate where to put the vertical lines for min(DS) and max(DS)
minDS_bar<-levels(cut(NDS,breaks=seq(minCutoff, maxCutoff, by = 0.3)))[1]
maxDS_bar<-levels(cut(NDS,breaks=seq(minCutoff, maxCutoff, by = 0.3)))[32]
#Generate data frame - seems stupid, but makes sense
#when the "real" data is used :-)
NDSdataframe<-cbind(as.data.frame(NDS_CUT),rep(factor("State-1"),length(NDS_CUT)))
colnames(NDSdataframe)<-c("Score","Action")
DSdataframe<-cbind(as.data.frame(DS_CUT),rep(factor("State-2"),length(DS_CUT)))
colnames(DSdataframe)<-c("Score","Action")
fulldataframe<-rbind(NDSdataframe,DSdataframe)
attach(fulldataframe)
#Plot the full distribution of NDS & DS
# with geom_vline(xintercept = minDS_bar) + geom_vline(xintercept = maxDS_bar)
# that unfortunately does not show :-(
fullplot<-ggplot(fulldataframe, aes(Score, fill=factor(Action,levels=c("State-2","State-1")))) + geom_bar(position="stack") + opts(axis.text.x = theme_text(angle = 45)) + opts (legend.position = "none") + xlab("Scoreinterval") + ylab("Antal pr. interval") + geom_vline(xintercept = minDS_bar) + geom_vline(xintercept = maxDS_bar)
#Generate dataframe for DS only
#It might seem stupid, but again makes sense
#when using the original data :-)
DSdataframe2<-cbind(as.data.frame(DS_CUT),rep(factor("State-2"),length(DS_CUT)))
colnames(DSdataframe2)<-c("theScore","theAction")
#Calucate max number of observations to adjust bar plot of DS only
myMax<-max(table(DSdataframe2))+1
attach(DSdataframe2)
#Generate bar plot of DS only
subplot<-ggplot(fulldataframe, aes(theScore, fill=factor(theAction))) + geom_bar (position="stack") + opts(axis.text.x = theme_text(angle = 45)) + opts(legend.position = "none") + ylim(0, myMax) + xlab("Scoreinterval") + ylab("Antal pr. interval")
#plot on a grid
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 1)))
vplayout <- function(x, y)
viewport(layout.pos.row = x, layout.pos.col = y)
print(fullplot, vp = vplayout(1, 1))
print(subplot, vp = vplayout(2, 1))
#detach dataframes
detach(DSdataframe2)
detach(fulldataframe)
Furthermore, if anybody has an idea of how I can align the to plot so that correct intervals are just below/above each other on the grid plot
Hope somebody is able to help!
Thanks in advance,
Christian
Wrap aes around the xintercept in the geom_vline layer:
... + geom_vline(aes(xintercept = minDS_bar)) + geom_vline(aes(xintercept = maxDS_bar))
Question 1:
Since you provide the vertical lines as data, you have to map the aesthetics first, using aes()
fullplot <-ggplot(
fulldataframe,
aes(Score, fill=factor(Action,levels=c("State-2","State-1")))) +
geom_bar(position="stack") +
opts(axis.text.x = theme_text(angle = 45)) +
opts (legend.position = "none") +
xlab("Scoreinterval") +
ylab("Antal pr. interval") +
geom_vline(aes(xintercept = minDS_bar)) +
geom_vline(aes(xintercept = maxDS_bar))
Second question:
To align the plots, you can use the align.plots() function in package ggExtra
install.packages("dichromat")
install.packages("ggExtra", repos="http://R-Forge.R-project.org")
library(ggExtra)
ggExtra::align.plots(fullplot, subplot)
Related
So I have a dataset of performance scores with an associated difficulty value, and I want to display the average performance score per difficulty value. The difficulty values range from 0 to 10, but have up to 10 decimal points and as a result are hyper specific. To make this more legible, I've been grouping the difficulty scores into bins. I've done this at two different resolutions, bins of width 0.1, and bins of width 1.
What I would like to do, is display a line plot (using the finer data points), on top of a bar plot (using the wider resolution), but I want the bar plot to maintain its structure. Right now, when I try to overlay the line plot, the x-axis seems to scale to the line plot, and the bars end up extremely narrow.
Here's the bar plot code:
g1.4 = ggplot() +
geom_bar(data = grouped_diff_wide, aes(y=mean_diff_perf, x=gr, fill=subject), stat = "identity" )+
facet_wrap(~subject)+
ggtitle("Average Performance By Difficulty")+
labs(fill = "Subject")+
ylab("Performance")+
xlab("Difficulty")+
scale_x_discrete(breaks = diff_breaks_wide, labels = seq(0, 9, 1))
g1.4
And the resulting graph:
just the bar plot
Here's the line plot code:
g1.5 = ggplot() +
geom_line(data = grouped_diff_fine, aes(y=mean_diff_perf, x = gr, group = 1))+
facet_wrap(~subject)+
ggtitle("Average Performance By Difficulty")+
labs(fill = "Subject")+
ylab("Performance")+
xlab("Difficulty")+
scale_x_discrete(breaks = diff_breaks_fine, labels = seq(0, 9, 1))
g1.5
And the resulting graph: just the line plot
And here's my attempt to combine them:
g1.6 = ggplot() +
geom_bar(data = grouped_diff_wide, aes(y=mean_diff_perf, x=gr, fill=subject), stat = "identity" )+
geom_line(data = grouped_diff_fine, aes(y=mean_diff_perf, x = gr, group = 1))+
facet_wrap(~subject)+
ggtitle("Average Performance By Difficulty")+
labs(fill = "Subject")+
ylab("Performance")+
xlab("Difficulty")+
scale_x_discrete(breaks = diff_breaks_fine, labels = seq(0, 9, 1))
g1.6
And how it turns out: combined plot with skinny bars
Is there a way to maintain the proportions of the stand alone bar plot but with the line plot overlayed?
you can use the width parameter of geom_bar (reference see here). As a very simple example using the built-in mtcars data:
ggplot(mtcars, aes(x = mpg, y = disp)) +
geom_bar(stat = "identity", width = 1.1) +
geom_line(colour = "blue", size = 2)
I'd like to use facet_zoom but for some reason the zoomed area results empty.
The two data sets I use are just numeric vectors of 1.000.000 numbers generated from a modified polynomial distribution. In the zoomed area there is a small spike that I'd like to show.
prova <-readRDS("probcond1.rds")
prova1 <-readRDS("probpoly.rds")
dfGamma <-data.frame(prova)
ggplot(dfGamma, aes(x=prova)) + stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3)
g <- ggplot(dfGamma, aes(x=prova)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300,1000,4000,5000), trans="log1p", expand=c(0,0)) +
theme_bw()
g+expand_limits(x = c(1, 6000)) +facet_zoom(xlim = c(4000,5000))
I'm really new to R. sorry for my ignorance
Your axis is on a log1p scale, so your xlim should be wrapped inside log1p to do a zoom. You can do as follows:
g+expand_limits(x = c(1, 6000)) +facet_zoom(xlim = c(log1p(4000),log1p(5000)))
Here is a sample using the mtcars dataset.
library(ggplot2)
library(ggforce)
g <- ggplot(mtcars, aes(x=hp)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300), trans="log1p", expand=c(0,0)) +
theme_bw()
If you use facet_zoom(xlim = c(100,300)) as follows will produce empty zoom output (flat values of 100 and 300 don't exist on the g's x-axis):
g+expand_limits(x = c(1, 300)) +facet_zoom(xlim = c(100,300))
Output-1 (flat value zoom)
If you transform the xlim using log1p, you can zoom on the corresponding values of the x-axis of plot g. You can do that as follows:
g+expand_limits(x = c(1, 300)) +facet_zoom(xlim = c(log1p(100),log1p(300)))
Output-2 (log1p zoom)
If you want to zoom in the axis independently, you can do as follows:
g+expand_limits(x = c(1, 300)) +facet_zoom(xlim = c(log1p(100),log1p(300)), ylim = c(5,10), split = TRUE)
Output
As you can see I did zoom the ylim between 5 and 10 and the split = TRUE makes the zoom independent and you can have multiple views of the zoom axis or if you just want one view, you can leave the split to its default value FALSE. The manual has a lot more information which you might want to consult, just in case it is available at Package ‘ggforce’
Hope that helps.
I've made a violin plot that looks like this:
As we can see most of the data lies near the region where the score is 0.90-0.95. What I wish is to focus on the interval 0.75 to 1.00 by changing the scale giving less space to ratings from 0 to 0.75.
Is there a way to do this?
This is the code I'm currently using to create the violin plot:
ggplot(data=Violin_plots, aes(x = Year, y = Score)) +
geom_violin(aes(fill = Violin_plots$Year), trim = TRUE) +
coord_flip()+
scale_fill_brewer(palette = "Blues") +
theme(legend.position = 'none') +
labs(y = "Rating score",
fill = "Rating year",
title = "Violin-plots of credit rating scores")
While it's possible to transform the scale to focus more in the upper region (e.g. add trans = "exp" as an argument to the scale), a non linear scale is often hard to interpret appropriately.
For such use cases, I recommend facet_zoom from the ggforce package, which is pretty much built for this exact purpose (see vignette here).
I also switched from geom_violin() + coord_flip() to geom_violinh from the ggstance package, which extends ggplot2 by providing flipped versions of ggplot components. Example with simulated data below:
library(ggforce) # for facet_zoom
library(ggstance) # for flipped version of geom_violin
ggplot(df,
aes(x = rating, y = year, fill = year)) +
geom_violinh() + # no need to specify trim = TRUE as it's the default
scale_fill_brewer(palette = "Blues") +
theme(legend.position = 'none') +
facet_zoom(xlim = c(0.75, 0.98)) # specify zoom range here
Sample data that simulates the characteristics of the data in the question:
df <- diamonds[, c("color", "price")]
df$rating <- (max(df$price) - df$price) / max(df$price)
df$year <- df$color
You could create a second plot to zoom in on the original plot, without modifying the data, by using ggplot2::coord_cartesian()
ggplot(data=Violin_plots, aes(x=Year,y=Score*100)) +
geom_violin(aes(fill=Violin_plots$Year),trim=TRUE) +
coord_flip() +
coord_cartesian(xlim = c(0.75, 1.00)) +
scale_fill_brewer(palette="Blues") +
theme(legend.position='none') +
labs(y="Rating score",fill="Rating year",title="Violin-plots of credit rating scores")
I want to create a graph that looks something like this:
However, I would like to incorporate density based on the connected lines (and not individual plot points, as the graph above using geom_density_2d does). The data, in reality, looks something like this:
Where I am showing gene expression over a 4-point time series (y = gene expression value, x = time) In both examples, the centre line was created using LOESS curve fitting.
How can I create a density or contour plot based on the actual individual connecting lines that span from time=1 to time=4?
This is what have done so far:
# make a dataset
test <- data.frame(gene=rep(c((1:500)), each=4),
time=rep(c(1:4), 125),
value=rep(c(1,2,3,1), 125))
# add random noise to dataset
test$value <- jitter(test$value, factor=1,amount=2)
# first graph created as follows:
ggplot(data=test, aes(x=time, y=value)) +
geom_density_2d(colour="grey") +
scale_x_continuous(limits = c(0,5),
breaks = seq(1,4),
minor_breaks = seq(1)) +
scale_y_continuous(limits = c(-3,8)) +
guides(fill=FALSE) +
theme_classic()
# second plot created as follows
ggplot(test, aes(time, value)) +
geom_line(aes(group = gene),
size = 0.5,
alpha = 0.3,
color = "snow3") +
geom_point() +
scale_y_continuous(limits = c(-3, 8)) +
scale_x_continuous(breaks = seq(1,4), minor_breaks = seq(1)) +
theme_classic()
Thanks in advance for your help!
I want to create the next histogram density plot with ggplot2. In the "normal" way (base packages) is really easy:
set.seed(46)
vector <- rnorm(500)
breaks <- quantile(vector,seq(0,1,by=0.1))
labels = 1:(length(breaks)-1)
den = density(vector)
hist(df$vector,
breaks=breaks,
col=rainbow(length(breaks)),
probability=TRUE)
lines(den)
With ggplot I have reached this so far:
seg <- cut(vector,breaks,
labels=labels,
include.lowest = TRUE, right = TRUE)
df = data.frame(vector=vector,seg=seg)
ggplot(df) +
geom_histogram(breaks=breaks,
aes(x=vector,
y=..density..,
fill=seg)) +
geom_density(aes(x=vector,
y=..density..))
But the "y" scale has the wrong dimension. I have noted that the next run gets the "y" scale right.
ggplot(df) +
geom_histogram(breaks=breaks,
aes(x=vector,
y=..density..,
fill=seg)) +
geom_density(aes(x=vector,
y=..density..))
I just do not understand it. y=..density.. is there, that should be the height. So why on earth my scale gets modified when I try to fill it?
I do need the colours. I just want a histogram where the breaks and the colours of each block are directionally set according to the default ggplot fill colours.
Manually, I added colors to your percentile bars. See if this works for you.
library(ggplot2)
ggplot(df, aes(x=vector)) +
geom_histogram(breaks=breaks,aes(y=..density..),colour="black",fill=c("red","orange","yellow","lightgreen","green","darkgreen","blue","darkblue","purple","pink")) +
geom_density(aes(y=..density..)) +
scale_x_continuous(breaks=c(-3,-2,-1,0,1,2,3)) +
ylab("Density") + xlab("df$vector") + ggtitle("Histogram of df$vector") +
theme_bw() + theme(plot.title=element_text(size=20),
axis.title.y=element_text(size = 16, vjust=+0.2),
axis.title.x=element_text(size = 16, vjust=-0.2),
axis.text.y=element_text(size = 14),
axis.text.x=element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
fill=seg results in grouping. You are actually getting a different histogram for each value of seg. If you don't need the colours, you could use this:
ggplot(df) +
geom_histogram(breaks=breaks,aes(x=vector,y=..density..), position="identity") +
geom_density(aes(x=vector,y=..density..))
If you need the colours, it might be easiest to calculate the density values outside of ggplot2.
Or an option with ggpubr
library(ggpubr)
gghistogram(df, x = "vector", add = "mean", rug = TRUE, fill = "seg",
palette = c("#00AFBB", "#E7B800", "#E5A800", "#00BFAB", "#01ADFA",
"#00FABA", "#00BEAF", "#01AEBF", "#00EABA", "#00EABB"), add_density = TRUE)
The confusion regarding interpreting the y-axis might be due to density is plotted rather than count. So, the values on the y-axis are proportions of the total sample, where the sum of the bars is equal to 1.