Suppose I have a data set called "data", and is generated through:
library(reshape2) # Reshape data, needed in command "melt"
library(ggplot2) # apply ggplot
density <-rep (0.05, each=800)
tau <-rep (0.05, each=800)
# define two different models: network and non-network
model <-rep(1:2, each=400, times=1)
## Create data and factors for the plot
df <- melt(rnorm(800, -3, 0.5))
data <- as.data.frame(cbind(density, tau, model, df$value))
data$density <- factor(data$density,levels=0.05,
labels=c("Density=0.05"))
data$tau <- factor(data$tau,levels=0.05,
labels=c("tau=0.05"))
data$model<- factor(data$model,levels=c(1,2),
labels=c("Yes",
"No"))
ggplot(data=data, aes(x=V4, shape=model, colour=model, lty=model)) +
stat_density(adjust=1, geom="line",position="identity") +
facet_grid(tau~density, scale="free") +
geom_vline(xintercept=-3, lty="dashed") +
ggtitle("Kernel Density") +
xlab("Data") +
ylab("Kernel Density") +
theme(plot.title=element_text(face="bold", size=17), # change fond size of title
axis.text.x= element_text(size=14),
axis.text.y= element_text(size=14),
legend.title=element_text(size=14),
legend.text =element_text(size=12),
strip.text.x=element_text(size=14), # change fond size of x_axis
strip.text.y=element_text(size=14)) # change fond size of y_axis
Looking at the data, variable V4 is separated into two subsets by the model (Yes [1:400] and No [401:800]), and the kernel density is plotted without change the original bandwidth since adjust=1.
What I want to do is: for the Yes model, the bandwidth changes to 10 times of the original, but for the No model, the bandwidth keeps unchanged. Can I do something like letting the adjust=c(10, 1)? I know how to realize this by plot()+lines(), but I want to do this in ggplot() for further analysis.
I wouldn't recommend this, since it creates a very misleading plot, but you can do it with two calls to stat_density(...).
ggplot(data=data, aes(x=V4, shape=model, colour=model, lty=model)) +
stat_density(data=data[data$model=="Yes",], adjust=10,
geom="line",position="identity") +
stat_density(data=data[data$model=="No",], adjust=1,
geom="line",position="identity") +
facet_grid(tau~density, scale="free") +
geom_vline(xintercept=-3, lty="dashed") +
ggtitle("Kernel Density") +
xlab("Data") +
ylab("Kernel Density") +
theme(plot.title=element_text(face="bold", size=17),
axis.text.x= element_text(size=14),
axis.text.y= element_text(size=14),
legend.title=element_text(size=14),
legend.text =element_text(size=12),
strip.text.x=element_text(size=14),
strip.text.y=element_text(size=14))
Related
I am using ggplot2 to plot a mixed-design dataset in a violin plot.
The data was collected over three sessions: Baseline (collected on Day 1), Post-training (collected on Day 3) and Follow-up (collected on Day 30) and two groups: (1) Active and (2) Sham. For the sessions I have a categorical factor called 'Session' with the labels: Baseline, Post-training and Follow-up which are plotted on the x-axis. (Please ignore the rough state of the draft plot and dummy data for demonstration purposes).
level_order <- factor(tidied_data$Session, level = c('Baseline (Day 1)', 'Post-training (Day 3)', 'Follow-up (Day 30)'))
tidied_data %>%
ggplot(aes(x=level_order, y=Amplitude, fill=Group)) +
geom_violin(position=position_dodge(1), trim=FALSE) +
geom_jitter(binaxis='y', stackdir='center',
position=position_dodge(1)) +
stat_summary(fun = "mean", geom = "point",
size = 3, position=position_dodge(1), color="white") +
stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width=0.3, position=position_dodge(1), color="white") +
theme_bw() + # removes background colour
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + # removes grid lines
theme(panel.border = element_blank()) + # removes border lines
theme(axis.line = element_line(colour = "black")) + # adds axis lines
labs(title = "Group x Session",
x = "Session",
y = "Amplitude")
I want to demonstrate to the viewer that there is a different time-course between Baseline (Day 1), Post-training (Day 3) and follow-up (Day 30), it's a 30-day scale essentially.
From previous threads I've seen that this isn't something that ggplot2 will handle well, since broken axes are generally considered questionable.
I've come across the package 'ggbreak', where you can use the function 'scale_x_break' or scale_y_break' to set an axis break on a continuous variable. This doesn't work for the three time-points, presumably as it's a categorical factor.
Can anyone recommend a way that I can 'break' the axis to demonstrate the different length of time between the three sessions, or alternatively another way I could demonstrate this to the viewer? I've thought about adding custom spacing between bars, but I can only manage to set this to the same width for each bar, not different widths between different bars.
Any help would be greatly appreciated! Thanks in advance!
I can't recommend using discontinuous for this, but you can use facets to visually indicate small multiples. Example with a standard dataset below:
library(ggplot2)
ggplot(mtcars, aes(factor(cyl), mpg, fill = factor(am))) +
geom_violin() +
facet_grid(~ cyl, scales = "free_x") +
theme_classic() +
theme(strip.text = element_blank()) # Hide strip text
Created on 2021-08-20 by the reprex package (v1.0.0)
I'm trying to figure out how to add legends to my R ggplot2 graphs, but clearly I'm not getting the syntax right.
# basic plot layout
ggplot() +
labs(x="random values", y="frequency", title="Examples for F-Test") +
theme_minimal() +
# histogram of distributions
geom_histogram(data=data.frame(random.data.1), aes(x=random.data.1), fill="forestgreen", color="grey", alpha=0.5, binwidth=0.5) +
geom_histogram(data=data.frame(random.data.2), aes(x=random.data.2), fill="orange", color="black", alpha=0.5, binwidth=0.5) +
# manual text annotations
annotate("text", x=10, y=5, label=paste("F-Test p-value =", signif(F.test[[3]], digits=3)), color="firebrick", fontface="bold") +
# add legend?
scale_color_manual(name="Distributions", values=c("grey", "black"))
ggplot2 usually works better if you concatenate your data into long-form columns, as I've done here, with one or more additional columns that indicate the variables or datasets that you want to use to group formatting options. In this case, since you wanted to split by dataset, I just used "1" and "2" for the fake datasets. That column should be a factor (if it's not, then R will assume that the variable is continuous). The command you are specifically looking for is guides(), I think.
Reshaping data can be done easily with either the "reshape2" package or the "tidyr" package. This post compares them.
library(ggplot2)
random.data.1 = runif(10)
random.data.2 = runif(10)
df = data.frame(vals = c(random.data.1,random.data.2))
df$dset<-c(rep(1,10),rep(2,10)) #Indicates the dataset
df$dset<-factor(df$dset)
df
ggplot(data=df,aes(x=vals,color=dset,fill=dset,group=dset)) +
labs(x="random values", y="frequency", title="Examples for F-Test") +
#theme_minimal() +
# histogram of distributions (now you only need one line!)
geom_histogram(position="stack",alpha=0.5, binwidth=0.5) +
# manual text annotations
annotate("text", x=10, y=5, label=paste("F-Test p-value =", signif(F.test[[3]], digits=3)), color="firebrick", fontface="bold") +
# add legend?
#These lines set the colors
scale_color_manual(values=c("grey", "black")) +
scale_fill_manual(values=c("forest green","orange")) +
#and these set the legend manually
guides(color = guide_legend(title = "Distributions")) +
guides(fill=FALSE) #don't show the fill legend
Good morning,
I am making a heat map in ggplot of correlations between specific phenotypes. I would like to label each tile with the R^2 for the association.
I have a correlation matrix, max_all, which looks like this:
phenolist2 pheno1 pheno2 pheno3 pheno4 pheno5
max.pheno1 pheno1 0.05475998 0.05055959 0.05056578 0.10330301 0.05026997
max.pheno2 pheno2 0.15743312 0.05036100 0.05151750 0.04880302 0.31008809
max.pheno3 pheno3 0.05458550 0.07672537 0.04043422 0.16845294 0.14268895
max.pheno4 pheno4 0.05484327 0.04391523 0.05151107 0.09521869 0.19776296
max.pheno5 pheno5 0.08658449 0.05183693 0.16292683 0.22369817 0.53630569
Otherwise, my code is as follows:
tmp_Rsq <- melt(max_all)
tmp_Rsq <- ddply(tmp_Rsq, .(variable), transform, rescale=rescale(value))
labels_Rsq <- expression(paste(R^2, " = ", format(tmp_Rsq$value, digits=2), sep=""))
ggplot(tmp, aes(variable, phenolist2)) +
geom_tile(aes(fill =-log10(value)), colour = "white") +
geom_text(aes(label=as.character(labels_Rsq), parse = TRUE, size=4)) +
scale_fill_gradientn(colours = myPalette(101), name="-log10(P)", limits=c(0 , 3.5)) +
theme(axis.title.x = element_blank(), axis.title.y=element_blank(),
plot.title=element_text(size=20))+
theme(axis.text = element_text(colour="black", face="bold"))
My problem is that I can not get the expression to write out so that 2 is a superscript of R.
I realize there are a number of questions on this website addressing similar issues, for example ggplot2 two-line label with expression, Combining paste() and expression() functions in plot labels and Adding Regression Line Equation and R2 on graph but I have been unable to get the solutions suggested in these answers to apply to my case (likely because I have been trying to use a vector of labels).
Thanks a lot for your help.
Parse needs to be outside the aes, and the labels need to be a character vector.
labels_Rsq <- paste0("R^2 ==", format(tmp_Rsq$value, digits=2))
> head(labels_Rsq)
[1] "R^2 ==0.055" "R^2 ==0.157" "R^2 ==0.055" "R^2 ==0.055" "R^2 ==0.087" "R^2 ==0.051"
ggplot(tmp_Rsq, aes(variable, phenolist2)) +
geom_tile(aes(fill =-log10(value)), colour = "white") +
geom_text(aes(label=as.character(labels_Rsq)), parse = TRUE, size=4) +
# scale_fill_gradientn(colours = myPalette(101), name="-log10(P)", limits=c(0 , 3.5)) +
theme(axis.title.x = element_blank(), axis.title.y=element_blank(),
plot.title=element_text(size=20))+
theme(axis.text = element_text(colour="black", face="bold"))
I have a dataframe of ~108m rows of data, in 7 columns. I use this R script to make a boxplot of it:
ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
scale_y_log10() +
stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
ylab(expression(Exposure~to~NO[x])) +
xlab(expression(Hour~of~the~day)) +
ggtitle("Hourly exposure to NOx") +
theme(axis.text=element_text(size=12, colour="black"),
axis.title=element_text(size=12, colour="black"),
plot.title=element_text(size=12, colour="black"),
legend.position="none")
The graph looks like this. It's pretty much fine, however it would be better to have a value towards the top of the Y axis. I guess it should be something like 1000 given the Y axis is a log10 scale. I'm not sure how to do this though?
Any ideas please?
EDIT: In response to DrDom:
Try to add scale_y_log10(breaks=c(0,10,100,1000)). The output of doing that, is this:
The output of doing the following:
scale_y_log10(breaks=c(0,10,100,1000), limits=c(0,1000))
Is an error of:
Error in seq.default(dots[[1L]][[1L]], dots[[2L]][[1L]], length = dots[[3L]][[1L]]:
'from' cannot be NA, NaN or infinite
In respnonse to Jaap who suggested the following code:
library(ggplot2)
library(scales)
ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
scale_y_continuous(breaks=c(0,10,100,1000,3000), trans="log1p") +
labs(title="Hourly exposure to NOx", x=expression(Hour~of~the~day), y=expression(Exposure~to~NO[x])) +
theme(axis.text=element_text(size=12, colour="black"), axis.title=element_text(size=12, colour="black"),
plot.title=element_text(size=12, colour="black"), legend.position="none")
It produces this graph. Have I done something wrong? I'm still missing a '1000' tick label? A tick inbetween the 10 and the 100 would also be good given that is where most of the data is?
You can modify your log scale by adding arguments breaks= to scale_y_log10(), only there shouldn't be a 0 value because from those values also log is calculated.
df<-data.frame(x=1:10000,y=1:10000)
ggplot(df,aes(x,y))+geom_line()+
scale_y_log10(breaks=c(1,5,10,85,300,5000))
Instead of using scale_y_log10 you can also use scale_y_continuous together with a log transformation from the scales package. When you use the log1p transformation, you are also able to include a 0 in your breaks: scale_y_continuous(breaks=c(0,1,3,10,30,100,300,1000,3000), trans="log1p")
Your complete code will then look like this (notice that I also combined the title arguments in labs):
library(ggplot2)
library(scales)
ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
scale_y_continuous(breaks=c(0,1,3,10,30,100,300,1000,3000), trans="log1p") +
labs(title="Hourly exposure to NOx", x=expression(Hour~of~the~day), y=expression(Exposure~to~NO[x])) +
theme(axis.text=element_text(size=12, colour="black"), axis.title=element_text(size=12, colour="black"),
plot.title=element_text(size=12, colour="black"), legend.position="none")
I have a time-series that I'm examining for data heterogeneity, and wish to explain some important facets of this to some data analysts. I have a density histogram overlayed by a KDE plot (in order to see both plots obviously). However the original data are counts, and I want to place the count values as labels above the histogram bars.
Here is some code:
$tix_hist <- ggplot(tix, aes(x=Tix_Cnt))
+ geom_histogram(aes(y = ..density..), colour="black", fill="orange", binwidth=50)
+ xlab("Bin") + ylab("Density") + geom_density(aes(y = ..density..),fill=NA, colour="blue")
+ scale_x_continuous(breaks=seq(1,1700,by=100))
tix_hist + opts(
title = "Ticket Density To-Date",
plot.title = theme_text(face="bold", size=18),
axis.title.x = theme_text(face="bold", size=16),
axis.title.y = theme_text(face="bold", size=14, angle=90),
axis.text.x = theme_text(face="bold", size=14),
axis.text.y = theme_text(face="bold", size=14)
)
I thought about extrapolating count values using KDE bandwidth, etc, . Is it possible to data frame the numeric output of a ggplot frequency histogram and add this as a 'layer'. I'm not savvy on the layer() function yet, but any ideas would be helpful. Many thanks!
if you want the y-axis to show the bin_count number, at the same time, adding a density curve on this histogram,
you might use geom_histogram() first and record the binwidth value! (this is very important!), next add a layer of geom_density() to show the fitting curve.
if you don't know how to choose the binwidth value, you can just calculate:
my_binwidth = (max(Tix_Cnt)-min(Tix_Cnt))/30;
(this is exactly what geom_histogram does in default.)
The code is given below:
(suppose the binwith value you just calculated is 0.001)
tix_hist <- ggplot(tix, aes(x=Tix_Cnt)) ;
tix_hist<- tix_hist + geom_histogram(aes(y=..count..),colour="blue",fill="white",binwidth=0.001);
tix_hist<- tix_hist + geom_density(aes(y=0.001*..count..),alpha=0.2,fill="#FF6666",adjust=4);
print(tix_hist);