using ggplot to create pareto chart

using ggplot to create pareto chart - r

I am using this code to create a Pareto chart, but it does not work well on this dataset, the x-axis becomes unsorted in the graph
cds <- catsvdogs
newdata <- cds[order(-cds$`Number of Households (in 1000)`),]
newdata <- data.frame(newdata$Location,newdata$`Number of Households (in 1000)`)
newdata <- newdata[c(1:10),c(1:2)]
newdata$cumulative <- cumsum(newdata$newdata..Number.of.Households..in.1000..)
ggplot(newdata, aes(x=newdata[,1])) +
geom_bar(aes(y=newdata[,2]), fill='blue', stat="identity") +
geom_point(aes(y=cumulative), color = rgb(0, 1, 0), pch=16, size=1) +
geom_path(aes(y=cumulative, group=1), colour="slateblue1", lty=3, size=0.9) +
theme(axis.text.x = element_text(angle=90, vjust=0.6)) +
labs(title = "Pareto Plot", x = 'Cities', y =
'Count')

Related

Bunched up x axis ticks on multi panelled plot in ggplot

I am attempting to make a multi-panelled plot from three individual plots (see images).However, I am unable to rectify the bunched x-axis tick labels when the plots are in the multi-panel format. Following is the script for the individual plots and the multi-panel:
Individual Plot:
NewDat [[60]]
EstRes <- NewDat [[60]]
EstResPlt = ggplot(EstRes,aes(Distance3, `newBa`))+geom_line() + scale_x_continuous(n.breaks = 10, limits = c(0, 3500))+ scale_y_continuous(n.breaks = 10, limits = c(0,25))+ xlab("Distance from Core (μm)") + ylab("Ba:Ca concentration(μmol:mol)") + geom_hline(yintercept=2.25, linetype="dashed", color = "red")+ geom_vline(xintercept = 1193.9, linetype="dashed", color = "grey")+ geom_vline(xintercept = 1965.5, linetype="dashed", color = "grey") + geom_vline(xintercept = 2616.9, linetype="dashed", color = "grey") + geom_vline(xintercept = 3202.8, linetype="dashed", color = "grey")+ geom_vline(xintercept = 3698.9, linetype="dashed", color = "grey")
EstResPlt
Multi-panel plot:
MultiP <- grid.arrange(MigrPlt,OcResPlt,EstResPlt, nrow =1)
I have attempted to include:
MultiP <- grid.arrange(MigrPlt,OcResPlt,EstResPlt, nrow =1)+
theme(axis.text.x = element_text (angle = 45)) )
MultiP
but have only received errors. It's not necessary for all tick marks to be included. An initial, mid and end value is sufficient and therefore they would not need to all be included or angled. I'm just not sure how to do this. Assistance would be much appreciated.

There are several options to resolve the crowded axes. Let's consider the following example which parallels your case. The default labelling strategy wouldn't overcrowd the x-axis.
library(ggplot2)
library(patchwork)
library(scales)
df <- data.frame(
x = seq(0, 3200, by = 20),
y = cumsum(rnorm(161))
)
p <- ggplot(df, aes(x, y)) +
geom_line()
(p + p + p) / p &
scale_x_continuous(
name = "Distance (um)"
)
However, because you've given n.breaks = 10 to the scale, it becomes crowded. So a simple solution would just be to remove that.
(p + p + p) / p &
scale_x_continuous(
n.breaks = 10,
name = "Distance (um)"
)
Alternatively, you could convert the micrometers to millimeters, which makes the labels less wide.
(p + p + p) / p &
scale_x_continuous(
n.breaks = 10,
labels = label_number(scale = 1e-3, accuracy = 0.1),
name = "Distance (mm)"
)
Yet another alternative is to put breaks only every n units, in the case below, a 1000. This happens to coincide with omitting n.breaks = 10 by chance.
(p + p + p) / p &
scale_x_continuous(
breaks = breaks_width(1000),
name = "Distance (um)"
)
Created on 2021-11-02 by the reprex package (v2.0.1)

I thought it would be better to show with an example.
What I mean was, you made MigrPlt, OcResPlt, EstResPlt each with ggplot() +...... For plot that you want to rotate x axis, add + theme(axis.text.x = element_text (angle = 45)).
For example, in iris data, only rotate x axis text for a like
a <- ggplot(iris, aes(Sepal.Width, Sepal.Length)) +
geom_point() +
theme(axis.text.x = element_text (angle = 45))
b <- ggplot(iris, aes(Petal.Width, Petal.Length)) +
geom_point()
gridExtra::grid.arrange(a,b, nrow = 1)

Add custom, multiple geom_vline to each faceted ggplot2

I have a dataset where each species was mixed with a certain density (numeric) and type (numeric) of another species. I want to add two types of vertical lines to each of my facet_grid panels in ggplot: (a) A fixed line which dives the density/ type. e.g. 1000/1 = 1000, 1000/6 = 166.7, 10000/1 = 10000, 10000/6 = 1666.7
set.seed(111)
count <- rbinom(500,100,0.1)
species <- rep(c("A","B"),time = 250)
density <- rep(c("1000","10000","1000","10000"),time = 125)
type <- rep(c("1","1","6","6"),time = 125)
df <- data.frame(species, density, type, count) # I feel too naiive, but I'm not able to get all the treatments filled. Gah.
ggplot(df, aes(x= count, colour = species, fill = species)) +
geom_histogram(position="identity", alpha=0.5) +
theme_bw() + ylab("Frequency") +
facet_grid(species ~ type + density) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(legend.position = "none") + theme(aspect.ratio = 1.75/1)

How to fix: when overlaying two scatter plots with using reorder of aes, the reorder gets lost

I have two scatter plots obtained from two sets of data that I would like to overlay, when using the ggplo2 for creating single plot i am using log scale and than ordering the numbers sothe scatter plot falls into kind if horizontal S shape. Byt when i want to overlay, the information about reordering gets lost, and the plot loses its shape.
this is how the df looks like (one has 1076 entries and the other 1448)
protein Light_Dark log10
AT1G01080 1.1744852 0.06984755
AT1G01090 1.0710359 0.02980403
AT1G01100 0.4716955 -0.32633823
AT1G01320 156.6594802 2.19495668
AT1G02500 0.6406005 -0.19341276
AT1G02560 1.3381804 0.12651467
AT1G03130 0.6361147 -0.19646458
AT1G03475 0.7529015 -0.12326181
AT1G03630 0.7646064 -0.11656207
AT1G03680 0.8340107 -0.07882836
this is for single plot:
p1 <- ggplot(ratio_log_ENR4, aes(x=reorder(protein, -log10), y=log10)) +
geom_point(size = 1) +
#coord_cartesian(xlim = c(0, 1000)) +
geom_hline(yintercept=0.1, col = "red") + #check gene
geom_hline(yintercept=-0.12, col = "red") +#check gene
labs(x = "Protein")+
theme_classic()+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
labs(y = "ratio Light_Dark log10")+
labs(x="Protein")
image=p1
ggsave(file="p1_ratio_data_ENR4_cys.svg", plot=image, width=10, height=8)
and for over lay:
p1_14a <- ggplot(ratio_log_ENR1, aes(x=reorder(protein, -log10), y=log10)) +
geom_point(size = 1) +
#coord_cartesian(xlim = c(0, 1000)) +
geom_hline(yintercept=0.1, col = "red") + #check gene
geom_hline(yintercept=-0.12, col = "red") +#check gene
labs(x = "Protein")+
theme_classic()+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
labs(y = "ratio Light_Dark log10")+
labs(x="Protein")+
geom_point()+
geom_point(data=ratio_log_ENR4, color="red")
p=ggplot(ratio_log_ENR1, aes(x=reorder(protein, -log10), y=log10)) +
geom_point(size = 1) +
#coord_cartesian(xlim = c(0, 1000)) +
geom_hline(yintercept=0.1, col = "red") + #check gene
geom_hline(yintercept=-0.12, col = "red") +#check gene
labs(x = "Protein")+
theme_classic()+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())+
labs(y = "ratio Light_Dark log10")+
labs(x="Protein")
p = p + geom_point(data=ratio_log_ENR4, aes(x=reorder(protein, -log10), y=log10), color ="red" )
p
I tried to change classes... but it cant be the problem since for single plot its working like it is

The easiest solution I see for you is just binding together your two dataframes before plotting.
a$color <- 'red'
b$color <- 'blue'
ab <- a %>%
rbind(b)
ggplot(ab, aes(x = fct_reorder(protein, -log10), y = log10, color = color)) +
geom_point() +
scale_color_identity()
You can find a nice cheat-sheet for working with factors here: https://stat545.com/block029_factors.html

Grouped scatterplot over grouped boxplot in R using ggplot2

I am creating a grouped boxplot with a scatterplot overlay using ggplot2. I would like to group each scatterplot datapoint with the grouped boxplot that it corresponds to.
However, I'd also like the scatterplot points to be different symbols. I seem to be able to get my scatterplot points to group with my grouped boxplots OR get my scatterplot points to be different symbols... but not both simultaneously. Below is some example code to illustrate what's happening:
library(scales)
library(ggplot2)
# Generates Data frame to plot
Gene <- c(rep("GeneA",24),rep("GeneB",24),rep("GeneC",24),rep("GeneD",24),rep("GeneE",24))
Clone <- c(rep(c("D1","D2","D3","D4","D5","D6"),20))
variable <- c(rep(c(rep("Day10",6),rep("Day20",6),rep("Day30",6),rep("Day40",6)),5))
value <- c(rnorm(24, mean = 0.5, sd = 0.5),rnorm(24, mean = 10, sd = 8),rnorm(24, mean = 1000, sd = 900),
rnorm(24, mean = 25000, sd = 9000), rnorm(24, mean = 8000, sd = 3000))
value <- sqrt(value*value)
Tdata <- cbind(Gene, Clone, variable)
Tdata <- data.frame(Tdata)
Tdata <- cbind(Tdata,value)
# Creates the Plot of All Data
# The below code groups the data exactly how I'd like but the scatter plot points are all the same shape
# and I'd like them to each have different shapes.
ln_clr <- "black"
bk_clr <- "white"
point_shapes <- c(0,15,1,16,2,17)
blue_cols <- c("#EFF2FB","#81BEF7","#0174DF","#0000FF","#0404B4")
lp1 <- ggplot(Tdata, aes(x=variable, y=value, fill=Gene)) +
stat_boxplot(geom ='errorbar', position = position_dodge(width = .83), width = 0.25,
size = 0.7, coef = 4) +
geom_boxplot( coef=1, outlier.shape = NA, position = position_dodge(width = .83), lwd = 0.3,
alpha = 1, colour = ln_clr) +
geom_point(position = position_jitterdodge(dodge.width = 0.83), size = 1.8, alpha = 0.7,
pch=15)
lp1 + scale_fill_manual(values = blue_cols) + labs(y = "Fold Change") +
expand_limits(y=c(0.01,10^5)) +
scale_y_log10(expand = c(0, 0), breaks = c(0.01,1,100,10000,100000),
labels = trans_format("log10", math_format(10^.x)))
ggsave("Scatter Grouped-Wrong Symbols.png")
#*************************************************************************************************************************************
# The below code doesn't group the scatterplot data how I'd like but the points each have different shapes
lp2 <- ggplot(Tdata, aes(x=variable, y=value, fill=Gene)) +
stat_boxplot(geom ='errorbar', position = position_dodge(width = .83), width = 0.25,
size = 0.7, coef = 4) +
geom_boxplot( coef=1, outlier.shape = NA, position = position_dodge(width = .83), lwd = 0.3,
alpha = 1, colour = ln_clr) +
geom_point(position = position_jitterdodge(dodge.width = 0.83), size = 1.8, alpha = 0.7,
aes(shape=Clone))
lp2 + scale_fill_manual(values = blue_cols) + labs(y = "Fold Change") +
expand_limits(y=c(0.01,10^5)) +
scale_y_log10(expand = c(0, 0), breaks = c(0.01,1,100,10000,100000),
labels = trans_format("log10", math_format(10^.x)))
ggsave("Scatter Ungrouped-Right Symbols.png")
If anyone has any suggestions I'd really appreciate it.
Thank you
Nathan

To get the boxplots to appear, the shape aesthetic needs to be inside geom_point, rather than in the main call to ggplot. The reason for this is that when the shape aesthetic is in the main ggplot call, it applies to all the geoms, including geom_boxplot. However, applying a shape=Clone aesthetic causes geom_boxplot to create a separate boxplot for each level of Clone. Since there's only one row of data for each combination of variable and Clone, no boxplot is produced.
That the shape aesthetic affects geom_boxplot seems counterintuitive to me, but maybe there's a reason for it that I'm not aware of. In any case, moving the shape aesthetic into geom_point solves the problem by applying the shape aesthetic only to geom_point.
Then, to get the points to appear with the correct boxplot, we need to group by Gene. I also added theme_classic to make it easier to see the plot (although it's still very busy):
ggplot(Tdata, aes(x=variable, y=value, fill=Gene)) +
stat_boxplot(geom ='errorbar', width=0.25, size=0.7, coef=4, position=position_dodge(0.85)) +
geom_boxplot(coef=1, outlier.shape=NA, lwd=0.3, alpha=1, colour=ln_clr, position=position_dodge(0.85)) +
geom_point(position=position_jitterdodge(dodge.width=0.85), size=1.8, alpha=0.7,
aes(shape=Clone, group=Gene)) +
scale_fill_manual(values=blue_cols) + labs(y="Fold Change") +
expand_limits(y=c(0.01,10^5)) +
scale_y_log10(expand=c(0, 0), breaks=10^(-2:5),
labels=trans_format("log10", math_format(10^.x))) +
theme_classic()
I think the plot would be easier to understand if you use faceting for Gene and the x-axis for variable. Putting time on the x-axis seems more intuitive, while using facetting frees up the color aesthetic for the points. With six different clones, it's still difficult (for me at least) to differentiate the point markers, but this looks cleaner to me than the previous version.
library(dplyr)
ggplot(Tdata %>% mutate(Gene=gsub("Gene","Gene ", Gene)),
aes(x=gsub("Day","",variable), y=value)) +
stat_boxplot(geom='errorbar', width=0.25, size=0.7, coef=4) +
geom_boxplot(coef=1, outlier.shape=NA, lwd=0.3, alpha=1, colour=ln_clr, width=0.5) +
geom_point(aes(fill=Clone), position=position_jitter(0.2), size=1.5, alpha=0.7, shape=21) +
theme_classic() +
facet_grid(. ~ Gene) +
labs(y = "Fold Change", x="Day") +
expand_limits(y=c(0.01,10^5)) +
scale_y_log10(expand=c(0, 0), breaks=10^(-2:5),
labels=trans_format("log10", math_format(10^.x)))
If you really need to keep the points, maybe it would be better to separate the boxplots and points with some manual dodging:
set.seed(10)
ggplot(Tdata %>% mutate(Day=as.numeric(substr(variable,4,5)),
Gene = gsub("Gene","Gene ", Gene)),
aes(x=Day - 2, y=value, group=Day)) +
stat_boxplot(geom ='errorbar', width=0.5, size=0.5, coef=4) +
geom_boxplot(coef=1, outlier.shape=NA, lwd=0.3, alpha=1, width=4) +
geom_point(aes(x=Day + 2, fill=Clone), size=1.5, alpha=0.7, shape=21,
position=position_jitter(width=1, height=0)) +
theme_classic() +
facet_grid(. ~ Gene) +
labs(y="Fold Change", x="Day") +
expand_limits(y=c(0.01,10^5)) +
scale_y_log10(expand=c(0, 0), breaks=10^(-2:5),
labels=trans_format("log10", math_format(10^.x)))
One more thing: For future reference, you can simplify your data creation code:
Gene = rep(paste0("Gene",LETTERS[1:5]), each=24)
Clone = rep(paste0("D",1:6), 20)
variable = rep(rep(paste0("Day", seq(10,40,10)), each=6), 5)
value = rnorm(24*5, mean=rep(c(0.5,10,1000,25000,8000), each=24),
sd=rep(c(0.5,8,900,9000,3000), each=24))
Tdata = data.frame(Gene, Clone, variable, value)

R running average for non-time data

This is the plot I'm having now.
It's generated from this code:
ggplot(data1, aes(x=POS,y=DIFF,colour=GT)) +
geom_point() +
facet_grid(~ CHROM,scales="free_x",space="free_x") +
theme(strip.text.x = element_text(size=40),
strip.background = element_rect(color='lightblue',fill='lightblue'),
legend.position="top",
legend.title = element_text(size=40,colour="lightblue"),
legend.text = element_text(size=40),
legend.key.size = unit(2.5, "cm")) +
guides(fill = guide_legend(title.position="top",
title = "Legend:GT='REF'+'ALT'"),
shape = guide_legend(override.aes=list(size=10))) +
scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) +
scale_x_continuous(breaks = pretty_breaks(n=3)) +
geom_line(stat = "hline",
yintercept = "mean",
size = 1)
The last line, geom_line creates the mean line for each panel.
But now I want to have the more specific running average inside each panel.
i.e. If panel1('chr01') has x-axis range from 0 to 100,000,000, I would want to have the mean value for each 1,000,000 range.
mean1 = mean(x=0 to x=1,000,000)
mean2 = mean(x=1,000,001 to x=2,000,000)

One way to provide a running mean is with geom_smooth() using the loess local regression method. In order to demonstrate my proposed solution, I created a fake genomic dataset using R functions. You can adjust the span parameter of geom_smooth to make the running mean smoother (closer to 1.0) or rougher (closer to 1/number of data points).
# Create example data.
set.seed(27182)
y1 = rnorm(10000) +
c(rep(0, 1000), dnorm(seq(-2, 5, length.out=8000)) * 3, rep(0, 1000))
y2 = c(rnorm(2000), rnorm(1000, mean=1.5), rnorm(1000, mean=-1, sd=2),
rnorm(2000, sd=2))
y3 = rnorm(4000)
pos = c(sort(runif(10000, min=0, max=1e8)),
sort(runif(6000, min=0, max=6e7)),
sort(runif(4000, min=0, max=4e7)))
chr = rep(c("chr01", "chr02", "chr03"), c(10000, 6000, 4000))
data1 = data.frame(CHROM=chr, POS=pos, DIFF=c(y1, y2, y3))
# Plot.
p = ggplot(data1, aes(x=POS, y=DIFF)) +
geom_point(alpha=0.1, size=1.5) +
geom_smooth(colour="darkgoldenrod1", size=1.5, method="loess", degree=0,
span=0.1, se=FALSE) +
scale_x_continuous(breaks=seq(1e7, 3e8, 1e7),
labels=paste(seq(10, 300, 10)), expand=c(0, 0)) +
xlab("Position, Megabases") +
theme(axis.text.x=element_text(size=8)) +
facet_grid(. ~ CHROM, scales="free", space="free")
ggsave(filename="plot_1.png", plot=p, width=10, height=5, dpi=150)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

using ggplot to create pareto chart - r

Related

Bunched up x axis ticks on multi panelled plot in ggplot

Add custom, multiple geom_vline to each faceted ggplot2

How to fix: when overlaying two scatter plots with using reorder of aes, the reorder gets lost

Grouped scatterplot over grouped boxplot in R using ggplot2

R running average for non-time data

Categories

Resources