ggplot2 & stat_ellipse: Draw ellipses around multiple groups of points - r

This might be a simple one, but I'm trying to draw ellipses around my treatments on my PCoA plot.
My data frame (sc) is:
MDS1 MDS2 Treatment
X1xF1 -0.19736183 -0.24299825 1xFlood
X1xF2 -0.17409568 -0.29727596 1xFlood
X1xF3 -0.15272444 -0.28553837 1xFlood
S1 -0.06643271 0.47049959 Start
S2 -0.15143350 0.31152966 Start
S3 -0.26156297 0.12296849 Start
X3xF1 0.29840827 0.04581617 3xFloods
X3xF2 0.50503749 -0.07011503 3xFloods
X3xF3 0.20016537 -0.05488630 3xFloods
and my code is:
ggplot(data=sc,(aes(x=MDS1,y=MDS2,colour = Treatment)))+geom_point(size=3)+
ggtitle("PCoA of samples at 'class' level(method='Bray')\n",sep=''))+
theme_bw()+guides(colour = guide_legend(override.aes = list(size=3)))+
stat_ellipse()
It plots the PCoA okay up until stat_ellipse(). I've tried it with various parameters and at best I can get one ellipse for the whole plot (although I can't seem to reproduce that now).
What I'm after is three CI ellipses for the three treatments, coloured the same as the treatments. Any help would be very appreciated!
Thanks.

There is no stat_ellipse(...) in the ggplot package, so you must have retreived it from somewhere else. Care to share?? There are at least two versions that I am aware of, here, and here. Neither of these seem to work with your dataset, which is odd because both have worked with other datasets.
I finally fell back on the option of generating the ellipses externally to ggplot, which is not that difficult really.
library(ggplot2)
library(ellipse)
centroids <- aggregate(cbind(MDS1,MDS2)~Treatment,sc,mean)
conf.rgn <- do.call(rbind,lapply(unique(sc$Treatment),function(t)
data.frame(Treatment=as.character(t),
ellipse(cov(sc[sc$Treatment==t,1:2]),
centre=as.matrix(centroids[t,2:3]),
level=0.95),
stringsAsFactors=FALSE)))
ggplot(data=sc,(aes(x=MDS1,y=MDS2,colour = Treatment)))+
geom_point(size=3)+
geom_path(data=conf.rgn)+
ggtitle(paste("PCoA of samples at 'class' level(method='Bray')\n",sep=''))+
theme_bw()+
guides(colour = guide_legend(override.aes = list(size=3)))

Related

R: Cleaning GGally Plots

I am using the R programming language and I am new the GGally library. I followed some basic tutorials online and ran the following code:
#load libraries
library(GGally)
library(survival)
library(plotly)
I changed some of the data types:
#manipulate the data
data(lung)
data = lung
data$sex = as.factor(data$sex)
data$status = as.factor(data$status)
data$ph.ecog = as.factor(data$ph.ecog)
Now I visualize:
#make the plots
#I dont know why, but this comes out messy
ggparcoord(data, groupColumn = "sex")
#Cleaner
ggparcoord(data)
Both ggparcoord() code segments successfully ran, however the first one came out pretty messy (the axis labels seem to have been corrupted). Is there a way to fix the labels?
In the second graph, it makes it difficult to tell how the factor variables are labelled on their respective axis (e.g. for the "sex" column, is "male" the bottom point or is "female" the bottom type). Does anyone know if there is a way to fix this?
Finally, is there a way to use the "ggplotly()" function for "ggally" objects?
e.g.
a = ggparcoord(data)
ggplotly(a)
Thanks
Looks like your data columns get converted to a factor when adding the groupColumn. To prevent that you could exclude the groupColumn from the columns to be plotted:
BTW: Not sure about the general case. But at least for ggparcoord ggplotly works.
library(GGally)
library(survival)
data(lung)
data = lung
data$sex = as.factor(data$sex)
data$status = as.factor(data$status)
data$ph.ecog = as.factor(data$ph.ecog)
#I dont know why, but this comes out messy
ggparcoord(data, seq(ncol(data))[!names(data) %in% "sex"], groupColumn = "sex")

R: How does stat_density_ridges modify and plot data?

<Disclaimer(s) - (1) This is my first post, so please be gentle, specifically regarding formatting and (2) I did try to dig as much as I could on this topic before posting the question here>
I have a simple data vector containing returns of 40 portfolios on the same day:
Year Return
Now -17.39862061
Now -12.98954582
Now -12.98954582
Now -12.86928749
Now -12.37044334
Now -11.07007504
Now -10.68971539
Now -10.07578182
Now -9.984867096
Now -8.764036179
Now -8.698093414
Now -8.594026566
Now -8.193638802
Now -7.818599701
Now -7.622627735
Now -7.535216808
Now -7.391239166
Now -7.331315517
Now -5.58059597
Now -5.579797268
Now -4.525201797
Now -3.735909224
Now -2.687532902
Now -2.65363884
Now -2.177522898
Now -1.977644682
Now -1.353205681
Now -0.042584345
Now 0.096564181
Now 0.275416046
Now 0.638839543
Now 1.959529042
Now 3.715519428
Now 4.842819691
Now 5.475946426
Now 6.380955219
Now 6.535937309
Now 8.421762466
Now 8.556800842
Now 10.39185524
I am trying to plot these returns to compare versus other days (so the rest of my history e.g.). I tried to use stat_density_ridges as per the code block below
ggplot(data = data.plot, aes(x = Return, y = Year, fill = factor(..quantile..))) +
stat_density_ridges(geom = "density_ridges_gradient",calc_ecdf = TRUE,
quantiles = c(0.025, 0.5, 0.975),
quantile_lines = TRUE)
As you can see - the "year" in this case is the same i.e. there is no height parameter, yet I get a nice ridg(y) chart. While the chart is beautiful to behold, and very very awesome, I am at a loss to determine how the plotting function is computing the density in this case, specially the height.
This is the output chart I get (I have omitted the formatting code here since it doesn't make a difference to my question):
Portfolio Return Distribution Plots - US versus Europe
I tried digging into the code of the function itself, but came up with a total blank. The documentation didn't help (except perhaps give me a hint that the function plots continous distributions).
Any help, or guidance, or even a nudge in the right direction would be extremely helpful.

Q: How combine two types of lines using ggplot?

I am trying to plot the following graph:
This plot was made using a command in R; however, I need to change the x-axis. As you see the x-axis starts at 0 and finish at 46. I want that the x-axis starts in 1972 and finishes in 2018 seq(1972, 2018). The data used for this graph is the following:
For regime one
structure(c(0.996336942021931, 0.982749831853788, 0.25257000136794,
0.707797489518183, 0.339372705184362, 0.999209103898399, 0.348786927897612,
0.821500770877589, 0.569473419352121, 0.544946043345147, 0.15347485404411,
0.987921203799956, 0.00247541125926418, 0.999925918450173, 0.996940249283586,
0.0141234625702467, 0.105466117156579, 0.999992944275275, 0.991723355647765,
0.0958472062267191, 0.0362729940372193, 0.999999790503447, 0.0750715811130157,
0.999975836828039, 0.998991768987905, 0.327943641159186, 5.05723080618291e-05,
0.999999999869691, 0.995538324405397, 0.123355227931813, 0.999776636825943,
0.00875781169836433, 0.696284480883101, 0.854839147672286, 0.113243492249383,
0.00984853715078062, 0.442061195271808, 0.999959859676686, 0.0249739384218217,
0.715262186931097, 0.269481397703521, 0.708458897302807, 0.0444979324520481,
0.000133950914911277, 0.997976154782607, 0.191386380576805, 0.99775339928206,
0.97921531595208, 0.27690132186733, 0.671995422154737, 0.458800347851363,
0.999155966774432, 0.417000082142666, 0.838969001100901, 0.576424593247709,
0.439169303472056, 0.227227711549776, 0.978527102362448, 0.00408165810824898,
0.999955057843957, 0.994643622809094, 0.00847570472458959, 0.163000467960203,
0.999995704786608, 0.987482614312069, 0.0569007267419926, 0.0585312256476362,
0.999999671060746, 0.118213072794827, 0.99998536150034, 0.998897081324845,
0.212968271334585, 8.35316288758489e-05, 0.999999999920876, 0.993537683112221,
0.188538497918178, 0.999604116439039, 0.00905848219612739, 0.769430430615986,
0.794457999021984, 0.0665707154963958, 0.00776458004359329, 0.5668500474175,
0.999931021995446, 0.0265573724408095, 0.661699294173752, 0.296009575623967,
0.587638579198176, 0.0251758869152202, 0.000220356219397782,
0.997352716237698, 0.191386380576805), .Dim = c(46L, 2L))
for regime 2:
structure(c(0.00366305797806813, 0.0172501681462116, 0.74742999863206,
0.292202510481817, 0.660627294815638, 0.000790896101601132, 0.651213072102388,
0.178499229122411, 0.430526580647879, 0.455053956654853, 0.846525145955889,
0.0120787962000438, 0.997524588740736, 7.40815498269273e-05,
0.00305975071641352, 0.985876537429753, 0.894533882843421, 7.05572472485335e-06,
0.00827664435223535, 0.904152793773281, 0.963727005962781, 2.09496553467159e-07,
0.924928418886985, 2.41631719608902e-05, 0.00100823101209502,
0.672056358840815, 0.999949427691938, 1.30308744399533e-10, 0.00446167559460289,
0.876644772068187, 0.00022336317405711, 0.991242188301636, 0.303715519116899,
0.145160852327714, 0.886756507750617, 0.990151462849219, 0.557938804728191,
4.01403233139628e-05, 0.975026061578178, 0.284737813068903, 0.730518602296479,
0.291541102697193, 0.955502067547952, 0.999866049085089, 0.00202384521739295,
0.808613619423195, 0.00224660071793958, 0.0207846840479196, 0.72309867813267,
0.328004577845263, 0.541199652148637, 0.000844033225568314, 0.582999917857334,
0.161030998899099, 0.423575406752291, 0.560830696527944, 0.772772288450224,
0.0214728976375518, 0.995918341891751, 4.49421560426429e-05,
0.00535637719090558, 0.99152429527541, 0.836999532039797, 4.29521339242403e-06,
0.0125173856879312, 0.943099273258007, 0.941468774352364, 3.28939253926857e-07,
0.881786927205173, 1.46384996596921e-05, 0.00110291867515508,
0.787031728665414, 0.999916468371124, 7.91243531099699e-11, 0.00646231688777926,
0.811461502081822, 0.00039588356096145, 0.990941517803873, 0.230569569384014,
0.205542000978016, 0.933429284503604, 0.992235419956407, 0.4331499525825,
6.89780045536876e-05, 0.973442627559191, 0.338300705826248, 0.703990424376033,
0.412361420801824, 0.97482411308478, 0.999779643780602, 0.00264728376230197,
0.808613619423195), .Dim = c(46L, 2L))
I know that the red line can be plotted using geom_line but I do not know how can the black bars plot? maybe using geom_bar, and also how can I merge the plots?
Thanks for your help
It's actually plotted using base R (good old times), using your first data for For regime one:
plot(Regime1[,1],type="h",xaxt="n",ylab="",cex.axis=0.6,xlab="",xlim=c(0,46))
lines(Regime1[,2],col="red")
mtext("Smoothed Probabilities",2,padj=-5,col="red",cex=0.7)
mtext("Fitted Probabilities",4,padj=1,cex=0.7)
axis(side=1,at=c(0,20,46),labels=c(1972,1992,2018))
Your xaxis values are actually 0:46, so you turn off the x-axis ticks using xaxt="n", then with axis(), you put it at 0,20,46 with the labels 1972...
It also depends on your plotting device, so might have to change the padj parameter in the axis to adjust the axis labels. I guess you can check out post like this for base R plotting functions.
In ggplot2, I guess you just create a data.frame with the Index as the years you need, and you call geom_segment() to plot the vertical lines :
library(ggplot2)
Regime1 = data.frame(Regime1)
colnames(Regime1) = c("Fitted","Smoothed")
Regime1$index = 1:nrow(Regime1)+1972
ggplot(Regime1,aes(x=index))+
geom_segment(aes(xend=index,y=0,yend=Fitted,col="Fitted")) +
geom_line(aes(y=Smoothed,col="Smoothed")) + theme_minimal() +
scale_color_manual(values=c("black","red"))
For a ggplot2 solution, you are going to need a data.frame or tibble with 4 columns (Regime, Year, Smoothed, and Fitted). Based on the data you provided, this would have 92 rows.
Now assuming you use those column names (and storing your data into the variable example.dat), a ggplot2 solution is
example.dat %>%
ggplot( aes(x=Year) ) +
geom_line( aes(y=Smoothed), color="red" ) +
geom_linerange( aes(ymax=Fitted), ymin=0 ) +
facet_wrap( ~ Regime, ncol=1 )
Then you might need to adjust some of the scales to get the best plot.

Geom_points not dodging when geom_errorbars are

I can't figure out how to get these geom_points to properly dodge! I've searched many, MANY how-to's and questions on different stackexchange pages, but none of them fix the problem.
analyze_weighted <- data.frame(
mus = c(clean_mu,b_mu,d_mu,g_mu,bd_mu,bg_mu,dg_mu,bdg_mu,m_mu),
sds = c(clean_sigma,b_sigma,d_sigma,g_sigma,bd_sigma,bg_sigma,dg_sigma,bdg_sigma,m_sigma),
SNR =c("No shifts","1 shift","1 shift","1 shift","2 shifts","2 shifts","2 shifts","3 shifts","4 shifts"),
)
And then I try to plot it:
ggplot(analyze_weighted, aes(x=SNR,y=mus,color=SNR,group=mus)) +
geom_point(position="dodge",na.rm=TRUE) +
geom_errorbar(position="dodge",aes(ymax=mus+sds/2,ymin=mus-sds/2,), width=0.25)
And it manages to dodge the error bars but not the points! I'm going crazy here, what do I do?
Here's what it looks like now--I want the points to be slightly dodged!
geom_point requires that you explicitly provide the width you desire the points to dodge.
This should work:
ggplot(analyze_weighted, aes(x=SNR,y=mus,color=SNR,group=mus)) +
geom_point(position=position_dodge(width=0.2),na.rm=TRUE) +
geom_errorbar(position=position_dodge(width=0.2),aes(ymax=mus+sds/2,ymin=mus-sds/2),width=0.25)
Please notice that your example wasn't a fully reproducible one, as no values of the variables used to construct mus and sds are available.

How to plot a violin scatter boxplot (in R)?

I just came by the following plot:
And wondered how can it be done in R? (or other softwares)
Update 10.03.11: Thank you everyone who participated in answering this question - you gave wonderful solutions! I've compiled all the solution presented here (as well as some others I've came by online) in a post on my blog.
Make.Funny.Plot does more or less what I think it should do. To be adapted according to your own needs, and might be optimized a bit, but this should be a nice start.
Make.Funny.Plot <- function(x){
unique.vals <- length(unique(x))
N <- length(x)
N.val <- min(N/20,unique.vals)
if(unique.vals>N.val){
x <- ave(x,cut(x,N.val),FUN=min)
x <- signif(x,4)
}
# construct the outline of the plot
outline <- as.vector(table(x))
outline <- outline/max(outline)
# determine some correction to make the V shape,
# based on the range
y.corr <- diff(range(x))*0.05
# Get the unique values
yval <- sort(unique(x))
plot(c(-1,1),c(min(yval),max(yval)),
type="n",xaxt="n",xlab="")
for(i in 1:length(yval)){
n <- sum(x==yval[i])
x.plot <- seq(-outline[i],outline[i],length=n)
y.plot <- yval[i]+abs(x.plot)*y.corr
points(x.plot,y.plot,pch=19,cex=0.5)
}
}
N <- 500
x <- rpois(N,4)+abs(rnorm(N))
Make.Funny.Plot(x)
EDIT : corrected so it always works.
I recently came upon the beeswarm package, that bears some similarity.
The bee swarm plot is a
one-dimensional scatter plot like
"stripchart", but with closely-packed,
non-overlapping points.
Here's an example:
library(beeswarm)
beeswarm(time_survival ~ event_survival, data = breast,
method = 'smile',
pch = 16, pwcol = as.numeric(ER),
xlab = '', ylab = 'Follow-up time (months)',
labels = c('Censored', 'Metastasis'))
legend('topright', legend = levels(breast$ER),
title = 'ER', pch = 16, col = 1:2)
(source: eklund at www.cbs.dtu.dk)
I have come up with the code similar to Joris, still I think this is more than a stem plot; here I mean that they y value in each series is a absolute value of a distance to the in-bin mean, and x value is more about whether the value is lower or higher than mean.
Example code (sometimes throws warnings but works):
px<-function(x,N=40,...){
x<-sort(x);
#Cutting in bins
cut(x,N)->p;
#Calculate the means over bins
sapply(levels(p),function(i) mean(x[p==i]))->meansl;
means<-meansl[p];
#Calculate the mins over bins
sapply(levels(p),function(i) min(x[p==i]))->minl;
mins<-minl[p];
#Each dot is one value.
#X is an order of a value inside bin, moved so that the values lower than bin mean go below 0
X<-rep(0,length(x));
for(e in levels(p)) X[p==e]<-(1:sum(p==e))-1-sum((x-means)[p==e]<0);
#Y is a bin minum + absolute value of a difference between value and its bin mean
plot(X,mins+abs(x-means),pch=19,cex=0.5,...);
}
Try the vioplot package:
library(vioplot)
vioplot(rnorm(100))
(with awful default color ;-)
There is also wvioplot() in the wvioplot package, for weighted violin plot, and beanplot, which combines violin and rug plots. They are also available through the lattice package, see ?panel.violin.
Since this hasn't been mentioned yet, there is also ggbeeswarm as a relatively new R package based on ggplot2.
Which adds another geom to ggplot to be used instead of geom_jitter or the like.
In particular geom_quasirandom (see second example below) produces really good results and I have in fact adapted it as default plot.
Noteworthy is also the package vipor (VIolin POints in R) which produces plots using the standard R graphics and is in fact also used by ggbeeswarm behind the scenes.
set.seed(12345)
install.packages('ggbeeswarm')
library(ggplot2)
library(ggbeeswarm)
ggplot(iris,aes(Species, Sepal.Length)) + geom_beeswarm()
ggplot(iris,aes(Species, Sepal.Length)) + geom_quasirandom()
#compare to jitter
ggplot(iris,aes(Species, Sepal.Length)) + geom_jitter()

Resources