How to use a bubbleplot in ggplot2/R to deal with overplotting - r

I have a plot of categorical variables as below:
http://i.imgur.com/d1hJP21.png
This is a very small subset of the actual data (n > 10000)
While jittering handles the overplotting, it is ugly and can lead to ambiguity. I was keen to instead place bubbles to show the number of points that are co-incident.
I can't seem to find a simple and repeatable way to do this.
Thank you in advance!
Edit:
Thanks for the feedback. Here is what I hope is a reproducible example:
First, a CSV of the data (long, but relevant in this example):
ID,g,wf,fi
1824848,14,2,4
1314001,14,2,3
670960,14,1,3
1313235,15,3,4
1172304,3,5,4
1859973,15,1,3
1826951,14,1,4
1868238,15,1,2
1911869,15,1,4
1911861,15,1,2
926829,14,1,3
1609578,3,4,4
1306895,3,5,4
1199557,15,1,4
692849,10,3,4
1923352,3,5,4
1881724,4,4,4
1384603,3,5,4
1928829,15,1,4
493503,3,5,4
902650,15,1,3
1887582,6,4,4
1887584,3,5,4
1933992,13,1,4
635372,3,3,4
1892765,15,1,2
1934773,13,2,4
1892530,14,2,4
936786,3,5,4
1897585,13,3,4
1895932,15,1,3
422785,15,1,3
1219573,8,1,4
1897817,3,2,4
1899612,14,3,4
1939157,15,1,4
1952043,14,1,3
1938048,14,1,3
1896607,15,1,2
1941385,15,1,3
1959437,3,5,4
1064010,15,1,3
1951600,13,3,4
541439,15,1,4
1938609,3,5,4
1958667,15,1,2
1943792,10,1,4
1943782,14,1,4
1893714,14,1,4
1335502,15,1,1
1950179,3,2,4
1959069,15,1,2
1958811,15,1,2
1958808,15,3,4
1959878,15,1,1
1949904,15,1,3
1961475,15,1,4
1876863,15,1,4
384705,15,1,3
1966338,15,1,4
1980290,3,4,4
1966997,15,2,4
1967107,15,1,1
1976077,15,1,2
1967579,11,1,4
1967387,4,2,4
1973408,3,3,4
1684881,3,3,3
...and the plot code:
sx <- ggplot(dx, aes(x=fi, y=wf)) +
geom_point(shape=19, alpha=1, size=1, position=position_jitter(width=0.1,height=.1))
print(sx)
I really don't know where to go from here, other than manually making a count matrix...
Thanks again (sorry, new to stackoverflow).

Related

R: Cleaning GGally Plots

I am using the R programming language and I am new the GGally library. I followed some basic tutorials online and ran the following code:
#load libraries
library(GGally)
library(survival)
library(plotly)
I changed some of the data types:
#manipulate the data
data(lung)
data = lung
data$sex = as.factor(data$sex)
data$status = as.factor(data$status)
data$ph.ecog = as.factor(data$ph.ecog)
Now I visualize:
#make the plots
#I dont know why, but this comes out messy
ggparcoord(data, groupColumn = "sex")
#Cleaner
ggparcoord(data)
Both ggparcoord() code segments successfully ran, however the first one came out pretty messy (the axis labels seem to have been corrupted). Is there a way to fix the labels?
In the second graph, it makes it difficult to tell how the factor variables are labelled on their respective axis (e.g. for the "sex" column, is "male" the bottom point or is "female" the bottom type). Does anyone know if there is a way to fix this?
Finally, is there a way to use the "ggplotly()" function for "ggally" objects?
e.g.
a = ggparcoord(data)
ggplotly(a)
Thanks
Looks like your data columns get converted to a factor when adding the groupColumn. To prevent that you could exclude the groupColumn from the columns to be plotted:
BTW: Not sure about the general case. But at least for ggparcoord ggplotly works.
library(GGally)
library(survival)
data(lung)
data = lung
data$sex = as.factor(data$sex)
data$status = as.factor(data$status)
data$ph.ecog = as.factor(data$ph.ecog)
#I dont know why, but this comes out messy
ggparcoord(data, seq(ncol(data))[!names(data) %in% "sex"], groupColumn = "sex")

R: Problems while plotting sampled values from a curve

I am trying to simulate a signal in order to apply some methods of non-linear fittings, but I have some problems when plotting it.
x<-sample(seq(0,1,length.out = 1000),200)
y<-2*sin(4*pi*x)-6*abs(x-0.4)^(0.3)+2*exp(-30*(4*x-2)^2)+8*x+rnorm(200,0,0.5)
s<-2*sin(4*pi*x)-6*abs(x-0.4)^(0.3)+2*exp(-30*(4*x-2)^2)+8*x
plot(x,y)
lines(x,s,col="red")
The idea I want to have 200 observations uniformly sampled with an additive white noise term, and the I would like to plot this "perturbed" signal together with the original signal. (y and s respectively).
The fact is that if I use the code that I wrote I obtain as result something like:
Probably is such a simple thing, but I'm kinda stuck with this.
Any hint or suggestion will be greatly appreciated.
Lines are plotted sequentially, and you decided to randomly draw your X values, so x values sitting next to each other in x are not next to each other on the axis - hence the mess. Just sort it:
x<-sort(sample(seq(0,1,length.out = 1000),200))
y<-2*sin(4*pi*x)-6*abs(x-0.4)^(0.3)+2*exp(-30*(4*x-2)^2)+8*x+rnorm(200,0,0.5)
s<-2*sin(4*pi*x)-6*abs(x-0.4)^(0.3)+2*exp(-30*(4*x-2)^2)+8*x
plot(x,y)
lines(x,s,col="red")
Another way to do this on the fly mentioned by mickey is:
ord = order(x)
lines(x[ord], s[ord], col = 'red')
You need to reorder the x observations order in ascending order, you can do that by storing everything in a dataframe object and then ordering it:
x<-sample(seq(0,1,length.out = 1000),200)
df_p= data.frame(x)
df_p$y<-2*sin(4*pi*df_p$x)-6*abs(df_p$x-0.4)^(0.3)+2*exp(-30*(4*df_p$x-2)^2)+8*df_p$x+rnorm(200,0,0.5)
df_p$s<-2*sin(4*pi*df_p$x)-6*abs(df_p$x-0.4)^(0.3)+2*exp(-30*(4*df_p$x-2)^2)+8*df_p$x
df_p = df_p[order(df_p$x),]
plot(df_p$x,df_p$y)
lines(df_p$x, df_p$s,col="red")
Also if you want to avoid this step you can use the ggplot2 library:
p <- ggplot(df_p) + geom_point(aes(x = x,y= y)) + geom_line(aes(x=x,y=s,color='red'))
plot(p)

How to consistently plot a tree after removing tips?

Imagine I have a tree (or dendrogram)
require(ape)
fulltree <- rtree(n=50, br=NULL)
...and then I remove some tips
prunetree <- drop.tip(fulltree,node=5)
If I plot the pruned tree, R rescales it so that only those tips remaining are considered.
par(mfrow=c(1,2))
plot(fulltree, type="fan")
plot(prunetree, type="fan")
But this makes it really hard to tell what part of the tree is now missing.
Is there a simple way that I can plot the pruned tree in the same scale/arrangement/etc. as the complete tree so that none of the remaining branches appear to move around? (In this example, I would get some kind of pac-man shape rather than a full circle) I'm thinking this could be done by coloring branches white or light grey. It would be really useful if someone wanted to animate a tree that was losing tips.
The problem with this, is as you stated, the data is removed from the new tree so it is rescaled. To fix this, you might be better off plotting the tree with a new color for the desired tip(s).
We can do this using the excellent package ggtree (amongst other methods):
set.seed(1234)
library(ggtree)
library(gridExtra)
fulltree <- rtree(n=10, br=NULL)
col <- rep(1, 2*fulltree$Nnode + 1)
col[5] <- 10
grid.arrange(ggtree(fulltree, layout = "fan") + geom_text(aes(label=label)),
ggtree(fulltree, col = col, layout = "circular") + geom_text(aes(label=label)))
The actual coloring comes from the col[5] <- 20: change the col[5] to your desired dropped tip, and the 20 to your desired colour.
Thanks jeremycg for the ggtree tip. I think this was more what I was looking for.
require(ape)
library(ggtree)
library(gridExtra)
library(ggplot2)
set.seed(1234)
fulltree <- rtree(n=50, br=NULL)
#These are the tips to drop
prunetips <- c("t41","t44","t42","t8")
#But get the tips to keep
keeptips <- fulltree$tip.label[!fulltree$tip.label %in% prunetips]
#Group the tips to keep
prunetree <- groupOTU(fulltree, focus=keeptips)
#And plot
ggtree(prunetree, layout="fan", aes(color=group))+
scale_color_manual(values=c("lightgrey","black"))+
geom_tiplab()

Geom_points not dodging when geom_errorbars are

I can't figure out how to get these geom_points to properly dodge! I've searched many, MANY how-to's and questions on different stackexchange pages, but none of them fix the problem.
analyze_weighted <- data.frame(
mus = c(clean_mu,b_mu,d_mu,g_mu,bd_mu,bg_mu,dg_mu,bdg_mu,m_mu),
sds = c(clean_sigma,b_sigma,d_sigma,g_sigma,bd_sigma,bg_sigma,dg_sigma,bdg_sigma,m_sigma),
SNR =c("No shifts","1 shift","1 shift","1 shift","2 shifts","2 shifts","2 shifts","3 shifts","4 shifts"),
)
And then I try to plot it:
ggplot(analyze_weighted, aes(x=SNR,y=mus,color=SNR,group=mus)) +
geom_point(position="dodge",na.rm=TRUE) +
geom_errorbar(position="dodge",aes(ymax=mus+sds/2,ymin=mus-sds/2,), width=0.25)
And it manages to dodge the error bars but not the points! I'm going crazy here, what do I do?
Here's what it looks like now--I want the points to be slightly dodged!
geom_point requires that you explicitly provide the width you desire the points to dodge.
This should work:
ggplot(analyze_weighted, aes(x=SNR,y=mus,color=SNR,group=mus)) +
geom_point(position=position_dodge(width=0.2),na.rm=TRUE) +
geom_errorbar(position=position_dodge(width=0.2),aes(ymax=mus+sds/2,ymin=mus-sds/2),width=0.25)
Please notice that your example wasn't a fully reproducible one, as no values of the variables used to construct mus and sds are available.

How to color branches in cluster dendrogram?

I will appreciate it so much if anyone of you show me how to color the main branches on the Fan clusters.
Please use the following example:
library(ape)
library(cluster)
data(mtcars)
plot(as.phylo(hclust(dist(mtcars))),type="fan")
You will need to be more specific about what you mean by "color the main branches" but this may give you some ideas:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "green")[1+(phyl$edge.length >40) ])
The odd numbered edges are the radial arms in a fan plot so this mildly ugly (or perhaps devilishly clever?) hack colors only the arms with length greater than 40:
phyl <-as.phylo(hclust(dist(mtcars)))
plot(phyl,type="fan", edge.col=c("black", "black", "green")[
c(TRUE, FALSE) + 1 + (phyl$edge.length >40) ])
If you want to color the main branches to indicate which class that sample belongs to, then you might find the function ColorDendrogram in the R package sparcl useful (can be downloaded from here). Here's some sample code:
library(sparcl)
# Create a fake two sample dataset
set.seed(1)
x <- matrix(rnorm(100*20),ncol=20)
y <- c(rep(1,50),rep(2,50))
x[y==1,] <- x[y==1,]+2
# Perform hierarchical clustering
hc <- hclust(dist(x),method="complete")
# Plot
ColorDendrogram(hc,y=y,main="My Simulated Data",branchlength=3)
This will generate a dendrogram where the leaves are colored according to which of the two samples they came from.

Resources