Clustering dots in a scatterplot

Clustering dots in a scatterplot - r

Let's say I have this data.frame:
df <- data.frame(x = rep(1, 20), y = runif(20, 10, 20))
and I want to plot df$y vs. df$x.
Since the x values are constant, points that have identical or close y values will be plotted on top of each other in a simple scatterplot, which kind of hides the density of points at such y-values. One solution for that situation is of course to use a violin plot.
I'm looking for another solution - plotting clusters of points instead of the individual points, which will therefore look similar to a bubble plot. In a bubble plot however, a third dimension is required in order to make the bubbles meaningful, which I don't have in my data. Does anyone know of an R function/package that take as input points (and probably a defined radius) and will cluster them and plot them?

You can jitter the x values:
plot(jitter(df$x),df$y)

You could try a hexplot, using either the hexplot library or stat_binhex in ggplot2.
http://cran.r-project.org/web/packages/hexbin/
http://docs.ggplot2.org/0.9.3/stat_binhex.html

The other standard approach (vs. jitter) is to use a partially transparent color, so that overlapping points will appear darker than "lone" points.
De gustibus, etc.

Using transparency is another solution. E.g.:
ggplot(df, aes(x=x, y=y)) +
geom_point(alpha=0.2, size=3)
When there is only one x value, a density plot:
ggplot(df, aes(x=y)) +
stat_density(geom="line")
or a violin plot:
ggplot(df, aes(x=x, y=y)) +
geom_violin()
might also be options for displaying your data.

look at the sunflowerplot function (and the xyTable function that it uses to count overlapping points).
You could also use the my.symbols function from the TeachingDemos package with the results of xyTable to use other shapes (polygrams or example).

Related

R - Bar Plot with transparency based on values?

I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?

Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale

Wrong density values in a histogram with `fill` option in `ggplot2`

I was creating histograms with ggplot2 in R whose bins are separated with colors and noticed one thing. When the bins of a histogram are separated by colors with fill option, the density value of the histogram turns funny.
Here is the data.
set.seed(42)
x <- rnorm(10000,0,1)
df <- data.frame(x=x, b=x>1)
This is a histogram without fill.
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..))
This is a histogram with fill.
ggplot(df, aes(x = x, fill=b)) +
geom_histogram(aes(y=..density..))
You can see the latter is pretty crazy. The left side of the bins is sticking out. The density values of the bins of each color are obviously wrong.
I thought over this issue for a while. The data can't be wrong for the first histogram was normal. It should be something in ggplot2 or geom_histogram function. I googled "geom_histogram density fill" and couldn't find much help.
I want the end product to look like:
Separated by colors as you see in the second histogram
Size and shape identical to the first histogram
The vertical axis being density
How would you deal with issue?

I think what you may want is this:
ggplot(df, aes(x = x, fill=b)) +
geom_histogram()
Rather than the density. As mentioned above the density is asking for extra calcuations.
One thing that is important (in my opinion) is that histograms are graphs of one variable. As soon as you start adding data from other variables you start to change them more into bar charts or something else like that.
You will want work on setting the axis manually if you want it to range from 0 to .4.

The solution is to hand-compute density like this (instead of using the built-in ggplot2 version):
library(ggplot2)
# Generate test data
set.seed(42)
x <- rnorm(10000,0,1)
df <- data.frame(x=x, b=x>1)
ggplot(df, aes(x = x, fill=b)) +
geom_histogram(mapping = aes(y = ..count.. / (sum(..count..) * ..width..)))

when you provide a column name for the fill parameter in ggplot it groups varaiables and plots them according to each group with a unique color.
if you want a single color for the plot just specify the color you want:
FIXED
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..),fill="Blue")

tweaks to customized legends with ggplot and cowplot: colour matching issue

I'm trying to create a picture with points (actually bars, but whatever) in two distinct colours with parallel saturated-to-unsaturated colour scales, with corresponding colourbar legends. I'm most of the way there, but there are a few minor points I can't handle yet.
tl;dr the color scales I get from a red-to-white gradient and a saturated-red-to-completely-unsaturated gradient are not identical.
Set up data: y will determine both y-axis position and degree of saturation, w will determine binary colour choice.
set.seed(101)
dd <- data.frame(x=1:100,y=rnorm(100))
dd$w <- as.logical(sample(0:1,size=nrow(dd),
replace=TRUE))
Get packages:
library(ggplot2)
library(cowplot)
library(gridExtra)
I can get the plot I want by allowing alpha (transparency) to vary with y, but the legend is ugly:
g0 <- ggplot(dd,aes(x,y))+
geom_point(size=8,aes(alpha=y,colour=w))+
scale_colour_manual(values=c("red","blue"))
## + scale_alpha(guide="colourbar") ## doesn't work
I can draw each half of the points by themselves to get a legend similar to what I want:
g1 <- ggplot(dd[!dd$w,],aes(x,y))+
geom_point(size=8,aes(colour=y))+
scale_colour_gradient(low="white",high="red",name="not w")+
expand_limits(x=range(dd$x),y=range(dd$y))
g2 <- ggplot(dd[dd$w,],aes(x,y))+
geom_point(size=8,aes(colour=y))+
scale_colour_gradient(low="white",high="blue",name="w")+
expand_limits(x=range(dd$x),y=range(dd$y))
Now I can use tools from cowplot to pick off the legends and combine them with the original plot:
g1_leg <- get_legend(g1)
g2_leg <- get_legend(g2)
g0_noleg <- g0 + theme(legend.position='none')
ggdraw(plot_grid(g0_noleg,g1_leg,g2_leg,nrow=1,rel_widths=c(1,0.2,0.2)))
This is most of the way there, but:
ideally I'd like to squash the two colourbars together (I know I can probably do that with sufficient grid-hacking ...)
the colours don't quite match; the legend colours are slightly warmer than the point colours ...
Ideas? Or other ways of achieving the same goal?

Problems making a graphic in ggplot

I an working with ggplot. I want to desine a graphic with ggplot. This graphics is with two continuous variables but I would like to get a graphic like this:
Where x and y are the continuous variables. My problem is I can't get it to show circles in the line of the plot. I would like the plot to have circles for each pair of observations from the continuous variables. For example in the attached graphic, it has a circle for pairs (1,1), (2,2) and (3,3). It is possible to get it? (The colour of the line doesn't matter.)

# dummy data
dat <- data.frame(x = 1:5, y = 1:5)
ggplot(dat, aes(x,y,color=x)) +
geom_line(size=3) +
geom_point(size=10) +
scale_colour_continuous(low="blue",high="red")
Playing with low/high will change the colours.
In general, to remove the legend, use + theme(legend.position="none")

Creating a facet_wrap plot with ggplot2 with different annotations in each plot

I am using ggplot2 to explore the result of some testing on an agent-based model. The model can end in one of three rounds per realization, and as such I am interested in how player utilities differ in terms of what round the game ends and their relative position in 2D space.
All this is to say that I have generated a facet_wrap plot to show this for each round, but I would also like to annotate each plot with the cor(x,y) for the subset of data represented in each facet. Is there a way to tell ggplot2 that I would like the annotation to use the subset of data generated by facet_wrap? Here is the code I have so far, and what it is producing
library(ggplot2)
# Load data
abm.data<-read.csv("ABM_results.csv")
# Create new colun for area of Pareto set
attach(abm.data)
area<-abs(((x3*(y2-y1))+(x2*(y1-y3))+(x1*(y3-y2)))/2)
abm.data<-transform(abm.data,area=area)
detach(abm.data)
# Compare area of Pareto set with player utility
png("area_p1.png",res=100,pointsize=20,height=500,width=1600)
area.p1<-ggplot(abm.data,aes(x=area))+geom_point(aes(y=U1_2,colour="Player 1",alpha=0.4))+facet_wrap(~round,ncol=3)+
annotate("text",0.375,-1.25,label=paste("rho=",round(cor(abm.data$area,abm.data$U1_2),2)), parse=TRUE)+
scale_colour_manual(values=c("Player 1"="red"))
area.p1+xlab("Area of Pareto Set")+ylab("Player Utility at Game End")+
opts(title="Final Player 1 Utility by Pareto Set Size and Round Game Ends",legend.position="none")
dev.off()
(source: drewconway.com)
As you can see, there are two problems:
The \rho value is of the full dataset, rather than the subsets by 'round'. Is there a way to get the cor(x,y) to print based on only the data shown in each plot?
The annotation should read "\rho=some_value" but instead I get "=(\rho,value);" is there a way to fix this?

To fix the second problem use
annotate("text", 0.375, -1.25,
label=paste("rho==", round(cor(abm.data$area, abm.data$U1_2), 2)),
parse=TRUE)
i.e. "rho==".
Edit: Here is a solution to solve the first problem
library("plyr")
library("ggplot2")
set.seed(1)
df <- data.frame(x=rnorm(300), y=rnorm(300), cl=gl(3,100)) # create test data
df.cor <- ddply(df, .(cl), function(val) sprintf("rho==%.2f", cor(val$x, val$y)))
p1 <- ggplot(data=df, aes(x=x)) +
geom_point(aes(y=y, colour="col1", alpha=0.4)) +
facet_wrap(~ cl, ncol=3) +
geom_text(data=df.cor, aes(x=0, y=3, label=V1), parse=TRUE) +
scale_colour_manual(values=c("col1"="red")) +
opts(legend.position="none")
print(p1)

The same question may be asked as for adding segments for each facet. We can solve these general problems by geom_segment instead of annotate("segment",...), for the geom_foo, we can define a data.frame to store the data for the geom_foo.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Clustering dots in a scatterplot - r

You can jitter the x values: plot(jitter(df$x),df$y)

You could try a hexplot, using either the hexplot library or stat_binhex in ggplot2. http://cran.r-project.org/web/packages/hexbin/ http://docs.ggplot2.org/0.9.3/stat_binhex.html

The other standard approach (vs. jitter) is to use a partially transparent color, so that overlapping points will appear darker than "lone" points. De gustibus, etc.

look at the sunflowerplot function (and the xyTable function that it uses to count overlapping points). You could also use the my.symbols function from the TeachingDemos package with the results of xyTable to use other shapes (polygrams or example).

Related

R - Bar Plot with transparency based on values?

Wrong density values in a histogram with `fill` option in `ggplot2`

tweaks to customized legends with ggplot and cowplot: colour matching issue

Problems making a graphic in ggplot

Creating a facet_wrap plot with ggplot2 with different annotations in each plot

Categories

Resources