ggplot2 clustering in R

ggplot2 clustering in R - r

can someone point me in the right direction to making a plot like this one w/ ggplot2? even just the function type.
I've been looking around in ggplot2 and can't find anything like this.

I'm assuming that the essential features of the plot are that:
a.) the x-axis is categorical, and
b.) the x-positions of the points are varied slightly,
c.) some summary statistic (I used medians).
If that's what you're looking for,
require(ggplot2)
require(plyr)
#define the data
lev <- gl(2, 10, 20, labels=c("I", "II"))
y <- runif(20)
df <- data.frame(lev, y)
#calculate the medians - I'm guessing that's what the horiz lines are?
meds <- ddply(df, .(lev), summarise, med = median(y))
ggplot(df, aes(x=lev, y=y, colour=lev)) +
geom_point(position="jitter") +
theme_bw() +
scale_colour_manual(values=c("red", "darkblue")) +
geom_errorbar(data=meds, aes(x=lev, y=med, ymin=med, ymax=med))
You can use annotate() to add the numbers and the little bracket if that is important.

Related

Creating barplot with standard errors plotted in R

I am trying to find the best way to create barplots in R with standard errors displayed. I have seen other articles but I cannot figure out the code to use with my own data (having not used ggplot before and this seeming to be the most used way and barplot not cooperating with dataframes). I need to use this in two cases for which I have created two example dataframes:
Plot df1 so that the x-axis has sites a-c, with the y-axis displaying the mean value for V1 and the standard errors highlighted, similar to this example with a grey colour. Here, plant biomass should the mean V1 value and treatments should be each of my sites.
Plot df2 in the same way, but so that before and after are located next to each other in a similar way to this, so pre-test and post-test equate to before and after in my example.
x <- factor(LETTERS[1:3])
site <- rep(x, each = 8)
values <- as.data.frame(matrix(sample(0:10, 3*8, replace=TRUE), ncol=1))
df1 <- cbind(site,values)
z <- factor(c("Before","After"))
when <- rep(z, each = 4)
df2 <- data.frame(when,df1)
Apologies for the simplicity for more experienced R users and particuarly those that use ggplot but I cannot apply snippets of code that I have found elsewhere to my data. I cannot even get enough code together to produce a start to a graph so I hope my descriptions are sufficient. Thank you in advance.

Something like this?
library(ggplot2)
get.se <- function(y) {
se <- sd(y)/sqrt(length(y))
mu <- mean(y)
c(ymin=mu-se, ymax=mu+se)
}
ggplot(df1, aes(x=site, y=V1)) +
stat_summary(fun.y=mean, geom="bar", fill="lightgreen", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1)
ggplot(df2, aes(x=site, y=V1, fill=when)) +
stat_summary(fun.y=mean, geom="bar", position="dodge", color="grey70")+
stat_summary(fun.data=get.se, geom="errorbar", width=0.1, position=position_dodge(width=0.9))
So this takes advantage of the stat_summary(...) function in ggplot to, first, summarize y for given x using mean(...) (for the bars), and then to summarize y for given x using the get.se(...) function for the error-bars. Another option would be to summarize your data prior to using ggplot, and then use geom_bar(...) and geom_errorbar(...).
Also, plotting +/- 1 se is not a great practice (although it's used often enough). You'd be better served plotting legitimate confidence limits, which you could do, for instance, using the built-in mean_cl_normal function instead of the contrived get.se(...). mean_cl_normal returns the 95% confidence limits based on the assumption that the data is normally distributed (or you can set the CL to something else; read the documentation).

I used group_by and summarise_each function for this and std.error function from package plotrix
library(plotrix) # for std error function
library(dplyr) # for group_by and summarise_each function
library(ggplot2) # for creating ggplot
For df1 plot
# Group data by when and site
grouped_df1<-group_by(df1,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error(from plotrix)
summarised_df1<-summarise_each(grouped_df1,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df1,aes(site,mean))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g
For df2 plot
# Group data by when and site
grouped_df2<-group_by(df2,when,site)
#summarise grouped data and calculate mean and standard error using function mean and std.error
summarised_df2<-summarise_each(grouped_df2,funs(mean=mean,std_error=std.error))
# Define the top and bottom of the errorbars
limits <- aes(ymax = mean + std_error, ymin=mean-std_error)
#Begin your ggplot
#Here we are plotting site vs mean and filling by another factor variable when
g<-ggplot(summarised_df2,aes(site,mean,fill=when))
#Creating bar to show the factor variable position_dodge
#ensures side by side creation of factor bars
g<-g+geom_bar(stat = "identity",position = position_dodge())
#creation of error bar
g<-g+geom_errorbar(limits,width=0.25,position = position_dodge(width = 0.9))
#print graph
g

How to reverse axis order and use a predefined scale in ggplot?

I've read a past post asking about using scale_reverse and scale_log10 at the same time. I have a similar issue, except my scale I'm seeking to "reverse" is a pre-defined scale in the "scales" package. Here is my code:
##Defining y-breaks for probability scale
ybreaks <- c(1,2,5,10,20,30,40,50,60,70,80,90,95,98,99)/100
#Random numbers, and their corresponding weibull probability valeus (which I'm trying to plot)
x <- c(.3637, .1145, .8387, .9521, .330, .375, .139, .662, .824, .899)
p <- c(.647, .941, .255, .059, .745, .549, .853, .451, .352, .157)
df <- data.frame(x, p)
require(scales)
require(ggplot2)
ggplot(df)+
geom_point(aes(x=x, y=p, size=2))+
stat_smooth(method="lm", se=FALSE, linetype="dashed", aes(x=x, y=p))+
scale_x_continuous(trans='probit',
breaks=ybreaks,
minor_breaks=qnorm(ybreaks))+
scale_y_log10()
Resulting plot:
For more information, the scale I'm trying to achieve is the probability plotting scale, which has finer resolution on either end of the scale (at 0 and 1) to show extreme events, with ever-decreasing resolution toward the median value (0.5).
I want to be able to use scale_x_reverse concurrently with my scale_x_continuous probability scale, but I don't know how to build that in any sort of custom scale. Any guidance on this?

Arguments in scale_(x|y)_reverse() are passed to scale_(x|y)_continuous() so you should simply do:
scale_x_reverse(trans='probit', breaks = ybreaks, minor_breaks=qnorm(ybreaks))

Rather than try to combine two transformations, why not transform your existing data and then plot it?
The following looks like it should be right.
#http://r.789695.n4.nabble.com/Inverse-Error-Function-td802691.html
erf.inv <- function(x) qnorm((x + 1)/2)/sqrt(2)
#http://en.wikipedia.org/wiki/Probit#Computation
probit <- function(x) sqrt(2)*erf.inv((2*x)-1)
# probit(0.3637)
df$z <- probit(df$x)
ggplot(df)+
geom_point(aes(x=z, y=p), size=2)+
stat_smooth(method="lm", se=FALSE, linetype="dashed", aes(x=z, y=p))+
scale_x_reverse(breaks = ybreaks,
minor_breaks=qnorm(ybreaks))+
scale_y_log10()

How to plot stacked point histograms?

What's the ggplot2 equivalent of "dotplot" histograms? With stacked points instead of bars? Similar to this solution in R:
Plot Histogram with Points Instead of Bars
Is it possible to do this in ggplot2? Ideally with the points shown as stacks and a faint line showing the smoothed line "fit" to these points (which would make a histogram shape.)

ggplot2 does dotplots Link to the manual.
Here is an example:
library(ggplot2)
set.seed(789); x <- data.frame(y = sample(1:20, 100, replace = TRUE))
ggplot(x, aes(y)) + geom_dotplot()
In order to make it behave like a simple dotplot, we should do this:
ggplot(x, aes(y)) + geom_dotplot(binwidth=1, method='histodot')
You should get this:
To address the density issue, you'll have to add another term, ylim(), so that your plot call will have the form ggplot() + geom_dotplot() + ylim()
More specifically, you'll write ylim(0, A), where A will be the number of stacked dots necessary to count 1.00 density. In the example above, the best you can do is see that 7.5 dots reach the 0.50 density mark. From there, you can infer that 15 dots will reach 1.00.
So your new call looks like this:
ggplot(x, aes(y)) + geom_dotplot(binwidth=1, method='histodot') + ylim(0, 15)
Which will give you this:
Usually, this kind of eyeball estimate will work for dotplots, but of course you can try other values to fine-tune your scale.
Notice how changing the ylim values doesn't affect how the data is displayed, it just changes the labels in the y-axis.

As #joran pointed out, we can use geom_dotplot
require(ggplot2)
ggplot(mtcars, aes(x = mpg)) + geom_dotplot()
Edit: (moved useful comments into the post):
The label "count" it's misleading because this is actually a density estimate may be you could suggest we changed this label to "density" by default. The ggplot implementation of dotplot follow the original one of Leland Wilkinson, so if you want to understand clearly how it works take a look at this paper.
An easy transformation to make the y axis actually be counts, i.e. "number of observations". From the help page it is written that:
When binning along the x axis and stacking along the y axis, the numbers on y axis are not meaningful, due to technical limitations of ggplot2. You can hide the y axis, as in one of the examples, or manually scale it to match the number of dots.
So you can use this code to hide y axis:
ggplot(mtcars, aes(x = mpg)) +
geom_dotplot(binwidth = 1.5) +
scale_y_continuous(name = "", breaks = NULL)

I introduce an exact approach using #Waldir Leoncio's latter method.
library(ggplot2); library(grid)
set.seed(789)
x <- data.frame(y = sample(1:20, 100, replace = TRUE))
g <- ggplot(x, aes(y)) + geom_dotplot(binwidth=0.8)
g # output to read parameter
### calculation of width and height of panel
grid.ls(view=TRUE, grob=FALSE)
real_width <- convertWidth(unit(1,'npc'), 'inch', TRUE)
real_height <- convertHeight(unit(1,'npc'), 'inch', TRUE)
### calculation of other values
width_coordinate_range <- diff(ggplot_build(g)$panel$ranges[[1]]$x.range)
real_binwidth <- real_width / width_coordinate_range * 0.8 # 0.8 is the argument binwidth
num_balls <- real_height / 1.1 / real_binwidth # the number of stacked balls. 1.1 is expanding value.
# num_balls is the value of A
g + ylim(0, num_balls)

Apologies : I don't have enough reputation to 'comment'.
I like cuttlefish44's "exact approach", but to make it work (with ggplot2 [2.2.1]) I had to change the following line from :
### calculation of other values
width_coordinate_range <- diff(ggplot_build(g)$panel$ranges[[1]]$x.range)
to
### calculation of other values
width_coordinate_range <- diff(ggplot_build(g)$layout$panel_ranges[[1]]$x.range)

Plot frequency of a value of 2 factors in the same plot in R

I'd like to plot the frequency of a variable color coded for 2 factor levels for example blue bars should be the hist of level A and green the hist of level B both n the same graph? Is this possible with the hist command? The help of hist does not allow for a factor. Is there another way around?
I managed to do this by barplots manually but i want to ask if there is a more automatic method
Many thanks
EC
PS. I dont need density plots

Just in case the others haven't answered this is a way that satisfies. I had to deal with stacking histograms recently, and here's what I did:
data_sub <- subset(data, data$V1 == "Yes") #only samples that have V1 as "yes" in my dataset #are added to the subset
hist(data$HL)
hist(data_sub$HL, col="red", add=T)
Hopefully, this is what you meant?

It's rather unclear what you have as a data layout. A histogram requires that you have a variable that is ordinal or continuous so that breaks can be created. If you also have a separate grouping factor you can plot histograms conditional on that factor. A nice worked example of such a grouping and overlaying a density curve is offered in the second example on the help page for the histogram function in the lattice package.
A nice resource for learning relative merits of lattice and ggplot2 plotting is the Learning R blog. This is from the first of a multipart series on side-by=side comparison of the two plotting systems:
library(lattice)
library(ggplot2)
data(Chem97, package = "mlmRev")
#The lattice method:
pl <- histogram(~gcsescore | factor(score), data = Chem97)
print(pl)
# The ggplot method:
pg <- ggplot(Chem97, aes(gcsescore)) + geom_histogram(binwidth = 0.5) +
facet_wrap(~score)
print(pg)

I don't think you can do that easily with a bar histogram, as you would have to "interlace" the bars from both factor levels... It would need some kind of "discretization" of the now continuous x axis (i.e. it would have to be split in "categories" and in each category you would have 2 bars, for each factor level...
But it is quite easy and without problems if you are just fine with plotting the density line function:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
dx <- density(x)
dy <- density(y)
plot(dx, xlim = range(dx$x, dy$x), ylim = range(dx$y, dy$y),
type = "l", col = "red")
lines(dy, col = "blue")

It's very possible.
I didn't have data to work with but here's an example of a histogram with different colored bars. From here you'd need to use my code and figure out how to make it work for factors instead of tails.
BASIC SETUP
histogram <- hist(scale(vector)), breaks= , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < #of SD, Color 1, Color 2))
#EXAMPLE
x<-rnorm(1000)
histogram <- hist(scale(x), breaks=20 , plot=FALSE)
plot(histogram, col=ifelse(abs(histogram$breaks) < 2, "red", "green"))

I agree with the others that a density plot is more useful than merging colored bars of a histogram, particularly if the group's values are intermixed. This would be very difficult and wouldn't really tell you much. You've got some great suggestions from others on density plots, here's my 2 cents for density plots that I sometimes use:
y <- rnorm(1000, 0, 1)
x <- rnorm(1000, 0.5, 2)
DF <- data.frame("Group"=c(rep(c("y","x"), each=1000)), "Value"=c(y,x))
library(sm)
with(DF, sm.density.compare(Value, Group, xlab="Grouping"))
title(main="Comparative Density Graph")
legend(-9, .4, levels(DF$Group), fill=c("red", "darkgreen"))

Showing multiple axis labels using ggplot2 with facet_wrap in R

I've got a nice facet_wrap density plot that I have created with ggplot2. I would like for each panel to have x and y axis labels instead of only having the y axis labels along the left side and the x labels along the bottom. What I have right now looks like this:
library(ggplot2)
myGroups <- sample(c("Mo", "Larry", "Curly"), 100, replace=T)
myValues <- rnorm(300)
df <- data.frame(myGroups, myValues)
p <- ggplot(df) +
geom_density(aes(myValues), fill = alpha("#335785", .6)) +
facet_wrap(~ myGroups)
p
Which returns:
(source: cerebralmastication.com)
It seems like this should be simple, but my Google Fu has been too poor to find an answer.

You can do this by including the scales="free" option in your facet_wrap call:
myGroups <- sample(c("Mo", "Larry", "Curly"), 100, replace=T)
myValues <- rnorm(300)
df <- data.frame(myGroups, myValues)
p <- ggplot(df) +
geom_density(aes(myValues), fill = alpha("#335785", .6)) +
facet_wrap(~ myGroups, scales="free")
p

Short answer: You can't do that. It might make sense with 3 graphs, but what if you had a big lattice of 32 graphs? That would look noisy and bad. GGplot's philosophy is about doing the right thing with a minimum of customization, which means, naturally, that you can't customize things as much as other packages.
Long answer: You could fake it by constructing three separate ggplot objects and combining them. But it's not a very general solution. Here's some code from Hadley's book that assumes you've created ggplot objects a, b, and c. It puts a in the top row, with b and c in the bottom row.
grid.newpage()
pushViewport(viewport(layout=grid.layout(2,2)))
vplayout<-function(x,y)
viewport(layout.pos.row=x,layout.pos.col=y)
print(a,vp=vplayout(1,1:2))
print(b,vp=vplayout(2,1))
print(c,vp=vplayout(2,2))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ggplot2 clustering in R - r

can someone point me in the right direction to making a plot like this one w/ ggplot2? even just the function type. I've been looking around in ggplot2 and can't find anything like this.

Related

Creating barplot with standard errors plotted in R

How to reverse axis order and use a predefined scale in ggplot?

How to plot stacked point histograms?

Plot frequency of a value of 2 factors in the same plot in R

Showing multiple axis labels using ggplot2 with facet_wrap in R

Categories

Resources