colour single ggplot axis item - r

I have created a chart and am wanting to colour one of the x-axis items based on a variable. I have seen this post (How to get axis ticks labels with different colors within a single axis for a ggplot graph?), but am struggling to apply it to my dataset.
df1 <- data.frame(var=c("a","b","c","a","b","c","a","b","c"),
val=c(99,120,79,22,43,53,12,27,31),
type=c("alpha","alpha","alpha","bravo","bravo","bravo","charlie","charlie","charlie"))
myvar="a"
ggplot(df1,aes(x=reorder(var,-val), y=val,fill=type)) + geom_bar(stat="identity")
Any tips on how to make the x-axis value red when it is equal to myvar?
Update: Thanks to #ddiez for some guidance. I finally came around to the fact that i would have to reorder prior to plotting. I also should have made my original example with data.table, so am not sure if this would influenced original responses. I modified my original dataset to be a data.table and used the following code to achieve success.
df1 <- data.table(var=c("a","b","c","a","b","c","a","b","c"),
val=c(99,120,79,22,43,53,12,27,31),
type=c("alpha","alpha","alpha","bravo","bravo","bravo","charlie","charlie","charlie"))
myvar="a"
df1[,axisColour := ifelse(var==myvar,"red","black")]
df1$var <- reorder(df1$var,-df1$val,sum)
setkey(df1,var,type)
ggplot(df1,aes(x=var, y=val,fill=type)) + geom_bar(stat="identity") +
theme(axis.text.x = element_text(colour=df1[,axisColour[1],by=var][,V1]))

There may be a more elegant solution but a quick hack (requires you to know the final order) would be:
ggplot(df1,aes(x=reorder(var,-val), y=val,fill=type)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(colour=c("black","black","red")))
A solution using the variable myvar (yet, there may be better ways):
# reorder the factors in the data.frame (instead of in situ).
df1$var=reorder(df1$var, -df1$val)
# create a vector of colors for each level.
mycol=rep("black", nlevels(df1$var))
names(mycol)=levels(df1$var)
# assign the desired ones a different color.
mycol[myvar]="red"
ggplot(df1,aes(x=var, y=val,fill=type)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(colour=mycol))

Related

increase distance between stack of geom_line()

I have some diffraction data from XRD. I'd like to plot it all in one chart but stacked. Because the range of y is quite large, stacking is not so straight forward. there's a link to data if you wish to play and the simple script is below
https://www.dropbox.com/s/b9kyubzncwxge9j/xrd.csv?dl=0
library(dplyr)
library(ggplot2)
#load it up
xrd <- read.csv("xrd.csv")
#melt it
xrd.m = melt(xrd, id.var="Degrees_2_Theta")
# Reorder so factor levels are grouped together
xrd.m$variable = factor(xrd.m$variable,
levels=sort(unique(as.character(xrd.m$variable))))
names(xrd.m)[names(xrd.m) == "variable"] <- "Sample"
names(xrd.m)[names(xrd.m) == "Degrees_2_Theta"] <- "angle"
#colours use for nearly everything
cbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
#plot
ggplot(xrd.m, aes(angle, value, colour=Sample, group=Sample)) +
geom_line(position = "stack") +
scale_colour_manual(values=cbPalette) +
theme_linedraw() +
theme(legend.position = "none",
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
labs(x="Degrees 2-theta", y="Intensity - stacked for clarity")
Here is the plot- as you can see it's not quite stacked
Here is something I had in excel a way back. ugly - but slightly better
I'm not sure that I will actually use the stacked plot function from R because I find it always looks off from past experience and instead might use the same data manipulation I used from excel.
It seems that you have a different understanding of the result of applying position="stack" on your geom_line() than what actually is happening. What you're looking to do is probably best served by either using faceting or creating a ridgeline plot. I will give you solutions for both of those approaches here with some example data (sorry, I don't click dropbox links and they will eventually break anyway).
What does position="stack" actually do?
The result of position="stack" will be that your y values of each line will be added, or "stacked", together in the resulting plot. That means that the lines as drawn will only actually accurately reflect the actual value in the data for one of the lines, and the other will be "added on top" of that (stacked). The behavior is best illustrated via an example:
ex <- data.frame(x=c(1,1,2,2,3,3), y=c(1,5,1,2,1,1), grp=rep(c('A','B'),3))
ggplot(ex, aes(x,y, color=grp)) + geom_line()
The y values for "A" are equal to 1 at all values of x. This is the same as indicating position="identity". Now, let's see what happens if we use position="stack":
ggplot(ex, aes(x,y, color=grp)) + geom_line(position="stack")
You should see, the value of y plotted for "B" is equal to B, whereas the y value for "A" is actually the value for "A" added to the value for "B". Hope that makes sense.
Faceting
What you're trying to do is take the overlapping lines you have and "separate" them vertically, right? That's not quite stacking, as you likely want to maintain their y values as position="identity" (the default). One way to do that quite easily is to use faceting, which creates what you could call "stacked plots" according to one or two variables in your dataset. In this case, I'm using example data (for reasons outlined above), but you can use this to understand how you want to arrange your own data.
set.seed(1919191)
df <- data.frame(
x=rep(1:100, 5),
y=c(rnorm(100,0,0.1), rnorm(100,0,0.2), rnorm(100,0,0.3), rnorm(100,0,0.4), rnorm(100,0,0.5)),
sample_name=c(rep('A',100), rep('B',100), rep('C',100), rep('D',100), rep('E',100)))
# plot code
p <- ggplot(df, aes(x,y, color=sample_name))
p + geom_line() + facet_grid(sample_name ~ .)
Create a Ridgeline Plot
The other way that kind of does the same thing is to create what is known as a ridgeline plot. You can do this via the package ggridges and here's an example using geom_ridgeline():
p + geom_ridgeline(
aes(y=sample_name, height=y),
fill=NA, scale=1, min_height=-Inf)
The idea here is to understand that geom_ridgeline() changes your y axis to be the grouping variable (so we actually have to redefine that in aes()), and the actual y value for each of those groups should be assigned to the height= aesthetic. If you have data that has negative y values (now height= values), you'll also want to set the min_height=, or it will cut them off at 0 by default. You can also change how much each of the groups are separated by playing with scale= (does not always change in the way you think it would, btw).

Changing datastructure to create correct bar graph in ggplot

I would like to make a graph in R, which I managed to make in excel. It is a bargraph with species on the x-axis and the log number of observations on the y-axis. My current data structure in R is not suitable (I think) to make this graph, but I do not know how to change this (in a smart way).
I have (amongst others) a column 'camera_site' (site 1, site2..), 'species' (agouti, paca..), 'count'(1, 2..), with about 50.000 observations.
I tried making a dataframe with a column 'species" (with 18 species) and a column with 'log(total observation)' for each species (see dataframe) But then I can only make a point graph.
this is how I would like the graph to look:
desired graph made in excel
Your data seems to be in the correct format from what I can tell from your screenshot.
The minimum amount of code you would need to get a plot like that would be the following, assuming your data.frame is called df:
ggplot(df, aes(VRM_species, log_obs_count_vrm)) +
geom_col()
Many people intuitively try geom_bar(), but geom_col() is equivalent to geom_bar(stat = "identity"), which you would use if you've pre-computed observations and don't need ggplot to do the counting for you.
But you could probably decorate the plot a bit better with some additions:
ggplot(df, aes(VRM_species, log_obs_count_vrm)) +
geom_col() +
scale_x_discrete(name = "Species") +
scale_y_continuous(name = expression("Log"[10]*" Observations"),
expand = c(0,0,0.1,0)) +
theme(axis.text.x = element_text(angle = 90))
Of course, you could customize the theme anyway you would like.
Groetjes

compare boxplots with a single value

I want to compare the distribution of several variables (here X1 and X2) with a single value (here bm). The issue is that these variables are too many (about a dozen) to use a single boxplot.
Additionaly the levels are too different to use one plot. I need to use facets to make things more organised:
However with this plot my benchmark category (bm), which is a single value in X1 and X2, does not appear in X1 and seems to have several values in X2. I want it to be only this green line, which it is in the first plot. Any ideas why it changes? Is there any good workaround? I tried the options of facet_wrap/facet_grid, but nothing there delivered the right result.
I also tried combining a bar plot with bm and three empty categories with the boxplot. But firstly it looked terrible and secondly it got similarly screwed up in the facetting. Basically any work around would help.
Below the code to create the minimal example displayed here:
# Creating some sample data & loading libraries
library(ggplot2)
library(RColorBrewer)
set.seed(10111)
x=matrix(rnorm(40),20,2)
y=rep(c(-1,1),c(10,10))
x[y==1,]=x[y==1,]+1
x[,2]=x[,2]+20
df=data.frame(x,y)
# creating a benchmark point
benchmark=data.frame(y=rep("bm",2),key=c("X1","X2"),value=c(-0.216936,20.526312))
# melting the data frame, rbinding it with the benchmark
test_dat=rbind(tidyr::gather(df,key,value,-y),benchmark)
# Creating a plot
p_box <- ggplot(data = test_dat, aes(x=key, y=value,color=as.factor(test_dat$y))) +
geom_boxplot() + scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1"))
# The first line delivers the first plot, the second line the second plot
p_box
p_box + facet_wrap(~key,scales = "free",drop = FALSE) + theme(legend.position = "bottom")
The problem only lies int the use of test_dat$y inside the color aes. Never use $ in aes, ggplot will mess up.
Anyway, I think you plot would improve if you use a geom_hline for the benchmark, instead of hacking in a single value boxplot:
library(ggplot2)
library(RColorBrewer)
ggplot(tidyr::gather(df,key,value,-y)) +
geom_boxplot(aes(x=key, y=value, color=as.factor(y))) +
geom_hline(data = benchmark, aes(yintercept = value), color = '#4DAF4A', size = 1) +
scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1")) +
facet_wrap(~key,scales = "free",drop = FALSE) +
theme(legend.position = "bottom")

Adding text to facetted histogram

Using ggplot2 I have made facetted histograms using the following code.
library(ggplot2)
library(plyr)
df1 <- data.frame(monthNo = rep(month.abb[1:5],20),
classifier = c(rep("a",50),rep("b",50)),
values = c(seq(1,10,length.out=50),seq(11,20,length.out=50))
)
means <- ddply (df1,
c(.(monthNo),.(classifier)),
summarize,
Mean=mean(values)
)
ggplot(df1,
aes(x=values, colour=as.factor(classifier))) +
geom_histogram() +
facet_wrap(~monthNo,ncol=1) +
geom_vline(data=means, aes(xintercept=Mean, colour=as.factor(classifier)),
linetype="dashed", size=1)
The vertical line showing means per month is to stay.
But I want to also add text over these vertical lines displaying the mean values for each month. These means are from the 'means' data frame.
I have looked at geom_text and I can add text to plots. But it appears my circumstance is a little different and not so easy. It's a lot simpler to add text in some cases where you just add values of the plotted data points. But cases like this when you want to add the mean and not the value of the histograms I just can't find the solution.
Please help. Thanks.
Having noted the possible duplicate (another answer of mine), the solution here might not be as (initially/intuitively) obvious. You can do what you need if you split the geom_text call into two (for each classifier):
ggplot(df1, aes(x=values, fill=as.factor(classifier))) +
geom_histogram() +
facet_wrap(~monthNo, ncol=1) +
geom_vline(data=means, aes(xintercept=Mean, colour=as.factor(classifier)),
linetype="dashed", size=1) +
geom_text(y=0.5, aes(x=Mean, label=Mean),
data=means[means$classifier=="a",]) +
geom_text(y=0.5, aes(x=Mean, label=Mean),
data=means[means$classifier=="b",])
I'm assuming you can format the numbers to the appropriate precision and place them on the y-axis where you need to with this code.

Label selected percentage values inside stacked bar plot (ggplot2)

I want to put labels of the percentages on my stacked bar plot. However, I only want to label the largest 3 percentages for each bar. I went through a lot of helpful posts on SO (for example: 1, 2, 3), and here is what I've accomplished so far:
library(ggplot2)
groups<-factor(rep(c("1","2","3","4","5","6","Missing"),4))
site<-c(rep("Site1",7),rep("Site2",7),rep("Site3",7),rep("Site4",7))
counts<-c(7554,6982, 6296,16152,6416,2301,0,
20704,10385,22041,27596,4648, 1325,0,
17200, 11950,11836,12303, 2817,911,1,
2580,2620,2828,2839,507,152,2)
tapply(counts,site,sum)
tot<-c(rep(45701,7),rep(86699,7), rep(57018,7), rep(11528,7))
prop<-sprintf("%.1f%%", counts/tot*100)
data<-data.frame(groups,site,counts,prop)
ggplot(data, aes(x=site, y=counts,fill=groups)) + geom_bar()+
stat_bin(geom = "text",aes(y=counts,label = prop),vjust = 1) +
scale_y_continuous(labels = percent)
I wanted to insert my output image here but don't seem to have enough reputation...But the code above should be able to produce the plot.
So how can I only label the largest 3 percentages on each bar? Also, for the legend, is it possible for me to change the order of the categories? For example put "Missing" at the first. This is not a big issue here but for my real data set, the order of the categories in the legend really bothers me.
I'm new on this site, so if there's anything that's not clear about my question, please let me know and I will fix it. I appreciate any answer/comments! Thank you!
I did this in a sort of hacky manner. It isn't that elegant.
Anyways, I used the plyr package, since the split-apply-combine strategy seemed to be the way to go here.
I recreated your data frame with a variable perc that represents the percentage for each site. Then, for each site, I just kept the 3 largest values for prop and replaced the rest with "".
# I added some variables, and added stringsAsFactors=FALSE
data <- data.frame(groups, site, counts, tot, perc=counts/tot,
prop, stringsAsFactors=FALSE)
# Load plyr
library(plyr)
# Split on the site variable, and keep all the other variables (is there an
# option to keep all variables in the final result?)
data2 <- ddply(data, ~site, summarize,
groups=groups,
counts=counts,
perc=perc,
prop=ifelse(perc %in% sort(perc, decreasing=TRUE)[1:3], prop, ""))
# I changed some of the plotting parameters
ggplot(data2, aes(x=site, y=perc, fill=groups)) + geom_bar()+
stat_bin(geom = "text", aes(y=perc, label = prop),vjust = 1) +
scale_y_continuous(labels = percent)
EDIT: Looks like your scales are wrong in your original plotting code. It gave me results with 7500000% on the y axis, which seemed a little off to me...
EDIT: I fixed up the code.

Resources