How to make an overall boxplot alongside factors in R? - r

I am trying to create a boxplot that shows all of the factors of a variable, along with sample size, and at eh end of the plot also want an overall boxplot that combines all of the values into one. I am using the following line of code to do everything except making the overall plot:
library(ggplot2)
library(plyr)
xlabels <- ddply(extract8, .(Fuel), summarize, xlabels = paste(unique(Fuel), '\n(n = ', length(Fuel),')'))
ggplot(extract8, aes(x = Fuel, y = Exfiltration.Fraction.Percentage))+geom_boxplot()+
stat_boxplot(geom='errorbar', linetype=1) +
geom_boxplot(fill="pink") + geom_hline(yintercept = 0.4) +
scale_x_discrete(labels = xlabels[['xlabels']]) + ggtitle("Exfiltration Fraction (%) by Fuel Type")
Not sure on how to proceed regarding adding a boxplot that combines all of the factors into one.

This is certainly not the most elegant way to solve it, but it works:
Copy your dataset into a new object.
Within the new object, replace the content of the variable containing the factors with the label you would like, for instance, "Total".
Use rbind to attach the old and new objects together and attribute the result to the new object.
In ggplot replace the old object by the new object.
I had the same issue, couldn't find an answer and proceeded this way.

Related

How does one control the appearance (e.g. line size, line type, colour) of mqgam plots produced using plot.mgamViz from the "mgcViz" package?

I am using quantile regression in R with the qgam package and visualising them using the mgcViz package, but I am struggling to understand how to control the appearance of the plots. The package effectively turns gams (in my case mqgams) into ggplots.
Simple reprex:
egfit <- mqgam(data = iris,
Sepal.Length ~ s(Petal.Length),
qu = c(0.25,0.5,0.75))
plot.mgamViz(getViz(egfit))
I am able to control things that can be added, for example the axis labels and theme of the plot, but I'm struggling to effect things that would normally be addressed in the aes() or geom_x() functions.
How would I control the thickness of the line? If this were a normal geom_smooth() or geom_line() I'd simply put size = 1 inside of the geoms, but I cannot see how I'd do so here.
How can I control the linetype of these lines? The "id" is continuous and one cannot supply a linetype to a continuous scale. If this were a nomral plot I would convert "id" to a character, but I can't see a way of doing so with the plot.mgamViz function.
How can I supply a new colour scale? It seems as though if I provide it with a new colour scale it invents new ID values to put on the legend that don't correlate to the actual "id" values, e.g.
plot.mgamViz(getViz(egfit)) + scale_colour_viridis_c()
I fully expect this to be relatively simple and I'm missing something obvious, and imagine the answer to all three of these subquestions are very similar to one another. Thanks in advance.
You need to extract your ggplot element using this:
p1 <- plot.mgamViz(getViz(egfit))
p <- p1$plots [[1]]$ggObj
Then, id should be as.factor:
p$data$id <- as.factor(p$data$id)
Now you can play with ggplot elements as you prefer:
library(mgcViz)
egfit <- mqgam(data = iris,
Sepal.Length ~ s(Petal.Length),
qu = c(0.25,0.5,0.75))
p1 <- plot.mgamViz(getViz(egfit))
# Taking gg infos and convert id to factor
p <- p1$plots [[1]]$ggObj
p$data$id <- as.factor(p$data$id)
# Changing ggplot attributes
p <- p +
geom_line(linetype = 3, size = 1)+
scale_color_brewer(palette = "Set1")+
labs(x="Petal Length", y="s(Petal Length)", color = "My ID labels:")+
theme_classic(14)+
theme(legend.position = "bottom")
p
Here the generated plot:
Hope it is useful!

compare boxplots with a single value

I want to compare the distribution of several variables (here X1 and X2) with a single value (here bm). The issue is that these variables are too many (about a dozen) to use a single boxplot.
Additionaly the levels are too different to use one plot. I need to use facets to make things more organised:
However with this plot my benchmark category (bm), which is a single value in X1 and X2, does not appear in X1 and seems to have several values in X2. I want it to be only this green line, which it is in the first plot. Any ideas why it changes? Is there any good workaround? I tried the options of facet_wrap/facet_grid, but nothing there delivered the right result.
I also tried combining a bar plot with bm and three empty categories with the boxplot. But firstly it looked terrible and secondly it got similarly screwed up in the facetting. Basically any work around would help.
Below the code to create the minimal example displayed here:
# Creating some sample data & loading libraries
library(ggplot2)
library(RColorBrewer)
set.seed(10111)
x=matrix(rnorm(40),20,2)
y=rep(c(-1,1),c(10,10))
x[y==1,]=x[y==1,]+1
x[,2]=x[,2]+20
df=data.frame(x,y)
# creating a benchmark point
benchmark=data.frame(y=rep("bm",2),key=c("X1","X2"),value=c(-0.216936,20.526312))
# melting the data frame, rbinding it with the benchmark
test_dat=rbind(tidyr::gather(df,key,value,-y),benchmark)
# Creating a plot
p_box <- ggplot(data = test_dat, aes(x=key, y=value,color=as.factor(test_dat$y))) +
geom_boxplot() + scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1"))
# The first line delivers the first plot, the second line the second plot
p_box
p_box + facet_wrap(~key,scales = "free",drop = FALSE) + theme(legend.position = "bottom")
The problem only lies int the use of test_dat$y inside the color aes. Never use $ in aes, ggplot will mess up.
Anyway, I think you plot would improve if you use a geom_hline for the benchmark, instead of hacking in a single value boxplot:
library(ggplot2)
library(RColorBrewer)
ggplot(tidyr::gather(df,key,value,-y)) +
geom_boxplot(aes(x=key, y=value, color=as.factor(y))) +
geom_hline(data = benchmark, aes(yintercept = value), color = '#4DAF4A', size = 1) +
scale_color_manual(name="Cluster",values=brewer.pal(8,"Set1")) +
facet_wrap(~key,scales = "free",drop = FALSE) +
theme(legend.position = "bottom")

Plotting dataframes in the same ggplot with for-loop in a function

I have a bunch of dataframes and I want to plot 2 columns of each dataframe on the same ggplot. I already have a plot from another function, coloured in blue and red and I want the new ones to be added to it. Although the way I'm trying works on the console, I can't get to save the function, call it and have it work. The error I get is :
Discrete value supplied to continuous scale.
So, the dataframes are in my environment and named BEFMORN1 to BEFMORN9. The initial plot is test_plot.
The first part that gives me the test_plot works.
test_plot<-ggplot()+geom_point(data=yy4, aes(x=Time, y=Dist), colour="red")+geom_point(data=zz4, aes(x=Time, y=Dist), colour="blue")
test_plot<-test_plot+scale_x_continuous(name="Time (Seconds from the beginning)")
test_plot<-test_plot+scale_y_continuous(name="Distance (Metres from the beginning)")
The second part will be the new function
plot_all_runs<-function(r,test_plot) {
for (i in 1:(length(r[[1]]))) {
z<-as.data.frame(mget(ls(pattern=paste0("BEFMORN",i))))
test_plot2<-test_plot+geom_point(data=z, aes_string(x=names(z)[12], y=names(z)[17]))
}print(test_plot2)
}
r is a list of 6 lists of different dataframes, so BEFMORN came from r[[1]]. BEFNOON will come from r[[2]] etc. So my plan is to have 6 identical functions with different arguments in paste0.
I'm using aes_string(x=names(z)[12] because the data frames z will have different column names in each iteration.
Does someone understand why I'm getting an error? I have played around with the scales (removing them from the initial plot or adding them again in the next one) but no improvement.
EDIT:
All columns to be plotted have been transformed to numeric. Others are factors and integers.
EXAMPLE
BEFMORN1<-data.frame(BEFMORN1.Time=seq(0:10, 0.5), BEFMORN1.Dist=1:20)
BEFMORN2<-data.frame(BEFMORN2.Time=seq(0:13, 0.5), BEFMORN2.Dist=c(1:8,8,8,9,10,13,13,13,13.5,14,14,14 14:20))
yy4<-data.frame(Time=seq(0:10, 0.5). Dist=c(1:8,8,8,9,10,13,14:20))
ZZ4<-data.frame(Time=seq(0:12, 0.5). Dist=c(1:8,8,8,9,9.5,10,10.5,12,12.5,13,14:20))
test_plot<-ggplot()+geom_point(data=yy4, aes(x=Time, y=Dist), colour="red")+geom_point(data=zz4, aes(x=Time, y=Dist), colour="blue")
plot_all_runs<-function(test_plot) {
for (i in 1:9) {
z<-as.data.frame(mget(ls(pattern=paste0("BEFMORN",i))))
test_plot2<-test_plot+geom_point(data=z, aes_string(x=names(z)[12], y=names(z)[17]))
}print(test_plot2)
}
An example of generating the long format #biomiha and #joran suggested:
library(ggplot2)
BEFMORN1<-data.frame(Time=seq(0,10, 0.5)
, Dist=1:21, Group = "BEFMORN1")
BEFMORN2<-data.frame(Time=seq(0,13, 0.5)
, Dist=c(1:8,8,8,9,10,13,13,13,13.5,14,14,14,14:21)
, Group = "BEFMORN2")
yy4<-data.frame(Time=seq(0,10, 0.5)
, Dist=c(1:8,8,8,9,10,13,14:21)
, Group = "yy4")
zz4<-data.frame(Time=seq(0,12, 0.5)
, Dist=c(1:8,8,8,9,9.5,10,10.5,12,12.5,13,14:21)
, Group = "zz4")
allData <-
rbind(BEFMORN1, BEFMORN2, yy4, zz4)
ggplot(allData
, aes(x = Time
, y = Dist
, col = Group)) +
geom_point()
Note that if your data are already in place, adding a "Group" column may need to be done with a bit more care. However, the general principle is the same. If you want, you can use any of the scale_color_* functions to change the default colors, including scale_color_manual if you want to set them yourself.

Manually added legend not working in ggplot2?

Here's facsimile of my data:
d1 <- data.frame(
e=rnorm(3000,10,10)
)
d2 <- data.frame(
e=rnorm(2000,30,30)
)
So, I got around the problem of plotting two different density distributions from two very different datasets on the same graph by doing this:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2)
But when I try to manually add a legend, like so:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2) +
scale_fill_manual(name="Data", values = c("XXXXX" = "red","YYYYY" = "blue"))
Nothing happens. Does anybody know what's going wrong? I thought I could actually manually add legends if need be.
Generally ggplot works best when your data is in a single data.frame and in long format. In your case we therefore want to combine the data from both data.frames. For this simple example, we just concatenate the data into a long variable called d and use an additional column id to indicate to which dataset that value belongs.
d.f <- data.frame(id = rep(c("XXXXX", "YYYYY"), c(3000, 2000)),
d = c(d1$e, d2$e))
More complex data manipulations can be done using packages such as reshape2 and tidyr. I find this cheat sheet often useful. Then when we plot we map fill to id, and ggplot will take of the legend automatically.
ggplot(d.f, aes(x = d, fill = id)) +
geom_density()

ggplot2 - labeling maxima and minima in a facet wrap

I want to track the seven observations' performances on seven assessments using sparklines, so I thought that I could just melt my data frame then do a facet wrap by observation in ggplot. Now, I really need to label the maxima and minima for each facet. Is this possible in my current set up or do I need to graph each facet separately and add on indictors via geom_annotate? I'm sorry if this is a very rookie question. I am very new to R.
ggplot(test,aes(x=variable,y=value,group=1))+
facet_wrap("student",nrow=7)+
geom_point()+
geom_line()+
mytheme
It is possible. You have to add the max to your data frame, or make a new one to hold them and then add it to the plot. Can't reproduce, so the following code may contain errors, but something like this:
maxd <- aggregate(test$value, list(student = test$student), max)
names(maxd)[length(maxd)] <- "maxvalue"
ggplot(..) + ... +
geom_text(data = maxd, aes(label = maxvalue, x = X0, y = Y0))
#substitute X0, Y0 with your desired position of text

Resources