Is it possible to use multiple x-variables for a faceted ggplot boxplot? I am using facets in ggplot to stratify my analysis but would like to have two different classifications of price on the x-axis depending on the colour (E or F). I've made a test example using the diamonds data set where I have two different price classifications but can only figure out how to apply one at a time:
But I'd like to have a plot that considers both price classifications, depending on colour:
I know I could get a similar result using grobs and probably by assigning a the price category conditionally depending on the colour (E or F) but that seems a bit cumbersome. So for simplicity, I'd like to do this using facets. Is that possible, if so, how?
dat <- diamonds
# price grouping I
dat$priceI <- cut(diamonds$price,
breaks = c(0,5000,10000,Inf),
labels = c("0-4,999","5,000-9,999",">=10,000"),
right = FALSE)
# price grouping II
dat$priceII <- cut(diamonds$price,
breaks = c(0,1000,5000,10000,Inf),
labels = c("0-999", "1,000-4,999","5,000-9,999",">=10,000"),
right = FALSE)
ggplot(dat[(dat$color=="E" | dat$color=="F") &
(dat$cut=="Fair" | dat$cut=="Good"),],
aes(priceI,depth)) +
geom_boxplot() +
facet_grid(cut ~ color) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Related
I am trying to create a plot which includes multiple geom_smooth trendlines within one plot. My current code is as follows:
png(filename="D:/Users/...", width = 10, height = 8, units = 'in', res = 300)
ggplot(Data) +
geom_smooth(aes(BA,BAgp1),colour="red",fill="red") +
geom_smooth(aes(BA,BAgp2),colour="turquoise",fill="turquoise") +
geom_smooth(aes(BA,BAgp3),colour="orange",fill="orange") +
xlab(bquote('Tree Basal Area ('~cm^2~')')) +
ylab(bquote('Predicted Basal Area Growth ('~cm^2~')')) +
labs(title = expression(paste("Other Softwoods")), subtitle = "Tree Level Basal Area Growth") +
theme_bw()
dev.off()
Which yields the following plot:
The issue is I can't for the life of me include a simple legend where I can label what each trendline represents. The dataset is quite large- if it would be valuable in indentifying a solution I will post externally to Stackoverflow.
Your data is in the wide format, or like a matrix. There's no easy way to add a custom legend in ggplot, so you need to transform your current data to a long format. I simulated 3 curves like what you have, and you can see if you call geom_line or geom_smooth with a variable ("name" in the example below) that separates your different values, it will work and produce a legend nicely.
library(dplyr)
library(tidyr)
library(ggplot2)
X = 1:50
#simulate data
Data = data.frame(
BA=X,
BAgp1 = log(X)+rnorm(length(X),0,0.3),
BAgp2 = log(X)+rnorm(length(X),0,0.3) + 0.5,
BAgp3 = log(X)+rnorm(length(X),0,0.3) + 1)
# convert this to long format, use BA as id
Data <- Data %>% pivot_longer(-BA)
#define colors
COLS = c("red","turquoise","orange")
names(COLS) = c("BAgp1","BAgp2","BAgp3")
###
ggplot(Data) +
geom_smooth(aes(BA,value,colour=name,fill=name)) +
# change name of legend here
scale_fill_manual(name="group",values=COLS)+
scale_color_manual(name="group",values=COLS)
I would like to plot four-dimensional data in ggplot2 in the form of bars and points. The dimensions are country, year, index and variables, for each, I have a observe value which I would like to represent on the graph.
I basically want to split my database into two based on the year, the oldest observations get represented by bars (per country, index and variable), and overlay the most recent data as points (again per country, index and variable). I need support to add the points (i.e recent data) in ggplot2.
Graphically I would like the get the following graph, where I would like to add the circle (which I added manually). This is the final graph I would like to get
Illustration with a reproducible example
creating the data
library(dplyr)
country<-c('A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C')
year<-c('2000','2000','2000','2000',"2005","2005","2005","2005","2010","2010","2010","2010","2002","2002","2002","2002","2008","2008","2008","2008")
index<-c("1","2","1","2","1","2","1","2","1","2","1","2","1","2","1","2","1","2","1","2")
variable<-c("var1", "var1","var2", "var2","var1", "var1","var2", "var2","var1", "var1","var2", "var2","var1", "var1","var2", "var2","var1", "var1","var2", "var2")
value<-runif(20)
data<-as.data.frame(cbind(country,year,index,variable,value))
data$ct_year<-paste0(data$country,data$year)
data$value<-as.numeric(data$value)
data$ct_year<-paste0(data$country,data$year) # this is used to subset between old and recent data
creating the subdatset
dataset 1 contains for each country the data with the oldest data = this data will appear as bars
dataset 2 contains if available the most recent data = this is the data that I would like to appear as a point on the top of my graph.
sel<-c("A2000","B2005","C2002")
sel2<-c("B2010","C2008")
data1<-filter(data, ct_year %in% sel)
data2<-filter(data, ct_year %in% sel2)
creating the base graph
This the the code that leads to base graph that is used in the picture above:
p<-ggplot(data1,aes(country, value ,fill=variable, alpha = index )) +
geom_bar(stat = "identity", position = "dodge" )
the issue I need to solve :
Now I would like to add the values that are stored in data2 as points on the top of my base graph. (in other terms I would like for each country to superpose as a point value of recent years of the different variables split by index). Note that country A does not have any data in data2 so only country B and C will have points appearing on the graph.
Any leads on how I could do this?
A great thanks for your support!
You could try the following.
p + geom_point(data = data2,
aes(x = country,
y = value,
col = variable,
shape = index),
size = 5,
stroke = 2,
position = position_dodge(width = 0.9),
inherit.aes = FALSE) +
scale_color_manual(values = c(var1 = "black",
var2 = "black")) +
scale_shape_manual(values = c(21, 21)) +
guides(col = "none",
shape = "none")
The plot differs from what you have posted because you use value<-runif(20) without setting a seed. For this particular example I used set.seed(1).
I am trying to simply add a legend to my Nyquist plot where I am plotting 2 sets of data: 1 is an experimental set (~600 points), and 2 is a data frame calculated using a transfer function (~1000 points)
I need to plot both and label them. Currently I have them both plotted okay but when i try to add the label using scale_colour_manual no label appears. Also a way to move this label around would be appreciated!! Code Below.
pdf("nyq_2elc.pdf")
nq2 <- ggplot() + geom_point(data = treat, aes(treat$V1,treat$V2), color = "red") +
geom_point(data = circuit, aes(circuit$realTF,circuit$V2), color = "blue") +
xlab("Real Z") + ylab("-Imaginary Z") +
scale_colour_manual(name = 'hell0',
values =c('red'='red','blue'='blue'), labels = c('Treatment','EQ')) +
ggtitle("Nyquist Plot and Equivilent Circuit for 2 Electrode Treatment Setup at 0 Minutes") +
xlim(0,700) + ylim(0,700)
print(nq2)
dev.off()
Ggplot works best with long dataframes, so I would combine the datasets like this:
treat$Cat <- "treat"
circuit$Cat <- "circuit"
CombData <- data.frame(rbind(treat, circuit))
ggplot(CombData, aes(x=V1, y=V2, col=Cat))+geom_point()
This should give you the legend you want.
You probably have to change the names/order of the columns of dataframes treat and circuit so they can be combined, but it's hard to tell because you're not giving us a reproducible example.
I want to plot a ggplot2 boxplot using all columns of a data.frame, and I want to reorder the columns by the median for each column, rotate the x-axis labels, and fill each box with the colour corresponding to the same median. I can't figure out how to do the last part. There are plenty of examples where the fill colour corresponds to a factor variable, but I haven't seen a clear example of using a continuous variable to control fill colour. (The reason I'm trying to do this is that the resultant plot will provide context for a force-directed network graph with nodes that will be colour-coded in the same way as the boxplot -- the colour will then provide a mapping between the two plots.) It would be nice if I could re-use the value-to-colour mapping for later plots so that colours are consistent between plots. So, for example, the box corresponding to the column variable with a high median value will have a colour that denotes this mapping and matches perfectly the colour for the same column variable in other plots (such as the corresponding node in a force-directed network graph).
So far, I have something like this:
# Melt the data.frame:
DT.m <- melt(results, id.vars = NULL) # using reshape2
# I can now make a boxplot for every column in the data.frame:
g <- ggplot(DT.m, aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
geom_boxplot(???, alpha=0.5)
The colour fill information is what I'm stuck on. "value" is a continuous variable in the range [0,1] and there are 55 columns in my data.frame. Various approaches I've tried seem to result in the boxes being split vertically down the middle, and I haven't got any further. Any ideas?
You can do this by adding the median-by-group to your data frame and then mapping the new median variable to the fill aesthetic. Here's an example with the built-in mtcars data frame. By using this same mapping across different plots, you should get the same colors:
library(ggplot2)
library(dplyr)
ggplot(mtcars %>% group_by(carb) %>%
mutate(medMPG = median(mpg)),
aes(x = reorder(carb, mpg, FUN=median), y = mpg)) +
geom_boxplot(aes(fill=medMPG)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradient(low=hcl(15,100,75), high=hcl(195,100,75))
If you have various data frames with different ranges of medians, you can still use the method above, but to get a consistent mapping of color to median across all your plots, you'll need to also set the same limits for scale_fill_gradient in each plot. In this example, the median of mpg (by carb grouping) varies from 15.0 to 22.8. But let's say across all my data sets, it varies from 13.3 to 39.8. Then I could add this to all my plots:
scale_fill_gradient(limits=c(13.3, 39.8),
low=hcl(15,100,75), high=hcl(195,100,75))
This is just for illustration. For ease of maintenance if your data might change, you'll want to set the actual limits programmatically.
I built on eipi10's solution and obtained the following code which does what I want:
# "results" is a 55-column data.frame containing
# bootstrapped estimates of the Gini impurity for each column variable
# (But can synthesize fake data for testing with a bunch of rnorms)
DT.m <- melt(results, id.vars = NULL) # using reshape2
g <- ggplot(DT.m %>% group_by(variable) %>%
mutate(median.gini = median(value)),
aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot(aes(fill=median.gini)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradientn(colours = heat.colors(9)) +
ylab("Gini impurity") +
xlab("Feature") +
guides(fill=guide_colourbar(title="Median\nGini\nimpurity"))
plot(g)
Later, for the second plot:
medians <- lapply(results, median)
color <- colorRampPalette(colors =
heat.colors(9))(1000)[cut(unlist(medians),1000,labels = F)]
color is then a character vector containing the colours of the nodes in my subsequent network graph, and these colours match those in the boxplot. Job done!
If I have a dataframe like this:
obs<-rnorm(20)
d<-data.frame(year=2000:2019,obs=obs,pred=obs+rnorm(20,.1))
d$pup<-d$pred+.5
d$plow<-d$pred-.5
d$obs[20]<-NA
d
And I want the observation and model prediction error bars to look something like:
(p1<-ggplot(data=d)+aes(x=year)
+geom_point(aes(y=obs),color='red',shape=19)
+geom_point(aes(y=pred),color='blue',shape=3)
+geom_errorbar(aes(ymin=plow,ymax=pup))
)
How do I add a legend/scale/key identifying the red points as observations and the blue plusses with error bars as point predictions with ranges?
Here is one solution melting pred/obs into one column. Can't post image due to rep.
library(ggplot2)
obs <- rnorm(20)
d <- data.frame(dat=c(obs,obs+rnorm(20,.1)))
d$pup <- d$dat+.5
d$plow <- d$dat-.5
d$year <- rep(2000:2019,2)
d$lab <- c(rep("Obs", 20), rep("Pred", 20))
p1<-ggplot(data=d, aes(x=year)) +
geom_point(aes(y = dat, colour = factor(lab), shape = factor(lab))) +
geom_errorbar(data = d[21:40,], aes(ymin=plow,ymax=pup), colour = "blue") +
scale_shape_manual(name = "Legend Title", values=c(6,1)) +
scale_colour_manual(name = "Legend Title", values=c("red", "blue"))
p1
edit: Thanks for the rep. Image added
Here is a ggplot solution that does not require melting and grouping.
set.seed(1) # for reproducible example
obs <- rnorm(20)
d <- data.frame(year=2000:2019,obs,pred=obs+rnorm(20,.1))
d$obs[20]<-NA
library(ggplot2)
ggplot(d,aes(x=year))+
geom_point(aes(y=obs,color="obs",shape="obs"))+
geom_point(aes(y=pred,color="pred",shape="pred"))+
geom_errorbar(aes(ymin=pred-0.5,ymax=pred+0.5))+
scale_color_manual("Legend",values=c(obs="red",pred="blue"))+
scale_shape_manual("Legend",values=c(obs=19,pred=3))
This creates a color and shape scale wiith two components each ("obs" and "pred"). Then uses scale_*_manual(...) to set the values for those scales ("red","blue") for color, and (19,3) for scale.
Generally, if you have only two categories, like "obs" and "pred", then this is a reasonable way to go use ggplot, and avoids merging everything into one data frame. If you have more than two categories, or if they are integral to the dataset (e.g., actual categorical variables), then you are much better off doing this as in the other answer.
Note that your example left out the column year so your code does not run.