I'd like some help understanding an error so that it can't happen again.
I was producing some (gg) plots and wanted to change the order of facets for aesthetic reasons. The way I did this had unexpected consequences and almost slipped through the net when I was checking the results - it could have caused serious problems with the article I'm working on!
I wanted to re-order the facets based on a numerical vector that I could define up-front
E.g. facet_order=c(1,2,4,3). This was so the graph syntax could be copied / pasted for repeat graphs more easily and I wouldn't have to dig around too much in the code each time.
# some example data:
df <- data.frame(x=c(1,2,3,4), y=c(1,2,3,4), facet_var=factor(c('A','B','C','D')))
# First plot (facet order defined by default):
ggplot(df, aes(x,y))+geom_point()+facet_wrap(~facet_var, nrow = 1)+labs(title='Original data')
In the second plot, facets 'C' and 'D' are swapped as intended:
# reorder facets (normal method)
df$facet_var2 <- factor(df$facet_var, levels=c('A','B','D','C')) # Set the facets var
as a factor, to define the order
# Second plot:
ggplot(df, aes(x,y))+geom_point()+facet_wrap(~facet_var2, nrow = 1)+labs(title='Re-
ordered facets', subtitle='working as expected')
However, this is the mistake I made:
# different syntax to reorder the facets
df$facet_var3 <- df$facet_var # duplicate the faceting variable
levels(df$facet_var3) <- levels(df$facet_var3)[c(1,2,4,3)] # I thought I was just
re-ordering the levels here
# Third plot:
ggplot(df, aes(x,y))+geom_point()+facet_wrap(~facet_var3, nrow = 1)+labs(title='Re-
ordered facets (method 2)',subtitle='Unexpected behaviour')
In the third graph, it looks like the data doesn't move, but the facet labels do, which is obviously wrong.
Digging a bit deeper, it appears that my syntax changed not only the order of the factor, but actually the underlying data in the factor variable. Is this behaviour expected?
Here's the crux of it:
facet_order <- c(1,2,4,3)
levels(df$facet_var) <- levels(df$facet_var)[facet_order] # bad
df$facet_var <- factor(df$facet_var, levels=c(levels(df$facet_var)[facet_order)) #
good
Obviously I now know the solution but I'm still unclear what I actually did wrong here. Any pointers?
Hang on while I try and fix the images:
quick'n'dirty: posterior reordering with fct_reorder of {forcats} (part of tidyverse):
ggplot(df, aes(x,y)) +
geom_point() +
facet_wrap(~ fct_reorder(facet_var, c('B','A','D','C')),
nrow = 1)
Related
I have a set of times that I would like to plot on a histogram.
Toy example:
df <- data.frame(time = c(1,2,2,3,4,5,5,5,6,7,7,7,9,9, ">10"))
The problem is that one value is ">10" and refers to the number of times that more than 10 seconds were observed. The other time points are all numbers referring to the actual time. Now, I would like to create a histogram that treats all numbers as numeric and combines them in bins when appropriate, while plotting the counts of the ">10" at the side of the distribution, but not in a separate plot. I have tried to call geom_histogram twice, once with the continuous data and once with the discrete data in a separate column but that gives me the following error:
Error: Discrete value supplied to continuous scale
Happy to hear suggestions!
Here's a kind of involved solution, but I believe it best answers your question, which is that you are desiring to place next to typical histogram plot a bar representing the ">10" values (or the values which are non-numeric). Critically, you want to ensure that you maintain the "binning" associated with a histogram plot, which means you are not looking to simply make your scale a discrete scale and represent a histogram with a typical barplot.
The Data
Since you want to retain histogram features, I'm going to use an example dataset that is a bit more involved than that you gave us. I'm just going to specify a uniform distribution (n=100) with 20 ">10" values thrown in there.
set.seed(123)
df<- data.frame(time=c(runif(100,0,10), rep(">10",20)))
As prepared, df$time is a character vector, but for a histogram, we need that to be numeric. We're simply going to force it to be numeric and accept that the ">10" values are going to be coerced to be NAs. This is fine, since in the end we're just going to count up those NA values and represent them with a bar. While I'm at it, I'm creating a subset of df that will be used for creating the bar representing our NAs (">10") using the count() function, which returns a dataframe consisting of one row and column: df$n = 20 in this case.
library(dplyr)
df$time <- as.numeric(df$time) #force numeric and get NA for everything else
df_na <- count(subset(df, is.na(time)))
The Plot(s)
For the actual plot, you are asking to create a combination of (1) a histogram, and (2) a barplot. These are not the same plot, but more importantly, they cannot share the same axis, since by definition, the histogram needs a continuous axis and "NA" values or ">10" is not a numeric/continuous value. The solution here is to make two separate plots, then combine them with a bit of magic thanks to cowplot.
The histogram is created quite easily. I'm saving the number of bins for demonstration purposes later. Here's the basic plot:
bin_num <- 12 # using this later
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
Thanks to the subsetting previously, the barplot for the NA values is easy too:
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3)
Yikes! That looks horrible, but have patience.
Stitching them together
You can simply run plot_grid(p1, p2) and you get something workable... but it leaves quite a lot to be desired:
There are problems here. I'll enumerate them, then show you the final code for how I address them:
Need to remove some elements from the NA barplot. Namely, the y axis entirely and the title for x axis (but it can't be NULL or the x axes won't line up properly). These are theme() elements that are easily removed via ggplot.
The NA barplot is taking up WAY too much room. Need to cut the width down. We address this by accessing the rel_widths= argument of plot_grid(). Easy peasy.
How do we know how to set the y scale upper limit? This is a bit more involved, since it will depend on the ..count.. stat for p1 as well as the numer of NA values. You can access the maximum count for a histogram using ggplot_build(), which is a part of ggplot2.
So, the final code requires the creation of the basic p1 and p2 plots, then adds to them in order to fix the limits. I'm also adding an annotation for number of bins to p1 so that we can track how well the upper limit setting works. Here's the code and some example plots where bin_num is set at 12 and 5, respectively:
# basic plots
p1 <- ggplot(df, aes(x=time)) + theme_classic() +
geom_histogram(color='gray25', fill='blue', alpha=0.3, bins=bin_num)
p2 <- ggplot(df_na, aes(x=">10", y=n)) + theme_classic() +
geom_col(color='gray25', fill='red', alpha=0.3) +
labs(x="") + theme(axis.line.y=element_blank(), axis.text.y=element_blank(),
axis.title.y=element_blank(), axis.ticks.y=element_blank()
) +
scale_x_discrete(expand=expansion(add=1))
#set upper y scale limit
max_count <- max(c(max(ggplot_build(p1)$data[[1]]$count), df_na$n))
# fix limits for plots
p1 <- p1 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15))) +
annotate('text', x=0, y=max_count, label=paste('Bins:', bin_num)) # for demo purposes
p2 <- p2 + scale_y_continuous(limits=c(0,max_count), expand=expansion(mult=c(0,0.15)))
plot_grid(p1, p2, rel_widths=c(1,0.2))
So, our upper limit fixing works. You can get really crazy playing around with positioning, etc and the plot_grid() function, but I think it works pretty well this way.
Perhaps, this is what you are looking for:
df1 <- data.frame(x=sample(1:12,50,rep=T))
df2 <- df1 %>% group_by(x) %>%
dplyr::summarise(y=n()) %>% subset(x<11)
df3 <- subset(df1, x>10) %>% dplyr::summarise(y=n()) %>% mutate(x=11)
df <- rbind(df2,df3 )
label <- ifelse((df$x<11),as.character(df$x),">10")
p <- ggplot(df, aes(x=x,y=y,color=x,fill=x)) +
geom_bar(stat="identity", position = "dodge") +
scale_x_continuous(breaks=df$x,labels=label)
p
and you get the following output:
Please note that sometimes you could have some of the bars missing depending on the sample.
I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?
Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale
I have two factors and two continuous variables, and I use this to create a two-way facet plot using ggplot2. However, not all of my factor combinations have data, so I end up with dummy facets. Here's some dummy code to produce an equivalent output:
library(ggplot2)
dummy<-data.frame(x=rnorm(60),y=rnorm(60),
col=rep(c("A","B","C","B","C","C"),each=10),
row=rep(c("a","a","a","b","b","c"),each=10))
ggplot(data=dummy,aes(x=x,y=y))+
geom_point()+
facet_grid(row~col)
This produces this figure
Is there any way to remove the facets that don't plot any data? And, ideally, move the x and y axis labels up or right to the remaining plots? As shown in this GIMPed version
I've searched here and elsewhere and unless my search terms just aren't good enough, I can't find the same problem anywhere. Similar issues are often with unused factor levels, but here no factor level is unused, just factor level combinations. So facet_grid(drop=TRUE) or ggplot(data=droplevel(dummy)) doesn't help here. Combining the factors into a single factor and dropping unused levels of the new factor can only produce a 1-dimensional facet grid, which isn't what I want.
Note: my actual data has a third factor level which I represent by different point colours. Thus a single-plot solution allowing me to retain a legend would be ideal.
It's not too difficult to rearrange the graphical objects (grobs) manually to achieve what you're after.
Load the necessary libraries.
library(grid);
library(gtable);
Turn your ggplot2 plot into a grob.
gg <- ggplot(data = dummy, aes(x = x,y = y)) +
geom_point() +
facet_grid(row ~ col);
grob <- ggplotGrob(gg);
Working out which facets to remove, and which axes to move where depends on the grid-structure of your grob. gtable_show_layout(grob) gives a visual representation of your grid structure, where numbers like (7, 4) denote a panel in row 7 and column 4.
Remove the empty facets.
# Remove facets
idx <- which(grob$layout$name %in% c("panel-2-1", "panel-3-1", "panel-3-2"));
for (i in idx) grob$grobs[[i]] <- nullGrob();
Move the x axes up.
# Move x axes up
# axis-b-1 needs to move up 4 rows
# axis-b-2 needs to move up 2 rows
idx <- which(grob$layout$name %in% c("axis-b-1", "axis-b-2"));
grob$layout[idx, c("t", "b")] <- grob$layout[idx, c("t", "b")] - c(4, 2);
Move the y axes to the right.
# Move y axes right
# axis-l-2 needs to move 2 columns to the right
# axis-l-3 needs ot move 4 columns to the right
idx <- which(grob$layout$name %in% c("axis-l-2", "axis-l-3"));
grob$layout[idx, c("l", "r")] <- grob$layout[idx, c("l", "r")] + c(2, 4);
Plot.
# Plot
grid.newpage();
grid.draw(grob);
Extending this to more facets is straightforward.
Maurits Evers solution worked great, but is quite cumbersome to modify.
An alternative solution is to use facet_manual from {ggh4x}.
This is not equivalent though as it uses facet_wrap, but allows appropriate placement of the facets.
# devtools::install_github("teunbrand/ggh4x")
library(ggplot2)
dummy<-data.frame(x=rnorm(60),y=rnorm(60),
col=rep(c("A","B","C","B","C","C"),each=10),
row=rep(c("a","a","a","b","b","c"),each=10))
design <- "
ABC
#DE
##F
"
ggplot(data=dummy,aes(x=x,y=y))+
geom_point()+
ggh4x::facet_manual(vars(row,col), design = design, labeller = label_both)
Created on 2022-02-25 by the reprex package (v2.0.0)
One possible solution, of course, would be to create a plot for each factor combination separately and then combine them using grid.arrange() from gridExtra. This would probably lose my legend and would be an all around pain, would love to hear if anyone has any better suggestions.
This particular case looks like a job for ggpairs (link to a SO example). I haven't used it myself, but for paired plots this seems like the best tool for the job.
In a more general case, where you're not looking for pairs, you could try creating a column with a composite (pasted) factor and facet_grid or facet_wrap by that variable (example)
I'm quite new to R and ggplot and am having a tough time grasping how I am supposed to solve this problem in ggplot.
Essentially I want to draw 2 lines on a plot. One for method "a" and one for method "b". That is usually straightforward, but now I have a situation where I want to use functions in the aesthetic.
I want to do rank and length, but for each grouping separately. In this ggplot code, the rank and length are computed over all values. I have tried a lot of different configurations, but can't seem to get this! I include the code here to get the desired plot with regular plots.
d <- rbind(
data.frame(value=1:100, method=c("a")),
data.frame(value=50:60, method=c("b"))
)
ggplot(d, aes(x=value, y=rank(value)/length(value), colour=method)) + geom_point()
a <- d$value[d$method=="a"]
b <- d$value[d$method=="b"]
plot(
rank(a)/length(a),
col="red",
xlab="value",
ylab="F(value)",
pch=19
)
points(
rank(b)/length(b),
col="blue"
)
Is this possible with ggplot or do I need to do my calculations beforehand and then make a special plotting dataframe?
I am finding ggplot powerful, whenever I know how to do something, but frustrating as soon as I don't! Especially when I don't know if it can't do something, or if I just don't know how!
Thanks
Thanks to the commenters. Here is their solution in the context of my test case for reference.
Ranking the grouped values outside of ggplot.
d <- rbind(
data.frame(value=1:100, method=c("a")),
data.frame(value=50:60, method=c("b"))
)
d <- mutate(group_by(d, method), rank=rank(value)/length(value))
ggplot(d, aes(x=value, y=rank, colour=method)) + geom_point()
I often have to make stacked barplots to compare variables, and because I do all my stats in R, I prefer to do all my graphics in R with ggplot2. I would like to learn how to do two things:
First, I would like to be able to add proper percentage tick marks for each variable rather than tick marks by count. Counts would be confusing, which is why I take out the axis labels completely.
Second, there must be a simpler way to reorganize my data to make this happen. It seems like the sort of thing I should be able to do natively in ggplot2 with plyR, but the documentation for plyR is not very clear (and I have read both the ggplot2 book and the online plyR documentation.
My best graph looks like this, the code to create it follows:
The R code I use to get it is the following:
library(epicalc)
### recode the variables to factors ###
recode(c(int_newcoun, int_newneigh, int_neweur, int_newusa, int_neweco, int_newit, int_newen, int_newsp, int_newhr, int_newlit, int_newent, int_newrel, int_newhth, int_bapo, int_wopo, int_eupo, int_educ), c(1,2,3,4,5,6,7,8,9, NA),
c('Very Interested','Somewhat Interested','Not Very Interested','Not At All interested',NA,NA,NA,NA,NA,NA))
### Combine recoded variables to a common vector
Interest1<-c(int_newcoun, int_newneigh, int_neweur, int_newusa, int_neweco, int_newit, int_newen, int_newsp, int_newhr, int_newlit, int_newent, int_newrel, int_newhth, int_bapo, int_wopo, int_eupo, int_educ)
### Create a second vector to label the first vector by original variable ###
a1<-rep("News about Bangladesh", length(int_newcoun))
a2<-rep("Neighboring Countries", length(int_newneigh))
[...]
a17<-rep("Education", length(int_educ))
Interest2<-c(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15, a16, a17)
### Create a Weighting vector of the proper length ###
Interest.weight<-rep(weight, 17)
### Make and save a new data frame from the three vectors ###
Interest.df<-cbind(Interest1, Interest2, Interest.weight)
Interest.df<-as.data.frame(Interest.df)
write.csv(Interest.df, 'C:\\Documents and Settings\\[name]\\Desktop\\Sweave\\InterestBangladesh.csv')
### Sort the factor levels to display properly ###
Interest.df$Interest1<-relevel(Interest$Interest1, ref='Not Very Interested')
Interest.df$Interest1<-relevel(Interest$Interest1, ref='Somewhat Interested')
Interest.df$Interest1<-relevel(Interest$Interest1, ref='Very Interested')
Interest.df$Interest2<-relevel(Interest$Interest2, ref='News about Bangladesh')
Interest.df$Interest2<-relevel(Interest$Interest2, ref='Education')
[...]
Interest.df$Interest2<-relevel(Interest$Interest2, ref='European Politics')
detach(Interest)
attach(Interest)
### Finally create the graph in ggplot2 ###
library(ggplot2)
p<-ggplot(Interest, aes(Interest2, ..count..))
p<-p+geom_bar((aes(weight=Interest.weight, fill=Interest1)))
p<-p+coord_flip()
p<-p+scale_y_continuous("", breaks=NA)
p<-p+scale_fill_manual(value = rev(brewer.pal(5, "Purples")))
p
update_labels(p, list(fill='', x='', y=''))
I'd very much appreciate any tips, tricks or hints.
Your second problem can be solved with melt and cast from the reshape package
After you've factored the elements in your data.frame called you can use something like:
install.packages("reshape")
library(reshape)
x <- melt(your.df, c()) ## Assume you have some kind of data.frame of all factors
x <- na.omit(x) ## Be careful, sometimes removing NA can mess with your frequency calculations
x <- cast(x, variable + value ~., length)
colnames(x) <- c("variable","value","freq")
## Presto!
ggplot(x, aes(variable, freq, fill = value)) + geom_bar(position = "fill") + coord_flip() + scale_y_continuous("", formatter="percent")
As an aside, I like to use grep to pull in columns from a messy import. For example:
x <- your.df[,grep("int.",df)] ## pulls all columns starting with "int_"
And factoring is easier when you don't have to type c(' ', ...) a million times.
for(x in 1:ncol(x)) {
df[,x] <- factor(df[,x], labels = strsplit('
Very Interested
Somewhat Interested
Not Very Interested
Not At All interested
NA
NA
NA
NA
NA
NA
', '\n')[[1]][-1]
}
You don't need prop.tables or count etc to do the 100% stacked bars. You just need +geom_bar(position="stack")
About percentages insted of ..count.. , try:
ggplot(mtcars, aes(factor(cyl), prop.table(..count..) * 100)) + geom_bar()
but since it's not a good idea to shove a function into the aes(), you can write custom function to create percentages out of ..count.. , round it to n decimals etc.
You labeled this post with plyr, but I don't see any plyr in action here, and I bet that one ddply() can do the job. Online plyr documentation should suffice.
If I am understanding you correctly, to fix the axis labeling problem make the following change:
# p<-ggplot(Interest, aes(Interest2, ..count..))
p<-ggplot(Interest, aes(Interest2, ..density..))
As for the second one, I think you would be better off working with the reshape package. You can use it to aggregate data into groups very easily.
In reference to aL3xa's comment below...
library(ggplot2)
r<-rnorm(1000)
d<-as.data.frame(cbind(r,1:1000))
ggplot(d,aes(r,..density..))+geom_bar()
Returns...
alt text http://www.drewconway.com/zia/wp-content/uploads/2010/04/density.png
The bins are now densities...
Your first question: Would this help?
geom_bar(aes(y=..count../sum(..count..)))
Your second question; could you use reorder to sort the bars? Something like
aes(reorder(Interest, Value, mean), Value)
(just back from a seven hour drive - am tired - but I guess it should work)