General way to break on unique values in ggplot2 continuous scales - r

I often need to create custom breaks for axis or color/fill/size to reflect the actual data point. Typically in my data, the variable is continuous, but the measurement is at discrete points. I think this may apply to many others from what I see on SO. Below is an example of plotting mpg vs. cyl:
mpg %>%
ggplot(aes(cyl, cty)) +
geom_point() +
scale_x_continuous(breaks = unique(mpg$cyl))
But one does not really want to type different "mpg$cyl" for different exploratory data analysis all the time. So I am here to look for a general solution.
p.s. I read that ggplot does not pass the data to the scale functions -- probably just the range for calculation. I filed an issue but have not yet get any response.

Indeed, ggplot2 does not have a general way to do this. For continuous scales, the training method is to update the range of the scale every time a new layer is examined. It makes sense in the 'grammar of graphics' that scales are mostly independent of geometry layers.
You could, in theory, tackle this problem from the bottom up by making a new Range ggproto class that keeps track of unique values. However, ggplot2 does not export their Range classes, which likely means they don't support tinkering with this. Also, its quite the task to setup a new type of scale.
Instead I'm proposing to hack the ggplot_add() method to leak information from the global plot to the scale. First thing to do is to wrap the constructor of a scale, that tags on an extra class to that scale.
library(ggplot2)
scale_x_unique <- function(...) {
sc <- scale_x_continuous(...)
new <- ggproto("ScaleUnique", sc)
new
}
Next, we want to update the ggplot_add method for our ScaleUnique class. The function beneath essentially checks if there are any user-defined breaks and, if there are none, evaluate the scale's aesthetics in the global plot data.
ggplot_add.ScaleUnique <- function(object, plot, object_name) {
# "waiver" class is for undefined arguments
if (inherits(object$breaks, "waiver")) {
# Find common aesthetic between scale and plot mapping
aes <- intersect(object$aesthetics, names(plot$mapping))
# Find out the expression associated with that aesthetic
aes <- plot$mapping[[aes[[1]]]]
# Evaluate the aesthetic
values <- rlang::eval_tidy(aes, plot$data)
# Assign unique values to breaks
object$breaks <- sort(unique(values))
}
plot$scales$add(object)
plot
}
Now you can use it like any other scale
ggplot(mpg, aes(cyl, cty)) +
geom_point() +
scale_x_unique()
Created on 2021-08-11 by the reprex package (v1.0.0)
This of course only works if the aesthetic is defined in the global plot call and the data is available in the global plot. You could in theory traverse all layers and keep updating your unique values, but this becomes cumbersome.

Related

Histograms and Density Plots do not match up

I am creating histograms of substitutions: 1st, 2nd,or 3rd sub over Time. So each histogram shows the number of subs in a given minute given the Sub Number. The histograms make sense to me because for the most part they are smooth (I used a bin width of 1 minute). Nothing looks too out of the ordinary. However, when I overlay a density plot, the tails on the left inflate and I cannot determine why for one of the graphs.
The dataset is of substitions, ranging from minute 1 to a maximum time. I then cut this dataset in half to only look at when the sub was made after minute 45. I have not folded this data back and I have tried to create a reproducable example, but cannot given the data.
Code used to create in R
## Filter out subs that are not in the second half
df.half<-df[df$PeriodId>=2,]
p<-ggplot(data=df.half, aes(x=time)) +
geom_histogram(aes(y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
geom_density(alpha=.2)+
facet_grid(SUB_NUMBER ~ .)+
scale_y_continuous(limits = c(0,0.075),breaks = c(seq(0,0.075,0.025)),
minor_breaks = c(seq(0,0.075,0.025)),name='Count')
p
Why, for the First Sub is the density plot inflated in the tail if there are no values less than 45? Also why isn't the density plot more inflated in the tail for the Second Sub?
Side Note: I did ask this question on crossvalidated, but was told since it involved R, to ask it here instead. Here
So I was able to change the code and get the following:
ggplot() +
geom_histogram(data=df.half, aes(x=time,y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_density(data=df.half,aes(x=time,y=..density..))+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
facet_grid(SUB_NUMBER ~ .)
This looks more correct and at least now fits the dataset. However, I am still confused as to why those issues occured in the first place.
While there is no data sample to reproduce the error, you could try to
make sure that the environment used by geom_density is correct by specifying it explicitly. You can also try to move the code line specifying the density (geom_density) just after the geom_histogram. Also, the y-axis label is probably wrong - it is now set as counts, while values suggest that is in fact density.
How would I specify density explicitly?
You can specify the density parameters explicitly by specifying data, aes and position directly in geom_density function call, so it would use these stated instead of inherited arguments:
ggplot() +
geom_histogram(data=df.half, aes(x=time,y=..density..),position="identity", alpha=0.5,binwidth=1)+
geom_density(data=df.half,aes(x=time,y=..density..))+
geom_vline(data=sumy.df.half,aes(xintercept=grp.mean),color="blue", linetype="dashed", size=1)+
facet_grid(SUB_NUMBER ~ .)
I do not understand how it occured in the first place
I think in your initial code for geom_density, you have explicitly specified just the alpha argument. Thus for all of the rest of the parameters it needed, (data, aes, position etc) it used the inherited arguments/parameters and apparently it did not inherit them correctly. Probably it tried to use the data argument from the geom_vline function - sumy.df.half , or was confused by the syntaxis in argument "..density.."

How to set heigth of rows grid in graph lines on ggplots (R)?

I'm trying plots a graph lines using ggplot library in R, but I get a good plots but I need reduce the gradual space or height between rows grid lines because I get big separation between lines.
This is my R script:
library(ggplot2)
library(reshape2)
data <- read.csv('/Users/keepo/Desktop/G.Con/Int18/input-int18.csv')
chart_data <- melt(data, id='NRO')
names(chart_data) <- c('NRO', 'leyenda', 'DTF')
ggplot() +
geom_line(data = chart_data, aes(x = NRO, y = DTF, color = leyenda), size = 1)+
xlab("iteraciones") +
ylab("valores")
and this is my actual graphs:
..the first line is very distant from the second. How I can reduce heigth?
regards.
The lines are far apart because the values of the variable plotted on the y-axis are far apart. If you need them closer together, you fundamentally have 3 options:
change the scale (e.g. convert the plot to a log scale), although this can make it harder for people to interpret the numbers. This can also change the behavior of each line, not just change the space between the lines. I'm guessing this isn't what you will want, ultimately.
normalize the data. If the actual value of the variable on the y-axis isn't important, just standardize the data (separately for each value of leyenda).
As stated above, you can graph each line separately. The main drawback here is that you need 3 graphs where 1 might do.
Not recommended:
I know that some graphs will have the a "squiggle" to change scales or skip space. Generally, this is considered poor practice (and I doubt it's an option in ggplot2 because it masks the true separation between the data points. If you really do want a gap, I would look at this post: axis.break and ggplot2 or gap.plot? plot may be too complexe
In a nutshell, the answer here depends on what your numbers mean. What is the story you are trying to tell? Is the important feature of your plots the change between them (in which case, normalizing might be your best option), or the actual numbers themselves (in which case, the space is relevant).
you could use an axis transformation that maps your data to the screen in a non-linear fashion,
fun_trans <- function(x){
d <- data.frame(x=c(800, 2500, 3100), y=c(800,1950, 3100))
model1 <- lm(y~poly(x,2), data=d)
model2 <- lm(x~poly(y,2), data=d)
scales::trans_new("fun",
function(x) as.vector(predict(model1,data.frame(x=x))),
function(x) as.vector(predict(model2,data.frame(y=x))))
}
last_plot() + scale_y_continuous(trans = "fun")
enter image description here

Heatmap table (ggfluctuation function)

When I run this programming code, I will get this error "ggfluctuation is deprecated. (Defunct; last used in version 0.9.1)".
1-How can i fix this issue?
2-In my original data set, I have two string variables with too many levels (first variable with 65 levels and second variable with 8 levels),can I have Heatmap table for these two variables although they have different number of levels?
3-What is the best way (plot) to show the relationship between these two categorical variables in my data set?
library(Hmisc)
library(ggplot2)
library(reshape)
data(HairEyeColor)
P=t(HairEyeColor[,,2])
Pm=melt(P)
ggfluctuation(Pm,type="heatmap")+geom_text(aes(label=Pm$value),colour="white")+ opts(axis.text.x=theme_text(size = 15),axis.text.y=theme_text(size = 15))
If you want to plot a heatmap just use geom_tile. Also, opts and theme_text are deprecated instead and have been replaced by theme and element_text respectively.
So, you could use this:
ggplot(Pm, aes(Eye, Hair, fill=value)) + geom_tile() +
geom_text(aes(label=Pm$value),colour="white")+
theme(axis.text.x=element_text(size = 15),axis.text.y=element_text(size = 15))
Which outputs:
Also, just to answer all the questions yes, ggplot can handle two categorical columns with a different number of levels and also a heatmap is a nice way to show the relationship between two categorical variables such as the ones you have.
The GGally package has a ggfluctuation2 function that replaces the deprecated ggfluctuation. But it's still pretty rough (you can't even specify axis labels) and I prefer the original ggplot function. You can also try ggally_ratio.

Setting breakpoints for data with scale_fill_brewer() function in ggplot2

I am creating a map (choropleth) as described on the ggplot2 wiki. Everything works like a charm, except that I am running into an issue mapping a continuous value to the polygon fill color via the scale_fill_brewer() function.
This question describes the problem I'm having. As in the answer, my workaround has been to pre-cut my data into bins using the gtools quantcut() function:
UPDATE: This first example is actually the right way to do this
require(gtools) # needed for quantcut()
...
fill_factor <- quantcut(fill_continuous, q=seq(0,1,by=0.25))
ggplot(mydata) +
aes(long,lat,group=group,fill=fill_factor) +
geom_polygon() +
scale_fill_brewer(name="mybins", palette="PuOr")
This works, however, I feel like I should be able to skip the step of pre-cutting my data and do something like this with the breaks option:
ggplot(mydata) +
aes(long,lat,group=group,fill=fill_continuous) +
geom_polygon() +
scale_fill_brewer(names="mybins", palette="PuOr", breaks=quantile(fill_continuous))
But this doesn't work. Instead I get an error something like:
Continuous variable (composite score) supplied to discrete scale_brewer.
Have I misunderstood the purpose of the "breaks" option? Or is breaks broken?
A major issue with pre-cutting continuous data is that there are three pieces of information used at different points in the code:
The Brewer palette -- determines the maximum number of colors available
The number of break points (or the bin width) -- has to be specified with the data
The actual data to be plotted -- influences the choice of the Brewer palette (sequential/diverging)
A true vicious circle. This can be broken by providing a function that accepts the data and the palette, automatically derives the number of break points and returns an object that can be added to the ggplot object. Something along the following lines:
fill_brewer <- function(fill, palette) {
require(RColorBrewer)
n <- brewer.pal.info$maxcolors[palette == rownames(brewer.pal.info)]
discrete.fill <- call("quantcut", match.call()$fill, q=seq(0, 1, length.out=n))
list(
do.call(aes, list(fill=discrete.fill)),
scale_fill_brewer(palette=palette)
)
}
Use it like this:
ggplot(mydata) + aes(long,lat,group=group) + geom_polygon() +
fill_brewer(fill=fill_continuous, palette="PuOr")
As Hadley explains, the breaks option moves the ticks, but does not make the data continuous. Therefore pre-cutting the data as per the first example in the question is the right way to use the scale_fill_brewer command.

ggplot2 - possible to reorder x's by value of computed y (stat_summary)?

Is it possible to reorder x values using a computed y via stat_summary?
I would think that this should work:
stat_summary( aes( x = reorder( XVarName , ..y.. ) ) )
but I get the following error:
"Error: stat_summary requires the following missing aesthetics: x"
I've seen a number of your posts, and I think this may be helpful for you. When generating a plot, always save it to a unique variable
Create your plots without regard for ordering at first, until you're comfortable just creating the plots. Then, work your way into the structure of the ggplot objects to get a better understanding of what's in them. Then, figure out what you should be sorting.
plot1 <- ggplot() + ...
You can push plots to the viewport by typing out the object name that you've saved them to:
plot1
Creating a ggplot object (or variable) allows you the opportunity to review the structure of the plot. Which, incidentally, can answer a number of the questions that you've been having so far.
str(plot1)
It is still fairly simple to reorder a plot after you've saved it as a variable/object, albeit with slightly longer names:
plot$data$variable_tobe_recoded <- factor(...)

Resources