geom_step error for single point - r

This seems to be a duplicate of R - ggplot geom_step error, which has not been answered yet.
When the data to plot consists of one point only, geom_step throws an error while geom_line gives a warning only:
library(ggplot2)
data <- data.frame(x = 1, y = 2)
# works
ggplot(data = data, aes(x = x, y = y)) + geom_line()
# does not work
ggplot(data = data, aes(x = x, y = y)) + geom_step()
geom_step gives the error message: invalid line type. Is this a bug or the desired behaviour? With this behaviour geom_step looses a part of ggplot's flexibility since the single point case needs to be dealt with manually. One brute force solution is to manually check for number of points to be plotted and only add the step-layer if there are at least two point. But surely there must be a more elegant workaround?!
packageVersion("ggplot2")
[1] ‘1.0.1’

Related

geom_density blind in terms of the aesthetics supplied?

I have to admit that it has been a while since I used ggplot, but this seems a bit silly. Either I am missing something fundamental when trying to make a density plot, or there is a bug in ggplot2 (v3.3.2)
test <- data.frame(Time=rnorm(100),Age=rnorm(100))
ggplot(test,aes(y=Time,x=Age)) +
geom_density(aes(y=Time,x=Age))
produces
ggplot(test,aes(y=Time,x=Age)) +
geom_density(aes(y=Time,x=Age))
Error: geom_density requires the following missing aesthetics: y
how could the 'y' aesthetic be missing??
There are two cases when using geom_density(). It depends which stat layer you're specifying:
The standard case is the stat density which makes the geom_density() function compute its y values based on the frequency distribution of the given x values. In this case you must NOT proved a y aesthetic because those are computed behind the curtain.
Then there is a second case, which is yours, and which you have to specify explicitly by changing the stat to identity: This is needed if, for some reason, you've precalculated values which you want to feed directly into the density function.
Your problem arises, if you're mixing case 1) and 2). But I agree, the error message is not really clear, it could be mentioned to make sure that the used stat is the desired one.
library(ggplot2)
test <- data.frame(time = rnorm(100), age = rnorm(100))
#if you want to use precalculated y values you have to change the used stat to identity:
ggplot(test) +
geom_density(aes(x = age, y = time),
stat = "identity")
# compared to the case with the default value of stat: stat = "density"
ggplot(test) +
geom_density(aes(x = age))
Created on 2020-08-04 by the reprex package (v0.3.0)
If you want to plot the two variables in the graphic you need to "melt" it first.
test <- data.frame(Time=rnorm(100),Age=rnorm(100))
dt <- data.table(test)
dt_melt <- melt.data.table(dt)
ggplot(dt_melt,aes(x=value, fill=variable)) + geom_density(alpha=0.25)

position_dodgev causes error in order of connecting points in geom_line

I want to plot over a timecourse x with y values that are often repeated (integer scores 1-4) and I want to visualize many subjects at once.
Because there is so much overlap, a vertical position dodge would be ideal, such as position_dodgev from ggstance package. However, when I try to connect the dots with geom_line, the order of the connection gets messed up and is connected based on y values and not x values.
I tried a coordinate flip work-around which was not successful. And replacing geom_line with geom_path (making sure it was ordered on the x scale) also did not work.
Here is a reproducible example:
#data
df<-tibble(x=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
y=c(1,2,3,7,7,1,2,3,7,7,2,1,6,7,7),
group=c('a','a','a','a','a','b','b','b','b','b','c','c','c','c','c'))
#horizontal dodge masks groups
ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodge(width=0.3))+
geom_line(position=position_dodge(width=0.3))
#line connection error with vertical dodge
library(ggstance)
ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodgev(height=0.3))+
geom_line(position=position_dodgev(height=0.3))
Horizontal dodge works as expected but does not allow visualization of all the overlapped groups in a stretch of repeated y values. Vertical dodge from ggstance connected the dots in group c in the wrong order.
I am not sure what exactly causes the issue. Knowing that position_dodge is not intended to be used with geoms and it's been called a bug, I am surprised and not at the same time about this issue.
But in any case, I found a workaround by disassembling the plot using ggplot_build, rearranging the points for geom_line within that object and then reassembling the plot again; look below:
g <- ggplot(df, aes(x=x, y=y,col=group,group=group)) +
geom_point(position=position_dodgev(height=0.3)) +
geom_line(position=position_dodgev(height=0.3))
gg <- ggplot_build(g)
# -- look at gg$data to understand following lines --
#gg$data[[2]]: data associated with geom_line as it is the 2nd geom
#c(1,2) & c(2,1): I have $group==3 ...
# ... so just need to flip 1st and 2nd datapoints within that group
gg$data[[2]][gg$data[[2]]$group==3,][c(1,2),] <-
gg$data[[2]][gg$data[[2]]$group==3,][c(2,1),]
gt <- ggplot_gtable(gg)
plot(gt)
I suspect the problem occurs due to PositionDodgev's compute_panel function, which takes in a dataset sorted by x values, & returns a dataset sorted instead by y values (within each group) after making the necessary transformations to dodge positions vertically.
The following workaround defines a new ggproto object that inherits from PositionDodgev, but reorders the dataset in compute_panel before returning it:
# new ggproto based on PositionDodgev
PositionDodgeNew <- ggproto(
"PositionDodgeNew",
PositionDodgev,
compute_panel = function (data, params, scales){
result <- ggstance:::collidev(data, params$height,
name = "position_dodgev",
strategy = ggstance:::pos_dodgev,
n = params$n,
check.height = FALSE)
result <- result[order(result$group, result$x), ] # reordering by group & x
result
})
# position function that uses PositionDodgeNew instead of PositionDodgev
position_dodgenew <- function (height = NULL, preserve = c("total", "single")) {
ggproto(NULL, PositionDodgeNew, height = height, preserve = match.arg(preserve))
}
Usage:
po <- position_dodgenew(height = 0.3)
ggplot(df,
aes(x = x, y = y, col = group)) +
geom_point(position = po) +
geom_line(position = po)

Reminding R that integer is a factor when producing Boxplot

Good afternoon,
This is my 1st question here and every attempt is being made to be thorough.
I am working with a large data set (casualtiesdf) in R and I am trying to produce a Boxplot, using ggplot2, with the variable Age_of Casualty by the Casualty_Severity variable. The problem is that R thinks that Casualty_Severity variable is integer. Casualty_Severity in the data is listed by numbers 1, 2,3.
Below you can see that I've tried to rename the integer into the named factor to which is corresponds and then converted the integer into a factor.
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 1] "Fatal"
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 2]"Serious"
casualtiesdf$Casualty_Severity[casualtiesdf$Casualty_Severity == 3] "Slight"
casualtiesdf$Casualty_Severity <- as.factor(casualtiesdf$Casualty_Severity)
When I try doing the Boxplot, however...
> ggplot(data = casualtiesdf, aes(x = Age_of_Casualty,
+ y = casualtiesdf$Casualty_Severity)) +
+ geom_boxplot()
I get: "Warning message:position_dodge requires non-overlapping x intervals"
I typed this message into Google and stackflow seems to advise putting the categorical variable in the x axes (yes I'm still very confused with my x's and y's...) so I tried:
ggplot(data = casualtiesdf, aes(x = Casualtiesdf$Casualty_Severity,
y = Age_of_Casualty +
geom_boxplot()
and get error message "Error: object 'Age_of_Casualty' not found"
I then went for thinking that maybe I have to put the as.factor in the plot code:
ggplot(data = casualtiesdf, aes(x = casualtiesdf$Casualty_Severity
as.factor(casualtiesdf$Casualty_Severity))) y = Age_of_Casualty) +
geom_boxplot()
and get error message "unexpected symbol in: geom_boxplot() ggplot"
Any help with this is greatly appreciated!
Is Age_of_Casualty also part of the dataframe as well? if not, you might consider to merge or separate assignment to create a Age_of_Casualty column in the df as well.
I created a dummy dataframe, with two variables
casualtiesdf <- data.frame(Casualty_Severity=c(1,2,1,1,2,3,1,3),
Age_of_Casualty = c(31,32,32,33,33,33,35,35))
I then created another varialbe, to store the casualty_severity as factor
casualtiesdf$Casualty_Severity_factor <- factor(x = casualtiesdf$Casualty_Severity,
levels = c(1,2,3),
labels = c("Fatal","Serious","Slight"))
With that, I can then do the box plot, with the casualty_severity as X-axis
library("ggplot2")
ggplot(data = casualtiesdf,
aes(x= Casualty_Severity_factor, y = Age_of_Casualty)) +
geom_boxplot()
This should give you some plot like this
So it's expected to me that in your third example R is reporting that you have a syntax error: unexpected symbol in: geom_boxplot() means "I have no idea what to do with that ...))) y = business.
Your first example R mistakenly assigns Age_of_Casualty as the X - this is really the variable whose distribution you want to analyze (it should be the Y variable).
So you're right, you need to establish Casualty_Severity as a Factor and make sure to ascribe the two variables to X and Y correctly. Something like this:
# Creating dummy data
AC.rand <- sample(15:90, 500, replace = T)
CS.rand <- sample(1:3, 500, replace = T)
# Combine them into a dataframe, define the "Severity" variable as a Factor
casualtiesdf <- data.frame(Casualty_Severity = factor(CS.rand), Age_of_Casualty = AC.rand)
# Define the Levels for the "Severity" variable - not necessary
levels(casualtiesdf$Casualty_Severity)=c("Fatal", "Serious", "Slight")
g <- ggplot(data = casualtiesdf, aes(x = Casualty_Severity, y = Age_of_Casualty))
g <- g + geom_boxplot()
When I mocked up 500 rows of data I get something like:
I'm an SO noob, too, so let's learn together! :)

Getting counts on bins in a heat map using R

This question follows from these two topics:
How to use stat_bin2d() to compute counts labels in ggplot2?
How to show the numeric cell values in heat map cells in r
In the first topic, a user wants to use stat_bin2d to generate a heatmap, and then wants the count of each bin written on top of the heat map. The method the user initially wants to use doesn't work, the best answer stating that stat_bin2d is designed to work with geom = "rect" rather than "text". No satisfactory response is given.
The second question is almost identical to the first, with one crucial difference, that the variables in the second question question are text, not numeric. The answer produces the desired result, placing the count value for a bin over the bin in a stat_2d heat map.
To compare the two methods i've prepared the following code:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))
We know this first gives you the error:
"Error: geom_text requires the following missing aesthetics: x, y".
Same issue as in the first question. Interestingly, changing from stat_bin2d to stat_binhex works fine:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
ggplot(data, aes(x = x, y = y))
geom_binhex() +
stat_binhex(geom="text", aes(label=..count..))
Which is great and all, but generally, I don't think hex binning is very clear, and for my purposes wont work for the data i'm trying to desribe. I really want to use stat_2d.
To get this to work, i've prepared the following work around based on the second answer:
library(ggplot2)
data <- data.frame(x = rnorm(1000), y = rnorm(1000))
x_t<-as.character(round(data$x,.1))
y_t<-as.character(round(data$y,.1))
x_x<-as.character(seq(-3,3),1)
y_y<-as.character(seq(-3,3),1)
data<-cbind(data,x_t,y_t)
ggplot(data, aes(x = x_t, y = y_t)) +
geom_bin2d() +
stat_bin2d(geom="text", aes(label=..count..))+
scale_x_discrete(limits =x_x) +
scale_y_discrete(limits=y_y)
This works around allows one to bin numerical data, but to do so, you have to determine bin width (I did it via rounding) before bringing it into ggplot. I actually figured it out while writing this question, so I may as well finish.
This is the result: (turns out I can't post images)
So my real question here, is does any one have a better way to do this? I'm happy I at least got it to work, but so far I haven't seen an answer for putting labels on stat_2d bins when using a numerical variable.
Does any one have a method for passing on x and y arguments to geom_text from stat_2dbin without having to use a work around? Can any one explain why it works with text variables but not with numbers?
Another work around (but perhaps less work). Similar to the ..count.. method you can extract the counts from the plot object in two steps.
library(ggplot2)
set.seed(1)
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))
# plot
p <- ggplot(dat, aes(x = x, y = y)) + geom_bin2d()
# Get data - this includes counts and x,y coordinates
newdat <- ggplot_build(p)$data[[1]]
# add in text labels
p + geom_text(data=newdat, aes((xmin + xmax)/2, (ymin + ymax)/2,
label=count), col="white")

Transform color scale, but keep a nice legend with ggplot2

I have seen somewhat similar questions to this, but I'd like to ask my specific question as directly as I can:
I have a scatter plot with a "z" variable encoded into a color scale:
library(ggplot2)
myData <- data.frame(x = rnorm(1000),
y = rnorm(1000))
myData$z <- with(myData, x * y)
badVersion <- ggplot(myData,
aes(x = x, y = y, colour = z))
badVersion <- badVersion + geom_point()
print(badVersion)
Which produces this:
As you can see, since the "z" variable is normally distributed, very few of the points are colored with the "extreme" colors of the distribution. This is as it should be, but I am interested in emphasizing difference. One way to do this would be to use:
betterVersion <- ggplot(myData,
aes(x = x, y = y, colour = rank(z)))
betterVersion <- betterVersion + geom_point()
print(betterVersion)
Which produces this:
By applying rank() to the "z" variable, I get a much greater emphasis on minor differences within the "z" variable. One could imagine using any transformation here, instead of rank, but you get the idea.
My question is, essentially, what is the most straightforward way, or the most "true ggplot2" way, of getting a legend in the original units (units of z, as opposed to the rank of z), while maintaining the transformed version of the colored points?
I have a feeling this uses rescaler() somehow, but it is not clear to me how to use rescaler() with arbitrary transformations, etc. In general, more clear examples would be useful.
Thanks in advance for your time.
Have a look at the package scales
especially
?trans
I think that a transformation that maps the colour given the probability of getting the value or more extreme should be reasonable (basically pnorm(z))
I think that scale_colour_continuous(trans = probability_trans(distribution = 'norm') should work, but it throws warnings.
So I defined a new transformation (see ?trans_new)
I have to define a transformation and an inverse
library(scales)
norm_trans <- function(){
trans_new('norm', function(x) pnorm(x), function(x) qnorm(x))
}
badVersion + geom_point() + scale_colour_continuous(trans = 'norm'))
Using the supplied probability_trans throws a warning and doesn't seem to work
# this throws a warning
badVersion + geom_point+
scale_colour_continuous(trans = probability_trans(distribution = 'norm'))
## Warning message:
## In qfun(x, ...) : NaNs produced

Resources