I'm generating violin plots in ggplot2 for a time series, year_1 to year_32. The years in my df are stored as numerical values. From the examples I've seen, it seems that I must convert these numerical year values to factors to plot one violin per year; and in fact, if I run the code without as.factors, I get one big fat violin. I would like to understand why geom_violin can't have numeric values on the x axis; or if I'm wrong about that, how to use them?
So:
my_data$year <- as.factor(my_data$year)
p <- ggplot(data = my_data, aes(x = year, y = continuous_var)+
geom_violin(fill = "#FF0000", color = "#000000")+
ylim(0,500)+
labs(x = "x_label", y = "y_label")
p +my_theme()
works fine, but if I skip
my_data$year <- as.factor(my_data$year)
it doesn't work, I get one big fat violin for all years. Why?
TIA
You miss a ) at the end of this line p <- ggplot(data = my_data, aes(x = year, y = continuous_var)
I have construced a reproducible example with the ToothGrowth dataset:
This should work now:
library(ggplot2)
my_data <- ToothGrowth
my_data$dose <- as.factor(my_data$dose)
p <- ggplot(data = my_data, aes(x = dose, y = len))+
geom_violin(fill = "#FF0000", color = "#000000")+
ylim(0,500)+
labs(x = "x_label", y = "y_label") +
theme_bw()
p
PS: this discussion would better fit Cross Validated, as it's more of an statistics than coding question.
I'm not 100% sure, but here's my explanation: the violin plot shows the density for a set of data, you can divide your data into groups so that you can plot one violin for each part of your data. But if the metric you're using to divide groups (x axis) is a continuous, you're going to have infinite groupings (one group for the values at 0, one for 0.1, one for 0.01, etc.), so in the end you actually can't divide your data, and ggplot probably ignores the x variable and makes one violin for all your data.
Related
I'm wondering whether I can manipulate stat_density2d to show the density for the x values without considering the y values.
To illustrate:
df <- data.frame(x = c(1:40, rep(1:20, 3), 15:40))
ggplot(df, aes(x=x, y = x)) +
stat_density2d(aes(fill='red',alpha=..level..),geom='polygon', show.legend = F) +
geom_point(alpha = 0.3)
Obviously I does't really make sense to plot the sames values against each other, however I'm interested in the density of the plots at a certain value. Therefore I would like to keep y constant (e.g y = 1) but still show the same density like so:
(In my publication I actually have multiple groups, making this a nice way to plot the group separation even though it is 1D)
I have a model which has told me there is an interaction between two variables: a and b, which is significantly influencing my response variable: c. All three are continuous numeric variables. For detail c is the rate in change my response variable, b is the rate of change in my predictor and a is mean annual rainfall. The unit of analysis is pixels in a raster. So my model is telling me mean annual rainfall modifies how my predictor affects my response.
To visualise this interaction I would like to use a contour plot/heat map/level plot with a and b on the x and y axes and c providing the colour to show me how my response variable changes within the space described by a and b. I can do this with a scatter plot but its not very pretty or easy to interpret:
qplot(b, a, colour = c) +
scale_colour_gradient(low="green", high="red") +
When I try to plot a contour plot/heat map/level plot though all I get is errors, blank plots or ugly plots.
geom_contour gives me an error:
ggplot(data = Mod, aes(x = Rain, y = Bomas, z = Fire)) +
geom_contour()
Warning message:
Not possible to generate contour data
geom_raster initially gives me Error: cannot allocate vector of size 81567.2 Gb but when I round my data it produces:
ggplot(data = df, aes(x = a, y = b, z = c)) +
geom_raster(aes(fill = c))
Adding interpolate = TRUE to the geom_raster code just makes the lines a little blurry.
geom_tile produces a blank graph but with a scale bar for c:
ggplot(data = df, aes(x = a, y = b, z = c)) +
geom_tile(aes(color = c))
I've also tried using stat_density2d and setting the fill and/or the colour to c, but just got an error, and I've tried using levelplot in the lattice package as well but that produces this:
levelplot(c ~ a * b, data = df,
aspect = "asp", contour = TRUE,
xlab = "a",
ylab = "b")
I suspect the problems I'm encountering are because the functions are not set up to deal with continuous x and y variables, all the examples seem to use factors. I would have thought I could compensate for that by changing bin widths but that doesn't seem to work either. Is there a function that allows you to make a heat map with 3 continuous variables? Or do I need to treat my a and b variables as factors and manually make a dataframe with bins appropriate for my data?
If you want to experiment for yourself then you get similar problems to what I'm having with:
df<- as.data.frame(rnorm(1:1068))
df[,2] <- rnorm(1:1068)
df[,3] <- rnorm(1:1068)
names(df) <- c("a", "b", "c")
You can get automatic bins, and for example calculate the means by using stat_summary_2d:
ggplot(df, aes(a, b, z = c)) +
stat_summary_2d() +
geom_point(shape = 1, col = 'white') +
viridis::scale_fill_viridis()
Another good option is to slice your data by the third variable, and plot small multiples. This doesn't really show very well for random data though:
library(ggplot2)
ggplot(df, aes(a, b)) +
geom_point() +
facet_wrap(~cut_number(c, 4))
I have a scatter plot now. Each color represent a categorical group and each group has a range of values which are on the x-axis. There should not be any overlapping between the range of categorical variables. However, because of the thickness of scatter points, it looks like that there is overlapping. So, I want to draw a line to connect the maximum point of the group and the minimum point of the adjacent group so that as long as the line does not have a negative slope, it can show that there is no overlapping between each categorical variable.
I do not know how to use geom_line() to connect two points where y-coordinate is a categorical variable. IS that possible to do so??
Any help would be appreciated!!!
It sounds like you want geom_segment not geom_line. You'll need to aggregate your data into a new data frame that has the points you want plotted. I adapted Brian's sample data and use dplyr for this:
# sample data
df <- data.frame(xvals = runif(50, 0, 1))
df$cats <- cut(df$xvals, c(0, .25, .625, 1))
# aggregation
library(dplyr)
df_summ = df %>% group_by(cats) %>%
summarize(min = min(xvals), max = max(xvals)) %>%
mutate(adj_max = lead(max),
adj_min = lead(min),
adj_cat = lead(cats))
# plot
ggplot(df, aes(xvals, cats, colour = cats)) +
geom_point() +
geom_segment(data = df_summ, aes(
x = max,
xend = adj_min,
y = cats,
yend = adj_cat
))
You can keep the segments colored as the previous category, or maybe set them to a neutral color so they don't stand out as much.
My reading comprehension failed me, so I misunderstood the question. Ignore this answer unless you want to learn about the lineend = argument of geom_line.
# generate dummy data
df <- data.frame(xvals = runif(1000, 0, 1))
# these categories were chosen to line up
# with tick marks to show they don't overlap
df$cats <- cut(df$xvals, c(0, .25, .625, 1)))
ggplot(df, aes(xvals, cats, colour = cats)) +
geom_line(size = 3)
The caveat is there there is a lineend = argument to geom_line. The default is butt, so that lines end exactly where you want them to and butt up against things, but sometimes that's not the right look. In this case, the other options would cause visual overlap, as you can see with the gridlines.
With lineend = "square":
With lineend = "round":
I'm quite new to ggplot but I like the systematic way how you build your plots. Still, I'm struggeling to achieve desired results. I can replicate plots where you have categorical data. However, for my use I often need to fit a model to certain observations and then highlight them in a combined plot. With the usual plot function I would do:
library(splines)
set.seed(10)
x <- seq(-1,1,0.01)
y <- x^2
s <- interpSpline(x,y)
y <- y+rnorm(length(y),mean=0,sd=0.1)
plot(x,predict(s,x)$y,type="l",col="black",xlab="x",ylab="y")
points(x,y,col="red",pch=4)
points(0,0,col="blue",pch=1)
legend("top",legend=c("True Values","Model values","Special Value"),text.col=c("red","black","blue"),lty=c(NA,1,NA),pch=c(4,NA,1),col=c("red","black","blue"),cex = 0.7)
My biggest problem is how to build the data frame for ggplot which automatically then draws the legend? In this example, how would I translate this into ggplot to get a similar plot? Or is ggplot not made for this kind of plots?
Note this is just a toy example. Usually the model values are derived from a more complex model, just in case you wante to use a stat in ggplot.
The key part here is that you can map colors in aes by giving a string, which will produce a legend. In this case, there is no need to include the special value in the data.frame.
df <- data.frame(x = x, y = y, fit = predict(s, x)$y)
ggplot(df, aes(x, y)) +
geom_line(aes(y = fit, col = 'Model values')) +
geom_point(aes(col = 'True values')) +
geom_point(aes(col = 'Special value'), x = 0, y = 0) +
scale_color_manual(values = c('True values' = "red",
'Special value' = "blue",
'Model values' = "black"))
I am trying to create a graph where because there are so many points on the graph, at the edges of the green it starts to fade to black while the center stays green. The code I am currently using to create this graph is:
plot(snb$px,snb$pz,col=snb$event_type,xlim=c(-2,2),ylim=c(1,6))
I looked into contour plotting but that did not work for this. The coloring variable is a factor variable.
Thanks!
This is a great problem for ggplot2.
First, read the data in:
snb <- read.csv('MLB.csv')
With your data frame you could try plotting points that are partly transparent, and setting them to be colored according to the factor event_type:
require(ggplot2)
p1 <- ggplot(data = snb, aes(x = px, y = py, color = event_type)) +
geom_point(alpha = 0.5)
print(p1)
and then you get this:
Or, you might want to think about plotting this as a heatmap using geom_bin2d(), and plotting facets (subplots) for each different event_type, like this:
p2 <- ggplot(data = snb, aes(x = px, y = py)) +
geom_bin2d(binwidth = c(0.25, 0.25)) +
facet_wrap(~ event_type)
print(p2)
which makes a plot for each level of the factor, where the color will be the number of data points in each bins that are 0.25 on each side. But, if you have more than about 5 or 6 levels, this might look pretty bad. From the small data sample you supplied, I got this
If the levels of the factors don't matter, there are some nice examples here of plots with too many points. You could also try looking at some of the examples on the ggplot website or the R cookbook.
Transparency could help, which is easily achieved, as #BenBolker points out, with adjustcolor:
colvect = adjustcolor(c("black", "green"), alpha = 0.2)
plot(snb$px, snb$pz,
col = colvec[snb$event_type],
xlim = c(-2,2),
ylim = c(1,6))
It's built in to ggplot:
require(ggplot2)
p <- ggplot(data = snb, aes(x = px, y = pz, color = event_type)) +
geom_point(alpha = 0.2)
print(p)