I'm generating violin plots in ggplot2 for a time series, year_1 to year_32. The years in my df are stored as numerical values. From the examples I've seen, it seems that I must convert these numerical year values to factors to plot one violin per year; and in fact, if I run the code without as.factors, I get one big fat violin. I would like to understand why geom_violin can't have numeric values on the x axis; or if I'm wrong about that, how to use them?
So:
my_data$year <- as.factor(my_data$year)
p <- ggplot(data = my_data, aes(x = year, y = continuous_var)+
geom_violin(fill = "#FF0000", color = "#000000")+
ylim(0,500)+
labs(x = "x_label", y = "y_label")
p +my_theme()
works fine, but if I skip
my_data$year <- as.factor(my_data$year)
it doesn't work, I get one big fat violin for all years. Why?
TIA
You miss a ) at the end of this line p <- ggplot(data = my_data, aes(x = year, y = continuous_var)
I have construced a reproducible example with the ToothGrowth dataset:
This should work now:
library(ggplot2)
my_data <- ToothGrowth
my_data$dose <- as.factor(my_data$dose)
p <- ggplot(data = my_data, aes(x = dose, y = len))+
geom_violin(fill = "#FF0000", color = "#000000")+
ylim(0,500)+
labs(x = "x_label", y = "y_label") +
theme_bw()
p
PS: this discussion would better fit Cross Validated, as it's more of an statistics than coding question.
I'm not 100% sure, but here's my explanation: the violin plot shows the density for a set of data, you can divide your data into groups so that you can plot one violin for each part of your data. But if the metric you're using to divide groups (x axis) is a continuous, you're going to have infinite groupings (one group for the values at 0, one for 0.1, one for 0.01, etc.), so in the end you actually can't divide your data, and ggplot probably ignores the x variable and makes one violin for all your data.
I am trying to plot the dbscan clustering result through ggplot2. If I understand it correctly the current dbscan plots noise in black colour with base plot function. Some code first,
library(dbscan)
n <- 100
x <- cbind(
x = runif(5, 0, 10) + rnorm(n, sd = 0.2),
y = runif(5, 0, 10) + rnorm(n, sd = 0.2)
)
plot(x)
kNNdistplot(x, k = 5)
abline(h=.25, col = "red", lty=2)
res <- dbscan::dbscan(x, eps = .25, minPts = 4)
plot(res, x, main = "DBSCAN")
x <- data.frame(x)
ggplot(x, aes(x = x, y=y)) + geom_point(color = res$cluster+1, pch = clusym[res$cluster+1])
+ theme_grey() + ggtitle("(c)") + labs(x ="x", y = "y")
I want two things to do differently here, first trying to plot the clustering output through ggplot(). The difficulty is if I use res$cluster to plot points the plot() will ignore points with 0 labels (which are noise points), and ggplots() will though error as length of res$cluster will be smaller than actual data to plot and if I try to use res$cluster+1 it will give 1 to noise points, which I don't want. And secondly if possible try to do something which clusym[] in package fpc does. It plots clusters with labels 1, 2, 3, ... and ignores 0 labels. Thats fine if my labels for noise points are still 0 and then giving any specific symbol say "*" to noise point with a specific colour lets say grey. I have seen a stack overflow post which tries to do similar thing for convex hull plotting but couldn't still figure out how to do this if I don't want to draw the hull and want a clustering number for each cluster.
A possibility which I thought was first plot the points without noise and then additional adding noise points with the desired colour and symbols to the original plot .
But since the res$cluster length is not equal to x it is thronging error.
ggplot(x, aes(x = x, y=y)) + geom_point(color = res$cluster+1, pch = clusym[res$cluster+1])
+ theme_grey() + ggtitle("(c)") + labs(x ="x", y = "y") + adding noise points
Error: Aesthetics must be either length 1 or the same as the data (100): shape, colour
You should first subset the third column from the output of DBSCAN, tack that onto your original data as a new column (i.e. as cluster), and assign that as a factor.
When you make the ggplot, you can assign color or shape to cluster. As for ignoring the noise points, I would do it as follows.
data <- dataframe with the cluster column (still in numeric form).
data2 <- dplyr::filter(data, cluster > 0)
data2$cluster <- as.factor(data2$cluster)
ggplot(data2, aes(x = x, y = y) +
geom_point(aes(color = `cluster`))
I'm quite new to ggplot but I like the systematic way how you build your plots. Still, I'm struggeling to achieve desired results. I can replicate plots where you have categorical data. However, for my use I often need to fit a model to certain observations and then highlight them in a combined plot. With the usual plot function I would do:
library(splines)
set.seed(10)
x <- seq(-1,1,0.01)
y <- x^2
s <- interpSpline(x,y)
y <- y+rnorm(length(y),mean=0,sd=0.1)
plot(x,predict(s,x)$y,type="l",col="black",xlab="x",ylab="y")
points(x,y,col="red",pch=4)
points(0,0,col="blue",pch=1)
legend("top",legend=c("True Values","Model values","Special Value"),text.col=c("red","black","blue"),lty=c(NA,1,NA),pch=c(4,NA,1),col=c("red","black","blue"),cex = 0.7)
My biggest problem is how to build the data frame for ggplot which automatically then draws the legend? In this example, how would I translate this into ggplot to get a similar plot? Or is ggplot not made for this kind of plots?
Note this is just a toy example. Usually the model values are derived from a more complex model, just in case you wante to use a stat in ggplot.
The key part here is that you can map colors in aes by giving a string, which will produce a legend. In this case, there is no need to include the special value in the data.frame.
df <- data.frame(x = x, y = y, fit = predict(s, x)$y)
ggplot(df, aes(x, y)) +
geom_line(aes(y = fit, col = 'Model values')) +
geom_point(aes(col = 'True values')) +
geom_point(aes(col = 'Special value'), x = 0, y = 0) +
scale_color_manual(values = c('True values' = "red",
'Special value' = "blue",
'Model values' = "black"))
I was trying to create a simple line graph of means and interactions. I have a DV (reading times) on the y-axis, one factor (Length) on the x-axis, and another as a grouping variable (position).
The syntax I used is below. The data plotted as single points on a line for each of the two Length conditions, but did not connect with lines between the two Length conditions. What am I missing in terms of syntax?
I am using R i386 2.15.2, and updated ggplot2 last week.
Here is a reproducible example
SubjectID <- c(101,101,101,101,101,101,101,101,102,102,102,102,102,102,102,102,
201,201,201,201,201,201,201,201,202,202,202,202,202,202,202,202)
Group <- c("PWA","PWA","PWA","PWA","PWA","PWA","PWA","PWA","PWA","PWA","PWA",
"PWA","PWA","PWA","PWA","PWA","Control","Control","Control",
"Control","Control","Control","Control","Control","Control",
"Control","Control","Control","Control","Control","Control",
"Control")
Length <- c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2)
Pos <- c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2)
ReadT <- c(6.7,7.6,6.4,7.9,5.4,6.4,6.3,7.4,6.9,7.2,6.7,7.4,5.7,6.1,6.5,7.8,
6.1,5.7,4.9,6.1,4.7,6.5,6.1,6.2,6.9,5.9,4.8,6.5,4.6,6.3,6.7,6.6)
data <- data.frame (SubjectID, Group,Length,Pos,ReadT)
data$Length <- factor(data$Length, order = TRUE,
levels = c(1,2),
labels = c("Length 1", "Length 2"))
data$Pos <- factor(data$Pos, order = TRUE,
levels = c(1,2),
labels = c("Position 1", "Position 2"))
qplot(Length, data=data, ReadT, geom=c("point", "line"),
stat="summary", fun.y=mean, group=Pos, colour=Pos,
facets = ~Group)
I don't think you have reproduced any inconsistency, but your issues in part are clouded by trying condense everything into single qplot call.
Your x variable Length is a factor, therefore ggplot is sensibly considering Length 1 and Length 2 to be independent, and won't connect the lines.
Secondly, you won't be able to use stat_summary to summarize by your x values, without forcing these to be a factor (and hence independant).
I find it easiest to presummarize the data and not rely on ggplot.
eg
library(plyr)
data.means <- ddply(data, .(Group, Pos, Length), summarize, ReadT = mean(ReadT))
Then construct the plot using ggplot not qplot, to give you the flexibility (and transparency) required.
The trick to get the lines connected is to consider x numeric within the call to geom_line see here for example
ggplot(data.means, aes(x= Length, y= ReadT, colour = Pos)) +
geom_point() +
geom_line(aes(x=as.numeric(Length))) +
facet_grid(~Group)
If you insisted on using the raw data, and stat_xxxx functions, you could also replicate this using stat_smooth to estimate the means (which would keep x classified as numeric)
ggplot(data, aes(x = Length, y= ReadT, colour = Pos)) +
stat_summary(fun.y = 'mean', geom = 'point')+
stat_smooth(method = 'lm', aes(x=as.numeric(Length)), se = FALSE) +
facet_grid(~Group)