subgroups for discrete x Axis in ggplot2 - r

I would like to create a sub grouping in a ggplot2 (geom_point), meaning that I would like to shift discrete x values slightly according to a subgroup (see Figure).
I could achieve that by changing the discrete values to continuous and add a subgroup dependent shift value (see Fig.B), and than manually adjust the x labels. But I thought there is probably a more elegant way which deals with spacing and labeling issues.Below is a minimal example which hopefully describes what I mean.
library(ggplot2)
set.seed(1)
df <- data.frame(
ID = rep(seq(1,8),2),
group = rep(LETTERS[1:4],4),
subgroup = c(rep("a",8),rep("b",8)),
value = runif(16)
)
df$xpos <- as.numeric(df$group)+(as.numeric(df$subgroup)/4)
ggplot(data=df, aes(x=group, y= value, color=subgroup))+
geom_point()+
ggtitle("How it is")
ggplot(data=df, aes(x=xpos, y= value, color=subgroup))+
geom_point() +
ggtitle("How I would like it (without adjusted xAxes Labels)")

We can use position_dodge:
library(ggplot2)
ggplot(data=df, aes(x=group, y= value, color=subgroup))+
geom_point(position=position_dodge(width=0.5))+
ggtitle("How it is")
Data
set.seed(1)
df <- data.frame(
ID = rep(seq(1,8),2),
group = rep(LETTERS[1:4],4),
subgroup = c(rep("a",8),rep("b",8)),
value = runif(16)
)

Related

Color an ecdf plot that is grouped by one discrete factor, to be colored continuously using a different (continues) factor?

I'm trying to make an ecdf graph (Empirical cumulative distribution function) with a different colored plot for each subject ('A', 'B' or 'C' in this example).
In this example, the X axis describes the RT (response time), and the Y axis describes the cumulative proportion of rt observations.
Using ggplot2 and ecdf function, I managed to plot each subject's ecdf plot with a different discrete color for each of them. The problem starts when I want to color the subject's plot continuously based on a totally different variable, here called 'color_factor', which is different for each subject and is continuous.
Here is my simplified example:
set.seed(125)
dat <- data.frame(
subject = c(rep(c("A"), 10), rep(c("B"), 10), rep(c("C"), 10)),
color_factor = c(rep(0.3, 10), rep(0.6,10), rep(0.9,10)),
rt = sample(1:50, 30, replace =T)
)
dat <- arrange(dat,color_factor,rt)
dat.ecdf <- ddply(dat, .(color_factor), transform, ecdf=ecdf(rt)(rt) )
p <- ggplot( dat.ecdf, aes(rt, ecdf, colour = subject)) + geom_line()
p2 <- ggplot( dat.ecdf, aes(rt, ecdf, colour = color_factor)) + geom_line()
the initial data looks like this:
Plot p works great and looks like this:
But when I try to color the plots using the color_factor variable, it draws only one plot for all subjects and colors it not as intended.
What I intend to do is that the graph will look like graph p, except for the plots colors, which will be, for example colored as such: subject A- light blue, subject B- blue, and subject C- dark blue, corresponding to each subject's color_factor variable.
Anyone has any ideas what I can do? Any help would be greatly appreciated!
Thanks very much,
Yuval
Try any of these options:
library(plyr)
library(ggplot2)
#Data
set.seed(125)
dat <- data.frame(
subject = c(rep(c("A"), 10), rep(c("B"), 10), rep(c("C"), 10)),
color_factor = c(rep(0.3, 10), rep(0.6,10), rep(0.9,10)),
rt = sample(1:50, 30, replace =T)
)
#Transform
dat <- arrange(dat,color_factor,rt)
dat.ecdf <- ddply(dat, .(color_factor), transform, ecdf=ecdf(rt)(rt) )
#Plot 1
ggplot( dat.ecdf, aes(rt, ecdf, colour = subject,group=1)) + geom_line()+
scale_color_manual(values = c('lightblue','blue','darkblue'))
Output:
Or this:
#Plot 2
ggplot( dat.ecdf, aes(rt, ecdf, colour = factor(color_factor),group=subject)) + geom_line()+
scale_color_manual(values = c('lightblue','blue','darkblue'))+
labs(color='Factor')
Output:
Or this:
#Plot 3
ggplot( dat.ecdf, aes(rt, ecdf, colour = subject,group=subject)) + geom_line()+
scale_color_manual(values = c('lightblue','blue','darkblue'))+
labs(color='Subject')
Output:
Here is the answer that does exactly what I wanted, provided by #Lime:
p <- ggplot( dat.ecdf, aes(rt, ecdf, group = subject, colour = color_factor)) + geom_line()
This colors each subject's plot appropriate to his 'color_factor' value:

violin_plot() with continuous axis for grouping variable?

The grouping variable for creating a geom_violin() plot in ggplot2 is expected to be discrete for obvious reasons. However my discrete values are numbers, and I would like to show them on a continuous scale so that I can overlay a continuous function of those numbers on top of the violins. Toy example:
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df) + geom_violin(aes(x=factor(x), y=y))
This works as you'd imagine: violins with their x axis values (equally spaced) labelled 1, 2, and 5, with their means at y=1,2,5 respectively. I want to overlay a continuous function such as y=x, passing through the means. Is that possible? Adding + scale_x_continuous() predictably gives Error: Discrete value supplied to continuous scale. A solution would presumably spread the violins horizontally by the numeric x values, i.e. three times the spacing between 2 and 5 as between 1 and 2, but that is not the only thing I'm trying to achieve - overlaying a continuous function is the key issue.
If this isn't possible, alternative visualisation suggestions are welcome. I know I could replace violins with a simple scatter plot to give a rough sense of density as a function of y for a given x.
The functionality to plot violin plots on a continuous scale is directly built into ggplot.
The key is to keep the original continuous variable (instead of transforming it into a factor variable) and specify how to group it within the aesthetic mapping of the geom_violin() object. The width of the groups can be modified with the cut_width argument, depending on the data at hand.
library(tidyverse)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T),
y = rnorm(1000, mean = x))
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'lm')
By using this approach, all geoms for continuous data and their varying functionalities can be combined with the violin plots, e.g. we could easily replace the line with a loess curve and add a scatter plot of the points.
ggplot(df, aes(x=x, y=y)) +
geom_violin(aes(group = cut_width(x, 1)), scale = "width") +
geom_smooth(method = 'loess') +
geom_point()
More examples can be found in the ggplot helpfile for violin plots.
Try this. As you already guessed, spreading the violins by numeric values is the key to the solution. To this end I expand the df to include all x values in the interval min(x) to max(x) and use scale_x_discrete(drop = FALSE) so that all values are displayed.
Note: Thanks #ChrisW for the more general example of my approach.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(1,2,5), size = 1000, replace = T), y = rnorm(1000, mean = x^2))
# y = x^2
# add missing x values
x.range <- seq(from=min(df$x), to=max(df$x))
df <- df %>% right_join(tibble(x = x.range))
#> Joining, by = "x"
# Whatever the desired continuous function is:
df.fit <- tibble(x = x.range, y=x^2) %>%
mutate(x = factor(x))
ggplot() +
geom_violin(data=df, aes(x = factor(x, levels = 1:5), y=y)) +
geom_line(data=df.fit, aes(x, y, group=1), color = "red") +
scale_x_discrete(drop = FALSE)
#> Warning: Removed 2 rows containing non-finite values (stat_ydensity).
Created on 2020-06-11 by the reprex package (v0.3.0)

scatterplot with no x variable

My data set has a response variable and a 2-level factor explanatory variable. Is there a function for creating a scatter plot with no x axis variable? I'd like the variables to be randomly spread out along the x axis to make them easier to see and differentiate the 2 groups by color. I'm able to create a plot by creating an "ID" variable, but I'm wondering if it's possible to do it without it? The "ID" variable is causing problems when I try to add + facet_grid(. ~ other.var) to view the same plot broken out by another factor variable.
#Create dummy data set
response <- runif(500)
group <- c(rep('group1',250), rep('group2',250))
ID <- c(seq(from=1, to=499, by=2), seq(from=2, to=500, by=2))
data <- data.frame(ID, group, response)
#plot results
ggplot() +
geom_point(data=data, aes(x=ID, y=response, color=group))
How about using geom_jitter, setting the x axis to some fixed value?
ggplot() +
geom_jitter(data=data, aes(x=1, y=response, color=group))
You could plot x as the row number?
ggplot() +
geom_point(data=data, aes(x=1:nrow(data), y=response, color=group))
Or randomly order it first?
RandomOrder <- sample(1:nrow(data), nrow(data))
ggplot() +
geom_point(data=data, aes(x= RandomOrder, y=response, color=group))
Here's how you can scatter plot a variable against row index without intermediate variable:
ggplot(data = data, aes(y = response, x = seq_along(response), color = group)) +
geom_point()
To shuffle row index just add a sample function, like this:
ggplot(data = data, aes(y = response, x = sample(seq_along(response)), color = group)) +
geom_point()

changing y scale when using fun.y ggplot

This an example of my data
library(ggplot)
set.seed(1)
df <- data.frame(Groups = factor(rep(1:10, each = 10)))
x <- sample(1:100, 50)
df[x, "Style"] <- "Lame"
df[-x, "Style"] <- "Cool"
df$Style <- factor(df$Style)
p <- ggplot() + stat_summary(data = df, aes(Groups, Style, fill = Style),
geom = "bar", fun.y = length, position=position_dodge())
(Sorry, this is my first question... I don't know how to present code snippets like head(df) or the actual plot in SO. Please run this code to understand my question.)
So the plot adequately presents the count of every 'Style' per 'Groups'. However, the y axis scale shows the levels of the factor variable 'Style'. Although values I am plotting are originally discrete, the count of every 'Cool' and 'Lame' per 'Groups' is continuous.
How do I change the 'y' scale of my barplot from discrete to continuous in ggplot2, in order to correspond to the count values and not the original factor levels???
You can take advantage of ggplot grouping and the histogram to do this for you
p <- ggplot(df, aes(Groups, fill=Style)) + geom_histogram(position=position_dodge())

How can a line be overlaid on a bar plot using ggplot2?

I'm looking for a way to plot a bar chart containing two different series, hide the bars for one of the series and instead have a line (smooth if possible) go through the top of where bars for the hidden series would have been (similar to how one might overlay a freq polynomial on a histogram). I've tried the example below but appear to be running into two problems.
First, I need to summarize (total) the data by group, and second, I'd like to convert one of the series (df2) to a line.
df <- data.frame(grp=c("A","A","B","B","C","C"),val=c(1,1,2,2,3,3))
df2 <- data.frame(grp=c("A","A","B","B","C","C"),val=c(1,4,3,5,1,2))
ggplot(df, aes(x=grp, y=val)) +
geom_bar(stat="identity", alpha=0.75) +
geom_bar(data=df2, aes(x=grp, y=val), stat="identity", position="dodge")
You can get group totals in many ways. One of them is
with(df, tapply(val, grp, sum))
For simplicity, you can combine bar and line data into a single dataset.
df_all <- data.frame(grp = factor(levels(df$grp)))
df_all$bar_heights <- with(df, tapply(val, grp, sum))
df_all$line_y <- with(df2, tapply(val, grp, sum))
Bar charts use a categorical x-axis. To overlay a line you will need to convert the axis to be numeric.
ggplot(df_all) +
geom_bar(aes(x = grp, weight = bar_heights)) +
geom_line(aes(x = as.numeric(grp), y = line_y))
Perhaps your sample data aren't representative of the real data you are working with, but there are no lines to be drawn for df2. There is only one value for each x and y value. Here's a modifed version of your df2 with enough data points to construct lines:
df <- data.frame(grp=c("A","A","B","B","C","C"),val=c(1,2,3,1,2,3))
df2 <- data.frame(grp=c("A","A","B","B","C","C"),val=c(1,4,3,5,0,2))
p <- ggplot(df, aes(x=grp, y=val))
p <- p + geom_bar(stat="identity", alpha=0.75)
p + geom_line(data=df2, aes(x=grp, y=val), colour="blue")
Alternatively, if your example data above is correct, you can plot this information as a point with geom_point(data = df2, aes(x = grp, y = val), colour = "red", size = 6). You can obviously change the color and size to your liking.
EDIT: In response to comment
I'm not entirely sure what the visual for a freq polynomial over a histogram is supposed to look like. Are the x-values supposed to be connected to one another? Secondly, you keep referring to wanting lines but your code shows geom_bar() which I assume isn't what you want? If you want lines, use geom_lines(). If the two assumptions above are correct, then here's an approach to do that:
#First let's summarise df2 by group
df3 <- ddply(df2, .(grp), summarise, total = sum(val))
> df3
grp total
1 A 5
2 B 8
3 C 3
#Second, let's plot df3 as a line while treating the grp variable as numeric
p <- ggplot(df, aes(x=grp, y=val))
p <- p + geom_bar(alpha=0.75, stat = "identity")
p + geom_line(data=df3, aes(x=as.numeric(grp), y=total), colour = "red")

Resources