I'm trying to plot a dotplot using geom_dotplot in which each dot represents an observation of my data set. Therefore, the y-axis shouldn't represent density but actual counts. I'm aware of this thread which revolves around the same topic. However, I haven't managed to solve my issue following the same methodology.
df <- data.frame(x = sample(1:500, size = 150, replace = TRUE))
ggplot(df, aes(x)) +
geom_dotplot(method = 'histodot', binwidth = 1)
And I obtain the following graph , I want to obtain one similar to this one where I can manipulate dots' size, space between, etc.
Thanks in advance
You can modify the binwidth argument to cause the points to stack. For example,
df %>%
ggplot(aes(x = x)) +
geom_dotplot(method = "histodot", binwidth = 20)
There is a dotsize argument that can be used to modify the size of the dot.
Related
I would like to produce multiple contour plots using ggplot2 and
geom_contour_filled()
but the z values range is too large. To give you a little bit of an idea of what the values are, it ranges from -2,71 to -157,28. So I thought I should change the breaks so it covers all of these values.
The code below is not the data I work with, but it should represent the problem I have:
The data
h_axis <- 10^(seq(log10(0.1), log10(1000),
length.out = 20))
a_axis <- 10^(seq(log10(0.1), log10(1000),
length.out = 20))
comb <- expand.grid(h_axis, a_axis)
h_val <- comb$Var2
a_val <- comb$Var1
values <- seq(-2, -150, length.out = 400)
dt <- data.frame(h = h_val, a = a_val, values)
First, let's say I don't change the breaks. Then, using this code
ggplot(dt, aes(x = log10(h_val), y = log10(a_val), z = values)) +
geom_contour_filled() +
# geom_contour(color = "black", size = 0.1) +
xlab(expression(log[10](h))) +
ylab(expression(log[10](a))) +
guides(fill = guide_colorbar(title = expression('E ||'*g - hat(g)*'||'[2]*'')))
will produce the following figure:
So a lot of the area will be covered by the same colour, which is a problem since my data consists of multiple factors. Factor 1 is covered by the yellow, Factor 2 is covered by the green, and so on.
Then my second approach, is to add
bar <- 10^(seq(log10(-min(values)), log10(-max(values)),
length.out = 100))
and put bar in the geom_contour_filled() like this
geom_contour_filled(breaks = -bar)
Then I get
which is nice! But, in both cases I get the following warning
Warning message:
colourbar guide needs continuous scales.
Also, the legend is not shown on the right side. What do I need to do to fix the warning and how can I make sure that the legend is shown?
Try guide_legend instead of guide_colorbar.
For my thesis, I am making scatterplots in APA format in R.
So far, my code is as follows, and it works great for plotting just one variable with confidence interval and regression line:
scatterplot=ggplot(dat, aes(x=STAIT, y=valence))+
geom_point()+
geom_smooth(method=lm,se=T, fullrange=T,colour='black')+
labs(x='STAI-T score', y='Report length')+
apatheme
However, I have two variables that were initially measured on the same 0-100 scale: valence and arousal. Instead of two seperate plots, I thought it would be nice to add both variables in a single plot, using 'valence/arousal score' as the ylab and open/closed dots to define which data points come from which variable, a bit like in this example I found online.
In that example, however, the data comes from different groups. So that code doesn't work on my data.
I've tried different things, and the closest I get, is with the following code:
sp.both=ggplot(dat, aes(x=STAIT))+
geom_point(aes(y=valence)) +
geom_point(aes(y=arousal)) +
apatheme
This gives me a scatterplot with data points of both of the variables added in the same plot.
However, I need the data points of one score to be visually different from the other, and I want to add two seperate regression lines for each variable. But everything I've tried so far, has resulted in errors, and I cannot find any examples online of people trying to do the same thing.
Any help would be highly appreciated!
Using some random example data you could achieve your desired like so:
It's best to reshape your data to long format using e.g. tidyr::pivot_longer which gives us two new cols, one with the names of the variables and one with the corresponding values. After reshaping you could map the values on y and set different shapes and linetypes by mapping the variables column on shape and linetype:
library(ggplot2)
library(tidyr)
set.seed(42)
dat <- data.frame(
STAIT = runif(20, 0, 1),
valence = runif(20, 0, 1),
arousal = runif(20, 0, 1)
)
dat_long <- dat %>%
pivot_longer(c(valence, arousal), names_to = "var", values_to = "value")
ggplot(dat_long, aes(x = STAIT, y = value, linetype = var, shape = var)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black", size = .5)
#> `geom_smooth()` using formula 'y ~ x'
I figured out a way to do it, with the following code:
sp.both = ggplot(dat,aes(x=STAIT)) +
geom_point(shape = 16, aes(y=arousal)) +
geom_point(shape = 1, aes(y=valence)) +
labs(x='STAI-T score', y= 'valence/arousal score')+
geom_smooth(method=lm,se=T,fullrange=T,colour='black',aes(y=arousal))+
geom_smooth(method=lm,se=T,fullrange=T,linetype ='dashed',colour='black',aes(y=valence))+
apatheme
The only thing I haven't figured out yet, is how to now add a legend with both the linetype (solid/dashed) and the corresponding datapoint (solid/open) and the variable it belongs to.
But Stefan's example solved this problem, and I prefer the way the plot then looks visually as well. So that's definitely a better solution to this problem. Thanks!
I would like to take a ggplot scatterplot and overlay on top of it the mean of the y-variable within evenly-spaced bins on the x-axis.
So far what I have is this:
library(tidyverse)
data(midwest)
ggplot(arrange(midwest,percollege),aes(x=percollege,y=percbelowpoverty))+
geom_point()+
stat_summary_bin(aes(x=percollege,y=percbelowpoverty),
bins=10,fun.y='mean',geom='point',col='red')
Which produces
which is basically perfect except instead of red points I would like horizontal red lines that extend from the beginning of the bin to the end of the bin.
I can sort of mimic what I want with
library(tidyverse)
data(midwest)
ggplot(arrange(midwest,percollege),aes(x=percollege,y=percbelowpoverty))+
geom_point()+
stat_summary_bin(aes(x=percollege,y=percbelowpoverty),
bins=10,fun.y='mean',geom='point',col='red',shape="-",size=50)
which gives
Which is kinda what I want, except
I have to manually set the size every time I make a new graph like this
Uh, ew.
Another approach I've tried is with geom='bar',fill=NA, which seems promising if I can somehow get it to only show the top bar without the sides or bottom of the bar.
Any tips for this? I've had little luck with setting the geom to pointrange or linerange or line (the first two I've yet to get to work, and the last just connects each point with non-horizontal lines). Kind of surprised this isn't default behavior for stat_summary_bin to be honest!
Thanks!
This should work. I think the rownames_to_column line may not be necessary, and the modify_if argument is necessary because the cut function produces strings rather than than numeric values.
midwest_sum <- midwest %>%
mutate(coll_bins = cut(percollege, breaks = 10)) %>%
group_by(coll_bins) %>%
summarise(bin_mean = mean(percbelowpoverty)) %>%
rownames_to_column(var = "bin_num") %>%
tidyr::extract(coll_bins, c("min", "max"), "\\((.*),(.*)]") %>%
modify_if(is.character, as.numeric)
ggplot()+
geom_point(data = midwest, aes(x=percollege,y=percbelowpoverty)) +
geom_errorbarh(data = midwest_sum, aes(xmin = min, xmax = max, y = bin_mean),
col = "red", size = 1)
Hope this helps!
I wouldn't often call this desired default behaviour; leaving out the sides of the bins necessarily makes it confusing where the bin boundaries actually are for points far above or below the bin means.
Anyway, here's a first attempt. We can calculate the bin boundaries based on some input parameter and then use geom_segment to draw them on the graph. geom_segment needs start and end coordinates, so bin_boundaries calculates the means of the y variable and the bounds of the bins for the x variable, and returns a call to geom_segment. This means we can simply add the output of our function to our ggplot call and it works as expected. Note the use of passing through ... so we can still use the geom parameters.
You can probably modify to use other bin width and dodge parameters instead of calculating from the bounds of your x variable, haven't thought too carefully about that. Note that the lines look different from your use of stat_summary_bin because they are centered differently and so use different points in each calculation. You might also consider a version that uses geom_step which would connect the ends of each horizontal line.
library(tidyverse)
bin_boundaries <- function(tbl, n_bins, x_var, y_var, ...) {
x_var <- enquo(x_var)
y_var <- enquo(y_var)
bin_bounds <- seq(
from = min(pull(tbl, !!x_var)),
to = max(pull(tbl, !!x_var)),
length.out = n_bins + 1)
bounds_tbl <- tbl %>%
mutate(bin_group = ntile(!!x_var, n_bins)) %>%
group_by(bin_group) %>%
summarise(!!y_var := mean(!!y_var)) %>%
mutate(bin_start = bin_bounds[1:n_bins], bin_end = bin_bounds[2:(n_bins + 1)])
geom_segment(
data = bounds_tbl,
mapping = aes(
x = bin_start, y = !!y_var,
xend = bin_end, yend = !!y_var
),
...
)
}
ggplot(midwest) +
geom_point(aes(x = percollege, y = percbelowpoverty)) +
bin_boundaries(midwest, 10, percollege, percbelowpoverty, colour = "red", size = 1)
Created on 2019-02-07 by the reprex package (v0.2.1)
I have a scatter plot now. Each color represent a categorical group and each group has a range of values which are on the x-axis. There should not be any overlapping between the range of categorical variables. However, because of the thickness of scatter points, it looks like that there is overlapping. So, I want to draw a line to connect the maximum point of the group and the minimum point of the adjacent group so that as long as the line does not have a negative slope, it can show that there is no overlapping between each categorical variable.
I do not know how to use geom_line() to connect two points where y-coordinate is a categorical variable. IS that possible to do so??
Any help would be appreciated!!!
It sounds like you want geom_segment not geom_line. You'll need to aggregate your data into a new data frame that has the points you want plotted. I adapted Brian's sample data and use dplyr for this:
# sample data
df <- data.frame(xvals = runif(50, 0, 1))
df$cats <- cut(df$xvals, c(0, .25, .625, 1))
# aggregation
library(dplyr)
df_summ = df %>% group_by(cats) %>%
summarize(min = min(xvals), max = max(xvals)) %>%
mutate(adj_max = lead(max),
adj_min = lead(min),
adj_cat = lead(cats))
# plot
ggplot(df, aes(xvals, cats, colour = cats)) +
geom_point() +
geom_segment(data = df_summ, aes(
x = max,
xend = adj_min,
y = cats,
yend = adj_cat
))
You can keep the segments colored as the previous category, or maybe set them to a neutral color so they don't stand out as much.
My reading comprehension failed me, so I misunderstood the question. Ignore this answer unless you want to learn about the lineend = argument of geom_line.
# generate dummy data
df <- data.frame(xvals = runif(1000, 0, 1))
# these categories were chosen to line up
# with tick marks to show they don't overlap
df$cats <- cut(df$xvals, c(0, .25, .625, 1)))
ggplot(df, aes(xvals, cats, colour = cats)) +
geom_line(size = 3)
The caveat is there there is a lineend = argument to geom_line. The default is butt, so that lines end exactly where you want them to and butt up against things, but sometimes that's not the right look. In this case, the other options would cause visual overlap, as you can see with the gridlines.
With lineend = "square":
With lineend = "round":
I have a dataset, where each data point has an x-value that is constrained (represents an actual instance of a quantitative variable), y-value that is arbitrary (exists simply to provide a dimension to spread out text), and a label. My datasets can be very large, and there is often text overlap, even when I try to spread the data across the y-axis as much as possible.
Hence, I am trying to use the new ggrepel. However, I am trying to keep the text labels constrained at their x-value position, while only allowing them to repel from each other in the y-direction.
As an example, the below code produces an plot for 32 data points, where the x-values show the number of cylinders in a car, and the y-values are determined randomly (have no meaning but to provide a second dimension for text plotting purposes). Without using ggrepel, there is significant overlap in the text:
library(ggrepel)
library(ggplot2)
set.seed(1)
data = data.frame(x=runif(100, 1, 10),y=runif(100, 1, 10),label=paste0("label",seq(1:100)))
origPlot <- ggplot(data) +
geom_point(aes(x, y), color = 'red') +
geom_text(aes(x, y, label = label)) +
theme_classic(base_size = 16)
I can remedy the text overlap using ggrepel, as shown below. However, this changes not only the y-values, but also the x-values. I am trying to avoid changing the x-values, as they represent an actual physical meaning (the number of cylinders):
repelPlot <- ggplot(data) +
geom_point(aes(x, y), color = 'red') +
geom_text_repel(aes(x, y, label = label)) +
theme_classic(base_size = 16)
As a note, the reason I cannot allow the x-value of the text to change is because I am only plotting the text (not the points). Whereas, it seems that most examples in ggrepel keep the position of the points (so that their values remain true), and only repel the x and y values of the labels. Then, the points and connected to the labels with segments (you can see that in my second plot example).
I kept the points in the two examples above for demonstration purposes. However, I am only retaining the text (and hence will be removing the points and the segments), leaving me with something like this:
repelPlot2 <- ggplot(data) + geom_text_repel(aes(x, y, label = label), segment.size = 0) + theme_classic(base_size = 16)
My question is two fold:
1) Is it possible for me to repel the text labels only in the y-direction?
2) Is it possible for me to obtain a structure containing the new (repelled) y-values of the text?
Thank you for any advice!
ggrepel version 0.6.8 (Install from GitHub using devtools::github_install) now supports a "direction" argument, which enables repelling of labels only in "x" or "y" direction.
repelPlot2 <- ggplot(data) + geom_text_repel(aes(x, y, label = label), segment.size = 0, direction = "y") + theme_classic(base_size = 16)
Getting the y values is harder -- one approach can be to use the "repel_boxes" function from ggrepel first to get repelled values and then input those into ggplot with geom_text. For discussion and sample code of that approach, see https://github.com/slowkow/ggrepel/issues/24. Note that if using the latest version, the repel_boxes function now also has a "direction" argument, which takes in "both","x", or "y".
I don't think it is possible to repel text labels only in one direction with ggrepel.
I would approach this problem differently, by instead generating the arbitrary y-axis positions manually. For example, for the data set in your example, you could do this using the code below.
I have used the dplyr package to group the data set by the values of x, and then created a new column of data y containing the row numbers within each group. The row numbers are then used as the values for the y-axis.
library(ggplot2)
library(dplyr)
data <- data.frame(x = mtcars$cyl, label = paste0("label", seq(1:32)))
data <- data %>%
group_by(x) %>%
mutate(y = row_number())
ggplot(data, aes(x = x, y = y, label = label)) +
geom_text(size = 2) +
xlim(3.5, 8.5) +
theme_classic(base_size = 8)
ggsave("filename.png", width = 4, height = 2)