Simple histogram of two variables with ggplot - r

I'm using ggplot2 to do an histogram for two weight variables in my dataframe. The dataframe has two columns, a column with case name caso and a value column named peso. I have 3000 cases for each, and when I put the histograms side by side with facet_wrapoption they show correctly:
df |>
pivot_longer(cols = c(peso,peso2), names_to = "caso", values_to = "peso") |>
ggplot(aes(x = peso, colour= caso, fill = caso))+
geom_histogram(alpha = 0.4) +
facet_wrap(~caso)
But when I try to overlap the two histograms on the same frame, the first one seems to have double cases number, the histograms are unequal in size:
df |>
pivot_longer(cols = c(peso,peso2), names_to = "caso", values_to = "peso") |>
ggplot(aes(x = peso, colour= caso, fill = caso))+
geom_histogram(alpha = 0.4)
I don't know what I'm doing wrong. Any advice? Thanks in advance!
Juan

The default of geom_histogram is to stack multiple series. The "identity" position scheme should fix this:
library(tidyverse)
df <- data.frame(peso = rnorm(1000, 250, 10),
peso2 = rnorm(1000, 260, 10))
df %>%
pivot_longer(everything()) %>%
ggplot(aes(x = value, fill = name)) +
geom_histogram(position = "identity", alpha = 0.5)
# geom_histogram(position = position_identity(), alpha = 0.5) # alternate syntax
From the help for ?geom_histogram, it looks like the "Usage" section shows that as the default. FWIW, geom_freqpoly defaults to "identity."
geom_histogram(
mapping = NULL,
data = NULL,
stat = "bin",
position = "stack", #### HERE
...,
binwidth = NULL,
bins = NULL,
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE
)

Related

selectize widget of ggplotly highlight not always visible (depends on order of geoms?)

I want to do an interactive scatterplot where I can
highlight individual points
a tooltip shows me the id
search for specific id with a selectize widget
I tried for some time with plotly and ended up with this code
library(tidyverse)
library(plotly)
set.seed(1)
dat <- tibble(id = LETTERS[1:10],
trt = factor(rep(0:1, 5)),
x = rnorm(10),
y = x + rnorm(10, sd = 0.2)) %>%
highlight_key(~id)
dat %>%
{ggplot(., aes(x = x, y = y, group = id, color = trt)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed")} %>%
ggplotly(tooltip = c("id")) %>%
highlight(on = "plotly_hover", selectize = TRUE)
It took my very long to understand that the order of geoms seems to be important
## no color, geom order reversed
## selectize.js widget is completely missing
dat %>%
{ggplot(., aes(x = x, y = y, group = id)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_point()} %>%
ggplotly(tooltip = c("id")) %>%
highlight(on = "plotly_hover", selectize = TRUE)
## color by trt, geom order reversed
## selectize.js widget only works for data where t = 0
dat %>%
{ggplot(., aes(x = x, y = y, group = id, color = trt)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_point()} %>%
ggplotly(tooltip = c("id")) %>%
highlight(on = "plotly_hover", selectize = TRUE)
Can somebody explain this strange behavior? What if I would like to reverse the order of geoms i.e. hline ploted behind points?

How to highlight points from and hist chart in R

I have some troubles with my code. I'm very very beginner in R, so I would like some help. I have a dataframe and I need to make an hist chart and then highlight some points. But I cannot understand how to find those points in my dataset. Here is and example of what I have.
x <- c("a","b","c","d","f","g","h","i","j","k")
y <- c(197421,77506,130474,18365,30470,22518,70183,15378,29747,11148)
z <- data.frame(x,y)
hist(z$y)
For example, how can I find in the hist where is the "a" and "h" value placed? and in a barplot? I tried the function points, but I cannot find the coordinates. Please let me know how could I make that . Thanks in advance.
Here is a way with dplyr and ggplot2. The approach is to cut the y variable into bins and then use summarise to create the counts and the labels.
library(dplyr)
library(ggplot2)
z %>%
mutate(bins = cut(y, seq(0, 200000, 50000))) %>%
group_by(bins) %>%
summarise(xes = paste0(x, collapse = ", "),
count = n()) %>%
ggplot() +
geom_bar(aes(x = bins, y = count), stat = "identity", color = "black", fill = "grey") +
geom_text(aes(x = bins, y = count + 0.5, label = xes)) +
xlab("y")
Here is a more complicated way that makes a plot that looks more like what hist() produces.
z2 <- z %>%
mutate(bins = cut(y, seq(0, 200000, 50000))) %>%
group_by(bins) %>%
summarise(xes = paste0(x, collapse = ", "),
count = n()) %>%
separate(bins, into = c("start", "end"), sep = ",") %>%
mutate(across(start:end, ~as.numeric(str_remove(., "\\(|\\]"))))
ggplot() +
geom_histogram(data = z, aes(x = y), breaks = seq(0, 200000, 50000),
color = "black", fill = "grey") +
geom_text(data = z2, aes(x = (start + end) / 2, y = count + 0.5, label = xes))

Drawing line segment connecting two points on ggplot

I have a point plot with two different points on each category and I want to create a line segment joining the two points on each row.
items %>%
group_by(category) %>%
summarise(med_buy_price = mean(buy_value, na.rm = TRUE),
med_sell_price = mean(sell_value, na.rm = TRUE)) %>%
pivot_longer(cols = c("med_buy_price", "med_sell_price"),
names_to = "measure",
values_to = "value") %>%
ggplot(aes(x = value, y = category)) +
geom_point(aes(color = measure), size = 3)
For creating a line segment, you need to have start and endpoints for the segment. Thus, you can stay with the wide format, so no pivot_longer needed.
Then create individual geom_point for sell and buy value and a geom_segment combining both points.
This code will work:
library(ggplot2)
library(dplyr)
library(tibble)
library(tidyr)
items <- tribble(
~category, ~buy_value, ~sell_value,
"Wallpaper", 2000, 5200,
"Usables", 500, 12500,
"Umbrellas", 200, 1800
)
items %>%
group_by(category) %>%
summarise(med_buy_price = mean(buy_value, na.rm = TRUE),
med_sell_price = mean(sell_value, na.rm = TRUE)) %>%
ggplot() +
geom_point(aes(x = med_buy_price, y = category), size = 3, color = "red")+
geom_point(aes(x = med_sell_price, y = category), size = 3, color = "green")+
geom_segment(aes(x = med_buy_price, xend = med_sell_price, y = category, yend = category))
If you do not insist on using geom_point you could try geom_errorbar which simplifies thing a little bit
items %>%
group_by(category) %>%
summarise(med_buy_price = mean(buy_value, na.rm = TRUE),
med_sell_price = mean(sell_value, na.rm = TRUE)) %>%
ggplot(aes(xmin=med_buy_price,xmax=med_sell_price, y = category)) +
geom_errorbar(width=0.1)

ggplot monthly date scale on x axis uses days as units

When plotting a bar chart with monthly data, ggplot shortens the distance between February and March, making the chart look inconsistent
require(dplyr)
require(ggplot2)
require(lubridate)
## simulating sample data
set.seed(.1073)
my_df <- data.frame(my_dates = sample(seq(as.Date('2010-01-01'), as.Date('2016-12-31'), 1), 1000, replace = TRUE))
### aggregating + visualizing counts per month
my_df %>%
mutate(my_dates = round_date(my_dates, 'month')) %>%
group_by(my_dates) %>%
summarise(n_row = n()) %>%
ggplot(aes(x = my_dates, y = n_row))+
geom_bar(stat = 'identity', color = 'black',fill = 'slateblue', alpha = .5)+
scale_x_date(date_breaks = 'months', date_labels = '%y-%b') +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
I would keep the dates as dates rather than factors. Yes, factors will keep the bars uniform in size but you'll have to remember to join in any months that are missing so that blank months aren't skipped and factors are easy to get out of order. I would recommend adjusting your aesthetics to reduce the effect that the black outline has on the gap between February and March.
Here are two examples:
Adjust the outline color to be white. This will reduce the contrast and makes the gap less noticible.
Set the width to 20 (days).
As an aside, you don't need to summarize the data, you can use floor_date() or round_date() in an earlier step and go straight into geom_bar().
dates <- seq(as.Date("2010-01-01"), as.Date("2016-12-31"), 1)
set.seed(.1073)
my_df <-
tibble(
my_dates = sample(dates, 1000, replace = TRUE),
floor_dates = floor_date(my_dates, "month")
)
ggplot(my_df, aes(x = floor_dates)) +
geom_bar(color = "white", fill = "slateblue", alpha = .5)
ggplot(my_df, aes(x = floor_dates)) +
geom_bar(color = "black", fill = "slateblue", alpha = .5, width = 20)
using some parts from IceCream's answer you can try this.
Of note, geom_col is now recommended to use in this case.
my_df %>%
mutate(my_dates = factor(round_date(my_dates, 'month'))) %>%
group_by(my_dates) %>%
summarise(n_row = n()) %>%
ungroup() %>%
mutate(my_dates_x = as.numeric(my_dates)) %>%
mutate(my_dates_label = paste(month(my_dates,label = T), year(my_dates))) %>%
{ggplot(.,aes(x = my_dates_x, y = n_row))+
geom_col(color = 'black',width = 0.8, fill = 'slateblue', alpha = .5) +
scale_x_continuous(breaks = .$my_dates_x, labels = .$my_dates_label) +
theme(axis.text.x = element_text(angle = 60, hjust = 1))}
You can convert it to a factor variable to use as the axis, and fix the formatting with a label argument to scale_x_discrete.
library(dplyr)
library(ggplot2)
my_df %>%
mutate(my_dates = factor(round_date(my_dates, 'month'))) %>%
group_by(my_dates) %>%
summarise(n_row = n()) %>%
ggplot(aes(x = my_dates, y = n_row))+
geom_bar(stat = 'identity', color = 'black',fill = 'slateblue', alpha = .5)+
scale_x_discrete(labels = function(x) format(as.Date(x), '%Y-%b'))+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Edit: Alternate method to account for possibly missing months which should be represented as blank spaces in the plot.
library(dplyr)
library(ggplot2)
library(lubridate)
to_plot <-
my_df %>%
mutate(my_dates = round_date(my_dates, 'month'),
my_dates_ticks = interval(min(my_dates), my_dates) %/% months(1))
to_plot %>%
group_by(my_dates_ticks) %>%
summarise(n_row = n()) %>%
ggplot(aes(x = my_dates_ticks, y = n_row))+
geom_bar(stat = 'identity', color = 'black',fill = 'slateblue', alpha = .5)+
scale_x_continuous(
breaks = unique(to_plot$my_dates_ticks),
labels = function(x) format(min(to_plot$my_dates) + months(x), '%y-%b'))+
theme(axis.text.x = element_text(angle = 60, hjust = 1))

Reordering legend items in ggplot from two different datasets and layers

I'm combining two layers in ggplot that were created from two different data sets and want to control the order in which the legend appears.
With example data and code:
base <-
data.frame(idea_num = c(1, 2),
value = c(-50, 90),
it_cost = c(30, 10))
group <-
data.frame(idea_num = c(1, 1, 2, 2),
group = c("a", "b", "a", "b"),
is_primary = c(TRUE, FALSE, FALSE, TRUE),
group_value = c(-40, -10, 20, 70))
base %>%
left_join(group) %>%
arrange(desc(value)) %>%
mutate(idea_num = idea_num %>% factor(levels = unique(idea_num)),
is_primary = is_primary %>% factor(levels = c("TRUE", "FALSE"))) %>%
ggplot(aes(x = idea_num, y = group_value, fill = is_primary)) +
geom_bar(stat = "identity") +
geom_bar(data = base %>%
arrange(desc(value)) %>%
mutate(idea_num = idea_num %>% factor(levels = unique(idea_num))),
aes(x = idea_num, y = it_cost, alpha = 0.1, fill = "it_cost"),
stat = "identity") +
scale_fill_manual(name = "Group", labels = c("TRUE" = "Primary", "FALSE" = "Secondary", "it_cost" = "IT Cost"),
values = c("TRUE" = "blue", "FALSE" = "red", "it_cost" = "black")) +
scale_alpha(guide = "none") +
theme(legend.position = "bottom")
I get a figure
but I'd like the legend to appear in the order of Primary, Secondary, IT Cost.
Were all of the numbers I'm trying to plot part of the same grand number, I could easily melt the dataframe and sum everything; however, the values from the group$group_value need to be displayed separate from base$it_cost.
If I plot only the values from teh first layer, i.e.,
base %>%
left_join(group) %>%
arrange(desc(value)) %>%
mutate(idea_num = idea_num %>% factor(levels = unique(idea_num)),
is_primary = is_primary %>% factor(levels = c("TRUE", "FALSE"))) %>%
ggplot(aes(x = idea_num, y = group_value, fill = is_primary)) +
geom_bar(stat = "identity") +
scale_fill_manual(name = "Group", labels = c("TRUE" = "Primary", "FALSE" = "Secondary"),
values = c("TRUE" = "blue", "FALSE" = "red")) +
theme(legend.position = "bottom")
I get a figure I expect
How can I add the second layer and adjust the ordering of the legend boxes? I do not believe that this question or this question are entirely relevant to mine as the former is dealing with levels of a factor and the latter deals with ordering of multiple legends.
Can I do what I'd like to do? Is there a better way of constructing this plot?
use scale_fill_manual(..., limit=, ...):
... +
scale_fill_manual(name = "Group",
labels = c("TRUE" = "Primary", "FALSE" = "Secondary", "it_cost" = "IT Cost"),
limits = c("TRUE", "FALSE", "it_cost"),
values = c("TRUE" = "blue", "FALSE" = "red", "it_cost" = "black")) +
...
This gives:
That said, I think you may want to consider a few different approaches:
A: why do you create your data in such a complex way, ending up multiple observations of IT Costs for the same idea number? I don't know your data, you may well have your reasons, but a simple dataset along the lines:
idea_num value type
1 1 -40 Primary
2 1 -10 Secondary
3 2 20 Secondary
4 2 70 Primary
5 1 -50 IT Cost
6 2 90 IT Cost
would simplify the things quite a bit.
B: Why do you want to stack/overplot these two separate barplots? I would do position="dodge" instead to have separate bars.
df2 <- base %>%
left_join(group) %>%
mutate(is_primary=paste0("pri_", is_primary+0)) %>%
spread(is_primary, group_value) %>%
gather(yvar, y, it_cost, pri_0, pri_1)
df2$yvar <- factor(df2$yvar, levels=c("pri_0", "pri_1", "it_cost"),
labels=c("Primary", "Secondary", "IT Cost"))
df2$idea_num <- factor(df2$idea_num, levels=c(2, 1))
ggplot(df2, aes(idea_num, y, fill=yvar)) +
geom_bar(stat="identity") +
scale_fill_manual("Group", values=c("blue", "red", "black")) +
scale_alpha(guide = "none") +
theme(legend.position = "bottom")

Resources