Plot part of a data using ggplot, in R - r

I want to create a chart, using ggplot, relating the variables "var_share" (in the y-axis) and "cbo" (in the x-axis), but by three time periods: 1996-2002, 2002-2008 and 2008-2012. Also, I want to calculate the "cbo" variable, by percentile. Here is my dataset:
ano cbo ocupado quant total share var_share
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1996 20 1 32 39675 0.0807 -0.343
2 1997 20 1 52 41481 0.125 0.554
3 1998 20 1 34 40819 0.0833 -0.336
4 1999 20 1 44 41792 0.105 0.264
5 2001 20 1 57 49741 0.115 0.0884
6 1996 21 1 253 39675 0.638 -0.0326
You can download the full dataset here.
The result is almost like this:

I believe this is what you are looking for. After reading your data in, a new variable called ano2 is build and after that a new DF which contains the bins called new you have defined.
The first plot then builds on this DF and uses stat_summary.
You also said something about the quantiles. I am not sure what exactly you have meant, but I grouped over this new variable and used technique from purrr to calculate the desired quantiles.
library(tidyverse)
df <- ocupacoes
df$ano2 <- readr::parse_date(paste0('01-01-', df$ano), '%d-%m-%Y')
ddf <- df %>%
mutate(new = case_when(
lubridate::year(ano2) %in% 1996:2002 ~ '96-02',
lubridate::year(ano2) %in% 2003:2008 ~ '02-08',
lubridate::year(ano2) %in% 2009:2012 ~ '08-12'
))
ggplot(ddf,aes(x = new, y = var_share, color = new,)) +
stat_summary(fun = mean, colour = "red", size = 1) +
scale_x_discrete(limits = c('96-02', '02-08', '08-12'))
# I think you were also looking for quantiles of cbo
ddf %>%
group_by(new) %>%
group_modify(~ {
quantile(.x$cbo, probs = seq(0,1, by = .2)) %>%
tibble::enframe(name = "prob", value = "quantile")
}) %>%
ggplot(aes(x = prob, quantile, color = new, group = new)) +
geom_line() +
scale_x_discrete(limits = c('0%', '20%' ,
'40%', '60%',
'80%' , '100%'))

Related

Adding count to ggseqplot in R

I am using the ggseqplot package to visualize sequential data, however, I cannot add a count label to each stack. Following is my code:
ggseqdplot(biofam.seq, border = T) +
scale_fill_discrete_sequential("PuBuGn")+
scale_x_discrete() +
labs(x = "Path Number") +
theme(legend.position = "right") +
theme_minimal()
This sample dataset can be used:
data(biofam)
biofam <- biofam[sample(nrow(biofam),300),]
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
How can I add a count data label to each stack of this plot.
Including count labels in a dplot is unconventional, because in most instances the resulting plots are quite crowded. This is also true for the example data you used but, of course, this might be different in your own application.
As virtually no one uses count labels, this option is not implemented in ggseqdplot, but you can add another plot layer containing the labels using geom_text. For this purpose, we have to generate an additional data set containing the labels (state frequencies) and their positions on the x and y axes.
In the following code, I first extract the state frequencies using table and then reshape the data into the long (tidy) format required by ggplot. If you save the plot generated with ggseqdplot as an object (p) you can inspect the data underlying the plot (p$data). This gives you an idea of the required data structure for your count label data. Once the data are in the right shape, you can add the label plot layer. Note that I only display count labels if states with a frequency higher than 10.
# load required libraries
library(TraMineR)
library(ggseqplot)
library(tidyverse)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Example data
data(biofam)
set.seed(10)
biofam <- biofam[sample(nrow(biofam),300),]
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Render & save dplot
p <- ggseqdplot(biofam.seq, border = T) +
scale_x_discrete() +
theme_minimal() +
theme(legend.position = "bottom")
#> Scale for 'x' is already present. Adding another scale for 'x', which will
#> replace the existing scale.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Table with state frequencies
state.freqs <- map_dfc(biofam.seq, table) |>
map_dfc(as.numeric) |>
mutate(state = row_number(), .before = 1)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reshape & prepare data for plot
label.data <- state.freqs |>
pivot_longer(cols = -1,
names_to = "x",
values_to = "label") |>
group_by(state) |>
mutate(x = row_number() |> factor()) |>
ungroup() |>
filter(label != 0) |>
mutate(state = factor(state, labels = stlab(biofam.seq))) |>
group_by(x) |>
mutate(y = cumsum(label/sum(label)))
# Have a look at the data
label.data
#> # A tibble: 92 × 4
#> # Groups: x [16]
#> state x label y
#> <fct> <fct> <dbl> <dbl>
#> 1 Parent 1 296 0.987
#> 2 Parent 2 285 0.95
#> 3 Parent 3 281 0.937
#> 4 Parent 4 269 0.897
#> 5 Parent 5 249 0.83
#> 6 Parent 6 222 0.74
#> 7 Parent 7 189 0.63
#> 8 Parent 8 162 0.54
#> 9 Parent 9 135 0.45
#> 10 Parent 10 112 0.373
#> # … with 82 more rows
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Add labels to original plot
# We only show labels for states with more than 10 occurences
p +
geom_text(data = label.data |> filter(label > 10),
aes(x = x, y = y, label = label),
nudge_y = -.02, size = 3)
Created on 2022-10-17 with reprex v2.0.2

how to use geom_pointrange to plot data in descending order?

I have been asked to do this, "make a dotplot of the average age for each combination of sex, passengerClass and survival. Use geom_pointrange, order the dots by decreasing age and make sure the labels are on the y-axis.". This is my try and the plot it produce doesn't seem to be right.
# the dataset given to me
TitanicSurvival %>% group_by(survived,passengerClass,sex) %>%
summarize(age=mean(age, na.rm=T)) %>%
unite(survived, passengerClass, sex, col="group", remove = FALSE) -> age
glimpse(age)
# my attempt to plot
age %>%
unite(sex, passengerClass, survived, col="group", remove = FALSE) %>%
ggplot(aes(x = age, y= group)) +
geom_dotplot(binwidth = 1) +
geom_pointrange(aes(xmin = sort(age), xmax = sort(age)), orientation = "y")
This is generally easier to reproduce if you provide (a representative portion) of the actual data, e.g. the output of dput(head(TitanicSurvival, 100)). But as this is a fairly popular dataset, I had it already available.
The exact wording of your task may have led to some confusion here - geom_dotplot is essentially a variant of a histogram, and I believe the intention was to ask about plotting the datapoints (usually geom_point). Specifying the geom_pointrange function indicates that you should also show the range, but it's not actually asked for.
So, here is my take on it. First we prepare all statistics and labels we may want to plot. The trickiest step is the last one - setting the factor levels of the labels based on the order of the average age, so they will show up in that order on the y-axis:
df_age <- TitanicSurvival %>%
group_by(survived, passengerClass, sex) %>%
dplyr::summarize(
age_min = min(age, na.rm = TRUE),
age_mean = mean(age, na.rm=TRUE),
age_max = max(age, na.rm = TRUE),
.groups = "drop"
) %>%
dplyr::mutate(
group_label = glue::glue("{sex}_{passengerClass}_{survived}"),
) %>%
dplyr::arrange(age_mean) %>%
dplyr::mutate(
group_label = forcats::as_factor(group_label)
)
# A tibble: 12 x 7
# survived passengerClass sex age_min age_mean age_max group_label
# <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct>
# 1 yes 2nd male 0.667 17.4 62 male_2nd_yes
# 2 yes 3rd female 0.167 20.8 63 female_3rd_yes
# 3 yes 3rd male 0.417 22.4 45 male_3rd_yes
# 4 no 3rd female 1 23.4 48 female_3rd_no
Then we can plot the points using geom_pointrange (we already took care of the correct ordering when we set the factor levels!):
ggplot(df_age, aes(x = age_mean, y = group_label)) +
geom_pointrange(aes(xmin = age_min, xmax = age_max)) +
theme_bw() +
labs(x = "Age / Years", y = "")

How do I create a stacked bar chart in R, where the y axis should denote the percentages for the bars?

I would like to create a stacked bar chart in R. My X axis just contains data on sex i.e male or female. I just need the y axis to show percentages of the stacked bars. The "Survived" column is just a mixture of 0s and 1s. I.e 1 denoting that an indiividual survived an experience and 0 showing that the individual did not survive the experience. I am not sure what to put in for the y label. Can anyone help please?
ggplot(data = df, mapping = aes(x = Sex, y = ? , fill = Survived)) + geom_bar(stat = "identity")
One possible solution is to use dplyr package to calculate percentage of each categories outside of ggplot2 and then use those values to get your bargraph using geom_col:
library(dplyr)
df %>% count(Sex, Survive) %>%
group_by(Sex) %>%
mutate(Percent = n/sum(n)*100)
# A tibble: 4 x 4
# Groups: Sex [2]
Sex Survive n Percent
<fct> <dbl> <int> <dbl>
1 F 0 26 55.3
2 F 1 21 44.7
3 M 0 34 64.2
4 M 1 19 35.8
And now with the plotting part:
library(dplyr)
library(ggplot2)
df %>% count(Sex, Survive) %>%
group_by(Sex) %>%
mutate(Percent = n/sum(n)*100) %>%
ggplot(aes(x = Sex, y = Percent, fill = as.factor(Survive)))+
geom_col()
Reproducible example
df <- data.frame(Sex = sample(c("M","F"),100, replace = TRUE),
Survive = sample(c(0,1), 100, replace = TRUE))

ggplot2 plotting two variables of two groups and with different scales

I have a dataframe of following form:
School_type Year fund rate
1 1998 8 0.1
0 1998 7 0.2
1 1999 9 0.11
0 1999 8 0.22
1 2000 10 0.12
0 2000 15 0.23
I am thinking about plotting the "fund" and "rate" for each school type and the x axis is year, so there are four lines--two higher lines and two lower lines, but I don't know how to implement this with two scales of y-axes. Thanks in advance.
I am not sure if this is what you are looking for, but here is my two cents on your question.
#create the dataframe
df = data.frame("school_type" = 0:1, "year" = c("1998","1998","1999","1999","2000","2000"),
"fund" = c("8","7","9","8","10","15"), "rate" = c("0.1","0.2","0.11","0.22","0.12","0.23"))
#Modify the variable typr
df$fund = as.numeric(as.character(df$fund))
df$rate = as.numeric(as.character(df$rate))
#plot the log of the variables
df %>%
mutate(log_fund = log(fund),
log_rate = log(rate)) %>%
melt(id.vars = c("school_type","year")) %>%
filter(variable %in% c("log_fund","log_rate")) %>%
ggplot(aes(x = year, y = value, group = variable, color = variable, shape = variable)) +
geom_line(size = 1) +
geom_point(size = 3) +
facet_wrap(~ school_type) +
theme_bw()
Result:

Plot a simple conversion funnel in ggplot

I have a simple dataframe that looks like this:
df
steps numbers rate
1 clicks 332835 100.000000
2 signup 157697 47.379933
3 cart 29866 8.973215
4 buys 17012 5.111241
How can I plot a simple conversion funnel instead of a barchart?
If you must do the funnel thing, it's just a variation on bar chart:
library(ggplot2)
library(reshape2) # for melt()
# get data
dat <- read.table(text=
"steps numbers rate
clicks 332835 100.000000
signup 157697 47.379933
cart 29866 8.973215
buys 17012 5.111241",
header = T)
# add spacing, melt, sort
total <- subset(dat, rate==100)$numbers
dat$padding <- (total - dat$numbers) / 2
molten <- melt(dat[, -3], id.var='steps')
molten <- molten[order(molten$variable, decreasing = T), ]
molten$steps <- factor(molten$steps, levels = rev(dat$steps))
ggplot(molten, aes(x=steps)) +
geom_bar(aes(y = value, fill = variable),
stat='identity', position='stack') +
geom_text(data=dat,
aes(y=total/2, label= paste(round(rate), '%')),
color='white') +
scale_fill_manual(values = c('grey40', NA) ) +
coord_flip() +
theme(legend.position = 'none') +
labs(x='stage', y='volume')
That said, there's no real point in a "funnel chart" - the same information can be presented in a plain bar chart with less fuss:
# get data
dat <- read.table(text=
"steps numbers rate
clicks 332835 100.000000
signup 157697 47.379933
cart 29866 8.973215
buys 17012 5.111241",
header = T)
# order x axis
dat$steps <- factor(dat$steps, levels = dat$steps)
# plot
ggplot(dat, aes(x=steps, y=numbers)) +
geom_bar(stat='identity') +
geom_text(aes(label = paste(round(rate), '%')), vjust=-0.5
Alternatively, you can do a simple funnel in highcharts. My dataframe looks like this:
# data is a df called check_stage
check_stage
# A tibble: 9 × 4
stage_name count x percent
<ord> <int> <chr> <dbl>
1 Opportunity Disqualified 805 1 13.5
2 Qualifying 5138 2 86.5
3 Evaluation 1773 3 29.8
4 Meeting Scheduled 4104 4 69.1
5 Quoted 4976 5 83.7
6 Order Submitted 1673 6 28.2
7 Closed Won 1413 7 23.8
8 Closed Lost 957 8 16.1
9 Nurture 1222 9 20.6
library(highcharter)
# make a funnel plot of stage
options(highcharter.theme = hc_theme_smpl(tooltip = list(valueDecimals = 1,
pointFormat = "{point.y}%")))
hc <- check_stage %>%
hchart(
"funnel", hcaes(x = stage_name, y = percent), name = "Proportion of Leads")
hc

Resources