plotting subset of grouped data in ggplot2 - r

I am trying to make a plot that has mean (+/- SD) number (ID = total count per row) of Explorations on the y-axis and then grouped by both pp and type on the x-axis.
That is, I want to generate something that looks like this (hand-drawn and made up graph):
Here is how the dataframe is structured (available here).
pp crossingtype km type ID
0 Complete 80.0 DCC 10
1 Complete 80.0 DCC 4
0 Exploration 80.0 DCC 49
1 Exploration 80.0 DCC 4
0 Complete 144.0 DWC 235
1 Complete 144.0 DWC 22
0 Exploration 144.0 DWC 238
1 Exploration 144.0 DWC 18
1 Exploration 84.0 PC 40
0 Complete 107.0 PC 43
1 Complete 107.0 PC 22
0 Exploration 107.0 PC 389
I want to use ggplot2 and have tried this code:
ggplot(expMean, aes(x=as.factor(pp), y=crossingtype, color=factor(type),group=factor(type)))
+ geom_point(shape=16,cex=3)
+ geom_smooth(method=lm)
+ facet_grid(.~type)
But it gives me this figure (which is not what I am trying to make).
How can I use ggplot2 to make the first plot?

You can do the statistical transformations within ggplot(), but my preference is to process the data first, then plot the results.
library(tidyverse)
expMean %>%
filter(crossingtype == "Exploration") %>%
group_by(type, pp) %>%
summarise(Mean = mean(ID), SD = sd(ID)) %>%
ggplot(aes(factor(pp), Mean)) +
geom_pointrange(aes(ymax = Mean + SD,
ymin = Mean - SD)) +
facet_wrap(~type) +
theme_bw()

Is this what you want? This filters the data to only include Exploration, uses ID as the y variable, groups by pp and facets on type
tbl <- read_table2(
"pp crossingtype km type ID
0 Complete 80.0 DCC 10
1 Complete 80.0 DCC 4
0 Exploration 80.0 DCC 49
1 Exploration 80.0 DCC 4
0 Complete 144.0 DWC 235
1 Complete 144.0 DWC 22
0 Exploration 144.0 DWC 238
1 Exploration 144.0 DWC 18
1 Exploration 84.0 PC 40
0 Complete 107.0 PC 43
1 Complete 107.0 PC 22
0 Exploration 107.0 PC 389"
) %>%
mutate(pp = factor(pp))
ggplot(data = tbl %>% filter(crossingtype == "Exploration")) +
geom_boxplot(aes(x = pp, y = ID)) +
facet_wrap(~type)
I ran this code on the linked dataset to produce this:

Here's the approach I used. Utilised a colour instead of the double valued x-axis.
Note that I downloaded the data to my working directory, so the read.table command may need to be modified
library(dplyr)
library(ggplot2)
dat <- read.table("figshare.txt")
dat <- droplevels(filter(dat, crossingtype == "Exploration"))
dat <- dat %>%
group_by(pp, type) %>%
summarise(val = mean(ID),
SD = sd(ID))
ggplot(dat, aes(x = type, y = val, colour = as.factor(pp), group =
as.factor(pp))) +
geom_point(size = 3, position = position_dodge(width = 0.2)) +
geom_errorbar(aes(ymax = val + SD, ymin = val - SD), position =
position_dodge(width = 0.2), width = 0.2) +
labs(y = "Mean # of explorations (+/- SD", colour = "pp")

Related

Barplot overlay with geom line

here is the data example:
S P C P_int C_int
10 20 164 72 64
20 550 709 92 89
30 142 192 97 96
40 45 61 99 98
50 12 20 99 99
60 5 6 99 99
70 2 2 99 99
80 4 1 99 99
90 1 0 10 99
100 0 1 10 99
Let's say i have a dataframe called df, the aim is to have a bar chart using variables P and C, with an line chart overlayed using sum of variables P_int and C_int. Currently I have these lines of codes to create the bar chart:
final <- df %>% tidyr::gather(type, value, c(`P`, `C`))
ggplot(final, aes(S))+
geom_bar(aes(y=value, fill=type), stat="identity", position="dodge")
The thing I can't figure out is hot to plot the sum of variables P_int and C_int as a line chart overlayed on the above plot with a second Y axis. Would appreciate any help.
Do you need something like this ?
library(ggplot2)
library(dplyr)
ggplot(final, aes(S))+
geom_bar(aes(y=value, fill=type), stat="identity", position="dodge") +
geom_line(data = final %>%
group_by(S) %>%
summarise(total = sum(P_int + C_int)),
aes(y = total), color = 'blue') +
scale_y_continuous(sec.axis = sec_axis(~./1)) +
theme_classic()
I have kept the scale of secondary y-axis same as primary y-axis since they are in the same range but you might need to adjust it in according to your real data.

Add more settings to a plot

I have
> head(My Data)
Time V1 V2 V3 V4 V5 V6 V7 V8 V9
1 10:00:00 0 11489 11017 6846 6832 3.95 5.75 0 464
2 10:00:10 0 11495 11012 6796 6807 3.95 5.75 0 467
3 10:00:20 0 11409 10983 6835 6797 3.95 5.75 0 458
4 10:00:30 0 11441 11064 6761 6661 0 0 0 463
5 10:00:40 0 11413 10999 6870 6853 0 0 0 461
6 10:00:50 0 11434 11053 6798 6814 0 0 0 459
I would like to plot "in the same plot":
9 separated curves in the same plot, where each curve with a different color represents a time series : V1, ..., V9.
Add, to the plot, the name of each time series "column name".
For this plot, I would like to add Time to x-axis but as the timestamp is too small "10 seconds" there will be a dense of values arranged on the x-axis. Instead, I would like to represent Time in 30 minutes time lag.
Add a plot name, so to add My data as a title of the plot.
What I usually do is to use ts.plot() or autoplot(ts()) to plot similar multiple time series in the same graph but I don't know how to calibrate my plot so I think ggplot would be better to do this.
Could you please help me to learn how to do this easily in ggplot or even using ts.plot() if possible
You can get the data in long format to plot, convert Time to POSIXct. Since the values in each column are not in the same range I have used facets to plot them separately.
library(tidyverse)
my_data %>%
pivot_longer(cols = -Time) %>%
mutate(Time = as.POSIXct(Time, format = '%H:%M')) %>%
ggplot() + aes(Time, value, color = name) +
geom_line() +
facet_wrap(~name, scales = 'free_y') +
scale_x_datetime(date_labels = '%H:%M', breaks = '30 mins') +
theme(legend.position="none") +
ggtitle('My data')
To keep everything in the same plot you can do :
my_data %>%
pivot_longer(cols = -Time) %>%
mutate(Time = as.POSIXct(Time, format = '%H:%M')) %>%
ggplot() + aes(Time, value, color = name) +
geom_line() +
scale_x_datetime(date_labels = '%H:%M', breaks = '30 mins') +
ggtitle('My data')

How to use mse function - babynames example

So I am using the 'babynames' package in rstudio and am trying to get the 35 most common unisex names. I am trying to rank the names based on the mean squared error from the 50-50 line (however, I am not sure how to do this). Any help would be greatly appreciated! (Also below my code I will put the 'reference code' we were given that includes what the top 35 unisex names are)
Reference Code:
actual_names <- c("Jessie", "Marion", "Jackie", "Alva", "Ollie",
"Jody", "Cleo", "Kerry", "Frankie", "Guadalupe",
"Carey", "Tommie", "Angel", "Hollis", "Sammie",
"Jamie", "Kris", "Robbie", "Tracy", "Merrill",
"Noel", "Rene", "Johnnie", "Ariel", "Jan",
"Devon", "Cruz", "Michel", "Gale", "Robin",
"Dorian", "Casey", "Dana", "Kim", "Shannon")
I think there are a few ways to answer the question as posed, since there's a tradeoff between "most popular" and "most unisex."
Here's a way to prep the data to collect some stats for each name.
library(babynames)
library(tidyverse)
babynames_share <-
babynames %>%
filter(year >= 1930, year <= 2012) %>%
count(name, sex, wt = n) %>%
spread(sex, n, fill = 0) %>%
mutate(Total = F + M,
F_share = F / Total,
MS_50 = ((F_share-0.5)^2 +
(0.5-F_share)^2) / 2)
It looks like around 100 names have perfect gender parity -- but they're all quite uncommon:
babynames_share %>%
filter(F == M) %>%
arrange(-Total)
# A tibble: 100 x 6
name F M Total F_share RMS_50
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tyjae 157 157 314 0.5 0
2 Callaway 128 128 256 0.5 0
3 Avyn 100 100 200 0.5 0
4 Zarin 92 92 184 0.5 0
5 Tkai 72 72 144 0.5 0
6 Rayen 57 57 114 0.5 0
7 Meco 43 43 86 0.5 0
8 Pele 40 40 80 0.5 0
9 Nijay 35 35 70 0.5 0
10 Mako 27 27 54 0.5 0
# … with 90 more rows
Or we might pick some arbitrary threshold for what counts as unisex. In the example above, I've calculated the mean squared error for the female and male percent shares. We can plot that to see very gendered names on the top (MS_50 tops out at 0.25 by this measure), and unisex names toward the bottom. But it isn't obvious to me how far down we should go to count a name as unisex. Is Casey, which is 58.9% male, with therefore an 8.9%^2 = 0.79% squared error, unisex? Or do we need to further to Jessie, which is 50.8% male?
babynames_share %>%
ggplot(data = .,
aes(Total, MS_50, label = name)) +
geom_point(size = 0.2, alpha = 0.1, color = "gray30") +
geom_text(data = . %>% filter(Total > 10000),
check_overlap = TRUE, size = 3) +
scale_x_log10(breaks = c(10^(1:7)),
labels = scales::comma)
At the "Casey" level of gender parity, here are the top 35:
unisex_names <- babynames_share %>%
filter(MS_50 <= 0.00796) %>%
arrange(-Total) %>%
top_n(35, wt = Total)
It's also interesting to see the whole spectrum of names, with most male on the bottom, female on the top, and unisex in the middle:
babynames_share %>%
ggplot(data = .,
aes(Total, F_share, label = name)) +
geom_point(size = 0.2, alpha = 0.1, color = "gray30") +
geom_text(data = . %>% filter(Total > 10000),
check_overlap = TRUE, size = 2) +
scale_x_log10(breaks = c(10^(1:7)),
labels = scales::comma)

Add a percent to y axis labels [duplicate]

This question already has answers here:
How can I change the Y-axis figures into percentages in a barplot?
(4 answers)
Closed 4 years ago.
I'm sure I missed an obvious solution tot his problem but I can't figure out how to add a percent sign to the y axis labels.
Data Sample:
Provider Month Total_Count Total_Visits Procedures RX State
Roberts 2 19 19 0 0 IL
Allen 2 85 81 4 4 IL
Dawson 2 34 34 0 0 CA
Engle 2 104 100 4 4 CA
Goldbloom 2 7 6 1 1 NM
Nathan 2 221 192 29 20 NM
Castro 2 6 6 0 0 AK
Sherwin 2 24 24 0 0 AK
Brown 2 282 270 12 12 UT
Jackson 2 114 96 18 16 UT
Corwin 2 22 22 0 0 CO
Dorris 2 124 102 22 22 CO
Ferris 2 427 318 109 108 OH
Jeffries 2 319 237 82 67 OH
The following code gives graphs with inaccurate values because R seems to be multiplying by 100.
procs <- read.csv(paste0(dirdata, "Procedure percents Feb.csv"))
procs$Percentage <- round(procs$Procedures/procs$Total.Visits*100, 2)
procs$Percentage[is.na(procs$Percentage)] <- 0
procsplit <- split(procs, procs$State)
plots <- function(procs) {
ggplot(data = procs, aes(x= Provider, y= Percentage, fill= Percentage)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(x = Provider, y = Percentage, label = sprintf("%.1f%%", Percentage)), position = position_dodge(width = 0.9), hjust = .5, vjust = 0 , angle = 0) +
theme(axis.text.x = element_text(angle = 45, vjust = .5)) +
ggtitle("Procedure Percentages- February 2018", procs$State) +
theme(plot.title = element_text(size = 22, hjust = .5, family = "serif")) +
theme(plot.subtitle = element_text(size = 18, hjust = .5, family = "serif")) +
scale_y_continuous(name = "Percentage", labels = percent)
}
lapply(procsplit, plots)
I'm not sure if there's a way to use sprintf to add it or if there's a way to paste it onto the labels.
adding + scale_y_continuous(labels = function(x) paste0(x, "%")) to the ggplot statement fixes this issue

Issue with a drawing a vertical line in ggplot for categorical variable x-axis in R

I have the following table. I want to plot a vertical line using the "st_date_wk" column for each county. Please see my code below but it DOES NOT draw the vertical line using the "st_date_wk" column. Cannot figure out what I am doing wrong here.
Any help is appreciated.
Thanks.
dfx1:
YEAR Week Area acc_sum percentage COUNTY st_date_wk
1998 10-1 250 250 12.4 133 10-4
1998 10-2 300 550 29.0 133 10-4
1998 10-3 50 600 58.0 133 10-4
1998 10-4 100 700 75.9 133 10-4
1998 10-5 100 800 100.0 133 10-4
1999 9-3 75 75 22.0 205 10-2
1999 10-1 250 250 12.4 205 10-2
1999 10-2 300 550 29.0 205 10-2
1999 10-3 50 600 58.0 205 10-2
1999 10-4 100 700 75.9 205 10-2
1999 10-5 100 800 100.0 205 10-2
.
.
dfx1$YEAR <- as.factor(dfx1$YEAR)
dfx1$COUNTY <- as.factor(dfx1$COUNTY)
dfx1$percentage <- as.numeric(dfx1$percentage)
dfx1$acc_sum <- as.numeric(dfx1$acc_sum)
dfx1$Week <- factor(dfx1$Week, ordered = T)
dfx1$st_date_wk <- factor(dfx1$st_date_wk,ordered = T)
dfx1$Week <- factor(dfx1$Week, levels=c("6-1","6-2","6-3","6-4","6-5","7-1","7-2","7-3","7-4","7-5","8-1","8-2","8-3","8-4","8-5","9-1","9-2","9-3","9-4","9-5","10-1","10-2","10-3","10-4","10-5","11-1","11-2","11-3","11-4","11-5","12-1","12-2","12-3","12-4","12-5"))
gg <- ggplot(dfx1, aes(Week,percentage, col=YEAR, group = YEAR))
gg <- gg + geom_line()
gg <- gg + facet_wrap(~COUNTY, 2, scales = "fixed")
gg <- gg + theme(text = element_text(size=15), axis.text.x = element_text(angle=90, hjust=1))
gg <- gg + geom_vline(data=dfx1, aes(xintercept = dfx1$st_date_wk), color = "blue", linetype = "dashed", size = 1.0)+ facet_wrap(~COUNTY)
plot(gg)
1: In Ops.ordered(x, from[1]) : '-' is not meaningful for ordered factors
It is a very interesting issue, and I haven't quite figured out why it does not work. However, there is a fix for it.
First, This is the data that is used in the answer:
dfx1 <- read.table(text =
"YEAR Week Area acc_sum percentage COUNTY st_date_wk
1998 10-1 250 250 12.4 133 10-4
1998 10-2 300 550 29.0 133 10-4
1998 10-3 50 600 58.0 133 10-4
1998 10-4 100 700 75.9 133 10-4
1998 10-5 100 800 100.0 133 10-4
1999 9-3 75 75 22.0 133 10-1",
header = TRUE)
Convert types of Year, COUNTY, percentage, and acc_sum:
dfx1$YEAR <- as.factor(dfx1$YEAR)
dfx1$COUNTY <- as.factor(dfx1$COUNTY)
dfx1$percentage <- as.numeric(dfx1$percentage)
dfx1$acc_sum <- as.numeric(dfx1$acc_sum)
Create a vector with the week_levels (more reader-friendly):
week_levels <- c("6-1","6-2","6-3","6-4","6-5",
"7-1","7-2","7-3","7-4","7-5",
"8-1","8-2","8-3","8-4","8-5",
"9-1","9-2","9-3","9-4","9-5",
"10-1","10-2","10-3","10-4","10-5",
"11-1","11-2","11-3","11-4","11-5",
"12-1","12-2","12-3","12-4","12-5")
Transform Week and st_date_wk to an ordered factor with the same levels:
dfx1$Week <- factor(dfx1$Week, levels = week_levels, ordered = TRUE)
dfx1$st_date_wk <- factor(dfx1$st_date_wk, levels = week_levels, ordered = TRUE)
Create labels for scale_x_discrete (a named vector where the names correspond to the breaks of the x-axis):
labels <- week_levels
names(labels) <- seq_along(week_levels)
Create the visualisation, but instead of using the factors on the x-axis, use numeric, in geom_vline() use which() to get the number that corresponds to a Week on the x-axis. Then use scale_x_discrete() to add the weeks.
library(ggplot2)
ggplot(dfx1, aes(x = as.numeric(Week), y = percentage, col=YEAR, group = YEAR)) +
geom_line() +
geom_vline(xintercept = which(levels(dfx1$Week) %in% dfx1$st_date_wk), color = "blue", linetype = "dashed") +
scale_x_continuous(breaks = seq_along(labels), labels = labels) +
theme(text = element_text(size=15), axis.text.x = element_text(angle=90, hjust=1)) +
facet_wrap(~COUNTY, 2, scales = "fixed")
This will give you:
EDIT AFTER COMMENT:
library(dplyr)
dfx1 <- merge(dfx1,
(dfx1 %>%
group_by(COUNTY, st_date_wk) %>%
summarise(x = which(levels(st_date_wk) %in% st_date_wk[COUNTY == COUNTY]))),
by = c("COUNTY", "st_date_wk"), all.x = TRUE
)
ggplot(dfx1, aes(x = as.numeric(Week), y = percentage, col=YEAR, group = YEAR)) +
geom_line() +
geom_vline(data = dfx1, aes(xintercept = x), color = "blue", linetype = "dashed") +
scale_x_continuous(breaks = seq_along(labels), labels = labels) +
theme(text = element_text(size=15), axis.text.x = element_text(angle=90, hjust=1)) +
facet_wrap(~COUNTY, 2, scales = "fixed")
You just have to change the aes in the geom_vline
aes(xintercept = dfx1$st_date_wk %>% as.numeric())

Resources