Add more settings to a plot - r

I have
> head(My Data)
Time V1 V2 V3 V4 V5 V6 V7 V8 V9
1 10:00:00 0 11489 11017 6846 6832 3.95 5.75 0 464
2 10:00:10 0 11495 11012 6796 6807 3.95 5.75 0 467
3 10:00:20 0 11409 10983 6835 6797 3.95 5.75 0 458
4 10:00:30 0 11441 11064 6761 6661 0 0 0 463
5 10:00:40 0 11413 10999 6870 6853 0 0 0 461
6 10:00:50 0 11434 11053 6798 6814 0 0 0 459
I would like to plot "in the same plot":
9 separated curves in the same plot, where each curve with a different color represents a time series : V1, ..., V9.
Add, to the plot, the name of each time series "column name".
For this plot, I would like to add Time to x-axis but as the timestamp is too small "10 seconds" there will be a dense of values arranged on the x-axis. Instead, I would like to represent Time in 30 minutes time lag.
Add a plot name, so to add My data as a title of the plot.
What I usually do is to use ts.plot() or autoplot(ts()) to plot similar multiple time series in the same graph but I don't know how to calibrate my plot so I think ggplot would be better to do this.
Could you please help me to learn how to do this easily in ggplot or even using ts.plot() if possible

You can get the data in long format to plot, convert Time to POSIXct. Since the values in each column are not in the same range I have used facets to plot them separately.
library(tidyverse)
my_data %>%
pivot_longer(cols = -Time) %>%
mutate(Time = as.POSIXct(Time, format = '%H:%M')) %>%
ggplot() + aes(Time, value, color = name) +
geom_line() +
facet_wrap(~name, scales = 'free_y') +
scale_x_datetime(date_labels = '%H:%M', breaks = '30 mins') +
theme(legend.position="none") +
ggtitle('My data')
To keep everything in the same plot you can do :
my_data %>%
pivot_longer(cols = -Time) %>%
mutate(Time = as.POSIXct(Time, format = '%H:%M')) %>%
ggplot() + aes(Time, value, color = name) +
geom_line() +
scale_x_datetime(date_labels = '%H:%M', breaks = '30 mins') +
ggtitle('My data')

Related

Barplot overlay with geom line

here is the data example:
S P C P_int C_int
10 20 164 72 64
20 550 709 92 89
30 142 192 97 96
40 45 61 99 98
50 12 20 99 99
60 5 6 99 99
70 2 2 99 99
80 4 1 99 99
90 1 0 10 99
100 0 1 10 99
Let's say i have a dataframe called df, the aim is to have a bar chart using variables P and C, with an line chart overlayed using sum of variables P_int and C_int. Currently I have these lines of codes to create the bar chart:
final <- df %>% tidyr::gather(type, value, c(`P`, `C`))
ggplot(final, aes(S))+
geom_bar(aes(y=value, fill=type), stat="identity", position="dodge")
The thing I can't figure out is hot to plot the sum of variables P_int and C_int as a line chart overlayed on the above plot with a second Y axis. Would appreciate any help.
Do you need something like this ?
library(ggplot2)
library(dplyr)
ggplot(final, aes(S))+
geom_bar(aes(y=value, fill=type), stat="identity", position="dodge") +
geom_line(data = final %>%
group_by(S) %>%
summarise(total = sum(P_int + C_int)),
aes(y = total), color = 'blue') +
scale_y_continuous(sec.axis = sec_axis(~./1)) +
theme_classic()
I have kept the scale of secondary y-axis same as primary y-axis since they are in the same range but you might need to adjust it in according to your real data.

Why the two mean don't match when computed manually and using stat_summary?

I'm doing a project and analysis on the SO data and wanted to check whether the badges (gold, silver and bronze) would depict significant differences between the top users within those groups i.e. that the CI don't overlap. To this end I incorporate t-student CI's and error bars but the mean computed in that summary doesn't match the mean shown using the stat_summary see here (apologies for not having a reproducible example, the data set is huge):
str(comp)
'data.frame': 4500 obs. of 10 variables:
$ userId : num 51 58 61 79 101 122 136 142 233 238 ...
$ reputation : num 35198 39731 41299 38596 38689 ...
$ creationDate : POSIXct, format: "2008-08-01 13:31:13" "2008-08-01 13:56:33" "2008-08-01 14:21:00" "2008-08-01 16:05:09" ...
$ lastAccessDate: POSIXct, format: "2019-11-30 16:40:08" "2019-10-31 15:55:12" "2019-12-01 01:41:04" "2018-04-06 01:48:22" ...
$ location : chr "Yad Binyamin, Israel" "Indianapolis, IN" "Auckland, New Zealand" "New York, NY" ...
$ views : int 3086 1825 1771 1404 1845 2936 2199 874 1655 780 ...
$ upvotes : int 2753 1049 1322 411 550 517 553 106 1734 216 ...
$ downvotes : int 44 55 219 38 64 51 98 3 211 18 ...
$ class : Factor w/ 3 levels "bronze","gold",..: 1 1 1 1 1 1 1 1 1 1 ...
$ badge : Factor w/ 91 levels "Altruist","Analytical",..: 52 52 52 52 52 52 52 52 52 52 ..
summaryRep <- comp %>%
group_by(class) %>%
summarise(n=n(), mean=mean(reputation), sd=sd(reputation), se=sd/sqrt(n), ci=qt(.975,n-1)*se)
> summaryRep
# A tibble: 3 x 6
class n mean sd se ci
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 bronze 1500 37494. 5513. 142. 279.
2 gold 1500 145712. 117260. 3028. 5939.
3 silver 1500 54451. 13118. 339. 664.
colorSpec <- c("#f9a602", "#c0c0c0", "#cd7f32")
names(colorSpec) <- c("gold", "silver", "bronze")
comp %>%
left_join(summaryRep, by="class") %>%
ggplot(aes(badge, reputation, colour=class, group=class)) +
geom_boxplot(notch=T) +
stat_summary(fun.y=mean, geom="point", shape=20, size=10) +
geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.3) +
scale_y_log10() +
scale_colour_manual(values = colorSpec) +
geom_jitter(alpha=0.3)
See that the mean inside the error bar doesn't match the mean per class produced by stat_summary.
PS: the data is very far from normally distributed so I'd need to use a different CI like a bootstrapped CI (BCI) but still I'm very curious why the mean don't match.
UPDATE this demonstrates that no matter which column ggplot uses to group by either badge or class should end in the same mean:
identical(comp %>%
group_by(class) %>%
summarise(avgReputation=mean(reputation)) %>%
select(avgReputation) %>%
arrange(avgReputation),
comp %>%
group_by(badge) %>%
summarise(avgReputation=mean(reputation)) %>%
select(avgReputation) %>%
arrange(avgReputation))
[1] TRUE
It is the log transformation that's causing you the problem. When you apply scale_y_log10(), stat_summary is by taking the mean and sd of the log10 values, which is different from the log10(mean) or log10(sd). Ideally you should transform the data before making these calculations.
Simulate some data:
comp <- data.frame(
class = sample(c("bronze","silver","gold"),1000,replace=TRUE),
reputation = rnbinom(1000,mu=100,size=1)+1
)
rep2badge = c("silver"="Good Answer","gold"="Great Answer","brzone"="Nice Answer")
comp$badge = rep2badge[comp$class]
We make a function for your plot:
boxplot_func = function(DF,LOG,TITLE){
if(LOG){DF <- DF %>% mutate(reputation=log10(reputation))}
colorSpec <- c("#f9a602", "#c0c0c0", "#cd7f32")
names(colorSpec) <- c("gold", "silver", "bronze")
summaryRep <- DF %>%
group_by(class) %>%
summarise(n=n(), mean=mean(reputation),
median=median(reputation),sd=sd(reputation),
se=sd/sqrt(n), ci=qt(.975,n-1)*se)
DF %>%
left_join(summaryRep, by="class") %>%
ggplot(aes(badge, reputation, colour=class, group=class)) +
geom_boxplot(notch=T) +
stat_summary(fun.y=mean, geom="point", shape=20, size=3) +
geom_errorbar(aes(ymin=mean-ci, ymax=mean+ci), width=.3) +
scale_colour_manual(values = colorSpec) +
geom_jitter(alpha=0.3) +
ggtitle(TITLE)
}
Then we plot with and without log transformation on reputation
library(ggplot2)
library(dplyr)
library(gridExtra)
p1= boxplot_func(comp,TRUE,"log10scale")
p2= boxplot_func(comp,FALSE,"normal scale")
grid.arrange(p1,p2,ncol=2)

How to use mse function - babynames example

So I am using the 'babynames' package in rstudio and am trying to get the 35 most common unisex names. I am trying to rank the names based on the mean squared error from the 50-50 line (however, I am not sure how to do this). Any help would be greatly appreciated! (Also below my code I will put the 'reference code' we were given that includes what the top 35 unisex names are)
Reference Code:
actual_names <- c("Jessie", "Marion", "Jackie", "Alva", "Ollie",
"Jody", "Cleo", "Kerry", "Frankie", "Guadalupe",
"Carey", "Tommie", "Angel", "Hollis", "Sammie",
"Jamie", "Kris", "Robbie", "Tracy", "Merrill",
"Noel", "Rene", "Johnnie", "Ariel", "Jan",
"Devon", "Cruz", "Michel", "Gale", "Robin",
"Dorian", "Casey", "Dana", "Kim", "Shannon")
I think there are a few ways to answer the question as posed, since there's a tradeoff between "most popular" and "most unisex."
Here's a way to prep the data to collect some stats for each name.
library(babynames)
library(tidyverse)
babynames_share <-
babynames %>%
filter(year >= 1930, year <= 2012) %>%
count(name, sex, wt = n) %>%
spread(sex, n, fill = 0) %>%
mutate(Total = F + M,
F_share = F / Total,
MS_50 = ((F_share-0.5)^2 +
(0.5-F_share)^2) / 2)
It looks like around 100 names have perfect gender parity -- but they're all quite uncommon:
babynames_share %>%
filter(F == M) %>%
arrange(-Total)
# A tibble: 100 x 6
name F M Total F_share RMS_50
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tyjae 157 157 314 0.5 0
2 Callaway 128 128 256 0.5 0
3 Avyn 100 100 200 0.5 0
4 Zarin 92 92 184 0.5 0
5 Tkai 72 72 144 0.5 0
6 Rayen 57 57 114 0.5 0
7 Meco 43 43 86 0.5 0
8 Pele 40 40 80 0.5 0
9 Nijay 35 35 70 0.5 0
10 Mako 27 27 54 0.5 0
# … with 90 more rows
Or we might pick some arbitrary threshold for what counts as unisex. In the example above, I've calculated the mean squared error for the female and male percent shares. We can plot that to see very gendered names on the top (MS_50 tops out at 0.25 by this measure), and unisex names toward the bottom. But it isn't obvious to me how far down we should go to count a name as unisex. Is Casey, which is 58.9% male, with therefore an 8.9%^2 = 0.79% squared error, unisex? Or do we need to further to Jessie, which is 50.8% male?
babynames_share %>%
ggplot(data = .,
aes(Total, MS_50, label = name)) +
geom_point(size = 0.2, alpha = 0.1, color = "gray30") +
geom_text(data = . %>% filter(Total > 10000),
check_overlap = TRUE, size = 3) +
scale_x_log10(breaks = c(10^(1:7)),
labels = scales::comma)
At the "Casey" level of gender parity, here are the top 35:
unisex_names <- babynames_share %>%
filter(MS_50 <= 0.00796) %>%
arrange(-Total) %>%
top_n(35, wt = Total)
It's also interesting to see the whole spectrum of names, with most male on the bottom, female on the top, and unisex in the middle:
babynames_share %>%
ggplot(data = .,
aes(Total, F_share, label = name)) +
geom_point(size = 0.2, alpha = 0.1, color = "gray30") +
geom_text(data = . %>% filter(Total > 10000),
check_overlap = TRUE, size = 2) +
scale_x_log10(breaks = c(10^(1:7)),
labels = scales::comma)

ggplot2: geom_bar(); how to alternate order of fill so bars are not lost inside a bar with a higher value?

I am trying to position two bars at the same position on the x-axis and seperated out by colour (almost as if stacking).
However, instead of stacking I want the bar simply inside the other bar - with the smallest Y-value being visable inside the bar with the highest Y-value.
I can get this to work to some extent - but the issue is that one Y-value is not consistently higher across one of the two factors. This leads to bars being 'lost' within a bar with a higher Y-value.
Here is a subset of my dataset and the current ggplot code:
condition hours expression freq_genes
1 tofde 9 up 27
2 tofde 12 up 92
3 tofde 15 up 628
17 tofde 9 down 0
18 tofde 12 down 1
19 tofde 15 down 0
33 tofp 9 up 2462
34 tofp 12 up 786
35 tofp 15 up 298
49 tofp 9 down 651
50 tofp 12 down 982
51 tofp 15 down 1034
65 tos 0 up 27
66 tos 3 up 123
67 tos 6 up 752
81 tos 0 down 1
82 tos 3 down 98
83 tos 6 down 594
sf_plot <- ggplot(data = gene_freq,
aes(x = hours,
y = freq_genes,
group = condition,
fill = factor(expression,
labels=c("Down",
"Up"))))
sf_plot <- sf_plot + labs(fill="Expression")
sf_plot <- sf_plot + geom_bar(stat = "identity",
width = 2.5,
position = "dodge")
sf_plot <- sf_plot + scale_fill_manual(values=c("#9ecae1",
"#3182bd"))
sf_plot <- sf_plot + xlab("Time (Hours)")
sf_plot <- sf_plot + scale_x_continuous(breaks =
seq(min(gene_freq$freq_genes),
max(gene_freq$freq_genes),
by = 3))
sf_plot <- sf_plot + ylab("Gene Frequency")
sf_plot <- sf_plot + facet_grid(. ~ condition, scales = "free")
sf_plot <- sf_plot + theme_bw()
sf_plot <- sf_plot + theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
sf_plot <- sf_plot + theme(axis.text.x = element_text(angle = 90))
# Print plot
sf_plot
You can add alpha = 0.5 to your geom_bar() statement to make the bars transparent. This will allow both bars to be seen. Adding that alpha statement and nothing else will produce what you're looking for, to make both overlaid bars visible. The colors, however, make seeing the two different bars challenging.
Another (and maybe better) option is to change the order in which the plot is created. If I recall correctly, ggplot will plot the bars in alphabetical or numeric or factor-level order. Here, your expression values are c("Down", "Up") and "Down" is being plotted first. If you force "Up" to be plotted first, you could resolve this, too.
library(dplyr)
library(ggplot2)
dat <-
read.table(text = "condition hours expression freq_genes
1 tofde 9 up 27
2 tofde 12 up 92
3 tofde 15 up 628
17 tofde 9 down 0
18 tofde 12 down 1
19 tofde 15 down 0
33 tofp 9 up 2462
34 tofp 12 up 786
35 tofp 15 up 298
49 tofp 9 down 651
50 tofp 12 down 982
51 tofp 15 down 1034
65 tos 0 up 27
66 tos 3 up 123
67 tos 6 up 752
81 tos 0 down 1
82 tos 3 down 98
83 tos 6 down 594") %>%
mutate(expression2 = ifelse(expression == "up", 1, 2))
dat %>%
ggplot(aes(x = hours, y = freq_genes, group = condition,
fill = factor(expression2, labels=c("Up", "Down")))) +
labs(fill="Expression") +
geom_bar(stat = "identity", position = "dodge", width = 2.5, alpha = 0.5) +
scale_fill_manual(values=c("#9ecae1", "#3182bd")) +
xlab("Time (Hours)") +
scale_x_continuous(breaks = seq(min(dat$freq_genes),
max(dat$freq_genes),
by = 3)) +
ylab("Gene Frequency") +
facet_grid(. ~ condition, scales = "free") +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
axis.text.x = element_text(angle = 90))
Here, I've created a new column called expression2 that is just a numeric version of expression. I changed the fill variable in aes() to match with those new labels. I left the colors in scale_fill_manual() the same as in your original statement and kept the alpha value. "Down" is being plotted on top of "Up" but in keeping the same colors with the alpha value, both bars are easier to see. You can play with the legend to display "Down" before "Up" if that's necessary.
Note that providing machine readable data goes a long way in allowing others to help you out. Consider using dput() to output your data next time rather than pasting it in. Also note that you can "chain" together ggplot() statements with a +. This makes code much more compact and easier to read.

plotting subset of grouped data in ggplot2

I am trying to make a plot that has mean (+/- SD) number (ID = total count per row) of Explorations on the y-axis and then grouped by both pp and type on the x-axis.
That is, I want to generate something that looks like this (hand-drawn and made up graph):
Here is how the dataframe is structured (available here).
pp crossingtype km type ID
0 Complete 80.0 DCC 10
1 Complete 80.0 DCC 4
0 Exploration 80.0 DCC 49
1 Exploration 80.0 DCC 4
0 Complete 144.0 DWC 235
1 Complete 144.0 DWC 22
0 Exploration 144.0 DWC 238
1 Exploration 144.0 DWC 18
1 Exploration 84.0 PC 40
0 Complete 107.0 PC 43
1 Complete 107.0 PC 22
0 Exploration 107.0 PC 389
I want to use ggplot2 and have tried this code:
ggplot(expMean, aes(x=as.factor(pp), y=crossingtype, color=factor(type),group=factor(type)))
+ geom_point(shape=16,cex=3)
+ geom_smooth(method=lm)
+ facet_grid(.~type)
But it gives me this figure (which is not what I am trying to make).
How can I use ggplot2 to make the first plot?
You can do the statistical transformations within ggplot(), but my preference is to process the data first, then plot the results.
library(tidyverse)
expMean %>%
filter(crossingtype == "Exploration") %>%
group_by(type, pp) %>%
summarise(Mean = mean(ID), SD = sd(ID)) %>%
ggplot(aes(factor(pp), Mean)) +
geom_pointrange(aes(ymax = Mean + SD,
ymin = Mean - SD)) +
facet_wrap(~type) +
theme_bw()
Is this what you want? This filters the data to only include Exploration, uses ID as the y variable, groups by pp and facets on type
tbl <- read_table2(
"pp crossingtype km type ID
0 Complete 80.0 DCC 10
1 Complete 80.0 DCC 4
0 Exploration 80.0 DCC 49
1 Exploration 80.0 DCC 4
0 Complete 144.0 DWC 235
1 Complete 144.0 DWC 22
0 Exploration 144.0 DWC 238
1 Exploration 144.0 DWC 18
1 Exploration 84.0 PC 40
0 Complete 107.0 PC 43
1 Complete 107.0 PC 22
0 Exploration 107.0 PC 389"
) %>%
mutate(pp = factor(pp))
ggplot(data = tbl %>% filter(crossingtype == "Exploration")) +
geom_boxplot(aes(x = pp, y = ID)) +
facet_wrap(~type)
I ran this code on the linked dataset to produce this:
Here's the approach I used. Utilised a colour instead of the double valued x-axis.
Note that I downloaded the data to my working directory, so the read.table command may need to be modified
library(dplyr)
library(ggplot2)
dat <- read.table("figshare.txt")
dat <- droplevels(filter(dat, crossingtype == "Exploration"))
dat <- dat %>%
group_by(pp, type) %>%
summarise(val = mean(ID),
SD = sd(ID))
ggplot(dat, aes(x = type, y = val, colour = as.factor(pp), group =
as.factor(pp))) +
geom_point(size = 3, position = position_dodge(width = 0.2)) +
geom_errorbar(aes(ymax = val + SD, ymin = val - SD), position =
position_dodge(width = 0.2), width = 0.2) +
labs(y = "Mean # of explorations (+/- SD", colour = "pp")

Resources