Using ggplot to plot number of TRUE statements from a df - r

I'm trying to plot a graph where number of TRUE statement from a df column.
I have a df that looks like this
Speed Month_1
12 67
12 114
12 155
12 44
13 77
13 165
13 114
13 177
...
And I would like to plot a bargraph where we have x = Speed and y = Number of rows that are above 100 in Month_1 column.
So for X = 12 I would have a bargraph with a Y-value of 2 and for X = 13 I would have a Y-value of 3.
Can I do this directly in ggplot, or do I have to create a new DF first?

Sure, just filter out the values below 100 in the data you pass to ggplot and do a normal geom_bar
ggplot(df[df$Month_1 >= 100, ], aes(factor(Speed))) +
geom_bar(width = 0.5, fill = 'deepskyblue4') +
theme_bw(base_size = 16) +
labs(x = 'Speed')
If, for some reason, you really need to pass the full data frame without filtering it, you can fill the < 100 values with a fully transparent colour:
ggplot(df, aes(factor(Speed), fill = Month_1 > 100)) +
geom_bar(width = 0.5) +
theme_bw(base_size = 16) +
scale_fill_manual(values = c('#00000000', 'deepskyblue4')) +
labs(x = 'Speed') +
theme(legend.position = 'none')

You can use dplyr to filter your data frame and then plot it with ggplot.
library(tidyverse)
df <- tibble(Speed = c(12, 12, 12, 12, 13, 13, 13, 13),
Month_1 = c(67, 114, 155, 44, 77, 165, 114, 177))
df %>% filter(Month_1 > 100) %>%
ggplot(aes(x = Speed)) + geom_bar()

Related

stacked bar chart without using fill in geom_bar?

I have some dummy data and am able to create a bar chart and a stacked bar chart:
# some data
egdf <- data.frame(
ch = c('a', 'b', 'c'),
N = c(100, 110, 120),
M = c(10, 15, 20)
)
Looks like this:
egdf
ch N M
1 a 100 10
2 b 110 15
3 c 120 20
Now some charts:
# bar chart
ggplot(egdf, aes(x = ch, y = N)) +
geom_bar(stat = 'identity')
# stacked bar chart
egdf %>%
pivot_longer(cols = c(N, M), names_to = 'metric') %>%
ggplot(aes(x = ch, y = value, fill = metric)) +
geom_bar(stat = 'Identity')
My question is, is there a way to create the stacked bar chart from egdf directly without having to first transform with pivot_longer()?
[EDIT]
Why am I asking for this? My actual dataframe has some additional fields which are based on calculations off the current structure, e.g. it looks more like this:
egdf <- data.frame(
ch = c('a', 'b', 'c'),
N = c(120, 110, 100),
M = c(10, 15, 20)
) %>%
mutate(drop = N - lag(N),
drop_pct = scales::percent(drop / N),
Rate = scales::percent(M / N))
egdf
ch N M drop drop_pct Rate
1 a 120 10 NA <NA> 8.3%
2 b 110 15 -10 -9.09% 13.6%
3 c 100 20 -10 -10.00% 20.0%
In my plot, I'm adding on some additional geoms. If I was to pivot_longer, these relationships would be buckled. If I was able to somehow tell ggplot to make a stacked bar just based on feature1, feature2 (N and M in the example) it would be much easier for this particular use case.
Update: See valuable comment of stefan:
ggplot(egdf1, aes(x=ch, y=N+M)) +
geom_col(aes(fill="N")) +
geom_col(aes(x=ch, y=M, fill="M")) +
ylab("N") +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10))
First answer:
Are you looking for such a solution?
ggplot(egdf1, aes(x=ch, y=N)) +
geom_col(aes(fill="N")) +
geom_col(aes(x=ch, y=M, fill="M"))

ggplot, facet, piechart, missing values

This is my final dataset. I originally obtained this table by calculating the values separately and doing rbind between females (F) and males (M) from an original bigger dataset.
I am trying to make a handsome piechart, with the percentage labels outside and I've encountered ALL problems possible for which I cannot find a solution.
Notice there is no value A for males in the dataframe.
Dataframe:
sex ms n_ms n msPerc value
1 F A 1 91 0.01098901 1.098901
2 F B 18 91 0.19780220 19.780220
3 F C 65 91 0.71428571 71.428571
4 F D 7 91 0.07692308 7.692308
5 M B 11 108 0.10185185 10.185185
6 M C 86 108 0.79629630 79.629630
7 M D 11 108 0.10185185 10.185185
library(ggplot2)
library(ggrepel)
library(tidyverse)
n<- c(91, 91 , 91, 91, 108, 108, 108 )
n_ms<-c(1,18,65,7,11,86,11)
sex<- c("F","F","F","F", "M"," M","M")
ms<- c("A","B","C","D","B","C","D")
df <- data.frame(sex, ms, n_ms, n)
df[is.na(df)]<- 0
df$msPerc <- df$n_ms /df$n
df$value <- 100*df$n_ms /df$n
df$n_ms<- as.integer(df$n_ms) # original big dataframe (doing for replication purposes)
df$n<- as.integer(df$n)
#creating position of labels
df2 <- df %>%
mutate(csum = rev(cumsum(rev(value))),
pos = value/2 + lead(csum, 1),
pos = if_else(is.na(pos), value/2, pos))
ms_pie<-ggplot(df, aes(x="", y=msPerc, group=sex, fill=ms)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0) +
facet_grid(.~ sex) +
theme_void()+
theme(legend.position="top",
legend.text = element_text(size = 9),
legend.title = element_text(size = 9,face = "bold"))+
scale_fill_manual(values=c("#d7191c", "#fdae61", "#abd9e9","#5e3c99"),
name="Moulting stage",
labels=c("A","B","C","D"))+
# geom_label(aes(label = percent(msPerc)),
# position = position_stack(vjust = 0.5),
# show.legend = FALSE)
# geom_text(aes(label = percent(msPerc)),size = 3,color = "black",
# position = position_stack(vjust = 0.5),
# show.legend = FALSE)
geom_label_repel(data = df2,
aes(y = pos, label = paste0(value, "%")),
size = 4.5, nudge_x = 1, show.legend = FALSE)
ms_pie
This is what happens...
What I would like is a piechart like this one from https://r-charts.com/part-whole/pie-chart-labels-outside-ggplot2/ but including facet_grid in variable "sex".
So far this is the closest I've got. Using geom_label, however my values overlap and I do not know how to separate them either... the joys of being a beginner in R.
I also tried the solutions provided in ggplot, facet, piechart: placing text in the middle of pie chart slices but coor_polar won't work with scales "free".
I would much appreciate the help.
Kind regards.
There are several problems with your code I've tried to clean (see comments preceeded by ####) - this should get you closer:
library(ggplot2)
library(ggrepel)
library(tidyverse)
library(scales) #### using scales for number formatting
n<- c(91, 91 , 91, 91, 108, 108, 108 )
n_ms<-c(1,18,65,7,11,86,11)
sex<- c("F","F","F","F", "M","M","M") #### changed " M" to "M" at last but one element
ms<- c("A","B","C","D","B","C","D")
df <- data.frame(sex, ms, n_ms, n)
df[is.na(df)]<- 0
df$msPerc <- df$n_ms /df$n
df$value <- 100*df$n_ms /df$n
df$n_ms<- as.integer(df$n_ms) # original big dataframe (doing for replication purposes)
df$n<- as.integer(df$n)
#creating position of labels
df2 <- df %>% group_by(sex) %>% #### you need to group your data by the facets you want to show
mutate(csum = cumsum(msPerc), #### adjusted example code to use "msPerc"
pos = msPerc/2 + lag(csum, 1),
pos = if_else(is.na(pos), msPerc/2, pos))
ms_pie<-ggplot(df, aes(x="", y=msPerc, group=sex, fill=ms)) +
geom_col(width = 1) +
coord_polar("y", start=0) +
facet_grid(~sex) +
theme_void()+
theme(legend.position="top",
legend.text = element_text(size = 9),
legend.title = element_text(size = 9,face = "bold"))+
scale_fill_manual(values=c("#d7191c", "#fdae61", "#abd9e9","#5e3c99"),
name="Moulting stage",
labels=c("A","B","C","D"))+
geom_label_repel(data = df2,
aes(y = pos, label = percent(msPerc, digits = 1)),
size = 4.5, nudge_x = 1, show.legend = FALSE)
ms_pie

A circular histogram in R shows incorrect values

I'm trying to recreate a circular plot from here (a first plot on this page), but the output I just got seems incorrect. The 'last' bar (between 23 and 0) is missing and the 'first' one (between 0 and 1) is unproportionally high. What's more, bars appear 'moved' by one unit to the left, while on the website above the plot seems fine.
Here is a code which I copied from that site. The only difference I made is that I removed "width=2" from geom_histogram(), because otherwise it raised an error saying that argument width was deprecated.
library(lubridate)
library(ggplot2)
set.seed(44)
N=500
events <- as.POSIXct("2011-01-01", tz="GMT") +
days(floor(365*runif(N))) +
hours(floor(24*rnorm(N))) +
minutes(floor(60*runif(N))) +
seconds(floor(60*runif(N)))
hour_of_event <- hour(events)
eventdata <- data.frame(datetime = events, eventhour = hour_of_event)
# determine if event is in business hours
eventdata$Workday <- eventdata$eventhour %in% seq(9, 17)
ggplot(eventdata, aes(x = eventhour, fill = Workday)) +
geom_histogram(breaks = seq(0, 24), colour = "grey") +
coord_polar(start = 0) + theme_minimal() +
scale_fill_brewer() + ylab("Count") +
ggtitle("Events by Time of day") +
scale_x_continuous("", limits = c(0, 24), breaks = seq(0, 24), labels = seq(0, 24))
Here is what I got:
Here is a table of the data. You can see that for hour 23 should be a value of 17 instead of 0 like in my plot.
table(eventdata$eventhour)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
23 22 18 26 28 20 19 21 16 17 20 16 18 22 16 21 24 21 22 27 25 18 23 17
Do you have an idea why my plot doesn't show correct values and how I can fix this?
I propose this solution based on this post :
library(lubridate)
library(ggplot2)
set.seed(44)
N=500
events <- as.POSIXct("2011-01-01", tz="GMT") +
days(floor(365*runif(N))) +
hours(floor(24*rnorm(N))) +
minutes(floor(60*runif(N))) +
seconds(floor(60*runif(N)))
hour_of_event <- hour(events)
eventdata <- data.frame(datetime = events, eventhour = hour_of_event)
# determine if event is in business hours
eventdata$Workday <- eventdata$eventhour %in% seq(9, 17)
df <- data.frame(table(eventdata$eventhour),
business_hour = 0:23 %in% seq(9, 17))
colnames(df)[1:2] <- c("hour", "value")
ggplot(df, aes(hour, value, fill = business_hour)) +
coord_polar(theta = "x", start = 0) +
geom_bar(stat = "identity", width = .9)
I hope it helps. It doesn't tell you why you have a problem in your case but it gives you a viable solution.
It seems that the issue was caused by arguments of geom_histogram and scale_x_continuous function.
Instead of this:
geom_histogram(breaks = seq(0, 24), colour = "grey") +
scale_x_continuous("", limits = c(0, 24), breaks = seq(0, 24), labels = seq(0, 24))
it should be:
geom_histogram(bins = 24, colour = "grey") +
scale_x_continuous(breaks = seq(-0.5, 23.5), labels = seq(0, 24))
It's still a bit confusing to me why it works only this way, but it finally works...

How to define day from 6am to 6am on x axis in ggplot?

I am trying to do a bar chart of an aggregate, by the hour.
hourly <- data.frame(
hour = 0:23,
N = 7+0:23,
hour.mod = c(18:23, 0:17))
The day is from 6am to 6am, so I added an offset, hour.mod, and then:
ggplot(hourly, aes(x = hour.mod, y = N)) +
geom_col() +
labs(x = "6am to 6am", y = "Count")
Except, the x-axis scale at 0 contradicts the label. While tinkering with scales: scale_x_discrete(breaks = c(6, 10, 14, 18, 22)) disappeared the scale altogether; which works for now but sub-optimal.
How do I specify x axis to start at an hour other than 0 or 23? Is there way to do so without creating an offset column? I am a novice, so please assume you are explaining to the village idiot.
You don't say what you want to see, but it's fairly clear that you should be using scale_x_continuous and shifting your labels somehow, either "by hand" or with some simple math:
ggplot(hourly, aes(x = hour.mod, y = N)) +
geom_col() +
labs(x = "6am to 6am", y = "Count") +
scale_x_continuous(breaks= c(0,4,8,12,16), labels = c(6, 10, 14, 18, 22) )
Or perhaps:
ggplot(hourly, aes(x = hour.mod, y = N)) +
geom_col() +
labs(x = "6am to 6am", y = "Count") +
scale_x_continuous(breaks= c(6, 10, 14, 18, 22)-6, # shifts all values lower
labels = c(6, 10, 14, 18, 22) )
It's possible you need to use modulo arithmetic, which in R involves the use of %% and %/%:
1:24 %% 12
[1] 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0

Looking to split a scatterplot and return the highest values (images included)

I have two columns in R called A and B where I created a scatter plot for. (Both scatter plots are identical one just has the labels added to them).
I want to identify the names of the items after the cut off line. There are about 11 points above the cut off line, how can I go about doing this in R?
Create a data table of all values greater than 1e+05? Just looking for some direction, thanks.
Plot 1:
ggplot(df, aes(x=A, y=B)) +
geom_point(fill = "red")
Plot 2:
ggplot(df, aes(x=A, y=B)) +
geom_point(fill = "red") +
geom_text(label=rownames(df))
Sample of A:
1 1136 12 203 147 9 3 12 83 82 1304
Sample of B:
1852 90 277 329 15 8 30 165 156 2572
Two Identical Scatterplots
How about this approach:
library(ggplot2)
library(ggrepel) # will avoid overlapping of values above the cut off
A <- c( 1136, 12, 203, 147, 9, 3, 12, 83, 82, 1304)
B <- c(1852, 90, 277, 329, 15, 8, 30, 165, 156, 2572)
df <- data.frame(A , B)
df$id <- ifelse(df$A > 1e+03 & df$B > 1e+03, B, NA) # just an example of cut off.
ggplot(df, aes(x=A, y=B)) +
geom_point() +
geom_text_repel(aes(x = A, y = B, label = id), size = 2.5, na.rm = TRUE)

Resources