I have created a bar plot with each variable having up to four data points. I have managed to plot it successfully. The only issue I'm currently experiencing is the key is not in the order I would like it to be. I would ideally want the key ranging from best to worst or in this case 'Excellent' to 'Not so good'.
What part of code would I need to change for the order to go from best to worst?
df <- read.csv("//ecfle35/STAFF-HOME$/MaxEmery/open event feedback/October/Q3.csv")
df %>%
#First the dataset needs to be long not wide
gather(review,
count,
Excellent:Not.so.good,
factor_key = T) %>%
#Lets get ride of N/A
filter(count != 'N/A') %>%
#convert count from string to number
#Remove the annoying full stop in the middle of text
mutate(count = as.integer(count),
review = gsub('\\.', ' ', review)) %>%
ggplot(aes(
x = Faculty,
y = count,
fill = review
)) +
geom_bar(position = 'dodge',
stat = 'identity') +
scale_y_continuous(breaks = seq(0, 22, by = 2)) +
labs(title = 'Teaching Staff Ratings',
x = 'Faculty',
y = 'Count') +
theme(axis.text.x = element_text(angle = 90))
Below is an image of how it currently looks -
Graphic of my plot
When you convert to a factor without specifying the levels, the levels get decided based on the alphabetical order. Add a mutate() in the pipe to relevel the review column before sending it to ggplot().
Related
Let's say I have 100 variables column and 1 label column. The label is categorical, for example, 1,2,and 3. Now for each variable I would like to generate a plot for each category(e.g. boxplot). Is there a good format to show all plot? By using facet_grid, it seems that we can only put 16 plots together, otherwise the plot will be too small.
Example code:
label = sample.int(3, 50, replace = TRUE)
var = as.matrix(matrix(rnorm(5000),50,100))
data = as.data.frame(cbind(var,label))
Ultimately, if you want a box for each of 3 groups for each column of your data, then you would need 300 boxes in total. This seems like a bad idea from a data visualisation perspective. A plot should allow your data to tell a story, but the only story a plot like that could show is "I can make a very crowded plot". In terms of getting it to look nice, you would need a lot of room to plot this, so if it were on a large poster it might work.
To fit it all in to a single page with minimal room taken up by axis annotations, you could do something like:
library(tidyverse)
pivot_longer(data, -label) %>%
mutate(name = as.numeric(sub('V', '', name))) %>%
mutate(row = (name - 1) %/% 20,
label = factor(label)) %>%
ggplot(aes(factor(name), value, fill = label)) +
geom_boxplot() +
facet_wrap(row~., nrow = 5, scales = 'free_x') +
labs(x = "data frame column") +
theme(strip.background = element_blank(),
strip.text = element_blank())
But this is still far from ideal.
An alternative, depending on the nature of your data columns, would be to plot the column number as a continuous variable. That way, you can represent the distribution in each column via its density, allowing for a heatmap-type plot which might actually convey your data's story better:
pivot_longer(data, -label) %>%
mutate(x = as.numeric(sub('V', '', name))) %>%
mutate(label = factor(label)) %>%
group_by(x, label) %>%
summarize(y = density(value, from = -6, to = 6)$x,
z = density(value, from = -6, to = 6)$y) %>%
ggplot(aes(x, y, fill = label, alpha = z)) +
geom_raster() +
coord_cartesian(expand = FALSE) +
labs(x = 'data frame column', y = 'value', alpha = 'density') +
facet_grid(label~.) +
guides(fill = 'none') +
theme_bw()
So I have the following code which produces:
The issue here is twofold:
The group bar chart automatically places the highest value on the top (i.e. for avenue 4 CTP is on top), whereas I would always want FTP to be shown first then CTP to be shown after (so always blue bar then red bar)
I need all of the values to scale to 100 or 100% for their respective group (so for CTP avenue 4 would have a huge bar graph but the other avenues should be extremely tiny)
I am new to 'R'/Stack overflow so sorry if anything is wrong/you need more but any help is greatly appreciated.
library(ggplot2)
library(tidyverse)
library(magrittr)
# function to specify decimals
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
# sample data
avenues <- c("Avenue1", "Avenue2", "Avenue3", "Avenue4")
flytip_amount <- c(1000, 2000, 1500, 250)
collection_amount <- c(5, 15, 10, 2000)
# create data frame from the sample data
df <- data.frame(avenues, flytip_amount, collection_amount)
# got it working - now to test
df3 <- df
SumFA <- sum(df3$flytip_amount)
df3$FTP <- (df3$flytip_amount/SumFA)*100
df3$FTP <- specify_decimal(df3$FTP, 1)
SumCA <- sum(df3$collection_amount)
df3$CTP <- (df3$collection_amount/SumCA)*100
df3$CTP <- specify_decimal(df3$CTP, 1)
# Now we have percentages remove whole values
df2 <- df3[,c(1,4,5)]
df2 <- df2 %>% pivot_longer(-avenues)
FTGraphPos <- df2$name
ggplot(df2, aes(x = avenues, fill = as.factor(name), y = value)) +
geom_col(position = "dodge", width = 0.75) + coord_flip() +
labs(title = "Flytipping & Collection %", x = "ward_name", y = "Percentageperward") +
geom_text(aes(x= avenues, label = value), vjust = -0.1, position = "identity", size = 5)
I have tried the above and I have looked at lots of tutorials but nothing is exactly precise to what I need of ensuring the group bar charts puts the layers in the same order despite amount and scaling to 100/100%
As Camille notes, to handle ordering of the categories in a plot, you need to set them as factors, and then use functions from the forcats package to handle the order. Here I am using fct_relevel() (note that it will automatically convert character variables to factors).
Your numeric values are in fact set to character, so they need to be set to numeric for the chart to make sense.
To cover point #2, I'm using group_by() to calculate percentages within each name.
I have also fixed the labels so that they are properly dodged along with the bar chart. Also, note that you don't need to call ggplot2 or magrittr if you are calling tidyverse - those packages come along with it already.
df_plot <- df2 |>
mutate(name = fct_relevel(name, "CTP"),
value = as.numeric(value)) |>
group_by(name) |>
mutate(perc = value / sum(value)) |>
ungroup()
ggplot(df_plot, aes(x = value, y = avenues, fill = name)) +
geom_col(position = "dodge", width = 0.75) +
geom_text(aes(label = value), position = position_dodge(width = 0.75), size = 5) +
labs(title = "Flytipping & Collection %", x = "Percentageperward", y = "ward_name") +
guides(fill = guide_legend(reverse = TRUE))
I've tried everywhere to find the answer to this question but I am still stuck, so here it is:
I have a data frame data_1 that contains data from an ongoing latent profile analysis. The variables of interest for this question are profiles and gender.
I would like to plot gender distribution by profile, but within each profile show what % of each gender we have compared to the entire sample of this gender. For example, if we have 10 women and 5 in Profile 1, I want the text on top of the women bar for Profile 1 to show 50%.
Right now I am using the following code but it is giving me the percentage for the entire population, while I just want the percentage compared to the total number of women.
ggplot(data = subset(data_1, !is.na(gender)),
aes(x = gender, fill = gender)) + geom_bar() +
facet_grid(cols=vars(profiles)) + theme_minimal() +
scale_fill_brewer(palette = 'Accent', name = "Gender",
labels = c("Non-binary", "Man", "Woman")) +
labs(x = "Gender", title = "Gender distribution per LPA profile") +
geom_text(aes(y = ((..count..)/sum(..count..)),
label = scales::percent((..count..)/sum(..count..))),
stat = "count", vjust = -28)
Thanks in advance for your help!
I tried multiple alternatives including creating the variable within the dataset using summarize and mutate but with no success unfortunately.
As untidy as it seems, it's likely the best approach to summarise outside of the ggplot2 call, which can be done like this:
library(tidyverse)
data1 <- tibble(gender = sample(c("male", "female"), 100, replace = TRUE),
profile = sample(c("profile1", "profile2"), 100, replace = TRUE))
data1 |>
count(gender, profile) |>
group_by(gender) |>
mutate(perc = n / sum(n)) |>
ggplot(aes(x = gender, y = n, fill = gender)) +
geom_col() +
facet_grid(~profile) +
geom_text(aes(y = n + 3, label = scales::percent(perc)))
The facet_grid is essentially grouping the dataset by profile before doing any calculations of values, so in essence it's blind to the data in the other facet. I think only approach is thus summarising before the call and using geom_col (defaulting to stat = "identity") to make the plots. Note that the y value for the labels is calculated from the count variable - R will position the text relative to the counted values of the bars.
Edit - actually no, there's a "simpler" way
I tell a lie, you can actually do it in the ggplot2 call, but it's a little messier:
data1 |>
ggplot(aes(x = gender, fill = gender)) +
geom_bar() +
facet_grid(~ profile) +
stat_count(aes(y = after_stat(count) + 2,
label = scales::percent(after_stat(count) /
tapply(after_stat(count),
after_stat(group),
sum)[after_stat(group)]
)),
geom = "text")
Code borrowed from here. The after_stat(group) part is accessing the grouped gender count across both facets. Today I learned something!
I have a data frame (a tibble) like this:
library(tidyverse)
library(lubridate)
x = tibble(date=c("2022-04-25 07:04:07", "2022-04-25 07:09:07", "2022-04-25 07:14:07", "2022-04-26 07:04:07"),
value=c("on", "off", "on", "off"))
x$day<- as.factor(day(x$date))
x$time <- paste0(str_pad(hour(x$date),2,pad="0"),":",str_pad(minute(x$date),2,pad="0"))
When I plot the data:
x %>% ggplot() + geom_col(aes(x=day,y=time, fill=value))
the times in the y axis do not follow the bars. Each time data is supposed to be side by side with each bar segment.
I tried using as.factor(time) but that didn't solve.
I also tried to add a numeric scale:
x = tibble(date=c("2022-04-25 07:04:07", "2022-04-25 07:09:07", "2022-04-25 07:14:07", "2022-04-26 07:04:07"),
fake_y=c(1,1,1,1)
value=c("on", "off", "on", "off"))
x %>% ggplot() + geom_col(aes(x=day,y=fake_y, fill=value))
but then the order of the on/off bars is lost.
How can I fix this?
Since you are looking for a time line, you would probably be best with geom_segment rather than geom_col. The reason is that since you might have multiple 'on' or 'off' values in a single day, it would be difficult to get these to stack correctly. You would also need to diff the on-off times to get them to stack. Furthermore, your labels would be wrong using columns if "off" represents the time of going from an on state to an off state.
When working with times in R, it is often best to keep them in time format for plotting. If you convert times to character strings before plotting, they will be interpreted as factor levels, and therefore will not be proportionately spaced correctly.
Since you want to have the day along one axis, you will need quite a bit of data manipulation to ensure that you record the state at the start of each day and the end of each day, but it can be achieved by doing:
p <- x %>%
mutate(date = as.POSIXct(date)) %>%
mutate(day = as.factor(day(date))) %>%
group_by(day) %>%
group_modify(~ add_row(.x,
date = floor_date(as.POSIXct(first(.x$date)), 'day'),
value = ifelse(first(.x$value) == 'on', 'off', 'on'),
.before = 1)) %>%
group_modify(~ add_row(.x,
date = ceiling_date(as.POSIXct(last(.x$date)), 'day') - 1,
value = last(.x$value))) %>%
mutate(ends = lead(date)) %>%
filter(!is.na(ends)) %>%
mutate(date = hms::as_hms(date), ends = hms::as_hms(ends)) %>%
ggplot(aes(x = day, y = date)) +
geom_segment(aes(xend = day, yend = ends, color = value),
size = 20) +
coord_cartesian(ylim = c(25120, 26500)) +
labs(y = 'time') +
guides(color = guide_legend(override.aes = list(size = 8)))
p
And of course, you can easily flip the co-ordinates if you wish, and apply theme elements to make the plot more appealing:
p + coord_flip(ylim = c(25120, 26500)) +
scale_color_manual(values = c('deepskyblue4', 'orange')) +
theme_light(base_size = 16)
I'm currently trying to make my own graphical timeline like the one at the bottom of this page. I scraped the table from that link using the rvest package and cleaned it up.
Here is my code:
library(tidyverse)
library(rvest)
library(ggthemes)
library(lubridate)
URL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"
justices <- URL %>%
read_html %>%
html_node("table.wikitable") %>%
html_table(fill = TRUE) %>%
data.frame()
# Removes weird row at bottom of the table
n <- nrow(justices)
justices <- justices[1:(n - 1), ]
# Separating the information I want
justices <- justices %>%
separate(Justice.2, into = c("name","year"), sep = "\\(") %>%
separate(Tenure, into = c("start", "end"), sep = "\n–") %>%
separate(end, into = c("end", "reason"), sep = "\\(") %>%
select(name, start, end)
# Removes wikipedia tags in start column
justices$start <- gsub('\\[e\\]$|\\[m\\]|\\[j\\]$$','', justices$start)
justices$start <- mdy(justices$start)
# This will replace incumbencies with NA
justices$end <- mdy(justices$end)
# Incumbent judges are still around!
justices[is.na(justices)] <- today()
justices$start = as.Date(justices$start, format = "%m/%d%/Y")
justices$end = as.Date(justices$end, format = "%m/%d%/Y")
justices %>%
ggplot(aes(reorder(x = name, X = start))) +
geom_segment(aes(xend = name,
yend = start,
y = end)) +
coord_flip() +
scale_y_date(date_breaks = "20 years", date_labels = "%Y") +
theme(axis.title = element_blank()) +
theme_fivethirtyeight() +
NULL
This is the output from ggplot (I'm not worried about aesthetics yet I know it looks terrible!):
The goal for this plot is to order the judges chronologically from their start date, so the judge with the oldest start date should be at the bottom while the judge with the most recent should be at the top. As you can see, There are multiple instances where this rule is broken.
Instead of sorting chronologically, it simply lists the judges as the order they appear in the data frame, which is also the order Wikipedia has it in.
Therefore, a line segment above another segment should always start further right than the one below it
My understanding of reorder is that it will take the X = start from geom_segment and sort that and list the names in that order.
The only help I could find to this problem is to factor the dates and then order them that way, however I get the error
Error: Invalid input: date_trans works with objects of class Date only.
Thank you for your help!
You can make the name column a factor and use forcats::fct_reorder to reorder names based on start date. fct_reorder can take a function that's used for ordering start; you can use min() to order by the earliest start date for each justice. That way, judges with multiple start dates will be sorted according to the earliest one. Only a two line change: add a mutate at the beginning of the pipe, and remove the reorder inside aes.
justices %>%
mutate(name = as.factor(name) %>% fct_reorder(start, min)) %>%
ggplot(aes(x = name)) +
geom_segment(aes(xend = name,
yend = start,
y = end)) +
coord_flip() +
scale_y_date(date_breaks = "20 years", date_labels = "%Y") +
theme(axis.title = element_blank()) +
theme_fivethirtyeight()
Created on 2018-06-29 by the reprex package (v0.2.0).
I would make this a comment, but I couldn't fit it.
This was an attempt I gave up on. It looks like it actually does fix the problem, but it broke several other aspects of the formatting and I've run out of time to fix it back.
justices <- justices[order(justices$start, decreasing = TRUE),]
any(diff(justices$start) > 0) # FALSE, i.e. it works
justices$id <- nrow(justices):1
ggplot(data=justices, mapping=aes(x = start, y=id)) + #,color=name, color =
scale_x_date(date_breaks = "20 years", date_labels = "%Y") +
scale_y_discrete(breaks=justices$id, labels = justices$name) +
geom_segment(aes(xend = end, y = justices$id, yend = justices$id), size = 5) +
theme(axis.title = element_blank()) +
theme_fivethirtyeight()
Please also refer to this thread. GL!