I want to create in R a plot which contains side by side bars and line charts as follows:
I tried:
Total <- c(584,605,664,711,759,795,863,954,1008,1061,1117,1150)
Infected <- c(366,359,388,402,427,422,462,524,570,560,578,577)
Recovered <- c(212,240,269,301,320,359,385,413,421,483,516,548)
Death <- c(6,6,7,8,12,14,16,17,17,18,23,25)
day <- itemizeDates(startDate="01.04.20", endDate="12.04.20")
df <- data.frame(Day=day, Infected=Infected, Recovered=Recovered, Death=Death, Total=Total)
value_matrix = matrix(, nrow = 2, ncol = 12)
value_matrix[1,] = df$Recovered
value_matrix[2,] = df$Death
plot(c(1:12), df$Total, ylim=c(0,1200), xlim=c(1,12), type = "b", col="peachpuff", xaxt="n", xlab = "", ylab = "")
points(c(1:12), df$Infected, type = "b", col="red")
barplot(value_matrix, beside = TRUE, col = c("green", "black"), width = 0.35, add = TRUE)
But the bar chart does not fit the line chart. I guess it would be easier to use ggplot2, but don't know how. Could anyone help me? Thanks a lot in advance!
With ggplot2, the margins are handled nicely for you, but you'll need the data in two separate long forms. Reshape from wide to long with tidyr::gather, tidyr::pivot_longer, reshape2::melt, reshape, or whatever you prefer.
library(tidyr)
library(ggplot2)
df <- data.frame(
Total = c(584,605,664,711,759,795,863,954,1008,1061,1117,1150),
Infected = c(366,359,388,402,427,422,462,524,570,560,578,577),
Recovered = c(212,240,269,301,320,359,385,413,421,483,516,548),
Death = c(6,6,7,8,12,14,16,17,17,18,23,25),
day = seq(as.Date("2020-04-01"), as.Date("2020-04-12"), by = 'day')
)
ggplot(
tidyr::gather(df, Population, count, Total:Infected),
aes(day, count, color = Population, fill = Population)
) +
geom_line() +
geom_point() +
geom_col(
data = tidyr::gather(df, Population, count, Recovered:Death),
position = 'dodge', show.legend = FALSE
)
Another way to do it is to gather twice before plotting. Not sure if this is easier or harder to understand, but you get the same thing.
df %>%
tidyr::gather(Population, count, Total:Infected) %>%
tidyr::gather(Resolution, count2, Recovered:Death) %>%
ggplot(aes(x = day, y = count, color = Population)) +
geom_line() +
geom_point() +
geom_col(
aes(y = count2, color = Resolution, fill = Resolution),
position = 'dodge', show.legend = FALSE
)
You can actually plot the lines and points without reshaping by making separate calls for each, but to dodge bars (or get legends), you'll definitely need to reshape.
Related
I am processing this dataset (bottom of the page) in R for a project.
First I load in the data:
count_data <- read.table(file = "../data/GSE156388_read_counts.tsv", header = T, sep = "",
row.names = 1)
I then melt the data using reshape2:
melted_count_data <- melt(count_data)
Then I create a factor for colouring graphs by group:
color_groups <- factor(melted_count_data$variable, labels = rep(c("siTFIP11", "siGl3"), each = 3))
Now we get to the barplot I'm trying to make:
ggplot(melted_count_data, aes(x = variable, y = value / 1e6, fill = color_groups)) +
geom_bar(stat = "identity") + labs(title = "Read counts", y = "Sequencing depth (millions of reads)")
The problem is that this creates a barplot with a bunch of stripes, leading me to believe it is trying to stack a ton of bars on top of each other instead of just creating one solid block.
I also wanted to add data labels to the plot:
+ geom_text(label = value / 1e6)
but this seemed to just put a bunch of values on top of each other.
For the stacked bars problem I tried to use y = sum(values) but this just made all the bars the same height. I also tried using y = colSums(values) but this obviously didn't work because it needs "an array of at least two dimensions".
I tried figuring it out using the unmelted data but to no avail.
I just kind of gave up on the labels since I wasn't even able to fix the bars problem.
EDIT:
I found a thread suggesting this:
ggplot(melted_count_data, aes(x = variable, y = value / 1e6, color = color_groups)) +
geom_bar(stat = "identity") + labs(title = "Read counts", y = "Sequencing depth (millions of reads)")
Changing fill to color. This fixes the white lines but results in some (fewer) black lines. Looking at this new chart leads me to believe it might actually be pasting a bunch of charts on top of each other?
You could do:
library(tidyverse)
url <- paste0( "https://www.ncbi.nlm.nih.gov/geo/download/",
"?acc=GSE156388&format=file&file=GSE156388%5",
"Fread%5Fcounts%2Etsv%2Egz")
tmpfile <- tempfile()
download.file(url, tmpfile)
count_data <- readr::read_tsv(gzfile(tmpfile),
show_col_types = FALSE)
count_data %>%
pivot_longer(-1) %>%
mutate(color_groups = factor(name,
labels = rep(c("siTFIP11", "siGl3"), each = 3))) %>%
group_by(name) %>%
summarise(value = sum(value)/1e6, color_groups = first(color_groups)) %>%
ggplot(aes(name, value, fill = color_groups)) +
geom_col() +
geom_text(aes(label = round(value, 2)), nudge_y = 0.5) +
labs(title = "Read counts", x = "", fill = "Type",
y = "Sequencing depth (millions of reads)") +
scale_fill_manual(values = c("gold", "deepskyblue3")) +
theme_minimal()
Created on 2022-03-21 by the reprex package (v2.0.1)
I do have a problem with one of my charts, where there are four data sets, with three of the same length and one dataset that is a month longer; only the longest data set shows the appropriate label at the end of that particular line.
I'm trying to get all four labels related to each line series to shown on the chart, but I can only get the label for the longest series. Please any thoughts and ideas would be greatly appreciated!
I show the code below and the chart output
library(GetBCBData)
library(ggplot2)
library(dplyr)
library(ggrepel)
# set ids
id.series <- c(ICC_sprd_total = 27443,
ICC_sprd_corps = 27444,
ICC_sprd_indivs = 27445,
SELIC = 4189)
first.date = '2013-01-01'
# get series from bcb
df_cred <- gbcbd_get_series(id = id.series,
first.date = first.date,
last.date = Sys.Date(),
use.memoise = FALSE)
glimpse(df_cred)
p <- ggplot(df_cred, aes(x =ref.date, y = value, colour = series.name)) +
geom_line() +
geom_label_repel(data = df_cred %>%
slice(which.max(ref.date)),
aes(label = value),
nudge_x = 0.05,
show.legend = FALSE,
size = 4.5) +
scale_y_continuous(limits = c(0,NA), expand = c(0,0)) +
geom_hline(yintercept=0)
print(p)
The original code is identifying points whose ref.date match the latest ref.date in the data; what you want is the latest ref.date within each series, which you can get by grouping first.
...
geom_label_repel(data = df_cred %>%
group_by(series.name) %>% # ADD THIS
slice(which.max(ref.date)),
...
I have two dataframes and I want to plot a comparison between them. The plot and dataframes look like so
df2019 <- data.frame(Role = c("A","B","C"),Women_percent = c(65,50,70),Men_percent = c(35,50,30), Women_total =
c(130,100,140), Men_total = c(70,100,60))
df2016 <- data.frame(Role= c("A","B","C"),Women_percent = c(70,45,50),Men_percent = c(30,55,50),Women_total =
c(140,90,100), Men_total = c(60,110,100))
all_melted <- reshape2::melt(
rbind(cbind(df2019, year=2019), cbind(df2016, year=2016)),
id=c("year", "Role"))
Theres no reason I need the data in melted from, I just did it because I was plotting bar graphs with it, but now I need a line graph and I dont know how to make line graphs in melted form, and dont know how to keep that 19/16 tag if not in melted frame. When i try to make a line graph I dont know how to specify what "variable" will be used. I want the lines to be the Women,Men percent values, and the label to be the totals. (in this picture the geom_text is the percent values, I want it to use the total values)
Crucially I want the linetype to be dotted in 2016 and for the legend to show that
I think it would be simplest to rbind the two frames after labelling them with their year, then reshape the result so that you have columns for role, year, gender, percent and total.
I would then use a bit of alpha scale trickery to hide the points and labels from 2016:
df2016$year <- 2016
df2019$year <- 2019
rbind(df2016, df2019) %>%
pivot_longer(cols = 2:5, names_sep = "_", names_to = c("Gender", "Type")) %>%
pivot_wider(names_from = Type) %>%
ggplot(aes(Role, percent, color = Gender,
linetype = factor(year),
group = paste(Gender, year))) +
geom_line(size = 1.3) +
geom_point(size = 10, aes(alpha = year)) +
geom_text(aes(label = total, alpha = year), colour = "black") +
scale_colour_manual(values = c("#07aaf6", "#ef786f")) +
scale_alpha(range = c(0, 1), guide = guide_none()) +
scale_linetype_manual(values = c(2, 1)) +
labs(y = "Percent", color = "Gender", linetype = "Year")
I'm just learning ggplot, so my apologies if this is a really basic question. I have data that has been aggregated by year with a few different qualities to slice on (code below will generate sample data). I'm trying to show a few different charts: one that shows overall for a given metric, then a couple that show the same metric split across the qualities, but its not going right. Ideally, I want to make the plot once, then call the geom layer for each of the individual charts. I do have examples of how I want it to look in the code as well.
I'm starting to think this is a data structure issue, but really can't figure it out.
Secondary question - My years are formatted as integers, is that the best way to do that here, or should I convert them to dates?
library(data.table)
library(ggplot2)
#Generate Sample Data - Yearly summarized data
BaseData <- data.table(expand.grid(dataYear = rep(2010:2017),
Program = c("A","B","C"),
Indicator = c("0","1")))
set.seed(123)
BaseData$Metric1 <- runif(nrow(BaseData),min = 10000,100000)
BaseData$Metric2 <- runif(nrow(BaseData),min = 10000,100000)
BaseData$Metric3 <- runif(nrow(BaseData),min = 10000,100000)
BP <- ggplot(BaseData, aes(dataYear,Metric1))
BP + geom_area() #overall Aggregate
BP + geom_area(position = "stack", aes(fill = Program)) #Stacked by Program
BP + geom_area(position = "stack", aes(fill = Indicator)) #stacked by Indicator
#How I want them to look
##overall Aggregate
BP.Agg <- BaseData[,.(Metric1 = sum(Metric1)),
by = dataYear]
ggplot(BP.Agg,aes(dataYear, Metric1))+geom_area()
##Stacked by Program
BP.Pro <- BaseData[,.(Metric1 = sum(Metric1)),
by = .(dataYear,
Program)]
ggplot(BP.Pro,aes(dataYear, Metric1, fill = Program))+geom_area(position = "stack")
##stacked by Indicator
BP.Ind <- BaseData[,.(Metric1 = sum(Metric1)),
by = .(dataYear,
Indicator)]
ggplot(BP.Ind,aes(dataYear, Metric1, fill = Indicator))+geom_area(position = "stack")
I was right, it was an easy fix. I should have used stat_summary instead of geom_area, here are the correct layers to add:
BP + stat_summary(fun.y = sum, geom = "area")
BP + stat_summary(fun.y = sum, geom = "area", position = "stack", aes(fill = Program, group = Program))
BP + stat_summary(fun.y = sum, geom = "area", position = "stack", aes(fill = Indicator, group = Indicator))
I try to connect jittered points between measurements from two different methods (measure) on an x-axis. These measurements are linked to one another by the probands (a), that can be separated into two main groups, patients (pat) and controls (ctr),
My df is like that:
set.seed(1)
df <- data.frame(a = rep(paste0("id", "_", 1:20), each = 2),
value = sample(1:10, 40, rep = TRUE),
measure = rep(c("a", "b"), 20), group = rep(c("pat", "ctr"), each = 2,10))
I tried
library(ggplot2)
ggplot(df,aes(measure, value, fill = group)) +
geom_point(position = position_jitterdodge(jitter.width = 0.1, jitter.height = 0.1,
dodge.width = 0.75), shape = 1) +
geom_line(aes(group = a), position = position_dodge(0.75))
Created on 2020-01-13 by the reprex package (v0.3.0)
I used the fill aesthetic in order to separate the jittered dots from both groups (pat and ctr). I realised that when I put the group = a aesthetics into the ggplot main call, then it doesn't separate as nicely, but seems to link better to the points.
My question: Is there a way to better connect the lines to the (jittered) points, but keeping the separation of the two main groups, ctr and pat?
Thanks a lot.
The big issue you are having is that you are dodging the points by only group but the lines are being dodged by a, as well.
To keep your lines with the axes as is, one option is to manually dodge your data. This takes advantage of factors being integers under the hood, moving one level of group to the right and the other to the left.
df = transform(df, dmeasure = ifelse(group == "ctr",
as.numeric(measure) - .25,
as.numeric(measure) + .25 ) )
You can then make a plot with measure as the x axis but then use the "dodged" variable as the x axis variable in geom_point and geom_line.
ggplot(df, aes(x = measure, y = value) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
If you also want jittering, that can also be added manually to both you x and y variables.
df = transform(df, dmeasure = ifelse(group == "ctr",
jitter(as.numeric(measure) - .25, .1),
jitter(as.numeric(measure) + .25, .1) ),
jvalue = jitter(value, amount = .1) )
ggplot(df, aes(x = measure, y = jvalue) ) +
geom_blank() +
geom_point( aes(x = dmeasure), shape = 1 ) +
geom_line( aes(group = a, x = dmeasure) )
This turned out to be an astonishingly common question and I'd like to add an answer/comment to myself with a suggestion of a - what I now think - much, much better visualisation:
The scatter plot.
I originally intended to show paired data and visually guide the eye between the two comparisons. The problem with this visualisation is evident: Every subject is visualised twice. This leads to a quite crowded graphic. Also, the two dimensions of the data (measurement before, and after) are forced into one dimension (y), and the connection by ID is awkwardly forced onto your x axis.
Plot 1: The scatter plot naturally represents the ID by only showing one point per subject, but showing both dimensions more naturally on x and y. The only step needed is to make your data wider (yes, this is also sometimes necessary, ggplot not always requires long data).
The box plot
Plot 2: As rightly pointed out by user AllanCameron, another option would be to plot the difference of the paired values directly, for example as a boxplot. This is a nice visualisation of the appropriate paired t-test where the mean of the differences is tested against 0. It will require the same data shaping to "wide format". I personally like to show the actual values as well (if there are not too many).
library(tidyr)
library(dplyr)
library(ggplot2)
## first reshape the data wider (one column for each measurement)
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
## now use the new columns for your scatter plot
ggplot() +
geom_point(aes(time_a, time_b, color = group)) +
## you can add a line of equality to make it even more intuitive
geom_abline(intercept = 0, slope = 1, lty = 2, linewidth = .2) +
coord_equal()
Box plot to show differences of paired values
df %>%
pivot_wider(names_from = "measure", values_from = "value", names_prefix = "time_" ) %>%
ggplot(aes(x = "", y = time_a - time_b)) +
geom_boxplot() +
# optional, if you want to show the actual values
geom_point(position = position_jitter(width = .1))