Related
I have a time series data (date column and a value column). I am trying for a daily distribution plot.
In the below image is the weekly distribution plot that plots the values of the days of the week. Similarly I am trying to plot a daily distribution plot where x axis would be months, y axis is the value and the plot has 10 lines where each line gives you the date 1, date 2 , date 3 and so on until date 10 (since 30 days in one subplot will be clumsy so i wanted to divide the plots into 3 , 1-10, 11-20 and 21-31)
Code for weekly distribution for reference:
#dummy data
start_date <- as.Date("2020-01-01")
end_date <- as.Date("2021-12-31")
date_seq <- seq(from = start_date, to = end_date, by = "day")
set.seed(123)
value <- round(runif(length(date_seq), min = 10000, max = 100000000), 0)
df <- data.frame(date = date_seq, value = value)
df$week_number <- as.numeric(format(as.Date(df$date), "%U")) + 1
df$weekday <- weekdays(as.Date(df$date))
df$year <- as.numeric(format(as.Date(df$date), "%Y"))
years <- unique(df$year)
# Create a list of ggplots, one for each year
plots <- lapply(years, function(y) {
year_df <- df[df$year == y, ]
ggplot(year_df, aes(x = week_number, y = value, color = weekday)) +
geom_line() +
scale_color_discrete(limits = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) +
ggtitle(paste("Weekday Distribution", y)) +
xlab("Week number") +
ylab("Value") +
theme(legend.key.size = unit(0.4, "cm")) +
theme(plot.title = element_text(hjust = 0.5, vjust = 1.5))
library(cowplot)
plot_grid(plotlist = plots, ncol = 1)
So at the end, there will be three plots(1 to 10 dates, 11 to 20 dates and 21 to 31 dates) and each plot would contain 2 subplots (as the dates ranges from 2020 to 2021). Can anyone help me with this?
Below how I would do this. The lubridate package is your friend. For the grouping, use cuts.
The result is a (in my opinion) pretty useless clutter of lines. But this is not the only reason why I do not endorse this visualisation. I feel this somehow defeats the point of a time series... one point is to visualise the auto-correlation of your data. Artificially separating out only specific days from each month impacts drastically on this particular advantage (and maybe: reason) of using a time series. You're not only losing information, but also making your own analytical life much more complicated.
library(ggplot2)
library(dplyr)
library(lubridate)
df %>%
mutate(day = mday(date),
day_group = cut(day, c(1,11,21, 31), incl = T),
month = month(date, label = T, abbr = T)) %>%
ggplot(aes(x = month, y = value, color = day, group=interaction(day, day_group))) +
geom_line() +
theme(legend.key.size = unit(0.4, "cm"),
plot.title = element_text(hjust = 0.5, vjust = 1.5),
axis.text.x = element_text(angle = 90)) +
facet_wrap(year~day_group)
I feel you want to show how the "typical" 1st day compares with the 2nd, etc. For this, an aggregate visualisation might be more useful. (Still not a good idea, but at least you get a better idea of your data). This you can do with "stat_summary" which you pass to geom_smooth which has a geometry that combines geom_line and geom_ribbon.
df %>%
mutate(day = mday(date),
month = month(date, label = T, abbr = T)) %>%
ggplot(aes(x = day, y = value)) +
geom_smooth(stat= "summary", alpha = .5, color = "black") +
facet_grid(~year)
#> No summary function supplied, defaulting to `mean_se()`
#> No summary function supplied, defaulting to `mean_se()`
Following on tjebo's answer, I would also suggest to if you must you can simply highlight a line of code that would convey something out of the clutter of lines, here is an example if you want to highlight the 11th day from the rest.
Plot
df %>%
mutate(day = mday(date),
day_group = cut(day, c(1,11,21, 31), incl = T),
month = month(date, label = T, abbr = T),
highlight = ifelse(day == 11, "Yes", "No")) %>%
ggplot(aes(x = month, y = value, color = highlight, group=interaction(day, day_group))) +
geom_line() +
theme_bw()+
theme(plot.title = element_text(hjust = 1, vjust = 2),
axis.text.x = element_text(angle = 90)) +
scale_color_manual(breaks = c("Yes", "No"),
labels = c("11th Day", "Other"),
values = c("Yes" = "red2", "No" = "grey60")) +
facet_wrap(year~day_group) +
guides(color = guide_legend(order = 1))
I am plotting max_temperature (mean_tmax) against rainfall (mean_rain) in a mirrored barplot: max temp displayed upwards, rain values downwards on the negative scale. These two are stored in the "name" variable.
To highlight the highest values out of the 32 years plotted, I created two vectors colVecTmax, colVecRain. They return a color vector of length 32 each, with the index of max values marked differently.
But when adding these two vectors to fill within geom_bar(), it turns out that ggplot stops counting the top after 16 bars, and moves down to the negative scale to continue. So it does not count by the name (mean_tmax, or mean_rain) variable.
This messes up the plot, and I am not sure how to get ggplot count through on the top bars for max_temperature first, coloring by colVecTmax, and then move down to do the same for rain on the negative scale with colVecRain.
Can anyone give a hint on how to solve this?
colVecTmax <- rep("orange",32)
colVecTmax[which.max(as.numeric(unlist(df.long[df.long$place=="sheffield" & df.long$name == "mean_tmax",4])))] <- "blue"
colVecRain <- rep("grey",32)
colVecRain[which.max(as.numeric(unlist(df.long[df.long$place=="sheffield" & df.long$name == "mean_rain",4])))] <- "blue"
ggplot(df.long[df.long$name %in% c('mean_rain', 'mean_tmax'), ] %>% filter(place== "sheffield")%>%
group_by(name) %>% mutate(value = case_when(
name == 'mean_rain' ~ value/10 * -1,
TRUE ~ value)) %>% mutate(place==str_to_sentence(placenames)) %>%
mutate(name = recode(name,'mean_rain' = "rainfall" , "mean_tmax" = "max temp"))
, aes(x = yyyy, y = value, fill=name))+
geom_bar(stat="identity", position="identity", fill=c(colVecTmax,colVecRain))+
labs(x="Year", y=expression("Rain in cm, temperature in ("*~degree *C*")"))+
geom_smooth(colour="black", lwd=0.5,se=F)+
scale_y_continuous(breaks = seq(-30, 30 , 5))+
scale_x_continuous(breaks = seq(1990, 2025, 5))+
guides(fill= guide_legend(title=NULL))+
scale_fill_discrete(labels=c("Max temperature", "Rainfall"))+
guides(fill=guide_legend(reverse=T), res=96)
Using ggplot2 there are much easier and less error prone ways to assign colors. Instead of creating color vectors which you pass to the color or fill argument you could simply map on aesthetics (which you basically already have done) and assign your desired colors using a manual scale, e.g. scale_fill_manual. The same approach works fine when you want to highlight some values. To this end you could create additional categories, e.g. in the code below I add "_max" to the name for the observations with the max temperature or rainfall and assign your desired "blue" color to these categories. As doing so will add additional categories I use the breaks argument of scale_fill_manual so that these max categories will not show up in the legend.
Using some fake random example data:
# Create example data
set.seed(123)
df.long <- data.frame(
name = rep(c("mean_rain", "mean_tmax"), each = 30),
place = "sheffield",
yyyy = rep(1991:2020, 2),
value = c(runif(30, 40, 100), runif(30, 12, 16))
)
library(ggplot2)
library(dplyr)
df_plot <- df.long %>%
filter(name %in% c("mean_rain", "mean_tmax")) |>
filter(place == "sheffield") %>%
mutate(value = case_when(
name == "mean_rain" ~ -value / 10,
TRUE ~ value
)) |>
# Maximum values
group_by(name) |>
mutate(name = ifelse(abs(value) >= max(abs(value)), paste(name, "max", sep = "_"), name))
ggplot(df_plot, aes(x = yyyy, y = value, fill = name)) +
geom_col(position = "identity") +
geom_smooth(colour = "black", lwd = 0.5, se = F) +
scale_y_continuous(breaks = seq(-30, 30, 5), labels = abs) +
scale_x_continuous(breaks = seq(1990, 2025, 5)) +
scale_fill_manual(
values = c(
mean_rain = "orange", mean_tmax = "grey",
mean_rain_max = "blue", mean_tmax_max = "blue"
),
labels = c(mean_tmax = "Max temperature", mean_rain = "Rainfall"),
breaks = c("mean_rain", "mean_tmax")
) +
labs(x = "Year", y = expression("Rain in cm, temperature in (" * ~ degree * C * ")"), fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))
I'm trying to create a nice graph of indexed prices for a few currencies so I can track relative performance from origin for different projects and price-levels.
Below is my dummy code. I've tried a lot of things but this is as far as I got...
R plot of the orignal code: prices of HEX and BTC
I wish to add other currencies as I go along.
In the end it is just a data frame with multiple columns that all need to start on the same point, the timestamp is irrelevant and I could plot only the series or shift them all to start on the same location.
This is what I'm trying to achieve:
Indexed prices of projects starting at same origin
# Dummy data that recreates my problem - two frames with different starting dates and an indexed value of the closing price.
n1 <- 366
dat1 <- data.frame(timestamp=seq.Date(as.Date("2012-12-26"), as.Date("2013-12-26"), "day"),
index.btc=seq(from = 1, to = n1, by=1, replace=TRUE)
)
dat2 <- data.frame(timestamp=seq.Date(as.Date("2013-12-26"), as.Date("2014-12-26"), "day"),
index.hex=seq(from = 1, to = n1, by=1, replace=TRUE)
)
# Merging data
jointdataset2 <- merge(dat1, dat2, by = 'timestamp', all = TRUE)
# Creating plottable data with melt function
jointdataset_plot <- melt(jointdataset2 , id.vars = 'timestamp', variable.name = 'project')
# plot on same grid, each series colored differently --
# good if the series have same scale (they have but different starting date)
ggplot(jointdataset_plot, aes(timestamp,value)) +
geom_line(aes(colour = project)) +
scale_y_log10()
# Can also plot like this
ggplot() + geom_line(data = dat1, aes(timestamp,index.btc),
color = "blue",
size = 1) +
geom_line(data = dat2, aes(timestamp,index.hex),
color = "red",
size = 1) +
labs(x = "Time",
y = "Indexed Price",
title ="Indexed historical price (daily close index)",
subtitle = "Candlesticks - data by nomics.com") +
scale_x_date(date_labels = "%Y (%b)", date_breaks = "1 year", date_minor_breaks = "1 month") +
scale_y_log10() +
theme_bw()
If I remove the timestamps, and remove N/As from one of the data frames, would I then be able to create an ID column in both frames (starting at 1, same counter) and merging them both at ID counter 1 so origins align?
Your sample data overlaps, so I've changed dat2:
library(dplyr);library(tidyr)
n1 <- 366
n2 <- 500
dat1 <- data.frame(timestamp=seq.Date(as.Date("2012-12-26"), as.Date("2013-12-26"), "day"),
index.btc=seq(from = 1, to = n1, by=1, replace=TRUE))
dat2 <- data.frame(timestamp=seq.Date(as.Date("2013-12-26"), as.Date("2014-12-26"), "day"),
index.hex=seq(from = 1, to = n2, length.out=n1))
full_join(dat1,dat2) %>%
pivot_longer(-timestamp, names_to = "index", values_to = "price") %>%
filter(!is.na(price)) %>%
group_by(index) %>%
mutate(timestamp = as.integer(timestamp - min(timestamp))) -> plotdata
ggplot(plotdata, aes(x = as.integer(timestamp),
y = price, color = index)) +
geom_line() +
labs(x = "Time (Days)",
y = "Indexed Price",
title ="Indexed historical price (daily close index)",
subtitle = "Candlesticks - data by nomics.com") +
scale_y_log10() +
theme_bw()
n1 <- 366
dat1 <- data.frame(timestamp=seq.Date(as.Date("2012-12-26"), as.Date("2013-12-26"), "day"),
index.btc=cumsum(sample(-2:10, n1, replace=TRUE))
)
dat2 <- data.frame(timestamp=seq.Date(as.Date("2013-12-26"), as.Date("2014-12-26"), "day"),
index.hex=cumsum(sample(-2:10, n1, replace=TRUE))
)
dat1$timestamp<- seq(length(dat1$timestamp))
dat2$timestamp<- seq(length(dat2$timestamp))
# Merging data
jointdataset2 <- merge(dat1, dat2, by = 'timestamp', all = TRUE)
# Creating plottable data with melt function
jointdataset_plot <- melt(jointdataset2 , id.vars = 'timestamp', variable.name = 'project')
# plot on same grid, each series colored differently --
# good if the series have same scale (they have but different starting date)
ggplot(jointdataset_plot, aes(timestamp,value)) +
geom_line(aes(colour = project)) +
scale_y_log10()
# Can also plot like this
ggplot() + geom_line(data = dat1, aes(timestamp,index.btc),
color = "blue",
size = 1) +
geom_line(data = dat2, aes(timestamp,index.hex),
color = "red",
size = 1) +
labs(x = "Time",
y = "Indexed Price",
title ="Indexed historical price (daily close index)",
subtitle = "Candlesticks - data by nomics.com") +
scale_x_continuous() +
scale_y_log10() +
theme_bw()
I am quite new to R and especially to ggplot. For my next result I think I have to change from plot() to ggplot() where I need your help:
I have a dataframe with numeric values. One column is an absolute number, the other one is the belonging percentage value. I have 3 of this "two groups" indicators a, b and c.
The rownames are the 6 observations and are stored in the first column "X".
I want to plot them in a kind of grouped barplot, where the absolute+percent column is next to each other for the 3 indicators.
Sample dataframe:
df = data.frame(X = c("e 1","e 1,5","e 2","e 2,5","e 3","e 3,5","e 4"),
a_abs=c(-0.3693,-0.0735,-0.019,0.0015,0,-0.0224,-0.0135),
a_per=c(-0.4736,-0.0943,-0.0244,0.0019,0,-0.0287,-0.0173),
b_abs=c(-0.384,-0.0733,-0.0173,0.0034,0,-0.0204,-0.0179),
b_per=c(-0.546,-0.1042,-0.0246,0.0048,0,-0.029,-0.0255),
c_abs=c(-0.3876,-0.0738,-0.019,0.0015,0,-0.0225,-0.0137),
c_per=c(-0.4971,-0.0946,-0.0244,0.0019,0,-0.0289,-0.0176))
Thanks to #jonspring i got the following plot by using this code:
df3 <- df %>%
gather(column, value, -X) %>%
mutate(group = str_sub(column, end = 2),
stat = str_sub(column, start = 4)) %>%
select(-column) %>%
spread(stat, value) %>%
mutate(combo_label = paste(sep="\n",
scales::comma(abs, accuracy = 0.001),
scales::percent(per, accuracy = 0.01)))
df3$group = gsub(df3$group,pattern = "CK",replacement = "Cohen's\nKappa")
df3$group = gsub(df3$group,pattern = "JA",replacement = "Jaccard")
df3$group = gsub(df3$group,pattern = "KA",replacement = "Krippen-\ndorff's Alpha")
crg = ifelse(df3$abs< 0,"red","darkgreen")
ggplot(df3, aes(group, abs, label = combo_label)) +
geom_segment(aes(xend = group,
yend = 0),
color = crg) +
geom_point() +
geom_text(vjust = 1.5,
size = 3,
lineheight = 1.2) +
scale_y_continuous(expand = c(0.2,0)) +
facet_grid(~X) +
labs(x= "Exponent", y = "Wert")
plot output
When i zoom and have the positive values visible, the labels are written inside the segments. How to place them above / below depending of a positive or negative value?
Zoom with coord_cartesian(ylim = c(-0.015,0.005))
zoomed plot
Thank you for your helping hands.
EDIT: I found the solution already. Like the color changement from red to green i used ifelse for the vjust parameter.
There are a lot of varieties of ways to display this sort of data with ggplot. I highly recommend you check out https://r4ds.had.co.nz/data-visualisation.html if you haven't already.
One suggestion you'll find there is that ggplot almost always works better if you first convert your data into long (aka "tidy") form. This puts each of the dimensions of the data into its own column, so that you can map the dimension to a visual aesthetic. Here's one way to do that:
library(tidyverse)
df2 <- df %>%
gather(column, value, -X) %>%
mutate(group = str_sub(column, end = 1),
stat = str_sub(column, start = 3),
value_label = if_else(stat == "per",
scales::percent(value, accuracy = 0.1),
scales::comma(value, accuracy = 0.01)))
Now, the group a/b/c is in its own column, as is the type of data abs/per, the values are all together in one column, and we also have text labels that suit the type of data.
> head(df2)
X column value group stat value_label
1 e 1 a_abs -0.3693 a abs -0.37
2 e 1,5 a_abs -0.0735 a abs -0.07
3 e 2 a_abs -0.0190 a abs -0.02
4 e 2,5 a_abs 0.0015 a abs 0.00
5 e 3 a_abs 0.0000 a abs 0.00
6 e 3,5 a_abs -0.0224 a abs -0.02
With that out of the way, it's simpler to try out different combinations of ggplot options, which can help highlight different comparisons within the data.
For instance, if you want to compare the different observations within each group, you could put each group into a facet, and each observation along the x axis:
ggplot(df2, aes(X, value, label = value_label)) +
geom_segment(aes(xend = X, yend = 0), color = "blue") +
geom_point() +
geom_text(vjust = 2, size = 2) +
facet_grid(stat~group)
Or if you want to highlight how the different groups compared within each observation, you could swap them, like this:
ggplot(df2, aes(group, value, label = value_label)) +
geom_segment(aes(xend = group, yend = 0), color = "blue") +
geom_point() +
geom_text(vjust = 2, size = 2) +
facet_grid(stat~X)
You might also try combining the abs and per data, since they only vary slightly based on the different denominators applicable to each group and/or observation. To do that, it might be simpler to transform the data to keep each abs and per together:
df3 <- df %>%
gather(column, value, -X) %>%
mutate(group = str_sub(column, end = 1),
stat = str_sub(column, start = 3)) %>%
select(-column) %>%
spread(stat, value) %>%
mutate(combo_label = paste(sep="\n",
scales::comma(abs, accuracy = 0.01),
scales::percent(per, accuracy = 0.1)))
ggplot(df3, aes(group, abs, label = combo_label)) +
geom_segment(aes(xend = group, yend = 0), color = "blue") +
geom_point() +
geom_text(vjust = 1.5, size = 2, lineheight = 0.8) +
scale_y_continuous(expand = c(0.2,0)) +
facet_grid(~X)
I am trying to plot shift data by hour (integer) ordered by 3 different shifts worked (8-16, 16-24, 24-8) by day as the x-axis. The hours I have are 24hr format and I want to plot them not in numerical order (0-24) but by the shift order (8-16, 16-24, 24-8).
Here is the code to create the data and make the plot. I want to put the 0-8 chunk above the 16-24 chunk.
set.seed(123)
Hour = sample(0:24, 500, replace=T)
Day = sample(0:1, 500, replace=T)
dat <- as.tibble(cbind(Hour, Day)) %>%
mutate(Day = factor(ifelse(Day == 0, "Mon", "Tues")),
Shift = cut(Hour, 3, labels = c("0-8", "8-16", "16-24")),
Exposure = factor(sample(0:1, 500, replace=T)))
ggplot(dat, aes(x = Day, y = Hour)) +
geom_jitter(aes(color = Exposure, shape = Exposure)) +
geom_hline(yintercept = 8) +
geom_hline(yintercept = 16) +
theme_classic()
Current plot
It is an interesting problem, and I have tried recoding a new hour variable that is in the order that I want but then I'm not sure how to plot it displaying the standard 24hr variable.
How would i accomplish this ordering?
Not sure if I completely understand, but if you facet your table on the Shift column, it should do what you want. First you must factor the Shift column to the order you specify:
dat$Shift <- factor(dat$Shift, levels = c("0-8", "16-24", "8-16"))
ggplot(dat, aes(x = Day, y = Hour)) +
geom_jitter(aes(color = Exposure, shape = Exposure)) +
facet_grid(Shift ~ ., scales = "free") +
theme_classic()
set.seed(123)
Hour = sample(0:24, 500, replace=T)
Day = sample(0:1, 500, replace=T)
dat <- as.tibble(cbind(Hour, Day)) %>%
mutate(Day = factor(ifelse(Day == 0, "Mon", "Tues")),
Shift = cut(Hour, 3, labels = c("0-8", "8-16", "16-24")),
Exposure = factor(sample(0:1, 500, replace=T)))
dat$Shift <- factor(dat$Shift, levels=rev(levels(dat$Shift)))
ggplot(dat, aes(x = Day, y = Shift)) +
geom_jitter(aes(color = Exposure, shape = Exposure)) +
geom_hline(yintercept = 8) +
geom_hline(yintercept = 16) +
theme_classic()
You just need to reverse the level.