Related
Let me first share a dummy data, from which I want to prepare ggplot graphs.
library(tidyverse)
set.seed(1)
sample_size <- 1200
dates <- sample(seq(1,31),sample_size,replace = TRUE)
Monthss <- sample(seq(1,12),sample_size,replace = TRUE)
hrs <- sample(seq(1,23),sample_size,replace = TRUE)
minutes <- sample(seq(1,59),sample_size,replace = TRUE)
date_time_vector <- paste0(dates,"-",Monthss,"-",2022," ",hrs,":",minutes) |> lubridate::parse_date_time("dmy HM")
Conversion <- sample(c(TRUE,FALSE),sample_size, prob = c(0.25,0.75), replace = TRUE)
df <- data.frame(Date = date_time_vector, Conversion_Status = Conversion)
df <- df |> mutate(Leads = round(runif(sample_size, min = 0,max = 10),digits = 0))
df <- df[complete.cases(df), ]
The code above gives me a data.frame with columns Date, Leads and Conversion_Status. I want to prepare Monthly column chart of total leads per day. (For example, daily leads in January, daily leads in February, etc.) So, basically, I will need to split the data on the basis of Month, and prepare one chart for each month. How can I prepare such charts?
I have tried following way:
bar_function <- function(df, col1, col2, title) {
df %>%
ggplot2::ggplot(aes(x = {{col1}}, y = {{col2}})) +
ggplot2::geom_col(fill = "steelblue") +
theme(plot.background = element_rect(fill = "white")) +theme(plot.title = element_text(hjust = 0.5))+coord_flip() +
ggplot2::labs(title = title)
}
mycharts <- df |> dplyr::nest_by(Month) |> dplyr::mutate(plot = bar_function(df,Date,Leads,"Daily Leads by Month"))
But it is giving me errors.
You can split according to month(year) and plot that.
library(ggplot2)
library(lubridate)
set.seed(1)
sample_size <- 1200
dates <- sample(seq(1,31),sample_size,replace = TRUE)
Monthss <- sample(seq(1,12),sample_size,replace = TRUE)
hrs <- sample(seq(1,23),sample_size,replace = TRUE)
minutes <- sample(seq(1,59),sample_size,replace = TRUE)
date_time_vector <- paste0(dates,"-",Monthss,"-",2022," ",hrs,":",minutes) |> lubridate::parse_date_time("dmy HM")
Conversion <- sample(c(TRUE,FALSE),sample_size, prob = c(0.25,0.75), replace = TRUE)
df <- data.frame(Date = date_time_vector, Conversion_Status = Conversion)
df$Leads <- round(runif(sample_size, min = 0,max = 10),digits = 0)
df <- df[complete.cases(df), ]
df$month_year <- strftime(df$Date, format = "%m-%Y")
df.split <- split(df, f = df$month_year)
out <- vector("list", length(df.split))
names(out) <- names(df.split)
for (i in seq_along(df.split)) {
out[[i]] <- ggplot(data = df.split[[i]], mapping = aes(x = Date, y = Leads)) +
geom_col(fill = "steelblue") +
theme(plot.background = element_rect(fill = "white")) +
theme(plot.title = element_text(hjust = 0.5))+
coord_flip() +
labs(title = "Daily leads by month")
}
To plot you can just print e.g. out[[1]].
If you want to change the desired columns dynamically, you can use aes_string for mapping. This can naturally be wrapped into sapply and there are probably other ways of approaching the problem. The for loop is pretty agnostic and I find that it's readable even by people who do not dabble in R (compared to say sapply).
There are some issues with your code. First, your dataset has no Month column, i.e. you have to add it for which I use lubridate::month. Second, you are passing the dataset df to your bar function instead of the splitted data column from your nested df. Third, in the mutate step you have to wrap the result in list():
library(ggplot2)
library(dplyr, warn=FALSE)
mycharts <- df |>
nest_by(Month = lubridate::month(Date)) |>
mutate(plot = list(bar_function(data, Date, Leads, "Daily Leads by Month")))
mycharts$plot[[1]]
mycharts$plot[[5]]
I finally found an answer. I used following code:
lapply(split(df, df$Month),
function(x)
ggplot(x, aes(x=Date, y=Leads)) +
geom_col(fill = "steelblue") + coord_flip()+
ggtitle(x$Month[1]))
Thank you all for your support.
I'm trying to create a nice graph of indexed prices for a few currencies so I can track relative performance from origin for different projects and price-levels.
Below is my dummy code. I've tried a lot of things but this is as far as I got...
R plot of the orignal code: prices of HEX and BTC
I wish to add other currencies as I go along.
In the end it is just a data frame with multiple columns that all need to start on the same point, the timestamp is irrelevant and I could plot only the series or shift them all to start on the same location.
This is what I'm trying to achieve:
Indexed prices of projects starting at same origin
# Dummy data that recreates my problem - two frames with different starting dates and an indexed value of the closing price.
n1 <- 366
dat1 <- data.frame(timestamp=seq.Date(as.Date("2012-12-26"), as.Date("2013-12-26"), "day"),
index.btc=seq(from = 1, to = n1, by=1, replace=TRUE)
)
dat2 <- data.frame(timestamp=seq.Date(as.Date("2013-12-26"), as.Date("2014-12-26"), "day"),
index.hex=seq(from = 1, to = n1, by=1, replace=TRUE)
)
# Merging data
jointdataset2 <- merge(dat1, dat2, by = 'timestamp', all = TRUE)
# Creating plottable data with melt function
jointdataset_plot <- melt(jointdataset2 , id.vars = 'timestamp', variable.name = 'project')
# plot on same grid, each series colored differently --
# good if the series have same scale (they have but different starting date)
ggplot(jointdataset_plot, aes(timestamp,value)) +
geom_line(aes(colour = project)) +
scale_y_log10()
# Can also plot like this
ggplot() + geom_line(data = dat1, aes(timestamp,index.btc),
color = "blue",
size = 1) +
geom_line(data = dat2, aes(timestamp,index.hex),
color = "red",
size = 1) +
labs(x = "Time",
y = "Indexed Price",
title ="Indexed historical price (daily close index)",
subtitle = "Candlesticks - data by nomics.com") +
scale_x_date(date_labels = "%Y (%b)", date_breaks = "1 year", date_minor_breaks = "1 month") +
scale_y_log10() +
theme_bw()
If I remove the timestamps, and remove N/As from one of the data frames, would I then be able to create an ID column in both frames (starting at 1, same counter) and merging them both at ID counter 1 so origins align?
Your sample data overlaps, so I've changed dat2:
library(dplyr);library(tidyr)
n1 <- 366
n2 <- 500
dat1 <- data.frame(timestamp=seq.Date(as.Date("2012-12-26"), as.Date("2013-12-26"), "day"),
index.btc=seq(from = 1, to = n1, by=1, replace=TRUE))
dat2 <- data.frame(timestamp=seq.Date(as.Date("2013-12-26"), as.Date("2014-12-26"), "day"),
index.hex=seq(from = 1, to = n2, length.out=n1))
full_join(dat1,dat2) %>%
pivot_longer(-timestamp, names_to = "index", values_to = "price") %>%
filter(!is.na(price)) %>%
group_by(index) %>%
mutate(timestamp = as.integer(timestamp - min(timestamp))) -> plotdata
ggplot(plotdata, aes(x = as.integer(timestamp),
y = price, color = index)) +
geom_line() +
labs(x = "Time (Days)",
y = "Indexed Price",
title ="Indexed historical price (daily close index)",
subtitle = "Candlesticks - data by nomics.com") +
scale_y_log10() +
theme_bw()
n1 <- 366
dat1 <- data.frame(timestamp=seq.Date(as.Date("2012-12-26"), as.Date("2013-12-26"), "day"),
index.btc=cumsum(sample(-2:10, n1, replace=TRUE))
)
dat2 <- data.frame(timestamp=seq.Date(as.Date("2013-12-26"), as.Date("2014-12-26"), "day"),
index.hex=cumsum(sample(-2:10, n1, replace=TRUE))
)
dat1$timestamp<- seq(length(dat1$timestamp))
dat2$timestamp<- seq(length(dat2$timestamp))
# Merging data
jointdataset2 <- merge(dat1, dat2, by = 'timestamp', all = TRUE)
# Creating plottable data with melt function
jointdataset_plot <- melt(jointdataset2 , id.vars = 'timestamp', variable.name = 'project')
# plot on same grid, each series colored differently --
# good if the series have same scale (they have but different starting date)
ggplot(jointdataset_plot, aes(timestamp,value)) +
geom_line(aes(colour = project)) +
scale_y_log10()
# Can also plot like this
ggplot() + geom_line(data = dat1, aes(timestamp,index.btc),
color = "blue",
size = 1) +
geom_line(data = dat2, aes(timestamp,index.hex),
color = "red",
size = 1) +
labs(x = "Time",
y = "Indexed Price",
title ="Indexed historical price (daily close index)",
subtitle = "Candlesticks - data by nomics.com") +
scale_x_continuous() +
scale_y_log10() +
theme_bw()
I have timeseries and need to find clusters of values over threshold and plot that cluster on separate plot.
My code example. Unfortunately I don't know how to generate well clustered values.
#generate sample data
Sys.setlocale("LC_ALL","English")
set.seed(8)
Values <- sample(0:100,24241, replace = T)
Values <- rpois(24241, lambda=60)
start <- as.POSIXct("2012-01-15 06:10:00")
interval <- 15
end <- start + as.difftime(4, units="days") + as.difftime(5, units = "hours")
DateTimes <- seq(from=start, by=interval, to=end)
my_data_sample <- tibble(datetime = DateTimes, Value = Values)
threshold <- 82
ggplot(data = my_data_sample, aes(x = datetime, y = Value)) +
geom_line(size = 1, color = "darkgreen") +
geom_hline(yintercept=threshold, linetype="dashed", color = "red") +
theme_bw() +
labs(
x= "" ,
y = "",
title = paste("Threshold:", threshold )
) +
scale_x_datetime(date_breaks = "8 hour", labels = date_format("%b %d - %H:%M")) +
theme(axis.text.x = element_text(angle = 25, vjust = 1.0, hjust = 1.0))
Here is what I need:
I need to find clusters of values over threshold - consecutive or near each other, sort that clusters using cluster length in seconds (longest clusters) or sum of values (most powerful clusters), and plot let's say top 3 of that time periods on separate plots.
Any suggestions how to do that?
You can find runs that follow some expectation using run-length encoding (RLE). At the RLE level, you can filter out runs that are too short on either side. You can play with the run_threshold value until it matches your data.
# Put some actual deviating runs in the data
my_data_sample$Value[5001:5100] <- rpois(100, lambda = 80)
my_data_sample$Value[10001:11000] <- rpois(1000, lambda = 80)
threshold <- 82
rle <- rle(my_data_sample$Value > threshold)
# Find sub-threshold values in between super-threshold values,
# convert these to other class
run_threshold <- 20
rle$values[!rle$values & rle$lengths < run_threshold] <- TRUE
# Restructure rle
rle <- rle(inverse.rle(rle))
# Find short super-threshold values to filter
run_threshold <- 5
rle$values[rle$values & rle$lengths < run_threshold] <- FALSE
rle <- rle(inverse.rle(rle))
# Find run starts and ends
rle_start <- {rle_end <- cumsum(rle$lengths)} - rle$lengths + 1
# Format as data.frame for ggplot
rle_df <- data.frame(
min = my_data_sample$datetime[rle_start],
max = my_data_sample$datetime[rle_end],
value = rle$values
)
ggplot(data = my_data_sample, aes(x = datetime, y = Value)) +
geom_line(size = 1, color = "darkgreen") +
geom_rect(aes(xmin = min, xmax = max, ymin = 0, ymax = 10, fill = value),
data = rle_df, inherit.aes = FALSE) +
geom_hline(yintercept=threshold, linetype="dashed", color = "red") +
theme_bw() +
labs(
x= "" ,
y = "",
title = paste("Threshold:", threshold )
) +
scale_x_datetime(date_breaks = "8 hour", labels = date_format("%b %d - %H:%M")) +
theme(axis.text.x = element_text(angle = 25, vjust = 1.0, hjust = 1.0))
I am trying to plot some data, fitted values and forecasts on a nice ggplot format but when I plot my data the way I think should work I get a gap between the real data and the forecast. The gap is meaningless but it would be nice if it was gone.
Some R code you can use to recreate my problem is:
library(xts)
library(tidyverse)
library(forecast)
dates <- seq(as.Date("2016-01-01"), length = 100, by = "days")
realdata <- arima.sim(model = list(ar = 0.7, order = c(1,1,0)), n = 99)
data <- xts(realdata, order.by = dates)
user_arima <- arima(data, order = c(1,1,0))
user_arimaf <- forecast(user_arima)
fits <- xts(user_arimaf$fitted, order.by = dates)
fcastdates <- as.Date(dates[100]) + 1:10
meancast <- xts(user_arimaf$mean[1:10], order.by = fcastdates)
lowercast95 <- xts(user_arimaf$lower[1:10], order.by = fcastdates)
uppercast95 <- xts(user_arimaf$upper[1:10], order.by = fcastdates)
frame <- merge(data, fits, meancast, uppercast95, lowercast95, all = TRUE, fill = NA)
frame <- as.data.frame(frame) %>%
mutate(date = as.Date(dates[1] + 0:(109)))
frame %>%
ggplot() +
geom_line(aes(date, data, color = "Data")) +
geom_line(aes(date, fits, color = "Fitted")) +
geom_line(aes(date, meancast, color = "Forecast")) +
geom_ribbon(aes(date, ymin=lowercast95,ymax=uppercast95),alpha=.25) +
scale_color_manual(values = c(
'Data' = 'black',
'Fitted' = 'red',
'Forecast' = 'darkblue')) +
labs(color = 'Legend') +
theme_classic() +
ylab("some data") +
xlab("Date") +
labs(title = "chart showing a gap",
subtitle = "Shaded area is the 95% CI from the ARIMA")
And the chart is below
I know there is a geom_forecast in ggplot now but I would like to build this particular plot the way i'm doing it. Although if there's no other solution to the gap then i'll use the geom_forecast.
Closing the gap requires providing a data point in the meancast column for the blank area. I guess it makes sense just to use the value for the last "real" data point.
# Grab the y-value corresponding to the date just before the gap.
last_data_value = frame[frame$date == as.Date("2016-04-09"), "data"]
# Construct a one-row data.frame.
extra_row = data.frame(data=NA_real_,
fits=NA_real_,
meancast=last_data_value,
uppercast95=last_data_value,
lowercast95=last_data_value,
date=as.Date("2016-04-09"))
# Add extra row to the main data.frame.
frame = rbind(frame, extra_row)
I have an R code that creates a linear regression. I am having some problems with the legends in a graph. I would like to use the dates specified in the trendDateRange as the legend with different colors. Since these dates are in YYYY-MM-DD format. I only need the YYYY-MM. So for example, the trendDateRage1 = c("2015-01-01", "2015-12-31") and I want to display "2015-01 - 2015-12" as a legend with a any colour. When I run this in a for loop, it's only displaying 1 legend which uses the last trendDateRange i.e trendDateRange3 which displays "2013-01 - 2013-12". It does not display the legend for the other 2 dates. I do not have any problem with graphs although they're using the same colour. I would like to see different colours for each legend even though they have different line types.
If I run the code below showing individual graphs, it's working with the proper legend. I get the legend for each graph.
Month_Names <- c("2010-11","2010-12",
"2011-01","2011-02","2011-03","2011-04","2011-05","2011-06","2011-07","2011-08","2011-09","2011-10","2011-11","2011-12",
"2012-01","2012-02","2012-03","2012-04","2012-05","2012-06","2012-07","2012-08","2012-09","2012-10","2012-11","2012-12",
"2013-01","2013-02","2013-03","2013-04","2013-05","2013-06","2013-07","2013-08","2013-09","2013-10","2013-11","2013-12",
"2014-01","2014-02","2014-03","2014-04","2014-05","2014-06","2014-07","2014-08","2014-09","2014-10","2014-11","2014-12",
"2015-01","2015-02","2015-03","2015-04","2015-05","2015-06","2015-07","2015-08","2015-09","2015-10","2015-11","2015-12",
"2016-01","2016-02","2016-03","2016-04","2016-05","2016-06","2016-07","2016-08","2016-09","2016-10","2016-11","2016-12",
"2017-01")
Actual_volume <- c(54447,57156,
52033,49547,58718,53109,56488,60095,54683,60863,56692,55283,55504,56633,
53267,52587,54680,55569,60013,56985,59709,61281,54188,59832,56489,55819,
59295,52692,56663,59698,61232,57694,63111,60473,58984,64050,54957,63238,
59460,54430,58901,61088,60496,62984,66895,62720,65591,67815,58289,72002,
61054,60329,69283,68002,63196,72267,71058,69539,71379,70925,68704,76956,
65863,70494,77348,70214,74770,77480,69721,83034,76761,77927,79768,81836,
75381)
df_data <- data.frame(Month_Names, Actual_volume)
trendDateRange1 <- c("2010-11-01", "2017-01-31")
trendDateRange2 <- c("2012-01-01", "2012-12-31")
trendDateRange3 <- c("2013-01-01", "2013-12-31")
numoftrends <- 3
list_of_df <- list()
list_of_df<- lapply(1:numoftrends, function(j) {
trend.period <- get(paste0("trendDateRange", j))
trend1 <- substr(trend.period[1], 1, 7)
trend2 <- substr(trend.period[2], 1, 7)
TRx <- subset(df_data, as.character(Month_Names) >= trend1 &
as.character(Month_Names) <= trend2)
})
i = 1
trend.period <- get(paste0("trendDateRange", i))
trend1 <- substr(trend.period[1], 1, 7)
trend2 <- substr(trend.period[2], 1, 7)
Trend.dates <- paste0(trend1, '-' ,trend2)
plot = ggplot() +
geom_line(data = list_of_df[[i]],
aes(x = Month_Names, y = Actual_volume, group = 1 , colour = Trend.dates),
lty = i + 1)
print(ggplotly(plot))
i = 2
trend.period <- get(paste0("trendDateRange", i))
trend1 <- substr(trend.period[1], 1, 7)
trend2 <- substr(trend.period[2], 1, 7)
Trend.dates <- paste0(trend1, '-' ,trend2)
plot = ggplot() +
geom_line(data = list_of_df[[i]],
aes(x=Month_Names, y = Actual_volume, group = 1 , colour = Trend.dates),
lty = i + 1)
print(ggplotly(plot))
i = 3
trend.period <- get(paste0("trendDateRange", i))
trend1 <- substr(trend.period[1], 1, 7)
trend2 <- substr(trend.period[2], 1, 7)
Trend.dates <- paste0(trend1, '-' ,trend2)
plot = ggplot() +
geom_line(data = list_of_df[[i]],
aes(x = Month_Names, y = Actual_volume, group = 1 , colour = Trend.dates),
lty = i+1)
print(ggplotly(plot))
But when I put this in the loop to make it one graph with each legend it does not work
plot = ggplot()
for (i in seq_along(list_of_df)) {
trend.period = get(paste0("trendDateRange", i))
trend1 = substr(trend.period[1], 1, 7)
trend2 = substr(trend.period[2], 1, 7)
Trend.dates = paste0(trend1, '-' ,trend2)
plot = plot + geom_line(aes(x = Month_Names, y = Actual_volume, group = 1 , colour = Trend.dates),
data = list_of_df[[i]], lty = i + 1)
}
print(ggplotly(plot))
You'll have a much easier time working with ggplot2 if you combine the three datasets into one with an aesthetic that separates them, rather than adding them together in a for loop.
There are a number of ways you could do this, but here's an example using the dplyr and tidyr packages. It would replace everything after your df_data <- line.
library(ggplot2)
library(dplyr)
library(tidyr)
trends <- data_frame(Start = c("2010-11", "2012-01", "2013-01"),
End = c("2017-01", "2012-12", "2013-12"))
combined_data <- df_data %>%
crossing(trends) %>%
mutate(Month_Names = as.character(Month_Names),
TrendName = paste(Start, End, sep = "-")) %>%
filter(Month_Names >= Start,
Month_Names <= End)
# rotated x-axes to make plot slightly more readable
ggplot(combined_data, aes(Month_Names, y = Actual_volume,
group = TrendName,
color = TrendName)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
If you combine your list into a data.frame with an ID representing which element the observation came from and parse the dates, getting a decent plot is pretty simple:
library(dplyr)
library(ggplot2)
list_of_df %>%
bind_rows(.id = 'id') %>%
mutate(date = as.Date(paste0(Month_Names, '-01'))) %>%
ggplot(aes(date, Actual_volume, color = id)) +
geom_line()
or without dplyr,
df <- do.call(rbind,
Map(function(df, i){df$id <- i; df},
df = list_of_df,
i = as.character(seq_along(list_of_df))))
df$date <- as.Date(paste0(df$Month_Names, '-01'))
ggplot(df, aes(date, Actual_volume, color = id)) + geom_line()
which returns the same thing.
If you'd like more descriptive group labels, set the names of the list elements or define id as a string pasted together from the formatted minimums and maximums of the parsed dates.
Here is a solution using ggplotly.
nrows <- unlist(lapply(list_of_df,nrow))
df <- data.frame(do.call(rbind,list_of_df), Grp = factor(rep(1:3, nrows)))
plot <- ggplot(aes(x=Month_Names, y=Actual_volume, group = Grp,
colour=Grp), data=df) + geom_line()
print(ggplotly(plot))