I am trying to predict June - September Level for the Year 2020 using a multiple linear regression model. In my example below, I assume that the year 2016 conditions will repeat and use it for predicting June-Sep Level for the 2020. I plot the observed Level up until May 31, shown as solid black line and the Forecasted Level shown as dashed blue line.
library(tidyverse)
library(lubridate)
set.seed(1500)
DF <- data.frame(Date = seq(as.Date("2000-01-01"), to = as.Date("2018-12-31"), by = "days"),
Level = runif(6940, 360, 366), Flow = runif(6940, 1,10),
PCP = runif(6940, 0,25), MeanT = runif(6940, 1, 30)) %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
filter(between(Month, 6, 9))
Model <- lm(data = DF, Level~Flow+PCP+MeanT)
Yr_2016 <- DF %>%
filter(Year == 2016) %>%
select(c(3:5))
Pred2020 <- data.frame(Date = seq(as.Date("2020-06-01"), to = as.Date("2020-9-30"), by = "days"),
Forecast = predict(Model, Yr_2016))
Obs2020 <- data.frame(Date = seq(as.Date("2020-01-01"), to = as.Date("2020-05-31"), by = "days"),
Level = runif(152, 360, 366))
ggplot(data = Obs2020, aes(x = Date, y = Level), col = "black")+
geom_line(size = 2)+
geom_line(data = Pred2020, aes(x = Date, y = Forecast), linetype = "dashed")
My goal
I want to use the fitted model to predict June - Sep for the year 2020 assuming that all the years in DF will repeat itself (not just the year 2016) and then have a plot where all the years Forecasted scenarios (June -Sep) are shown in different colours - something like below
new answer
The code below should do what you are looking for (if I understood it correctly). The graph, however, is still chaotic.
library(tidyverse)
library(lubridate)
set.seed(1500)
DF <- data.frame(Date = seq(as.Date("2000-01-01"), to = as.Date("2018-12-31"), by = "days"),
Level = runif(6940, 360, 366), Flow = runif(6940, 1,10),
PCP = runif(6940, 0,25), MeanT = runif(6940, 1, 30)) %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
filter(between(Month, 6, 9))
Model <- lm(data = DF, Level ~ Flow + PCP + MeanT)
Obs2020 <- data.frame(Date = seq(as.Date("2020-01-01"),
to = as.Date("2020-05-31"),
by = "days"),
Level = runif(152, 362.7, 363.25))
pred_data <- DF %>%
nest_by(Year) %>%
mutate(pred_df = list(tibble(Date = seq(as.Date("2020-06-01"),
to = as.Date("2020-09-30"),
by = "days"),
Forecast = predict(.env$Model, data)))) %>%
select(Year, pred_df) %>%
unnest(pred_df)
ggplot(data = Obs2020, aes(x = Date, y = Level), col = "black") +
geom_line(size = 0.1) +
geom_line(data = pred_data,
aes(x = Date, y = Forecast, group = factor(Year), color = factor(Year)),
size = 0.1)
Created on 2020-06-20 by the reprex package (v0.3.0)
Related
I have a dataset like this, which we have the age of the cases, their status, data_entry : time their entered, and date_end.
set.seed(123)
date_entry<- sample(seq(as.Date('2000-01-01'), as.Date('2010-01-01'), by="day"), 1000)
age <- sample(seq(as.Date('1930-01-01'), as.Date('1970-01-01'), by="day"), 1000)
status <- sample(c(0, 1), size = 1000, replace = TRUE, prob = c(0.8, 0.2))
df <- data.frame( date_entry, age, status)
df <- df %>% mutate(id = row_number()) %>% rowwise %>%
mutate(date_end = sample(seq(date_entry, Sys.Date(), by="day"), 1))
I want to simulate date entry for status==1 to increase their mean of date_entery.
df$date_sim <- df$date_entry - df$status * round(rnorm(nrow(df), -400, 500))
df %>%
ggplot(aes(x = factor(status),
y = as.numeric(difftime(date_sim, age, unit = 'w'))/52,
fill = factor(status))) +
geom_boxplot(width = 0.6) +
guides(fill = guide_none()) +
labs(x = 'Status', y = 'Age (years)')
However, after simulation, several cases will have negative time duration which means their simulate time(date_sim) was after date_end.
df$time_duration= df$date_end-df$date_sim
df %>%
filter(time_duration<0)
How can I avoid these for those who have a short time duration of (date_end-df$date_sim?
So I need limit the simulation for those who had a short time_duration.
I am trying to order the time and date axes on my scatter plot into epochs/ time periods. For example, times between 12pm-:7:59pm and 9pm-11:59pm. I want to do something similar for the dates.
I am fairly new to R so I am just looking for suggestions/ to be told if this is even possible and maybe some alternatives:)
This is my code so far:
accident <- read.csv("accidents.csv",header = TRUE)
accident <- accident %>%
ggplot(data=accident)+
geom_point(mapping=aes(x=Time, y=Date, alpha=0.5))
Thank you!
Welcome to R! Here is one set of options.
library(tidyverse)
library(lubridate)
First, simulate dataset
accident <-
rnorm(n = 1000, mean = 1500000000, sd = 1000000) %>%
tibble(date_time = .) %>%
mutate(date_time = as.POSIXct(date_time, origin = "1970-01-01")) %>%
separate(date_time, into = c("date", "time"), sep = " ", remove = F)
Original plot:
accident %>%
ggplot()+
geom_point(aes(x=time, y=date), alpha=0.5)
Step 1: Collapse the x axis into smaller number of groups
accidents_per_trihour <-
accident %>%
mutate(hour = floor_date(date_time, unit = "hour"),
hour = as.numeric(str_sub(hour, 12,13)),
tri_hour = cut(hour, c(0, 3, 6, 9, 12, 15, 18, 21, 24), include.lowest = T)) %>%
group_by(date, tri_hour) %>%
count()
Then scale dot size by number of accidents
accidents_per_trihour %>%
ggplot()+
geom_point(aes(x=tri_hour, y=date, size = n), alpha=0.5) +
labs(x = "\nTime (in three-hour groups)", y = "Day\n", size = "Accidents count")
Still not great because the y axis is too expansive. So:
Step 2: Collapse the y axis into smaller number of groups
(For your data you may need to group into months for things to start to look reasonable)
accidents_per_trihour_per_week <-
accident %>%
mutate(hour = floor_date(date_time, unit = "hour"),
hour = as.numeric(str_sub(hour, 12,13)),
tri_hour = cut(hour, c(0, 3, 6, 9, 12, 15, 18, 21, 24), include.lowest = T)) %>%
mutate(week_start = floor_date(as.Date(date), unit = "weeks"),
week = format.Date(week_start, "%Y, week %W")) %>%
group_by(week, tri_hour) %>%
count()
Should be much more readable now
We’ll improve the theme as well, just because.
if (!require(ggthemr)) devtools::install_github('cttobin/ggthemr')
ggthemr::ggthemr("flat") ## helps with pretty theming
accidents_per_trihour_per_week %>%
ggplot()+
geom_point(aes(x=tri_hour, y=week, size = n), alpha = 0.9) +
labs(x = "\nTime (in three-hour groups)", y = "Week\n", size = "Accidents count")
Could also do a tile plot
accidents_per_trihour_per_week %>%
ggplot() +
geom_tile(aes(x = tri_hour, y = week, fill = n)) +
geom_label(aes(x = tri_hour, y = week, label = n), alpha = 0.4, size = 2.5, fontface = "bold") +
labs(x = "\nTime (in three-hour groups)", y = "Week\n", fill = "Accidents count")
Created on 2021-11-24 by the reprex package (v2.0.1)
I have the following data,
# Generate Data
library(tidyverse)
library(ggspectra)
fake_data <- tibble(
time = seq(1,100, length.out = 1000),
gdp = time+time*(1-sin(0.15*time))
) %>% mutate(
time = row_number(),
growth = (gdp - lag(gdp))/lag(gdp) * 100,
peak = as.numeric(near(gdp, peaks(gdp), tol = 0.01)),
valley = as.numeric(near(gdp, valleys(gdp), tol = 0.01)),
type = as.factor(
case_when(
gdp >= lag(gdp) ~ "Expansion",
gdp <= lag(gdp) ~ "Contraction"
)
)
) %>% mutate(
cycle = as.factor(cumsum(peak + valley))
) %>% na.omit()
And I'm using ggplot2 to produce a plot
fake_data %>%
ggplot(mapping = aes(x = time, y = gdp)) +
geom_line() + geom_point(
fake_data %>% filter(peak == 1 | valley == 1),
mapping = aes(x = time, y = gdp)
) + geom_ribbon(aes(ymax = gdp, ymin = 0,fill = type, group = type), alpha = 0.5)
Which generates the following plot,
Ideally, the contraction and expansion are clearly seperated for illustrative purposes. I attempted to create an additional group to seperate the connected ribbons but I got the following error Error: Aesthetics can not vary with a ribbon.
How do I generate this plot neatly?
You might benefit from setting your groups from run-length based IDs. The data.table::rleid() can help with that.
library(tidyverse)
library(ggspectra)
fake_data <- tibble(
time = seq(1,100, length.out = 1000),
gdp = time+time*(1-sin(0.15*time))
) %>% mutate(
time = row_number(),
growth = (gdp - lag(gdp))/lag(gdp) * 100,
peak = as.numeric(near(gdp, peaks(gdp), tol = 0.01)),
valley = as.numeric(near(gdp, valleys(gdp), tol = 0.01)),
type = as.factor(
case_when(
gdp >= lag(gdp) ~ "Expansion",
gdp <= lag(gdp) ~ "Contraction"
)
)
) %>% mutate(
cycle = as.factor(cumsum(peak + valley))
) %>% na.omit()
fake_data %>%
ggplot(mapping = aes(x = time, y = gdp)) +
geom_line() +
geom_point(
fake_data %>% filter(peak == 1 | valley == 1),
mapping = aes(x = time, y = gdp)
) +
geom_area(aes(fill = type, group = data.table::rleid(type)),
alpha = 0.5)
Created on 2021-08-27 by the reprex package (v1.0.0)
Consider this simple example
tibble(date = seq.Date(from = ymd('2019-01-01'),
to = ymd('2019-06-01'),
by = 'days')) %>%
mutate(var = rnorm(n())) %>%
barchart(var ~ date, data = ., horiz = FALSE)
Here obviously the x axis is messed up. There are too many labels! How can I reduce the number of date ticks?
Ideally one could use the scales::pretty_breaks() function like in ggplot, but here I was not able to do it even manually:
myseq <- seq.Date(from = ymd('2019-01-01'),
to = ymd('2019-06-01'),
by = '7 days')
tibble(date = seq.Date(from = ymd('2019-01-01'),
to = ymd('2019-06-01'),
by = 'days')) %>%
mutate(var = rnorm(n())) %>%
barchart(var ~ date, data = ., horiz = FALSE,
scales = list(x = list(at = myseq, rot = 45)))
simply removes ALL the x labels!!
What do you think? I need a lattice solution only.
Thanks!
ggplot() defaults this quite nicely, breaking at months in this case:
library(tidyverse)
library(lubridate)
tibble(
date = seq.Date(
from = ymd('2019-01-01'),
to = ymd('2019-06-01'),
by = 'days')) %>%
mutate(var = rnorm(n())) %>%
ggplot() +
geom_line(mapping = aes(x = date, y = var))
To use lattice, first store your tibble as an object, use a seq() to set the intervals (at), and then as.Date() to set the labels and indexing the dates column using the same seq().
df1 <- tibble(
date = seq.Date(
from = ymd('2019-01-01'),
to = ymd('2019-06-01'),
by = 'days')) %>%
mutate(var = rnorm(n()))
df1 %>%
barchart(var ~ date, data = ., horiz = FALSE,
scales =
list(x =
list(
at = seq(1,152,7),
labels = as.Date(unlist(df1[,'date']), origin = '1970-01-01')[seq(1,152,7)],
rot = 45)
)
)
A potential lattice solution based on this: Decrease number of x-axis ticks (labels) in barchart
library(dplyr)
library(lubridate)
library(lattice)
tb <- tibble(date = seq.Date(from = ymd('2019-01-01'),
to = ymd('2019-06-01'),
by = 'days')) %>%
mutate(var = rnorm(n()))
dateLabs <- seq(1, nrow(tb), by=7)
scalesList <- list(x = list(rot = 45, labels = format(tb$date, "%b-%d-%Y")[dateLabs], at = dateLabs))
tb %>%
barchart(var ~ date,
data = .,
horiz = FALSE,
scales = scalesList)
I have the following data set:
set.seed(10)
start_date <- as.Date('2000-01-01')
end_date <- as.Date('2000-01-10')
Data <- data.frame(
id = rep((1:1000),10),
group = rep(c("A","B"), 25),
x = sample(1:100),
y = sample(c("1", "0"), 10, replace = TRUE),
date = as.Date(
sample(as.numeric(start_date):
as.numeric(end_date), 1000,
replace = T), origin = '2000-01-01'))
With that, I create the following plot:
Data %>% mutate(treated = factor(group)) %>%
mutate(date = as.POSIXct(date)) %>% #convert date to date
group_by(treated, date) %>% #group
summarise(prop = sum(y=="1")/n()) %>% #calculate proportion
ggplot()+ theme_classic() +
geom_line(aes(x = date, y = prop, color = treated)) +
geom_point(aes(x = date, y = prop, color = treated)) +
geom_vline(xintercept = as.POSIXct("2000-01-05 12:00 GMT"), color = 'black', lwd = 1)
Unfortunately the plot is pretty 'jumpy' and I would like to smooth it. I tried geom_smooth() but can't get it to work. Other questions regarding smoothing didn't help me because they missed the grouping aspect and therefore had a different structure. However, the example data set is in reality part of a larger data set so I need to stick to that code.
[Edit: the geom_smooth() code I tried is geom_smooth(method = 'auto', formula = y ~ x)]
Can someone point me into the right direction?
Many thanks and all the best.
Is this what you want by a smoothed line? You call geom_smooth with aesthetics, not in combination with geom_line. You can choose different smoothing methods, though the default loess with low observations is usually what people want. As an aside, I don't think this is necessarily nicer to look at than the geom_line version, and in fact is slightly less readable. geom_smooth is best used when there are many y observations for every x which makes patterns hard to see, geom_line is good for 1-1.
EDIT: After looking at what you're doing more closely, I added a second plot that doesn't directly calculate the treatment-date means and just uses geom_smooth directly. That lets you get a more reasonable confidence interval instead of having to remove it as before.
set.seed(10)
start_date <- as.Date('2000-01-01')
end_date <- as.Date('2000-01-10')
Data <- data.frame(
id = rep((1:1000),10),
group = rep(c("A","B"), 25),
x = sample(1:100),
y = sample(c("1", "0"), 10, replace = TRUE),
date = as.Date(
sample(as.numeric(start_date):
as.numeric(end_date), 1000,
replace = T), origin = '2000-01-01'))
library(tidyverse)
Data %>%
mutate(treated = factor(group)) %>%
mutate(date = as.POSIXct(date)) %>% #convert date to date
group_by(treated, date) %>% #group
summarise(prop = sum(y=="1")/n()) %>% #calculate proportion
ggplot() +
theme_classic() +
geom_smooth(aes(x = date, y = prop, color = treated), se = F) +
geom_point(aes(x = date, y = prop, color = treated)) +
geom_vline(xintercept = as.POSIXct("2000-01-05 12:00 GMT"), color = 'black', lwd = 1)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Data %>%
mutate(treated = factor(group)) %>%
mutate(y = ifelse(y == "0", 0, 1)) %>%
mutate(date = as.POSIXct(date)) %>% #convert date to date
ggplot() +
theme_classic() +
geom_smooth(aes(x = date, y = y, color = treated), method = "loess") +
geom_vline(xintercept = as.POSIXct("2000-01-05 12:00 GMT"), color = 'black', lwd = 1)
Created on 2018-03-27 by the reprex package (v0.2.0).