I have a dataset like this, which we have the age of the cases, their status, data_entry : time their entered, and date_end.
set.seed(123)
date_entry<- sample(seq(as.Date('2000-01-01'), as.Date('2010-01-01'), by="day"), 1000)
age <- sample(seq(as.Date('1930-01-01'), as.Date('1970-01-01'), by="day"), 1000)
status <- sample(c(0, 1), size = 1000, replace = TRUE, prob = c(0.8, 0.2))
df <- data.frame( date_entry, age, status)
df <- df %>% mutate(id = row_number()) %>% rowwise %>%
mutate(date_end = sample(seq(date_entry, Sys.Date(), by="day"), 1))
I want to simulate date entry for status==1 to increase their mean of date_entery.
df$date_sim <- df$date_entry - df$status * round(rnorm(nrow(df), -400, 500))
df %>%
ggplot(aes(x = factor(status),
y = as.numeric(difftime(date_sim, age, unit = 'w'))/52,
fill = factor(status))) +
geom_boxplot(width = 0.6) +
guides(fill = guide_none()) +
labs(x = 'Status', y = 'Age (years)')
However, after simulation, several cases will have negative time duration which means their simulate time(date_sim) was after date_end.
df$time_duration= df$date_end-df$date_sim
df %>%
filter(time_duration<0)
How can I avoid these for those who have a short time duration of (date_end-df$date_sim?
So I need limit the simulation for those who had a short time_duration.
Related
I have a dataset like this. The date_e was accurate for status= "1". I want to simulate date_e based on age. Therefore, new_date_e will be changed for status="0", will be same for status="1". Also, status=1 has higher risk, so df= date_e-age should be in average shorter for status="1"than "0".
age date_e status id
1 1950-10-21 2008-11-02 0 1
2 1941-02-11 2006-08-28 0 2
3 1940-01-20 2000-05-25 0 3
4 1957-11-05 2008-03-28 1 4
5 1946-09-15 2004-03-10 0 5
and the data is :
library(dplyr)
set.seed(1)
age <- sample(seq(as.Date('1930-01-01'), as.Date('1970-01-01'), by="day"), 1000)
date1 <- sample(seq(as.Date('2000-01-01'), as.Date('2010-01-01'), by="day"), 1000)
status <- sample(c(0, 1), size = 1000, replace = TRUE, prob = c(0.8, 0.2))
df <- data.frame(age, date1, status)
df <- df %>% mutate(id = row_number())
I guess what you are wanting to simulate is the effect of status on longevity (i.e. the time difference between date1 and age in your reproducible example). At the moment, status has no effect on longevity:
library(ggplot2)
df %>%
ggplot(aes(x = factor(status),
y = as.numeric(difftime(date1, age, unit = 'w'))/52,
fill = factor(status))) +
geom_boxplot(width = 0.6) +
guides(fill = guide_none()) +
labs(x = 'Status', y = 'Age (years)')
Effectively, what you need to do is to subtract a random amount of time from the date1 column where status == 1. To do this, you can take advantage of the fact that dates are stored as integers 'under the hood' in R, and the fact that you can multiply a random draw by the status column, since those with status == 0 will thereby always have 0 subtracted.
So the answer is that you only need do:
df$date1 <- df$date1 - df$status * round(rnorm(nrow(df), 3650, 500))
Which will remove on average 10 years from those with status == 1 but leave those with status == 0 as-is:
df %>%
ggplot(aes(x = factor(status),
y = as.numeric(difftime(date1, age, unit = 'w'))/52,
fill = factor(status))) +
geom_boxplot(width = 0.6) +
guides(fill = guide_none()) +
labs(x = 'Status', y = 'Age (years)')
I have the following data,
# Generate Data
library(tidyverse)
library(ggspectra)
fake_data <- tibble(
time = seq(1,100, length.out = 1000),
gdp = time+time*(1-sin(0.15*time))
) %>% mutate(
time = row_number(),
growth = (gdp - lag(gdp))/lag(gdp) * 100,
peak = as.numeric(near(gdp, peaks(gdp), tol = 0.01)),
valley = as.numeric(near(gdp, valleys(gdp), tol = 0.01)),
type = as.factor(
case_when(
gdp >= lag(gdp) ~ "Expansion",
gdp <= lag(gdp) ~ "Contraction"
)
)
) %>% mutate(
cycle = as.factor(cumsum(peak + valley))
) %>% na.omit()
And I'm using ggplot2 to produce a plot
fake_data %>%
ggplot(mapping = aes(x = time, y = gdp)) +
geom_line() + geom_point(
fake_data %>% filter(peak == 1 | valley == 1),
mapping = aes(x = time, y = gdp)
) + geom_ribbon(aes(ymax = gdp, ymin = 0,fill = type, group = type), alpha = 0.5)
Which generates the following plot,
Ideally, the contraction and expansion are clearly seperated for illustrative purposes. I attempted to create an additional group to seperate the connected ribbons but I got the following error Error: Aesthetics can not vary with a ribbon.
How do I generate this plot neatly?
You might benefit from setting your groups from run-length based IDs. The data.table::rleid() can help with that.
library(tidyverse)
library(ggspectra)
fake_data <- tibble(
time = seq(1,100, length.out = 1000),
gdp = time+time*(1-sin(0.15*time))
) %>% mutate(
time = row_number(),
growth = (gdp - lag(gdp))/lag(gdp) * 100,
peak = as.numeric(near(gdp, peaks(gdp), tol = 0.01)),
valley = as.numeric(near(gdp, valleys(gdp), tol = 0.01)),
type = as.factor(
case_when(
gdp >= lag(gdp) ~ "Expansion",
gdp <= lag(gdp) ~ "Contraction"
)
)
) %>% mutate(
cycle = as.factor(cumsum(peak + valley))
) %>% na.omit()
fake_data %>%
ggplot(mapping = aes(x = time, y = gdp)) +
geom_line() +
geom_point(
fake_data %>% filter(peak == 1 | valley == 1),
mapping = aes(x = time, y = gdp)
) +
geom_area(aes(fill = type, group = data.table::rleid(type)),
alpha = 0.5)
Created on 2021-08-27 by the reprex package (v1.0.0)
I have the following fake data representig the answering times (in seconds) of different users in an online questionnaire:
n <- 1000
dat <- data.frame(user = 1:n,
question = sample(paste("q", 1:10, sep = ""), size = 10, replace = TRUE),
time = round(rnorm(n, mean = 10, sd=4), 0)
)
dat %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') +
ggtitle("Answering time per question")
Then, I am plotting the answering times as boxplots for each question. But how can I first calculate a column with a binary variable showing whether a case is an outlier or not [defined as median(time) +/- 3 * mad(time) ] within each question?
library(dplyr)
dat %>%
group_by(question) %>%
mutate(outlier = abs(time - median(time)) > 3*mad(time) ) %>%
ungroup() %>%
ggplot(aes(x = question, y = time)) +
geom_boxplot(fill = 'orange') +
geom_point(data = . %>% filter(outlier), color = "red") +
ggtitle("Answering time per question")
By first grouping within each question, the calculation is applied for each row compared to the median and mad for that question.
I am trying to predict June - September Level for the Year 2020 using a multiple linear regression model. In my example below, I assume that the year 2016 conditions will repeat and use it for predicting June-Sep Level for the 2020. I plot the observed Level up until May 31, shown as solid black line and the Forecasted Level shown as dashed blue line.
library(tidyverse)
library(lubridate)
set.seed(1500)
DF <- data.frame(Date = seq(as.Date("2000-01-01"), to = as.Date("2018-12-31"), by = "days"),
Level = runif(6940, 360, 366), Flow = runif(6940, 1,10),
PCP = runif(6940, 0,25), MeanT = runif(6940, 1, 30)) %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
filter(between(Month, 6, 9))
Model <- lm(data = DF, Level~Flow+PCP+MeanT)
Yr_2016 <- DF %>%
filter(Year == 2016) %>%
select(c(3:5))
Pred2020 <- data.frame(Date = seq(as.Date("2020-06-01"), to = as.Date("2020-9-30"), by = "days"),
Forecast = predict(Model, Yr_2016))
Obs2020 <- data.frame(Date = seq(as.Date("2020-01-01"), to = as.Date("2020-05-31"), by = "days"),
Level = runif(152, 360, 366))
ggplot(data = Obs2020, aes(x = Date, y = Level), col = "black")+
geom_line(size = 2)+
geom_line(data = Pred2020, aes(x = Date, y = Forecast), linetype = "dashed")
My goal
I want to use the fitted model to predict June - Sep for the year 2020 assuming that all the years in DF will repeat itself (not just the year 2016) and then have a plot where all the years Forecasted scenarios (June -Sep) are shown in different colours - something like below
new answer
The code below should do what you are looking for (if I understood it correctly). The graph, however, is still chaotic.
library(tidyverse)
library(lubridate)
set.seed(1500)
DF <- data.frame(Date = seq(as.Date("2000-01-01"), to = as.Date("2018-12-31"), by = "days"),
Level = runif(6940, 360, 366), Flow = runif(6940, 1,10),
PCP = runif(6940, 0,25), MeanT = runif(6940, 1, 30)) %>%
mutate(Year = year(Date), Month = month(Date), Day = day(Date)) %>%
filter(between(Month, 6, 9))
Model <- lm(data = DF, Level ~ Flow + PCP + MeanT)
Obs2020 <- data.frame(Date = seq(as.Date("2020-01-01"),
to = as.Date("2020-05-31"),
by = "days"),
Level = runif(152, 362.7, 363.25))
pred_data <- DF %>%
nest_by(Year) %>%
mutate(pred_df = list(tibble(Date = seq(as.Date("2020-06-01"),
to = as.Date("2020-09-30"),
by = "days"),
Forecast = predict(.env$Model, data)))) %>%
select(Year, pred_df) %>%
unnest(pred_df)
ggplot(data = Obs2020, aes(x = Date, y = Level), col = "black") +
geom_line(size = 0.1) +
geom_line(data = pred_data,
aes(x = Date, y = Forecast, group = factor(Year), color = factor(Year)),
size = 0.1)
Created on 2020-06-20 by the reprex package (v0.3.0)
I'm trying to replicate this histogram in R.
Here is how to mock my dataset:
dft <- data.frame(
menutype = sample(c(1,2,4,5,6,8,12), 120, replace = T),
Belief = sample(c(0,1), 120, replace = T),
Choice = sample(c(0,1), 120, replace = T)
)
Here is my code :
library(ggplot2)
library(dplyr)
library(tidyr)
library(MASS)
df <- data.frame(
menutype = factor(df$menutype, labels = c("GUILT" , "SSB0", "SSB1", "FLEX0", "FLEX1", "STD", "FLEX01"),
levels = c(1,2,4,5,6,8,12)),
Belief = factor(df$belieflearn, levels = c(1), labels= c("Believe Learn")), #Interested only in this condition
Choice = factor(df$learned, levels = c(1), labels= c("Learn")) #Same here
)
df1 <- rbind(na.omit(df %>%
count(Belief, menutype) %>%
group_by(menutype) %>%
mutate(prop = n / sum(n))),
na.omit(df %>%
count(Choice, menutype) %>%
group_by(menutype) %>%
mutate(prop = n / sum(n))))
test <- paste(df1$Belief[1:6],paste(df1$Choice[7:13]))
test[1:6] <- paste(df1$Belief[1:6])
test[7:13] <- paste(df1$Choice[7:13])
df1$combine <- paste(test)
ggplot(data = df1, aes(menutype, prop, fill = combine)) +
labs(title = "Classification based on rank ordering\n", x = "", y = "Fraction of subjects", fill = "\n") +
geom_bar(stat = "identity", position = "dodge")+
theme_bw() +
theme(legend.position="bottom", plot.title = element_text(hjust = 0.5)) #Centering of the main title+
#geom_text(aes(label="ok"), vjust=-0.3, size=3.5)+
The problem is that it's more or less working, I'm almost getting the graph that I want but it is a workaround and there is still some errors. Indeed, I've for example the same value for STD (0.10), while it should be 0 and 0.10 like in the original graph.
What I would like to do optimally is to have two different dataframe, one with menutype and Belief, the other one with menutype and Choice, then as I did, compute the proportion of a specific modality in each latter variables on menutype, and finally to plot it as histograms, much as the graph in the original study. Additionally, I'd like to have the proportions as fractions above each bar, but that is optional.
Could someone help me on this matter? I'm really struggling to get it working.
Thanks in advance!
EDIT: I think the issue is with the fill =. I would like to specify for each bar the variable I want (e.g, fill = df2$Belief & df2$Choice) but I don't know how to proceed.
library(tidyverse)
set.seed(10)
# example data frame
df <- data.frame(
menutype = sample(c(1,2,4,5,6,8,12), 120, replace = T),
Belief = sample(c(0,1), 120, replace = T),
Choice = sample(c(0,1), 120, replace = T)
)
# calculate all metrics based on all variables you want to plot in a tidy way
df_plot = df %>%
group_by(Choice) %>%
count(menutype, Belief) %>%
mutate(prop = n / sum(n),
prop_text = paste0(n, "/", sum(n))) %>%
ungroup()
# barplots using one variable and split plots using another variable
df_plot %>%
mutate(Belief = factor(Belief),
menutype = factor(menutype)) %>%
ggplot(aes(menutype, prop, fill = Belief))+
geom_col(position = "dodge")+
facet_wrap(~Choice, ncol=1)+
geom_text(aes(label=prop_text), position = position_dodge(1), vjust = -0.5)+
ylim(0,0.2)