I have the following dataset:
I want to measure the cumulative total at a daily level. So the result look something like:
I can use dplyr's cumsum function but the count for "missing days" won't show up. As an example, the date 1/3/18 does not exist in the original dataframe. I want this missed date to be in the resultant dataframe and its cumulative sum should be the same as the last known date i.e. 1/2/18 with the sum being 5.
Any help is appreciated! I am new to the language.
I'll use this second data.frame to fill out the missing dates:
daterange <- data.frame(Date = seq(min(x$Date), max(x$Date), by = "1 day"))
Base R:
transform(merge(x, daterange, all = TRUE),
Count = cumsum(ifelse(is.na(Count), 0, Count)))
# Date Count
# 1 2018-01-01 2
# 2 2018-01-02 5
# 3 2018-01-03 5
# 4 2018-01-04 5
# 5 2018-01-05 10
# 6 2018-01-06 10
# 7 2018-01-07 10
# 8 2018-01-08 11
# ...
# 32 2018-02-01 17
dplyr
library(dplyr)
x %>%
right_join(daterange) %>%
mutate(Count = cumsum(if_else(is.na(Count), 0, Count)))
Data:
x <- data.frame(Date = as.Date(c("1/1/18", "1/2/18", "1/5/18", "1/8/18", "2/1/18"), format="%m/%d/%y"),
Count = c(2,3,5,1,6))
Related
So I have a data frame which is daily data for stock prices, however, I have also a variable that indicates the week of year (1,2,3,4,...,51,52) this is repeated for 22 companies. I would like to create a new variable that takes an average of the daily prices but only across each week.
The above equation has d = day and t = week. My challenge is taking this average of days across each week. Therefore, I should have 52 values per stock that I observe.
Using ave().
dat <- transform(dat, avg_week_price=ave(price, week, company))
head(dat, 9)
# week company wday price avg_week_price
# 1 1 1 a 16.16528 15.47573
# 2 2 1 a 18.69307 15.13812
# 3 3 1 a 11.01956 12.99854
# 4 1 2 a 15.92029 14.56268
# 5 2 2 a 12.26731 13.64916
# 6 3 2 a 17.40726 17.27226
# 7 1 3 a 11.83037 13.02894
# 8 2 3 a 13.09144 12.95284
# 9 3 3 a 12.08950 15.81040
Data:
setseed(42)
dat <- expand.grid(week=1:3, company=1:5, wday=letters[1:7])
dat$price <- runif(nrow(dat), 10, 20)
An option with dplyr
library(dplyr)
dat %>%
group_by(week, company) %>%
mutate(avg_week_price = mean(price))
I have measured hourly data of ground O3 but with some missing data (marked as NA). I want to calculate daily maximums, but only in case there are more than 17 hourly measurements per date. In case it is less than 18 measurement per date I want to write NA.
head(o3sat)
date hour O3
1/1/2010 0 50.2
1/1/2010 1 39.8
1/1/2010 2 41.8
1/1/2010 3 NA
1/1/2010 4 9.2
1/1/2010 5 6.0
Is there a possibility to add some argument to this function to indicate that at least 75% of the data must be available in a day for the value to be calculated, else the data is removed
maximums <- aggregate(o3sat["dnevnik"], list(Date = as.Date(o3sat$datum)), max, na.rm = TRUE)
It is better to provide a reproducible example when asking a question. Here, I created an example data frame based on the information you provided. This data frame contains hourly O3 measurements from 2010-01-01 to 2010-01-03.
library(dplyr)
library(tidyr)
library(lubridate)
o3sat <- read.table(text = " date hour O3
'1/1/2010' 0 50.2
'1/1/2010' 1 39.8
'1/1/2010' 2 41.8
'1/1/2010' 3 NA
'1/1/2010' 4 9.2
'1/1/2010' 5 6.0 ",
stringsAsFactors = FALSE, header = TRUE)
set.seed(1234)
o3sat_ex <- o3sat %>%
mutate(date = mdy(date)) %>%
complete(date = seq.Date(ymd("2010-01-01"), ymd("2010-01-03"), 1), hour = 0:23) %>%
mutate(O3 = c(o3sat$O3, rnorm(66, 30, 10))) %>%
mutate(O3 = ifelse(row_number() %in% sample(7:72, 18), NA, O3))
We can count how many non-NA value per day using the following code.
o3sat_ex %>%
group_by(date) %>%
summarize(sum(!is.na(O3)))
# # A tibble: 3 x 2
# date `sum(!is.na(O3))`
# <date> <int>
# 1 2010-01-01 18
# 2 2010-01-02 17
# 3 2010-01-03 18
Based on your description, we would like to calculate the maximum for 2010-01-01 and 2010-01-03, but not 2010-01-02 as it only contains 17 non-NA values.
Here is one way to achieve the task, we can define a function, max_helper, that only returns maximum if the count of non-NA values is larger than 17.
max_helper <- function(x, threshold){
if (sum(!is.na(x)) >= threshold) {
r <- max(x, na.rm = TRUE)
} else {
r <- NA
}
return(r)
}
We can apply this number using the dplyr code to get the answer.
o3sat_ex2 <- o3sat_ex %>%
group_by(date) %>%
summarize(O3 = max_helper(O3, 18))
o3sat_ex2
# # A tibble: 3 x 2
# date O3
# <date> <dbl>
# 1 2010-01-01 50.2
# 2 2010-01-02 NA
# 3 2010-01-03 47.8
I'm building a dataset and am looking to be able to add a week count to a dataset starting from the first date, ending on the last. I'm using it to summarize a much larger dataset, which I'd like summarized by week eventually.
Using this sample:
library(dplyr)
df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
Week = nrow/7)
# A tibble: 93 x 2
Date Week
<date> <dbl>
1 1944-06-01 0.143
2 1944-06-02 0.286
3 1944-06-03 0.429
4 1944-06-04 0.571
5 1944-06-05 0.714
6 1944-06-06 0.857
7 1944-06-07 1
8 1944-06-08 1.14
9 1944-06-09 1.29
10 1944-06-10 1.43
# … with 83 more rows
Which definitely isn't right. Also, my real dataset isn't structured sequentially, there are many days missing between weeks so a straight sequential count won't work.
An ideal end result is an additional "week" column, based upon the actual dates (rather than hard-coded with a seq_along() type of result)
Similar solution to Ronak's but with lubridate:
library(lubridate)
(df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
week = interval(min(Date), Date) %>%
as.duration() %>%
as.numeric("weeks") %>%
floor() + 1))
You could subtract all the Date values with the first Date and calculate the difference using difftime in "weeks", floor all the values and add 1 to start the counter from 1.
df$week <- floor(as.numeric(difftime(df$Date, df$Date[1], units = "weeks"))) + 1
df
# A tibble: 93 x 2
# Date week
# <date> <dbl>
# 1 1944-06-01 1
# 2 1944-06-02 1
# 3 1944-06-03 1
# 4 1944-06-04 1
# 5 1944-06-05 1
# 6 1944-06-06 1
# 7 1944-06-07 1
# 8 1944-06-08 2
# 9 1944-06-09 2
#10 1944-06-10 2
# … with 83 more rows
To use this in your dplyr pipe you could do
library(dplyr)
df %>%
mutate(week = floor(as.numeric(difftime(Date, first(Date), units = "weeks"))) + 1)
data
df <- tibble::tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"))
I apologize in advance for the simplicity of this question.
How can I generate a monthly time series data set using something like set seed? I have a question about results from two packages but need to create a sample data set to show as an example. My data set needs to have some NA values within in it.
Regards,
Simon
Here's a random list of 1000 dates +- 5 years from today with some missing data using the simstudy package (please provide sample data and expected output for a more specific answer):
library(simstudy)
library(dplyr)
library(lubridate)
set.seed(1724)
# define data
def <- defData(varname = "tmp", dist = "uniform", formula = "0;1") # sumstudy seems to crash when adding missing data with only 1 column
def <- defData(def, varname = "date", dist = "uniform", formula = "-5;5") # +- 5 years
df_full <- genData(1000, def)
##### missing data ----
defM <- defMiss(varname = "date", formula = 0.1, logit.link = F)
df_missing <- genMiss(df_full, defM, idvars = "id")
# Create data with missing values
df <- genObs(df_full, df_missing, idvars = "id")
df %>%
as_tibble() %>%
select(-tmp) %>%
mutate(date = ymd(floor_date(as.POSIXct(Sys.Date()) + date * 365 * 24 * 60 * 60, unit = "day")), # +- 5 years from today
month = format(date, "%Y-%m"))
# A tibble: 1,000 x 3
id date month
<int> <date> <chr>
1 1 NA NA
2 2 2021-09-12 2021-09
3 3 2023-11-08 2023-11
4 4 2015-03-02 2015-03
5 5 2021-08-12 2021-08
6 6 2021-10-20 2021-10
7 7 2017-05-17 2017-05
8 8 2019-04-12 2019-04
9 9 NA NA
10 10 NA NA
# ... with 990 more rows
I'm working on a project and trying to create a plot of the number of open cases we have on any given date. An example of the data table is as follows.
case_files <- tibble(case_id = 1:10,
date_opened = c("2017-1-1",
"2017-1-1",
"2017-3-4",
"2017-4-4",
"2017-5-5",
"2017-5-6",
"2017-6-7",
"2017-6-6",
"2017-7-8",
"2017-7-8"),
date_closed = c("2017-4-1",
"2017-4-1",
"2017-5-4",
"2017-7-4",
"2017-7-5",
"2017-7-6",
"2017-8-7",
"2017-8-6",
"2017-9-8",
"2017-10-8"))
case_files$date_opened <- as.Date(case_files$date_opened)
case_files$date_closed <- as.Date(case_files$date_closed)
What I'm trying to do is create another data frame with the dates from the past year and the number of cases that are considered "Open" during each date. I would then be able to plot from this data frame.
daily_open_cases <- tibble(n = 0:365,
date = today() - n,
qty_open = .....)
Cases are considered Open on dates on orafter the date_opened AND on or before the date_closed
I've considered doing conditional subsetting and then using nrow(), but can't seem to get it to work. There must be an easier way to do this. I can do this easily in Excel using the COUNTIFS function.
Thanks!
The Excel funtion basically does a sum of logical 1's and 0's. Easy to do in R with sum function. I'd build a structure that had all the dates and then march through those dates summing up the logical vectors using the two inequalities below across the all paired rows in the case_files structure. The &-function in R is vectorized:
daily_open_cases <- tibble(dt = as.Date("2017-01-01")+0:365,
qty_open = NA)
daily_open_cases$qty_open = sapply(daily_open_cases$dt,
function(d) sum(case_files$date_opened <= d & case_files$date_closed >=d) )
> head( daily_open_cases)
# A tibble: 6 x 2
dt qty_open
<date> <int>
1 2017-01-01 2
2 2017-01-02 2
3 2017-01-03 2
4 2017-01-04 2
5 2017-01-05 2
6 2017-01-06 2
>
Here's a 'tidyverse' solution, the approach is the same as the one of 42 I just used dplyrs group_by and mutate instead of base-r sapply.
library(tidyverse)
library(magrittr)
days_files <- tibble(
date = as.Date("2017-01-01")+0:365,
no_open = NA_integer_
)
days_files %<>%
group_by(date) %>%
mutate(
no_open = sum(case_files$date_opened <= date & case_files$date_closed >= date)
)
# A tibble: 366 x 2
# Groups: date [366]
date no_open
<date> <int>
1 2017-01-01 2
2 2017-01-02 2
3 2017-01-03 2
4 2017-01-04 2
5 2017-01-05 2
6 2017-01-06 2
7 2017-01-07 2
8 2017-01-08 2
9 2017-01-09 2
10 2017-01-10 2
# ... with 356 more rows