How could I form date interval with counts in R? - r

I have a date variable called DATE as follows:
DATE
2019-12-31
2020-01-01
2020-01-05
2020-01-09
2020-01-25
I am trying to return a result that counts the number of times the date occur in a week considering the Week variable starts from the minimum of DATE variable. So it would look something like this:
Week Count
1 3
2 1
3 0
4 1
Thanks in advance.

From base R
dates <- c('2019-12-31','2020-01-01','2020-01-05','2020-01-09','2020-01-25')
weeks <- strftime(dates, format = "%V")
table(weeks)

We subtract DATE values with minimum DATE value to get the difference in days between DATES. We divide the difference by 7 to get it in weeks and count it. We then use complete to fill the missing week information.
df %>%
dplyr::count(week = floor(as.integer(DATE - min(DATE))/7) + 1) %>%
tidyr::complete(week = min(week):max(week), fill = list(n = 0))
# week n
# <dbl> <dbl>
#1 1 3
#2 2 1
#3 3 0
#4 4 1
If your DATE column is not of date class, first run this :
df$DATE <- as.Date(df$DATE)

Related

Calculate length of night in data frame

I have a some test data with two columns. The column "hour" shows hourly values (p.m.). The column "day" indicates the corresponding day, i.e. on day 1 there are hourly values from 7 to 11 o'clock.
I now want to calculate how big the time span is for each day and store these values in a vector.
Something like:
timespan <- c(5,7,3)
How could I calculate this in a loop?
I thought about something like length(unique...)
Thanks in advance!
Here is the code:
day <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
hour <- c(7,7,8,10,11,5,6,6,7,11,9,10,10,11,11)
df <- data.frame(day,hour)
library(dplyr)
df %>%
group_by(day) %>%
summarise(time_span = max(hour) - min(hour) + 1)
## A tibble: 3 x 2
# day time_span
# <dbl> <dbl>
# 1 1 5
# 2 2 7
# 3 3 3

How to take an arithmetic average over common variable, rather than whole data?

So I have a data frame which is daily data for stock prices, however, I have also a variable that indicates the week of year (1,2,3,4,...,51,52) this is repeated for 22 companies. I would like to create a new variable that takes an average of the daily prices but only across each week.
The above equation has d = day and t = week. My challenge is taking this average of days across each week. Therefore, I should have 52 values per stock that I observe.
Using ave().
dat <- transform(dat, avg_week_price=ave(price, week, company))
head(dat, 9)
# week company wday price avg_week_price
# 1 1 1 a 16.16528 15.47573
# 2 2 1 a 18.69307 15.13812
# 3 3 1 a 11.01956 12.99854
# 4 1 2 a 15.92029 14.56268
# 5 2 2 a 12.26731 13.64916
# 6 3 2 a 17.40726 17.27226
# 7 1 3 a 11.83037 13.02894
# 8 2 3 a 13.09144 12.95284
# 9 3 3 a 12.08950 15.81040
Data:
setseed(42)
dat <- expand.grid(week=1:3, company=1:5, wday=letters[1:7])
dat$price <- runif(nrow(dat), 10, 20)
An option with dplyr
library(dplyr)
dat %>%
group_by(week, company) %>%
mutate(avg_week_price = mean(price))

Creating new variable based on reference date calculation [duplicate]

This question already has answers here:
Calculate number of days between two dates in r
(4 answers)
Closed 2 years ago.
I have a dataframe with multiple participants (distinguished by the variable "ID") and calendar dates (MM/DD/YYYY) associated with each row of data.
I would like to create a "Day" column to calculate the number of days that has elapsed since the first calendar date for each ID (i.e. using the first date for each participant as a reference date).
Example Structure:
ID Calendar.date Day
1 06/23/2020 1
1 06/25/2020 3
1 06/26/2020 4
2 03/24/2019 1
2 03/30/2019 7
2 03/31/2019 8
Here is a dplyr approach. If you group_by the ID, you can subtract dates from the first date for each ID. This assumes you have your data in a data frame df:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Calendar_date = as.Date(Calendar_date, format = "%m/%d/%Y"),
Day = Calendar_date - first(Calendar_date) + 1)
For the output below, I modified your example data to avoid impossible dates in February. Also, the result for Day is a difftime object. If you simply want the numeric number of days just use as.numeric:
as.numeric(Calendar_date - first(Calendar_date))
Output
# A tibble: 6 x 3
# Groups: ID [2]
ID Calendar_date Day
<dbl> <date> <drtn>
1 1 2020-06-23 1 days
2 1 2020-06-25 3 days
3 1 2020-06-26 4 days
4 2 2019-02-20 1 days
5 2 2019-02-26 7 days
6 2 2019-02-27 8 days

R: cumulative total at a daily level

I have the following dataset:
I want to measure the cumulative total at a daily level. So the result look something like:
I can use dplyr's cumsum function but the count for "missing days" won't show up. As an example, the date 1/3/18 does not exist in the original dataframe. I want this missed date to be in the resultant dataframe and its cumulative sum should be the same as the last known date i.e. 1/2/18 with the sum being 5.
Any help is appreciated! I am new to the language.
I'll use this second data.frame to fill out the missing dates:
daterange <- data.frame(Date = seq(min(x$Date), max(x$Date), by = "1 day"))
Base R:
transform(merge(x, daterange, all = TRUE),
Count = cumsum(ifelse(is.na(Count), 0, Count)))
# Date Count
# 1 2018-01-01 2
# 2 2018-01-02 5
# 3 2018-01-03 5
# 4 2018-01-04 5
# 5 2018-01-05 10
# 6 2018-01-06 10
# 7 2018-01-07 10
# 8 2018-01-08 11
# ...
# 32 2018-02-01 17
dplyr
library(dplyr)
x %>%
right_join(daterange) %>%
mutate(Count = cumsum(if_else(is.na(Count), 0, Count)))
Data:
x <- data.frame(Date = as.Date(c("1/1/18", "1/2/18", "1/5/18", "1/8/18", "2/1/18"), format="%m/%d/%y"),
Count = c(2,3,5,1,6))

Merging and Averaging Data in R by ID and Date

I have two datasets that I would like to merge together in an unusual way. One dataset is my master set that contains an identifier and a datetime relevant to that ID. An ID can appear multiple times with different dates attached to it:
> head(Master_Data)
# A tibble: 5 x 2
ID Date
<chr> <dttm>
1 a 2018-03-31 00:00:00
2 a 2018-02-28 00:00:00
3 b 2018-06-07 00:00:00
4 c 2018-01-31 00:00:00
5 b 2018-02-09 00:00:00
The other dataset has the same ID, a different date and a score associated with that ID and date. IDs can also show up multiple times in this dataset as well with different dates and scores:
> head(Score_Data)
# A tibble: 6 x 3
ID Date Score
<chr> <dttm> <dbl>
1 a 2018-01-19 00:00:00 3
2 a 2018-01-01 00:00:00 5
3 a 2018-03-05 00:00:00 7
4 b 2018-01-31 00:00:00 1
5 b 2018-08-09 00:00:00 5
6 c 2018-01-17 00:00:00 10
What I would like to do is add an additional column to Master_Data that gives a mean of the score for that ID in the Score_Data df. The tricky part is that for each row in Master_Data, I only want to include scores in the average if the date variable in Score_Data is earlier than the date variable for a given row in Master_Data
Example:
For row 1 in Master_Data, I would want the new column to return a value of (3+5+7)/3 = 5. However, for row 2 I would only want to see (3+5)/2 = 4 since row 3 in Score_Data has a date after 2/28
Thoughts on what would be the best approach here to get this new column in Master_Data?
This solution would work for smaller data sets, but as the size of the data grows you'll start to notice performance issues.
library(lubridate)
library(dplyr)
master_data <- data.frame(
ID = c('a','a','b','c','b'),
Date = c('2018-03-31 00:00:00',
'2018-02-28 00:00:00',
'2018-06-07 00:00:00',
'2018-01-31 00:00:00',
'2018-02-09 00:00:00'))
master_data$Date <- ymd_hms(master_data$Date)
Score_Data <- data.frame(
ID = c('a','a','a','b','b','c'),
Date = c('2018-01-19 00:00:00',
'2018-01-01 00:00:00',
'2018-03-05 00:00:00',
'2018-01-31 00:00:00',
'2018-08-09 00:00:00',
'2018-01-17 00:00:00'),
Score = c(3,5,7,1,5,10))
Score_Data$Date <- ymd_hms(Score_Data$Date)
output <- apply(master_data, 1, function(x){
value <- Score_Data %>%
filter(ID == x[['ID']]) %>%
filter(Date < x[['Date']]) %>%
summarise(Val = mean(Score))
})
master_data$Output <- unlist(output)

Resources