Convert weekly Data frame to monthly data frame in R - r

My dat looks like below with different node_desc having weekly data for 4 years
ID1 ID2 DATE_ value
1: 00001 436 2014-06-29 175.8164
2: 00001 436 2014-07-06 188.9264
3: 00001 436 2014-07-13 167.5376
4: 00001 436 2014-07-20 160.7907
5: 00001 436 2014-07-27 185.3018
6: 00001 436 2014-08-03 179.5748
would like to convert data frame to monthly.Trying below code
df %>%
tq_transmute(select = c(value,ID1),
mutate_fun = apply.monthly,
FUN = mean)
But my output looks like below
DATE_ value
<dttm> <dbl>
1 2014-06-29 00:00:00 144.
2 2014-07-27 00:00:00 143.
3 2014-08-31 00:00:00 143.
4 2014-09-28 00:00:00 152.
5 2014-10-26 00:00:00 156.
6 2014-11-30 00:00:00 166.
But I would like to have ID1,ID2,Date(monthly) and value(either getting the mean or max of 4 weeks) instead of just having date and value,because I have data of different ID1's for 4 years.Can someone help me in R

Here's my take
dta <- data.frame(id1=rep("00001",6),id2=rep("436",6),
date_=as.Date(c("29jun2014","6jul2014","13jul2014","20jul2014","27jul2014","3aug2014"),"%d%B%Y"),
value=c(175.8164,188.9264,167.5376,160.7907,185.3018,179.5748))
And dplyr would do the rest. Here I summarize the data by taking the mean value
library(dplyr)
my_dta <- dta %>% mutate(month_=format(as.yearmon(date_),"%b"))
my_dta %>% group_by(.dots=c("id1","id2")) %>% summarise(mvalue=mean(value))

The problem you have is that your dataset doesn't have daily data. The apply.monthly function comes from xts, but tidyquant uses wrappers around a lot of functions so they work in a more tidy way. apply.monthly needs an xts object, which is basicly a matrix with a time index.
Also know that apply.monthly returns the last available day of the month in your timeseries. Looking at your example set, the last day it returns for july 2017 will the 27th. Now if you have 5 records (weeks) in a month the mean function will do this over 5 records. It will never be exactly 1 month as weekly data never covers monthly data.
But with tidyquant you can get sort of a monthly result with ID1 and ID2 with your data if you join the outcome with the original data. See code below. I haven't removed any unwanted columns.
df1 %>%
tq_transmute(select = c(value, ID1),
mutate_fun = apply.monthly,
FUN = mean) %>%
mutate(DATE_ = as.Date(DATE_)) %>%
inner_join(df1, by = "DATE_")
# A tibble: 3 x 5
DATE_ value.x ID1 ID2 value.y
<date> <dbl> <fct> <fct> <dbl>
1 2014-06-29 176. 00001 436 176.
2 2014-07-27 176. 00001 436 185.
3 2014-08-03 180. 00001 436 180.
data:
df1 <- data.frame(ID1 = rep("00001", 6),
ID2 = rep("436", 6),
DATE_ = as.Date(c("2014-06-29", "2014-07-06", "2014-07-13", "2014-07-20", "2014-07-27", "2014-08-03")),
value = c(175.8164,188.9264,167.5376,160.7907,185.3018,179.5748)
)

Related

Calculate number of pending tasks at given time points (ideally with dplyr)

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

In R: Join two dataframes based on a time period condition

Being new to R, I am trying to merge two data frames by considering a time period condition.
df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"), "second_event" = c("9346","a839", "d939"), "device_serial" = c("123","123","123") , "start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"), "exp_id" = NA)
df2 <- data.frame("device_serial" = c("123","123") , exp_id= c("a","b") , start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") , end_timestamp = c("2020-01-17 00:05:10", NULL) , current_event_id = c("1", "2") ,current_event_timestamp= c("2020-01-17 00:05:09", "2020-01-17 00:05:09"))
This is little bit difficult to explain, I will do my best to present the problem.
Basically, I am monitoring some expeditions (df2) and I want to know which events (df1) are related to a certain expedition (Have a look at the exp_id
in the df1, I want to fill this column).
Note that each expedition is created by a device, and evidently, each event is generated by a device. You may say this is feasible by joining the two tables based on the id of a device. However, the problem is that each device can be associated with multiple expeditions.
So, the objective is to see during a certain time period the device was related to which expedition so we can match events with that expedition. If you look at the third row of df1 you will see the difficulty I have for the time period condition. Because considering the duration in which the third row was recorded, we can not relate it to the expedition a.
Here comes the other problem. Sometimes the expeditions are not finished, so, we have to consider the last seen event timestamp (which is the current_event_timestamp in df2).
>df1
first_event second_event device_serial start_timestamp end_timestamp exp_id
4f7d 9346 123 2019-12-06 11:47:0 2020-01-10 12:59:38 NA
a10a a839 123 2019-09-06 11:47:0 2019-11-22 12:06:28 NA
e79b d939 123 "2019-09-05 10:00:00" "2019-11-22 12:06:28") NA
>df2
device_serial exp_id start_timestamp end_timestamp current_event_id current_event_timestamp
123 a 2019-12-03 07:12:20 2020-01-17 00:05:10 1 2020-01-17 00:05:09
123 b 2019-09-04 10:00:00 NULL 2 2019-11-23 12:06:28
The result that I am looking for is a table like this df3:
>df3
first_event second_event device_serial start_timestamp end_timestamp exp_id
4f7d 9346 123 2019-12-06 11:47:0 2020-01-10 12:59:38 a
a10a a839 123 2019-09-06 11:47:0 2019-11-22 12:06:28 b
e79b d939 123 "2019-09-05 10:00:00" "2019-11-22 12:06:28") b
Thanks for reading this question and helping me to solve it.
Here are some suggestions, if I understand you correctly.
First, your data, with a few edits:
Per #r2evans comment, I'm assuming the NULL was meant to be
NA_real
"current_event_timestamp" from df2 in the first block of
code does not match what you typed out in the second block; I used
the datetime from the second block, as it led to the answer you were
looking for
df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"),
"second_event" = c("9346","a839", "d939"),
"device_serial" = c("123","123","123") ,
"start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),
"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"),
"exp_id" = NA)
df2 <- data.frame("device_serial" = c("123","123") ,
exp_id= c("a","b") ,
start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") ,
end_timestamp = c("2020-01-17 00:05:10", NA_real_) ,
current_event_id = c("1", "2") ,
current_event_timestamp= c("2020-01-17 00:05:09", "2019-11-23 12:06:28"))
Now, to tidy the data a bit.
Two main points:
It seems like the start_timestamp and end_timestamp columns in df1 refer to starts
and ends of events, whereas those same column names in df2 refer to starts and
ends of expeditions. If so, it's good practice to assign these variables names
that reflect the fact that the data they contain differ. In this case, this
distinction is important when joining the two tables.
At least in your example dfs, note that all columns were read in as factors
initially. Variables are usually much easier to work with if they're stored as the
type of data they represent, and this is especially true for datetime data.
library(dplyr)
library(lubridate)
df1 <- df1 %>%
as_tibble(df1) %>% # convert to tibble; prints data type of each column
select(-exp_id, evnt_start = start_timestamp, evnt_end = end_timestamp) %>% # removing exp_id (not necessary, & messes up join) & changing names of time cols.
mutate(evnt_start = as_datetime(evnt_start), # converting time columns to datetime type
evnt_end = as_datetime(evnt_end))
df1
# A tibble: 3 x 5
first_event second_event device_serial evnt_start evnt_end
<fct> <fct> <fct> <dttm> <dttm>
1 4f7d 9346 123 2019-12-06 11:47:00 2020-01-10 12:59:38
2 a10a a839 123 2019-09-06 11:47:00 2019-11-22 12:06:28
3 e79b d939 123 2019-09-05 10:00:00 2019-11-22 12:06:28
df2 <- df2 %>%
as_tibble(df2) %>% # convert to tibble
rename(exp_start = start_timestamp, exp_end = end_timestamp) %>% # changing names of time cols
mutate_at(.vars=c("exp_start", "exp_end", "current_event_timestamp"), ~as_datetime(.)) # converting time cols from factor into datetime type
df2
# A tibble: 2 x 6
device_serial exp_id exp_start exp_end current_event_id current_event_timestamp
<fct> <fct> <dttm> <dttm> <fct> <dttm>
1 123 a 2019-12-03 07:12:20 2020-01-17 00:05:10 1 2020-01-17 00:05:09
2 123 b 2019-09-04 10:00:00 NA 2 2019-11-23 12:06:28
Now, to try for a solution using dplyr::left_join and dplyr::filter:
df3 <- df2 %>%
mutate(exp_end_or_current = if_else(is.na(exp_end), current_event_timestamp, exp_end)) %>% #creating a new col with either exp_end OR, if NA, then current timestamp
left_join(df1, ., by = ("device_serial")) %>% #join df2 to df1 by serial #
filter(evnt_start > exp_start & evnt_end < exp_end_or_current) %>% #filter, keeping only records where EVENT start & end times are between expedition start & end times
select(-c(exp_end, current_event_id, current_event_timestamp))
df3
# A tibble: 3 x 8
first_event second_event device_serial evnt_start evnt_end exp_id exp_start exp_end_or_current
<fct> <fct> <fct> <dttm> <dttm> <fct> <dttm> <dttm>
1 4f7d 9346 123 2019-12-06 11:47:00 2020-01-10 12:59:38 a 2019-12-03 07:12:20 2020-01-17 00:05:10
2 a10a a839 123 2019-09-06 11:47:00 2019-11-22 12:06:28 b 2019-09-04 10:00:00 2019-11-23 12:06:28
3 e79b d939 123 2019-09-05 10:00:00 2019-11-22 12:06:28 b 2019-09-04 10:00:00 2019-11-23 12:06:28

How to select the earliest date in a month from a Date series in R?

I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17

R- create dataset by removing duplicates based on a condition - filter

I have a data frame where for each day, I have several prices.
I would like to modify my data frame with the following code :
newdf <- Data %>%
filter(
if (Data$Date == Data$Echeance) {
Data$Close == lag(Data$Close,1)
} else {
Data$Close == Data$Close
}
)
However, it is not giving me what I want, that is :
create a new data frame where the variable Close takes its normal value, unless the day of Date is equal to the day of Echeance. In this case, take the following Close value.
I added filter because I wanted to remove the duplicate dates, and keep only one date per day where Close satisfies the condition above.
There is no error message, it just doesn't give me the right database.
Here is a glimpse of my data:
Date Echeance Compens. Open Haut Bas Close
1 1998-03-27 00:00:00 1998-09-10 00:00:00 125. 828 828 820 820. 197
2 1998-03-27 00:00:00 1998-11-10 00:00:00 128. 847 847 842 842. 124
3 1998-03-27 00:00:00 1999-01-11 00:00:00 131. 858 858 858 858. 2
4 1998-03-30 00:00:00 1998-09-10 00:00:00 125. 821 821 820 820. 38
5 1998-03-30 00:00:00 1998-11-10 00:00:00 129. 843 843 843 843. 1
6 1998-03-30 00:00:00 1999-01-11 00:00:00 131. 860 860 860 860. 5
Thanks a lot in advance.
Sounds like a use case for ifelse, with dplyr:
library(dplyr)
Data %>%
mutate(Close = ifelse(Date==Echeance, lead(Close,1), Close))
Here an example:
dat %>%
mutate(var_new = ifelse(date1==date2, lead(var,1), var))
# A tibble: 3 x 4
# date1 date2 var var_new
# <date> <date> <int> <int>
# 1 2018-03-27 2018-03-27 10 11
# 2 2018-03-28 2018-01-01 11 11
# 3 2018-03-29 2018-02-01 12 12
The function lead will move the vector by 1 position. Also note that I created a var_new just to show the difference, but you can mutate directly var.
Data used:
dat <- tibble(date1 = seq(from=as.Date("2018-03-27"), to=as.Date("2018-03-29"), by="day"),
date2 = c(as.Date("2018-03-27"), as.Date("2018-01-01"), as.Date("2018-02-01")),
var = 10:12)
dat
# A tibble: 3 x 3
# date1 date2 var
# <date> <date> <int>
# 1 2018-03-27 2018-03-27 10
# 2 2018-03-28 2018-01-01 11
# 3 2018-03-29 2018-02-01 12

"split" dataframe per month based on columns Start/End

I need to "split" a 15 million line df of the following form:
library(lubridate)
dateStart <- c(lubridate::ymd("2010-01-01"))
dateEnd <- c(lubridate::ymd("2010-03-06"))
length <- c(65)
Amt <- c(348.80)
df1 <- data.frame(dateStart, dateEnd, length, Amt)
df1
# dateStart dateEnd length Amt
# 1 2010-01-01 2010-03-06 65 348.8
into something like:
dateStart dateEnd length Amt
1 2010-01-01 2010-01-31 31 166.35
2 2010-02-01 2010-02-28 28 150.55
3 2010-03-01 2010-03-06 6 32.19
Where length is the number of days and Amt is the pro-rata amount for the number of days. Does anybody know how to do this? Someone mentioned the padr package to me but I do not know how to use it for this specific purpose.
Thank you in advance
I'm going to assume you have an some sort of unique id field in your data set so you have a unique record. Otherwise this is not going to work. I also added 1 extra record so we can see everything works on multiple records.
Data:
library(lubridate)
id <- c(1:2) # added id field needed for unique record and needed for grouping
dateStart <- c(lubridate::ymd("2010-01-01", "2011-01-09"))
dateEnd <- c(lubridate::ymd("2010-03-06", "2011-04-09"))
length <- c(65, 91)
Amt <- c(348.80, 468.70)
df1 <- data.frame(id , dateStart, dateEnd, length, Amt)
First create a data.frame which has the id and missing months. We need dplyr, tidyr and padr. Create groups per unique id, gather the dates so we have start and end date in 1 column. For padr to extend months we first need to thicken the data.frame. Get rid of not needed columns and fill in the missing months.
library(dplyr)
library(tidyr)
library(padr)
#create last_day function for later use
last_day <- function(date) {
ceiling_date(date, "month") - days(1)
}
dates <- df1 %>%
select(id, dateStart, dateEnd) %>%
group_by(id) %>%
gather(names, dates, -id) %>%
arrange(id, dates) %>%
thicken(interval = "month") %>% # need to thicken first for month interval
select(-c(names, dates)) %>%
pad(interval = "month")
dates
# A tibble: 7 x 2
# Groups: id [2]
id dates_month
<int> <date>
1 1 2010-01-01
2 1 2010-02-01
3 1 2010-03-01
4 2 2011-01-01
5 2 2011-02-01
6 2 2011-03-01
7 2 2011-04-01
Next join back the data to the original data.frame
df_extended <- inner_join(dates, df1, by = "id")
df_extended
# A tibble: 7 x 6
# Groups: id [2]
id dates_month dateStart dateEnd length Amt
<int> <date> <date> <date> <dbl> <dbl>
1 1 2010-01-01 2010-01-01 2010-03-06 65 349.
2 1 2010-02-01 2010-01-01 2010-03-06 65 349.
3 1 2010-03-01 2010-01-01 2010-03-06 65 349.
4 2 2011-01-01 2011-01-09 2011-04-09 91 469.
5 2 2011-02-01 2011-01-09 2011-04-09 91 469.
6 2 2011-03-01 2011-01-09 2011-04-09 91 469.
7 2 2011-04-01 2011-01-09 2011-04-09 91 469.
Now to get to the end result. need to use case_when, ifelse doesn't return the data in date format for some reason. The case_when replace set the correct start and end dates (I assume you need the exact start date, not the first of the month, otherwise adjust code to use dates_month instead.) I create an amount per day (amt_pd) variable to be able to multiply this with the number of days in the month to get the pro-rata amount for the number of days in the month.
df_end <- df_extended %>%
mutate(dateEnd = case_when(last_day(dates_month) <= dateEnd ~ last_day(dates_month),
TRUE ~ dateEnd),
dateStart = case_when(dates_month <= dateStart ~ dateStart,
TRUE ~ dates_month),
amt_pd = Amt / length,
length = dateEnd - dateStart + 1,
Amt = amt_pd * length) %>%
select(-c(dates_month, amt_pd)) # get rid of not needed columns
df_end
# A tibble: 7 x 5
# Groups: id [2]
id dateStart dateEnd length Amt
<int> <date> <date> <time> <time>
1 1 2010-01-01 2010-01-31 31 166.350769230769
2 1 2010-02-01 2010-02-28 28 150.252307692308
3 1 2010-03-01 2010-03-06 6 32.1969230769231
4 2 2011-01-09 2011-01-31 23 118.462637362637
5 2 2011-02-01 2011-02-28 28 144.215384615385
6 2 2011-03-01 2011-03-31 31 159.667032967033
7 2 2011-04-01 2011-04-09 9 46.354945054945
All of this could be done in one go. But if you have 15 million rows it might be better to see if the intermediate steps work. Also note that pad has a break_above option.
This is a numeric value that indicates the number of rows in millions
above which the function will break. Safety net for situations where
the interval is different than expected and padding yields a very
large dataframe, possibly overflowing memory.

Resources