I need to "split" a 15 million line df of the following form:
library(lubridate)
dateStart <- c(lubridate::ymd("2010-01-01"))
dateEnd <- c(lubridate::ymd("2010-03-06"))
length <- c(65)
Amt <- c(348.80)
df1 <- data.frame(dateStart, dateEnd, length, Amt)
df1
# dateStart dateEnd length Amt
# 1 2010-01-01 2010-03-06 65 348.8
into something like:
dateStart dateEnd length Amt
1 2010-01-01 2010-01-31 31 166.35
2 2010-02-01 2010-02-28 28 150.55
3 2010-03-01 2010-03-06 6 32.19
Where length is the number of days and Amt is the pro-rata amount for the number of days. Does anybody know how to do this? Someone mentioned the padr package to me but I do not know how to use it for this specific purpose.
Thank you in advance
I'm going to assume you have an some sort of unique id field in your data set so you have a unique record. Otherwise this is not going to work. I also added 1 extra record so we can see everything works on multiple records.
Data:
library(lubridate)
id <- c(1:2) # added id field needed for unique record and needed for grouping
dateStart <- c(lubridate::ymd("2010-01-01", "2011-01-09"))
dateEnd <- c(lubridate::ymd("2010-03-06", "2011-04-09"))
length <- c(65, 91)
Amt <- c(348.80, 468.70)
df1 <- data.frame(id , dateStart, dateEnd, length, Amt)
First create a data.frame which has the id and missing months. We need dplyr, tidyr and padr. Create groups per unique id, gather the dates so we have start and end date in 1 column. For padr to extend months we first need to thicken the data.frame. Get rid of not needed columns and fill in the missing months.
library(dplyr)
library(tidyr)
library(padr)
#create last_day function for later use
last_day <- function(date) {
ceiling_date(date, "month") - days(1)
}
dates <- df1 %>%
select(id, dateStart, dateEnd) %>%
group_by(id) %>%
gather(names, dates, -id) %>%
arrange(id, dates) %>%
thicken(interval = "month") %>% # need to thicken first for month interval
select(-c(names, dates)) %>%
pad(interval = "month")
dates
# A tibble: 7 x 2
# Groups: id [2]
id dates_month
<int> <date>
1 1 2010-01-01
2 1 2010-02-01
3 1 2010-03-01
4 2 2011-01-01
5 2 2011-02-01
6 2 2011-03-01
7 2 2011-04-01
Next join back the data to the original data.frame
df_extended <- inner_join(dates, df1, by = "id")
df_extended
# A tibble: 7 x 6
# Groups: id [2]
id dates_month dateStart dateEnd length Amt
<int> <date> <date> <date> <dbl> <dbl>
1 1 2010-01-01 2010-01-01 2010-03-06 65 349.
2 1 2010-02-01 2010-01-01 2010-03-06 65 349.
3 1 2010-03-01 2010-01-01 2010-03-06 65 349.
4 2 2011-01-01 2011-01-09 2011-04-09 91 469.
5 2 2011-02-01 2011-01-09 2011-04-09 91 469.
6 2 2011-03-01 2011-01-09 2011-04-09 91 469.
7 2 2011-04-01 2011-01-09 2011-04-09 91 469.
Now to get to the end result. need to use case_when, ifelse doesn't return the data in date format for some reason. The case_when replace set the correct start and end dates (I assume you need the exact start date, not the first of the month, otherwise adjust code to use dates_month instead.) I create an amount per day (amt_pd) variable to be able to multiply this with the number of days in the month to get the pro-rata amount for the number of days in the month.
df_end <- df_extended %>%
mutate(dateEnd = case_when(last_day(dates_month) <= dateEnd ~ last_day(dates_month),
TRUE ~ dateEnd),
dateStart = case_when(dates_month <= dateStart ~ dateStart,
TRUE ~ dates_month),
amt_pd = Amt / length,
length = dateEnd - dateStart + 1,
Amt = amt_pd * length) %>%
select(-c(dates_month, amt_pd)) # get rid of not needed columns
df_end
# A tibble: 7 x 5
# Groups: id [2]
id dateStart dateEnd length Amt
<int> <date> <date> <time> <time>
1 1 2010-01-01 2010-01-31 31 166.350769230769
2 1 2010-02-01 2010-02-28 28 150.252307692308
3 1 2010-03-01 2010-03-06 6 32.1969230769231
4 2 2011-01-09 2011-01-31 23 118.462637362637
5 2 2011-02-01 2011-02-28 28 144.215384615385
6 2 2011-03-01 2011-03-31 31 159.667032967033
7 2 2011-04-01 2011-04-09 9 46.354945054945
All of this could be done in one go. But if you have 15 million rows it might be better to see if the intermediate steps work. Also note that pad has a break_above option.
This is a numeric value that indicates the number of rows in millions
above which the function will break. Safety net for situations where
the interval is different than expected and padding yields a very
large dataframe, possibly overflowing memory.
Related
I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?
I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17
I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet
I have a data table of three columns id, dtstart, dtend. For example:
id start end
1 01/01/2015 31/01/2015
1 02/02/2015 28/02/2015
1 01/07/2016 31/07/2016
1 01/08/2016 31/08/2016
2 01/03/2015 31/03/2015
2 01/04/2015 30/04/2015
2 01/02/2016 28/02/2016
2 01/03/2016 31/03/2016
...
I need to create another data table grouped by id with the same columns but the new start date is the minimum date in the original start date and the new end date is the maximum date in the original dtend.
When there is a break of more then one day between an end date and the next start date then it should be grouped separately.
For example for the above the new table would be:
id start end
1 01/01/2015 28/02/2015
1 01/07/2016 31/08/2016
2 01/03/2015 30/04/2016
2 01/02/2016 31/03/2016
...
Do I need a for loop or is there a more efficient way (data table grouping for example)? The table is over 20 million rows with 100k+ unique ids.
Cheers
Andrew
This can be done using dplyr
dt.new <- dt %>%
arrange(id, start, end) %>%
mutate(gr = cumsum(lag(id, default = min(id)) != id |
as.numeric(difftime(start, lag(end, default = first(start)), units = 'days')) > 1)) %>%
group_by(id, gr) %>%
summarise(start = first(start),
end = last(end))
The result is:
Source: local data frame [6 x 4]
Groups: id [?]
id gr start end
<int> <int> <dttm> <dttm>
1 1 0 2015-01-01 2015-01-31
2 1 1 2015-02-02 2015-02-28
3 1 2 2016-07-01 2016-08-31
4 2 3 2015-03-01 2015-04-30
5 2 4 2016-02-01 2016-02-28
6 2 5 2016-03-01 2016-03-31
This works and doesn't match your output because you requested a one day margin (if you want two day margins then switch from >1 to >2), and 2016 was a leap year, which is in R's internal calendar. So the margin between 2/28/2016 and 3/1/2016 is 2 days.
Thanks again #akash87
For example row 6 below is within a month so it should still return one row for id 1 from 1/02/2006 to 30/09/2006 but it breaks into two, the first from 01/02/2006 to 12/06/2006 and then from 01/07/2006 to 30/09/2016
id dtstart dtend
1 01/02/2006 28/02/2006
1 01/03/2006 31/03/2006
1 01/04/2006 30/04/2006
1 01/05/2006 31/05/2006
1 01/06/2006 30/06/2006
1 10/06/2006 12/06/2006
1 01/07/2006 31/07/2006
1 01/08/2006 31/08/2006
1 01/09/2006 30/09/2006
2 01/04/2006 30/04/2006
2 01/05/2006 31/05/2006
2 01/09/2006 30/09/2006
2 01/10/2006 31/10/2006
So instead of returning
id start end
1 01/02/2006 30/09/2006
2 01/04/2006 31/05/2006
2 01/09/2006 31/10/2006
We have
id start end
1 01/02/2006 12/06/2006
1 01/07/2006 30/09/2006
2 01/04/2006 31/05/2006
2 01/09/2006 31/10/2006
Andrew