How to get exclusive date ranges? - r

I would like an object that gives me a date range for every month (or quarter) from 1990-01-01 to 2021-12-31, separated by a colon. So for example in the monthly case, the first object would be 1990-01-01:1990-01-31, the second object would be 1990-02-01:1990-02-31, and so on.
The issue I am having trouble with is making sure that the date range is exclusive, i.e., that no date gets repeated.
start_date1 <- as.Date("1990-01-01", "%Y-%m-%d")
end_date1 <- as.Date("2021-12-01", "%Y-%m-%d")
first_date <- format(seq(start_date1,end_date1,by="month"),"%Y-%m-%d")
start_date2 <- as.Date("1990-02-01", "%Y-%m-%d")
end_date2 <- as.Date("2022-01-01", "%Y-%m-%d")
second_date <- format(seq(start_date2,end_date2,by="month"),"%Y-%m-%d")
date<-paste0(first_date, ":")
finaldate<-paste0(date, second_date)
This code works, except that the first date in each month gets repeated "1990-01-01:1990-02-01" "1990-02-01:1990-03-01", and that the last date is "2021-12-01:2022-01-01" (including Jan 1, 2022 rather than stopping at Dec 31, 2021.
If I go by 30 days instead, it doesn't work as well because not every month has 30 days.
What's the best way to get an exclusive date range?

You could do:
dates <- seq(as.Date("1990-01-01"), as.Date("2022-01-01"), by = "month")
dates <- paste(head(dates, -1), tail(dates-1, - 1), sep = ":")
resulting in:
dates
#> [1] "1990-01-01:1990-01-31" "1990-02-01:1990-02-28" "1990-03-01:1990-03-31"
#> [4] "1990-04-01:1990-04-30" "1990-05-01:1990-05-31" "1990-06-01:1990-06-30"
#> [7] "1990-07-01:1990-07-31" "1990-08-01:1990-08-31" "1990-09-01:1990-09-30"
#> [10] "1990-10-01:1990-10-31" "1990-11-01:1990-11-30" "1990-12-01:1990-12-31"
#> [13] "1991-01-01:1991-01-31" "1991-02-01:1991-02-28" "1991-03-01:1991-03-31"
#> [16] "1991-04-01:1991-04-30" "1991-05-01:1991-05-31" "1991-06-01:1991-06-30"
#> [19] "1991-07-01:1991-07-31" "1991-08-01:1991-08-31" "1991-09-01:1991-09-30"
#> [22] "1991-10-01:1991-10-31" "1991-11-01:1991-11-30" "1991-12-01:1991-12-31"
#> [25] "1992-01-01:1992-01-31" "1992-02-01:1992-02-29" "1992-03-01:1992-03-31"
#> [28] "1992-04-01:1992-04-30" "1992-05-01:1992-05-31" "1992-06-01:1992-06-30"
#> [31] "1992-07-01:1992-07-31" "1992-08-01:1992-08-31" "1992-09-01:1992-09-30"
#> [34] "1992-10-01:1992-10-31" "1992-11-01:1992-11-30" "1992-12-01:1992-12-31"
#> [37] "1993-01-01:1993-01-31" "1993-02-01:1993-02-28" "1993-03-01:1993-03-31"
#> [40] "1993-04-01:1993-04-30" "1993-05-01:1993-05-31" "1993-06-01:1993-06-30"
#> [43] "1993-07-01:1993-07-31" "1993-08-01:1993-08-31" "1993-09-01:1993-09-30"
#> [46] "1993-10-01:1993-10-31" "1993-11-01:1993-11-30" "1993-12-01:1993-12-31"
#> [49] "1994-01-01:1994-01-31" "1994-02-01:1994-02-28" "1994-03-01:1994-03-31"
#> [52] "1994-04-01:1994-04-30" "1994-05-01:1994-05-31" "1994-06-01:1994-06-30"
#> [55] "1994-07-01:1994-07-31" "1994-08-01:1994-08-31" "1994-09-01:1994-09-30"
#> [58] "1994-10-01:1994-10-31" "1994-11-01:1994-11-30" "1994-12-01:1994-12-31"
#> [61] "1995-01-01:1995-01-31" "1995-02-01:1995-02-28" "1995-03-01:1995-03-31"
#> [64] "1995-04-01:1995-04-30" "1995-05-01:1995-05-31" "1995-06-01:1995-06-30"
#> [67] "1995-07-01:1995-07-31" "1995-08-01:1995-08-31" "1995-09-01:1995-09-30"
#> [70] "1995-10-01:1995-10-31" "1995-11-01:1995-11-30" "1995-12-01:1995-12-31"
#> [73] "1996-01-01:1996-01-31" "1996-02-01:1996-02-29" "1996-03-01:1996-03-31"
#> [76] "1996-04-01:1996-04-30" "1996-05-01:1996-05-31" "1996-06-01:1996-06-30"
#> [79] "1996-07-01:1996-07-31" "1996-08-01:1996-08-31" "1996-09-01:1996-09-30"
#> [82] "1996-10-01:1996-10-31" "1996-11-01:1996-11-30" "1996-12-01:1996-12-31"
#> [85] "1997-01-01:1997-01-31" "1997-02-01:1997-02-28" "1997-03-01:1997-03-31"
#> [88] "1997-04-01:1997-04-30" "1997-05-01:1997-05-31" "1997-06-01:1997-06-30"
#> [91] "1997-07-01:1997-07-31" "1997-08-01:1997-08-31" "1997-09-01:1997-09-30"
#> [94] "1997-10-01:1997-10-31" "1997-11-01:1997-11-30" "1997-12-01:1997-12-31"
#> [97] "1998-01-01:1998-01-31" "1998-02-01:1998-02-28" "1998-03-01:1998-03-31"
#> [100] "1998-04-01:1998-04-30" "1998-05-01:1998-05-31" "1998-06-01:1998-06-30"
#> [103] "1998-07-01:1998-07-31" "1998-08-01:1998-08-31" "1998-09-01:1998-09-30"
#> [106] "1998-10-01:1998-10-31" "1998-11-01:1998-11-30" "1998-12-01:1998-12-31"
#> [109] "1999-01-01:1999-01-31" "1999-02-01:1999-02-28" "1999-03-01:1999-03-31"
#> [112] "1999-04-01:1999-04-30" "1999-05-01:1999-05-31" "1999-06-01:1999-06-30"
#> [115] "1999-07-01:1999-07-31" "1999-08-01:1999-08-31" "1999-09-01:1999-09-30"
#> [118] "1999-10-01:1999-10-31" "1999-11-01:1999-11-30" "1999-12-01:1999-12-31"
#> [121] "2000-01-01:2000-01-31" "2000-02-01:2000-02-29" "2000-03-01:2000-03-31"
#> [124] "2000-04-01:2000-04-30" "2000-05-01:2000-05-31" "2000-06-01:2000-06-30"
#> [127] "2000-07-01:2000-07-31" "2000-08-01:2000-08-31" "2000-09-01:2000-09-30"
#> [130] "2000-10-01:2000-10-31" "2000-11-01:2000-11-30" "2000-12-01:2000-12-31"
#> [133] "2001-01-01:2001-01-31" "2001-02-01:2001-02-28" "2001-03-01:2001-03-31"
#> [136] "2001-04-01:2001-04-30" "2001-05-01:2001-05-31" "2001-06-01:2001-06-30"
#> [139] "2001-07-01:2001-07-31" "2001-08-01:2001-08-31" "2001-09-01:2001-09-30"
#> [142] "2001-10-01:2001-10-31" "2001-11-01:2001-11-30" "2001-12-01:2001-12-31"
#> [145] "2002-01-01:2002-01-31" "2002-02-01:2002-02-28" "2002-03-01:2002-03-31"
#> [148] "2002-04-01:2002-04-30" "2002-05-01:2002-05-31" "2002-06-01:2002-06-30"
#> [151] "2002-07-01:2002-07-31" "2002-08-01:2002-08-31" "2002-09-01:2002-09-30"
#> [154] "2002-10-01:2002-10-31" "2002-11-01:2002-11-30" "2002-12-01:2002-12-31"
#> [157] "2003-01-01:2003-01-31" "2003-02-01:2003-02-28" "2003-03-01:2003-03-31"
#> [160] "2003-04-01:2003-04-30" "2003-05-01:2003-05-31" "2003-06-01:2003-06-30"
#> [163] "2003-07-01:2003-07-31" "2003-08-01:2003-08-31" "2003-09-01:2003-09-30"
#> [166] "2003-10-01:2003-10-31" "2003-11-01:2003-11-30" "2003-12-01:2003-12-31"
#> [169] "2004-01-01:2004-01-31" "2004-02-01:2004-02-29" "2004-03-01:2004-03-31"
#> [172] "2004-04-01:2004-04-30" "2004-05-01:2004-05-31" "2004-06-01:2004-06-30"
#> [175] "2004-07-01:2004-07-31" "2004-08-01:2004-08-31" "2004-09-01:2004-09-30"
#> [178] "2004-10-01:2004-10-31" "2004-11-01:2004-11-30" "2004-12-01:2004-12-31"
#> [181] "2005-01-01:2005-01-31" "2005-02-01:2005-02-28" "2005-03-01:2005-03-31"
#> [184] "2005-04-01:2005-04-30" "2005-05-01:2005-05-31" "2005-06-01:2005-06-30"
#> [187] "2005-07-01:2005-07-31" "2005-08-01:2005-08-31" "2005-09-01:2005-09-30"
#> [190] "2005-10-01:2005-10-31" "2005-11-01:2005-11-30" "2005-12-01:2005-12-31"
#> [193] "2006-01-01:2006-01-31" "2006-02-01:2006-02-28" "2006-03-01:2006-03-31"
#> [196] "2006-04-01:2006-04-30" "2006-05-01:2006-05-31" "2006-06-01:2006-06-30"
#> [199] "2006-07-01:2006-07-31" "2006-08-01:2006-08-31" "2006-09-01:2006-09-30"
#> [202] "2006-10-01:2006-10-31" "2006-11-01:2006-11-30" "2006-12-01:2006-12-31"
#> [205] "2007-01-01:2007-01-31" "2007-02-01:2007-02-28" "2007-03-01:2007-03-31"
#> [208] "2007-04-01:2007-04-30" "2007-05-01:2007-05-31" "2007-06-01:2007-06-30"
#> [211] "2007-07-01:2007-07-31" "2007-08-01:2007-08-31" "2007-09-01:2007-09-30"
#> [214] "2007-10-01:2007-10-31" "2007-11-01:2007-11-30" "2007-12-01:2007-12-31"
#> [217] "2008-01-01:2008-01-31" "2008-02-01:2008-02-29" "2008-03-01:2008-03-31"
#> [220] "2008-04-01:2008-04-30" "2008-05-01:2008-05-31" "2008-06-01:2008-06-30"
#> [223] "2008-07-01:2008-07-31" "2008-08-01:2008-08-31" "2008-09-01:2008-09-30"
#> [226] "2008-10-01:2008-10-31" "2008-11-01:2008-11-30" "2008-12-01:2008-12-31"
#> [229] "2009-01-01:2009-01-31" "2009-02-01:2009-02-28" "2009-03-01:2009-03-31"
#> [232] "2009-04-01:2009-04-30" "2009-05-01:2009-05-31" "2009-06-01:2009-06-30"
#> [235] "2009-07-01:2009-07-31" "2009-08-01:2009-08-31" "2009-09-01:2009-09-30"
#> [238] "2009-10-01:2009-10-31" "2009-11-01:2009-11-30" "2009-12-01:2009-12-31"
#> [241] "2010-01-01:2010-01-31" "2010-02-01:2010-02-28" "2010-03-01:2010-03-31"
#> [244] "2010-04-01:2010-04-30" "2010-05-01:2010-05-31" "2010-06-01:2010-06-30"
#> [247] "2010-07-01:2010-07-31" "2010-08-01:2010-08-31" "2010-09-01:2010-09-30"
#> [250] "2010-10-01:2010-10-31" "2010-11-01:2010-11-30" "2010-12-01:2010-12-31"
#> [253] "2011-01-01:2011-01-31" "2011-02-01:2011-02-28" "2011-03-01:2011-03-31"
#> [256] "2011-04-01:2011-04-30" "2011-05-01:2011-05-31" "2011-06-01:2011-06-30"
#> [259] "2011-07-01:2011-07-31" "2011-08-01:2011-08-31" "2011-09-01:2011-09-30"
#> [262] "2011-10-01:2011-10-31" "2011-11-01:2011-11-30" "2011-12-01:2011-12-31"
#> [265] "2012-01-01:2012-01-31" "2012-02-01:2012-02-29" "2012-03-01:2012-03-31"
#> [268] "2012-04-01:2012-04-30" "2012-05-01:2012-05-31" "2012-06-01:2012-06-30"
#> [271] "2012-07-01:2012-07-31" "2012-08-01:2012-08-31" "2012-09-01:2012-09-30"
#> [274] "2012-10-01:2012-10-31" "2012-11-01:2012-11-30" "2012-12-01:2012-12-31"
#> [277] "2013-01-01:2013-01-31" "2013-02-01:2013-02-28" "2013-03-01:2013-03-31"
#> [280] "2013-04-01:2013-04-30" "2013-05-01:2013-05-31" "2013-06-01:2013-06-30"
#> [283] "2013-07-01:2013-07-31" "2013-08-01:2013-08-31" "2013-09-01:2013-09-30"
#> [286] "2013-10-01:2013-10-31" "2013-11-01:2013-11-30" "2013-12-01:2013-12-31"
#> [289] "2014-01-01:2014-01-31" "2014-02-01:2014-02-28" "2014-03-01:2014-03-31"
#> [292] "2014-04-01:2014-04-30" "2014-05-01:2014-05-31" "2014-06-01:2014-06-30"
#> [295] "2014-07-01:2014-07-31" "2014-08-01:2014-08-31" "2014-09-01:2014-09-30"
#> [298] "2014-10-01:2014-10-31" "2014-11-01:2014-11-30" "2014-12-01:2014-12-31"
#> [301] "2015-01-01:2015-01-31" "2015-02-01:2015-02-28" "2015-03-01:2015-03-31"
#> [304] "2015-04-01:2015-04-30" "2015-05-01:2015-05-31" "2015-06-01:2015-06-30"
#> [307] "2015-07-01:2015-07-31" "2015-08-01:2015-08-31" "2015-09-01:2015-09-30"
#> [310] "2015-10-01:2015-10-31" "2015-11-01:2015-11-30" "2015-12-01:2015-12-31"
#> [313] "2016-01-01:2016-01-31" "2016-02-01:2016-02-29" "2016-03-01:2016-03-31"
#> [316] "2016-04-01:2016-04-30" "2016-05-01:2016-05-31" "2016-06-01:2016-06-30"
#> [319] "2016-07-01:2016-07-31" "2016-08-01:2016-08-31" "2016-09-01:2016-09-30"
#> [322] "2016-10-01:2016-10-31" "2016-11-01:2016-11-30" "2016-12-01:2016-12-31"
#> [325] "2017-01-01:2017-01-31" "2017-02-01:2017-02-28" "2017-03-01:2017-03-31"
#> [328] "2017-04-01:2017-04-30" "2017-05-01:2017-05-31" "2017-06-01:2017-06-30"
#> [331] "2017-07-01:2017-07-31" "2017-08-01:2017-08-31" "2017-09-01:2017-09-30"
#> [334] "2017-10-01:2017-10-31" "2017-11-01:2017-11-30" "2017-12-01:2017-12-31"
#> [337] "2018-01-01:2018-01-31" "2018-02-01:2018-02-28" "2018-03-01:2018-03-31"
#> [340] "2018-04-01:2018-04-30" "2018-05-01:2018-05-31" "2018-06-01:2018-06-30"
#> [343] "2018-07-01:2018-07-31" "2018-08-01:2018-08-31" "2018-09-01:2018-09-30"
#> [346] "2018-10-01:2018-10-31" "2018-11-01:2018-11-30" "2018-12-01:2018-12-31"
#> [349] "2019-01-01:2019-01-31" "2019-02-01:2019-02-28" "2019-03-01:2019-03-31"
#> [352] "2019-04-01:2019-04-30" "2019-05-01:2019-05-31" "2019-06-01:2019-06-30"
#> [355] "2019-07-01:2019-07-31" "2019-08-01:2019-08-31" "2019-09-01:2019-09-30"
#> [358] "2019-10-01:2019-10-31" "2019-11-01:2019-11-30" "2019-12-01:2019-12-31"
#> [361] "2020-01-01:2020-01-31" "2020-02-01:2020-02-29" "2020-03-01:2020-03-31"
#> [364] "2020-04-01:2020-04-30" "2020-05-01:2020-05-31" "2020-06-01:2020-06-30"
#> [367] "2020-07-01:2020-07-31" "2020-08-01:2020-08-31" "2020-09-01:2020-09-30"
#> [370] "2020-10-01:2020-10-31" "2020-11-01:2020-11-30" "2020-12-01:2020-12-31"
#> [373] "2021-01-01:2021-01-31" "2021-02-01:2021-02-28" "2021-03-01:2021-03-31"
#> [376] "2021-04-01:2021-04-30" "2021-05-01:2021-05-31" "2021-06-01:2021-06-30"
#> [379] "2021-07-01:2021-07-31" "2021-08-01:2021-08-31" "2021-09-01:2021-09-30"
#> [382] "2021-10-01:2021-10-31" "2021-11-01:2021-11-30" "2021-12-01:2021-12-31"
Created on 2022-03-19 by the reprex package (v2.0.1)

I used lubridate for the simplicity of its ymd() function.
require(lubridate)
You start with creating a vector of first days of the month:
start <- seq(ymd("1990-01-01"), ymd("2021-12-01"), by = "month")
Then you create another vector subtracting 1 day to obtain the last day of each month:
b <- start - 1
You remove the first element of that vector
end <- b[-1]
You join them all
paste0(start, ":", end)
There's an easily (manually) fixable issue: the very last interval is incorrect.

1) yearmon/yearqtr Create a monthly sequence using yearmon class and then convert that to the start and end dates. Similarly for quarters and yearqtr class. Internally both represent dates by year and fraction of year so use 1/12 and 1/4 in by=. Also note that using as.Date gives the date at the start of the month or quarter and the same but with the frac=1 argument gives the end.
library(zoo)
# input
st <- as.Date("1990-01-01")
en <- as.Date("2021-12-01")
# by month
mon <- seq(as.yearmon(st), as.yearmon(en), 1/12)
paste(as.Date(mon), as.Date(mon, frac = 1), sep = ":")
# by quarter
qtr <- seq(as.yearqtr(st), as.yearqtr(en), 1/4)
paste(as.Date(qtr), as.Date(qtr, frac = 1), sep = ":")
There is some question of what the end date should be. The above give an end date on the last interval of 2021-12-31 but if the end date should be 2021-12-01 so that no interval extends past en then replace the two paste lines with these respectively.
paste(as.Date(mon), pmin(as.Date(mon, frac = 1), en), sep = ":")
paste(as.Date(qtr), pmin(as.Date(qtr, frac = 1), en), sep = ":")
2) Base R A base R alternative is to use the expressions involving cut shown below to get the end of period. (1) seems less tricky but this might be useful if using only base R were desired. A similar approach with pmin as in (1) could be used if we want to ensure that no range extends beyond en.
This and the remaining solutions, but not (1), assume that st is the first of the month; however, that could readily be handled if needed.
mon <- seq(st, en, by = "month")
paste(mon, as.Date(cut(mon + 31, "month")) - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(as.Date(qtr), as.Date(cut(qtr + 93, "month")) - 1, sep = ":")
3) lubridate Using various functions from this package we can write the following. A similar approach using pmin as in (1) could be used if the ranges may not extend beyond en.
library(lubridate)
mon <- seq(st, en, by = "month")
paste(mon, mon + month(1) - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(qtr, qtr + quarter(1) - 1, sep = ":")
4) IDate We can use IDate class from data.table in which case we can make use of cut.IDate which returns another IDate object rather than a character string (as in base R).
st <- as.IDate("1990-01-01")
en <- as.IDate("2021-12-01")
mon <- seq(st, en, by = "month")
paste(mon, cut(mon + 31, "month") - 1, sep = ":")
qtr <- seq(st, en, by = "quarter")
paste(qtr, cut(qtr + 93, "month") - 1, sep = ":")

Related

using the as.POSIXlt function

I am currently using the "economics" dataset in ggplot2 package. I have been told to try this code, and it works, but I do not understand the first line (how does this date conversion function work - I have seen in the vignette that it is supposed to change the timezone, but it doesn't seem to be used to that purpose here ? what does x refer to ?) and I would be grateful for any help!
year <- function(x) as.POSIXlt(x)$year + 1900
ggplot(economics, aes(unemploy / pop, uempmed)) +
geom_path(colour = "grey50") +
geom_point(aes(colour = year(date)))
The as.POSIXlt function converts an existing date or date-time (in a variety of different formats) into an object of class "POSIXlt". This is really just a list with different components such as year, month, day, etc.
We can see this with a simple example:
my_date <- as.POSIXlt("2022-07-20")
unclass(my_date)
#> $sec
#> [1] 0
#>
#> $min
#> [1] 0
#>
#> $hour
#> [1] 0
#>
#> $mday
#> [1] 20
#>
#> $mon
#> [1] 6
#>
#> $year
#> [1] 122
#>
#> $wday
#> [1] 3
#>
#> $yday
#> [1] 200
#>
#> $isdst
#> [1] 0
#>
#> attr(,"tzone")
#> [1] "GMT"
We can use $ notation to extract any of these elements just as we can with a normal list:
my_date$year
#> [1] 122
Since the year is represented as an integer relative to 1900, adding 1900 to the result simply returns the year as an integer:
my_date$year + 1900
#> [1] 2022
So your year function will simply extract the year as an integer from a date or date-time in s variety of formats. In the case of your plot code, it is simply extracting the year from the date column.
The year function you have shown is essentially identical to the year.default function in the lubridate package, except you don't need to load in an extra package:
lubridate:::year.default
#> function (x)
#> as.POSIXlt(x, tz = tz(x))$year + 1900
#> <bytecode: 0x000001ba60f19bf8>
#> <environment: namespace:lubridate>
Created on 2022-07-20 by the reprex package (v2.0.1)

Finding out dates

I have a data frame which looks like this:
Subscription MonthlyPayment FirstPaymentDate NumberofPayments
<chr> <dbl> <date> <int>
1 Netflix 12.99 2021-05-24 21
2 Spotify 9.99 2021-08-17 7
3 PureGym 19.99 2022-07-04 9
4 DisneyPlus 7.99 2020-10-26 11
5 AmazonPrime 34.99 2020-08-11 73
6 Youtube 12.99 2020-09-27 35
I want to find out future payment dates for each subscription service. For example Netflix has 21 monthly payments, so I want to list out all the monthly payment days from the first payment date. How would I do this for each subscription service, using dplyr?
You can use dplyr and tidyr; I create a list of sequential payments (rowwise) and then unnest that list
library(dplyr)
library(tidyr)
df %>%
rowwise() %>%
mutate(Payments = list(seq(FirstPaymentDate, by="month", length.out=NumberofPayments))) %>%
unnest(Payments)
Output:
# A tibble: 156 × 5
Subscription MonthlyPayment FirstPaymentDate NumberofPayments Payments
<chr> <dbl> <date> <int> <date>
1 Netflix 13.0 2021-05-24 21 2021-05-24
2 Netflix 13.0 2021-05-24 21 2021-06-24
3 Netflix 13.0 2021-05-24 21 2021-07-24
4 Netflix 13.0 2021-05-24 21 2021-08-24
5 Netflix 13.0 2021-05-24 21 2021-09-24
6 Netflix 13.0 2021-05-24 21 2021-10-24
7 Netflix 13.0 2021-05-24 21 2021-11-24
8 Netflix 13.0 2021-05-24 21 2021-12-24
9 Netflix 13.0 2021-05-24 21 2022-01-24
10 Netflix 13.0 2021-05-24 21 2022-02-24
# … with 146 more rows
You can add the months based on NumberofPayments directly to FirstPaymentDate. This approach does not require dplyr.
library(lubridate)
library(purrr)
df <- data.frame(sub = c("Netflix", "Spotify", "PureGym", "DisneyPlus", "AmazonPrime", "Youtube"),
mo_pay = c(12.99, 9.99, 19.99, 7.99, 34.99, 12.99),
dt_fpay = as.Date(c("2021-05-24", "2021-08-17", "2022-07-04", "2020-10-26", "2020-08-11", "2020-09-27")),
n_pay = c(21, 7, 9, 11, 73, 35))
pay_dt <- map(seq(nrow(df)),
function(x) df$dt_fpay[x] %m+% months(seq(df$n_pay[x])))
names(pay_dt) <- df$sub
pay_dt
output:
> pay_dt
$Netflix
[1] "2021-06-24" "2021-07-24" "2021-08-24" "2021-09-24" "2021-10-24" "2021-11-24"
[7] "2021-12-24" "2022-01-24" "2022-02-24" "2022-03-24" "2022-04-24" "2022-05-24"
[13] "2022-06-24" "2022-07-24" "2022-08-24" "2022-09-24" "2022-10-24" "2022-11-24"
[19] "2022-12-24" "2023-01-24" "2023-02-24"
$Spotify
[1] "2021-09-17" "2021-10-17" "2021-11-17" "2021-12-17" "2022-01-17" "2022-02-17"
[7] "2022-03-17"
$PureGym
[1] "2022-08-04" "2022-09-04" "2022-10-04" "2022-11-04" "2022-12-04" "2023-01-04"
[7] "2023-02-04" "2023-03-04" "2023-04-04"
$DisneyPlus
[1] "2020-11-26" "2020-12-26" "2021-01-26" "2021-02-26" "2021-03-26" "2021-04-26"
[7] "2021-05-26" "2021-06-26" "2021-07-26" "2021-08-26" "2021-09-26"
$AmazonPrime
[1] "2020-09-11" "2020-10-11" "2020-11-11" "2020-12-11" "2021-01-11" "2021-02-11"
[7] "2021-03-11" "2021-04-11" "2021-05-11" "2021-06-11" "2021-07-11" "2021-08-11"
[13] "2021-09-11" "2021-10-11" "2021-11-11" "2021-12-11" "2022-01-11" "2022-02-11"
[19] "2022-03-11" "2022-04-11" "2022-05-11" "2022-06-11" "2022-07-11" "2022-08-11"
[25] "2022-09-11" "2022-10-11" "2022-11-11" "2022-12-11" "2023-01-11" "2023-02-11"
[31] "2023-03-11" "2023-04-11" "2023-05-11" "2023-06-11" "2023-07-11" "2023-08-11"
[37] "2023-09-11" "2023-10-11" "2023-11-11" "2023-12-11" "2024-01-11" "2024-02-11"
[43] "2024-03-11" "2024-04-11" "2024-05-11" "2024-06-11" "2024-07-11" "2024-08-11"
[49] "2024-09-11" "2024-10-11" "2024-11-11" "2024-12-11" "2025-01-11" "2025-02-11"
[55] "2025-03-11" "2025-04-11" "2025-05-11" "2025-06-11" "2025-07-11" "2025-08-11"
[61] "2025-09-11" "2025-10-11" "2025-11-11" "2025-12-11" "2026-01-11" "2026-02-11"
[67] "2026-03-11" "2026-04-11" "2026-05-11" "2026-06-11" "2026-07-11" "2026-08-11"
[73] "2026-09-11"
$Youtube
[1] "2020-10-27" "2020-11-27" "2020-12-27" "2021-01-27" "2021-02-27" "2021-03-27"
[7] "2021-04-27" "2021-05-27" "2021-06-27" "2021-07-27" "2021-08-27" "2021-09-27"
[13] "2021-10-27" "2021-11-27" "2021-12-27" "2022-01-27" "2022-02-27" "2022-03-27"
[19] "2022-04-27" "2022-05-27" "2022-06-27" "2022-07-27" "2022-08-27" "2022-09-27"
[25] "2022-10-27" "2022-11-27" "2022-12-27" "2023-01-27" "2023-02-27" "2023-03-27"
[31] "2023-04-27" "2023-05-27" "2023-06-27" "2023-07-27" "2023-08-27"

data.table vs dplyr memory use revisited

I know that data.table vs dplyr comparisons are a perennial favourite on SO. (Full disclosure: I like and use both packages.)
However, in trying to provide some comparisons for a class that I'm teaching, I ran into something surprising w.r.t. memory usage. My expectation was that dplyr would perform especially poorly with operations that require (implicit) filtering or slicing of data. But that's not what I'm finding. Compare:
First dplyr.
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DF = tibble(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DF %>% filter(x > 7) %>% group_by(y) %>% summarise(mean(z))
#> # A tibble: 10 x 2
#> y `mean(z)`
#> * <chr> <dbl>
#> 1 A -0.00336
#> 2 B -0.00702
#> 3 C 0.00291
#> 4 D -0.00430
#> 5 E -0.00705
#> 6 F -0.00568
#> 7 G -0.00344
#> 8 H 0.000553
#> 9 I -0.00168
#> 10 J 0.00661
bench::bench_process_memory()
#> current max
#> 585MB 611MB
Created on 2020-04-22 by the reprex package (v0.3.0)
Then data.table.
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 948.47MB 1.17GB
Created on 2020-04-22 by the reprex package (v0.3.0)
So, basically data.table appears to be using nearly twice the memory that dplyr does for this simple filtering+grouping operation. Note that I'm essentially replicating a use-case that #Arun suggested here would be much more memory efficient on the data.table side. (data.table is still a lot faster, though.)
Any ideas, or am I just missing something obvious?
P.S. As an aside, comparing memory usage ends up being more complicated than it first seems because R's standard memory profiling tools (Rprofmem and co.) all ignore operations that occur outside R (e.g. calls to the C++ stack). Luckily, the bench package now provides a bench_process_memory() function that also tracks memory outside of R’s GC heap, which is why I use it here.
sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.9.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.12.8 dplyr_0.8.99.9002 bench_1.1.1.9000
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 knitr_1.28 magrittr_1.5 tidyselect_1.0.0
#> [5] R6_2.4.1 rlang_0.4.5.9000 stringr_1.4.0 highr_0.8
#> [9] tools_3.6.3 xfun_0.13 htmltools_0.4.0 ellipsis_0.3.0
#> [13] yaml_2.2.1 digest_0.6.25 tibble_3.0.1 lifecycle_0.2.0
#> [17] crayon_1.3.4 purrr_0.3.4 vctrs_0.2.99.9011 glue_1.4.0
#> [21] evaluate_0.14 rmarkdown_2.1 stringi_1.4.6 compiler_3.6.3
#> [25] pillar_1.4.3 generics_0.0.2 pkgconfig_2.0.3
Created on 2020-04-22 by the reprex package (v0.3.0)
UPDATE: Following #jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage.
dplyr
$ ./cgmemtime Rscript ~/mem-comp-dplyr.R
Child user: 0.526 s
Child sys : 0.033 s
Child wall: 0.455 s
Child high-water RSS : 128952 KiB
Recursive and acc. high-water RSS+CACHE : 118516 KiB
data.table
$ ./cgmemtime Rscript ~/mem-comp-dt.R
Child user: 0.510 s
Child sys : 0.056 s
Child wall: 0.464 s
Child high-water RSS : 129032 KiB
Recursive and acc. high-water RSS+CACHE : 118320 KiB
Bottom line: Accurately measuring memory usage from within R is complicated.
I'll leave my original answer below because I think it still has value.
ORIGINAL ANSWER:
Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable:
library(bench)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)
set.seed(123)
setDTthreads(1) ## TURN OFF MULTITHREADING
DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))
DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372
bench::bench_process_memory()
#> current max
#> 589MB 612MB
Created on 2020-04-22 by the reprex package (v0.3.0)
Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...

Getting different results when using the sparklyr and dplyr

Just now i start learning the sparklyr package using the reference sparklyr
i did what was written in the document.
when using the following code
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
collect
Warning messages:
1: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
2: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
> delay
# A tibble: 2,961 x 4
tailnum count dist delay
<chr> <dbl> <dbl> <dbl>
1 N14228 111 1547 3.71
2 N24211 130 1330 7.70
3 N668DN 49.0 1028 2.62
4 N39463 107 1588 2.16
5 N516JB 288 1249 12.0
6 N829AS 230 228 17.2
7 N3ALAA 63.0 1078 3.59
8 N793JB 283 1529 4.72
9 N657JB 285 1286 5.03
10 N53441 102 1661 0.941
# ... with 2,951 more rows
In the similar way i want apply the same operations on nycflights13::flights dataset using dplyr package
nycflights13::flights %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay))
# A tibble: 1,319 x 4
tailnum count dist delay
<chr> <int> <dbl> <dbl>
1 N102UW 48 536 2.94
2 N103US 46 535 - 6.93
3 N105UW 45 525 - 0.267
4 N107US 41 529 - 5.73
5 N108UW 60 534 - 1.25
6 N109UW 48 536 - 2.52
7 N110UW 40 535 2.80
8 N111US 30 536 - 0.467
9 N11206 111 1414 12.7
10 N112US 38 535 - 0.947
# ... with 1,309 more rows
My problem is why i am getting the different results ?
As mention in the documentation dplyr is the complete backend operations
for sparklyr.
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 dplyr_0.7.4 sparklyr_0.7.0
loaded via a namespace (and not attached):
[1] DBI_0.7 readr_1.1.1 withr_2.1.1
[4] nycflights13_0.2.2 rprojroot_1.3-2 lattice_0.20-35
[7] foreign_0.8-69 pkgconfig_2.0.1 config_0.2
[10] utf8_1.1.3 compiler_3.4.0 stringr_1.3.0
[13] parallel_3.4.0 xtable_1.8-2 Rcpp_0.12.15
[16] cli_1.0.0 shiny_1.0.5 plyr_1.8.4
[19] httr_1.3.1 tools_3.4.0 openssl_1.0
[22] nlme_3.1-131.1 broom_0.4.3 R6_2.2.2
[25] dbplyr_1.2.1 bindr_0.1 purrr_0.2.4
[28] assertthat_0.2.0 curl_3.1 digest_0.6.15
[31] mime_0.5 stringi_1.1.6 rstudioapi_0.7
[34] reshape2_1.4.3 hms_0.4.1 backports_1.1.2
[37] htmltools_0.3.6 grid_3.4.0 glue_1.2.0
[40] httpuv_1.3.5 rlang_0.2.0 psych_1.7.8
[43] magrittr_1.5 rappdirs_0.3.1 lazyeval_0.2.1
[46] yaml_2.1.16 crayon_1.3.4 tidyr_0.8.0
[49] pillar_1.1.0 base64enc_0.1-3 mnormt_1.5-5
[52] jsonlite_1.5 tibble_1.4.2 Lahman_6.0-0
The key difference is that in the non-sparklyr, we are not using na.rm = TRUE in mean, therefore, those elements having NA in 'distance' or 'arr_delay' will become NA when we take the mean but in sparklyr the NA values are already removed so the argument is not needed
We can check the NA elements in 'distance' and 'arr_delay'
nycflights13::flights %>%
summarise_at(vars(distance, arr_delay), funs(sum(is.na(.))))
# A tibble: 1 x 2
# distance arr_delay
# <int> <int>
#1 0 9430 #### number of NAs
So, if we correct for that, then the output will be the same
res <- nycflights13::flights %>%
group_by(tailnum) %>%
summarise(count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
arrange(tailnum)
res
# A tibble: 2,961 x 4
# tailnum count dist delay
# <chr> <int> <dbl> <dbl>
# 1 N0EGMQ 371 676 9.98
# 2 N10156 153 758 12.7
# 3 N102UW 48 536 2.94
# 4 N103US 46 535 - 6.93
# 5 N104UW 47 535 1.80
# 6 N10575 289 520 20.7
# 7 N105UW 45 525 - 0.267
# 8 N107US 41 529 - 5.73
# 9 N108UW 60 534 - 1.25
#10 N109UW 48 536 - 2.52
# ... with 2,951 more rows
Using sparklyr
library(sparklyr)
library(dplyr)
library(nycflights13)
sc <- spark_connect(master = "local")
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
arrange(tailnum) %>%
collect
delay
# A tibble: 2,961 x 4
# tailnum count dist delay
# <chr> <dbl> <dbl> <dbl>
# 1 N0EGMQ 371 676 9.98
# 2 N10156 153 758 12.7
# 3 N102UW 48.0 536 2.94
# 4 N103US 46.0 535 - 6.93
# 5 N104UW 47.0 535 1.80
# 6 N10575 289 520 20.7
# 7 N105UW 45.0 525 - 0.267
# 8 N107US 41.0 529 - 5.73
# 9 N108UW 60.0 534 - 1.25
#10 N109UW 48.0 536 - 2.52
# ... with 2,951 more rows

I am unable to change a column into Date format in R

And this is the string of my dataframe.
'data.frame': 10652 obs. of 4 variables:
$ Date: chr "06-15-2017" "06-15-2017" "06-15-2017" "06-15-2017" ...
$ Time: Factor w/ 951 levels "00:00:01","00:00:02",..: 396 398 400 402 404 406 407 409 411 413 ...
$ CPU : num 2.4 2.4 2.3 2.3 2.2 2.2 2.1 2.1 2.1 2.1 ...
$ MEM : num 2.5 2.5 2.5 2.6 2.6 2.6 2.6 2.7 2.9 2.9 ...
I want to make R read the date and time column in Date and Time format.
I have tried:
DateData$Date_Time = within(DateData, { timestamp=format(as.POSIXct(paste(DateData$Date, DateData$Time)), "%d/%m/%Y %H:%M:%S") })
I have tried this after merging the date and time column-
DateData$Date_Time = as.chron(DateData$Date_Time, "%d/%m/%Y %H:%M:%S")
DateData = within(DateData, { timestamp=strptime(paste((DateData$Date, DateData$Time), "%Y/%m/%d%H:%M:%S") })
And this: DateData$DateTime = strptime(DateData$DateTime,"%m-%d-%Y %H:%M:%S")
Nothing seems to work for me.
Dealing with conversion after importing data
This is a sample of your data
df <- data.frame(Date = c("06-15-2017","06-15-2017","06-15-2017","06-15-2017"), Time = c("00:00:01", "00:00:02", "00:00:03", "00:00:04"), stringsAsFactors = F)
For date object, you can use either base R, lubridate or anytime
packages
df$Date_base <- as.Date(df$Date, format = "%m-%d-%y")
library(lubridate)
#>
#> Attachement du package : 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
df$Date_lubridate <- mdy(df$Date)
library(anytime)
df$Date_anytime <- anytime(df$Date)
For working with time objects only (not Datetime), you can work with
hms package or period objects form lubridate package with
lubridate::hms
library(hms)
#>
#> Attachement du package : 'hms'
#> The following object is masked from 'package:lubridate':
#>
#> hms
df$Time_hms <- as.hms(df$Time)
df$Time_lubridate <- lubridate::hms(df$Time) # hms in lubridate is masked by hms package
here are what results look like
df
#> Date Time Date_base Date_lubridate Date_anytime Time_hms
#> 1 06-15-2017 00:00:01 2020-06-15 2017-06-15 2017-06-15 00:00:01
#> 2 06-15-2017 00:00:02 2020-06-15 2017-06-15 2017-06-15 00:00:02
#> 3 06-15-2017 00:00:03 2020-06-15 2017-06-15 2017-06-15 00:00:03
#> 4 06-15-2017 00:00:04 2020-06-15 2017-06-15 2017-06-15 00:00:04
#> Time_lubridate
#> 1 1S
#> 2 2S
#> 3 3S
#> 4 4S
Class of the column and summary of df
sapply(df, class)
#> $Date
#> [1] "character"
#>
#> $Time
#> [1] "character"
#>
#> $Date_base
#> [1] "Date"
#>
#> $Date_lubridate
#> [1] "Date"
#>
#> $Date_anytime
#> [1] "POSIXct" "POSIXt"
#>
#> $Time_hms
#> [1] "hms" "difftime"
#>
#> $Time_lubridate
#> [1] "Period"
#> attr(,"package")
#> [1] "lubridate"
summary(df)
#> Date Time Date_base
#> Length:4 Length:4 Min. :2020-06-15
#> Class :character Class :character 1st Qu.:2020-06-15
#> Mode :character Mode :character Median :2020-06-15
#> Mean :2020-06-15
#> 3rd Qu.:2020-06-15
#> Max. :2020-06-15
#> Date_lubridate Date_anytime Time_hms
#> Min. :2017-06-15 Min. :2017-06-15 Length:4
#> 1st Qu.:2017-06-15 1st Qu.:2017-06-15 Class1:hms
#> Median :2017-06-15 Median :2017-06-15 Class2:difftime
#> Mean :2017-06-15 Mean :2017-06-15 Mode :numeric
#> 3rd Qu.:2017-06-15 3rd Qu.:2017-06-15
#> Max. :2017-06-15 Max. :2017-06-15
#> Time_lubridate
#> Min. :1S
#> 1st Qu.:1.75S
#> Median :2.5S
#> Mean :2.5S
#> 3rd Qu.:3.25S
#> Max. :4S
Dealing with conversion directly when reading
You can deal with type conversion directly when you read a file from a file using the readr package.
library(readr)
read_csv('Date, Time
06-15-2017, 00:00:01
06-15-2017, 00:00:02
06-15-2017, 00:00:03
06-15-2017, 00:00:04
', col_types = cols(Date = col_date(format = "%m-%d-%Y"),
Time = col_time()))
#> # A tibble: 4 x 2
#> Date Time
#> <date> <time>
#> 1 2017-06-15 00:00:01
#> 2 2017-06-15 00:00:02
#> 3 2017-06-15 00:00:03
#> 4 2017-06-15 00:00:04
Using readr, you see that it directly import your data in a data.frame (a special tibble format from tidyverse) with column as Date and Time. You can find some information here
You used date-time formats that don't match your data at multiple places.
If you paste the Date and Time columns together with a space separator, the format to parse is %m-%d-%Y %H:%M:%S.
That is, to combine the two columns and parse as date-time:
DateData$DateTime <- strptime(paste(DateData$Date, DateData$Time, sep=' '), '%m-%d-%Y %H:%M:%S')
installing lubridate package
install.packages("lubridate")
library (lubridate)
pasting the Date and Time Column
DFanalysis$DateStamp <- paste(DFanalysis$Date, DFanalysis$Time, sep = " ")
Check the class of DateStamp
class(DFanalysis$DateStamp)
If the class is character we can convert it directly
DFanalysis$DateStamp <- mdy_hms(DFanalysis$DateStamp)

Resources