How to calculate a decimal month in R in a particular year? - r

If I have a date, say "2014-05-13" and I want to calculate the month in decimal, I would do this:
5 + 13/31 = 5.419355
How would it be possible in R to take a vector of dates and turn in it into a "month decimal" vector?
For example:
dates = c("2010-01-24", "2013-04-08", "2014-03-05", "2013-03-08", "2014-02-14",
          "2004-01-28", "2006-02-21", "2013-03-28", "2013-04-01", "2006-02-14",
          "2006-01-28", "2014-01-19", "2012-03-12", "2014-01-30", "2005-04-17")
library(lubridate)
month(dates) + day(dates)/31
As you can see, it would be wrong to put "31" as the diviser since the number of days differ depending on the month, and sometimes year (leap years).
So what would be the best solution?

You can use monthDaysfunction from Hmisc package
> require(Hmisc)
> library(lubridate)
> month(dates) + day(dates)/monthDays(dates)
[1] 1.774194 4.266667 3.161290 3.258065 2.500000 1.903226 2.750000 3.903226 4.033333
[10] 2.500000 1.903226 1.612903 3.387097 1.967742 4.566667

With magrittr,
library(magrittr)
library(lubridate)
dates %>% ymd() %>% { month(.) + day(.) / days_in_month(.) }
## Jan Apr Mar Mar Feb Jan Feb Mar Apr Feb Jan
## 1.774194 4.266667 3.161290 3.258065 2.500000 1.903226 2.750000 3.903226 4.033333 2.500000 1.903226
## Jan Mar Jan Apr
## 1.612903 3.387097 1.967742 4.566667
For some reason the vector gets named, so add %>% unname() if you like.

Here is a base R hack that uses a trick I've seen on SO to get the first day of the next month and subtract 1 to return the last day of the month of interest.
# format dates to Date class
dates <- as.Date(dates)
# get the next month
nextMonths <- as.integer(substr(dates, 6, 7)) + 1L
# replace next month with 1 if it is equal to 13
nextMonths[nextMonths == 13] <- 1L
# extract the number of days using date formatting (%d), paste, and subtraction
dayCount <- as.integer(format(as.Date(paste(substr(dates, 1, 4),
nextMonths, "01", sep="-"))-1L, format="%d"))
dayCount
[1] 31 30 31 31 28 31 28 31 30 28 31 31 31 31 30
# get month with fraction using date formatting (%m)
as.integer(format(dates, format="%m")) + (as.integer(format(dates, format="%d")) / dayCount)
[1] 1.774194 4.266667 3.161290 3.258065 2.500000 1.903226 2.750000 3.903226 4.033333 2.500000
[11] 1.903226 1.612903 3.387097 1.967742 4.566667

Related

Date range without year in R

I have data from several years and each record has a date value (YYYY-MM-DD). I want to label each record with the season that it fell into. For example, I want to take all the records from December 15 to March 15, across all years, and put "Winter" in a season column. Is there a way in R to specify a sequence of dates using just the month and date, regardless of year?
Lubridate quarter command doesn't work because I have custom dates to define the seasons and the seasons are not all of equal length, and I can't just do month(datevalue) %in% c(12,1,2,3) because I need to split the months in half (i.e. March 15 is winter and March 16 is spring).
I could manually enter in the date range for each year in my dataset (e.g. Dec 15 2015 to March 15 2015 or Dec 15 2016 to Mar 15 2016, etc...), but is there a better way?
You can extract the month and date out of the date column and use case_when to assign Season based on those two dates.
library(dplyr)
library(lubridate)
df %>%
mutate(day = day(Date),
month = month(Date),
Season = case_when(#15 December to 15 March as Winter
month == 12 & day >= 15 |
month %in% 1:2 | month == 3 & day <= 15 ~ "Winter",
#Add conditions for other season
)
)
We assume that when the question says that winter is "Dec 15 2015 to March 15 201 or Dec 15 2016 to Mar 15 2016" what is really meant is that winter is Dec 16, 2015 to Mar 15, 2016 or Dec 16, 2016 to Mar 15, 2017.
Also it is not clear what the precise output is supposed to be but in each case below we provide a second argument which takes a vector giving the season names or numbers. The default is that winter is reported as 1, spring is 2, summer is 3 and fall is 4 but you could pass a second argument of c("Winter", "Spring", "Summer", "Fall") instead or use other names if you wish.
1) yearmon/yearqtr Convert to Date class and subtract 15. Then convert that to yearmon class which represents dates internally as year + fraction where fraction = 0 for January, 1/12 for February, ..., 11/12 for December. Add 1/12 to get to the next month. Convert that to yearqtr class which represents dates as year + fraction where fraction is 0, 1/4, 2/4 or 3/4 for the 4 quarters and take cycle of that which gives the quarter number (1, 2, 3 or 4).
If we knew that the input x was a Date vector as opposed to a character vector then we could simplify this by replacing as.Date(x) in season.
library(zoo)
season <- function(x, s = 1:4)
s[cycle(as.yearqtr(as.yearmon(as.Date(x) - 15) + 1/12))]
# test
d <- c(as.Date("2020-12-15") + 0:1, as.Date("2021-03-15") + 0:1)
season(d)
## [1] 4 1 1 2
season(d, c("Winter", "Spring", "Summer", "Fall"))
## [1] "Fall" "Winter" "Winter" "Spring"
2) base The above could be translated to base R using POSIXlt. Subtract 15 as before and then add 1 to the month to get to the next month. Finally extract the month and ensure that is is less than or equal to the third month.
season.lt <- function(x, s = 1:4) {
lt <- as.POSIXlt(as.Date(d) - 15)
lt$mon <- lt$mon + 1
s[as.POSIXlt(format(lt))$mon %/% 3 + 1]
}
# test - d defined in (1)
is.season.lt(d)
## [1] 4 1 1 2
3) lubridate We can follow the same logic in lubridate like this:
season.lub <- function(x, s = 1:4)
s[(month((as.Date(x) - 15) %m+% months(1)) - 1) %/% 3 + 1]
# test - d defined in (1)
season.lub(d)
## [1] 4 1 1 2

Second to last Wednesday of month in R

In R, how can I produce a list of dates of all 2nd to last Wednesdays of the month in a specified date range? I've tried a few things but have gotten inconsistent results for months with five Wednesdays.
To generate a regular sequence of dates you can use seq with dates for parameter from and to. See the seq.Date documentation for more options.
Create a data frame with the date, the month and weekday. And then obtain the second to last wednesday for each month with the help of aggregate.
day_sequence = seq(as.Date("2020/1/1"), as.Date("2020/12/31"), "day")
df = data.frame(day = day_sequence,
month = months(day_sequence),
weekday = weekdays(day_sequence))
#Filter only wednesdays
df = df[df$weekday == "Wednesday",]
result = aggregate(day ~ month, df, function(x){head(tail(x,2),1)})
tail(x,2) will return the last two rows, then head(.., 1) will give you the first of these last two.
Result:
month day
1 April 2020-04-22
2 August 2020-08-19
3 December 2020-12-23
4 February 2020-02-19
5 January 2020-01-22
6 July 2020-07-22
7 June 2020-06-17
8 March 2020-03-18
9 May 2020-05-20
10 November 2020-11-18
11 October 2020-10-21
12 September 2020-09-23
There are probably simpler ways of doing this but the function below does what the question asks for. it returns a named vector of days such that
They are between from and to.
Are weekday day, where 1 is Monday.
Are n to last of the month.
By n to last I mean the nth counting from the end of the month.
whichWeekday <- function(from, to, day, n, format = "%Y-%m-%d"){
from <- as.Date(from, format = format)
to <- as.Date(to, format = format)
day <- as.character(day)
d <- seq(from, to, by = "days")
m <- format(d, "%Y-%m")
f <- c(TRUE, m[-1] != m[-length(m)])
f <- cumsum(f)
wed <- tapply(d, f, function(x){
i <- which(format(x, "%u") == day)
x[ tail(i, n)[1] ]
})
y <- as.Date(wed, origin = "1970-01-01")
setNames(y, format(y, "%Y-%m"))
}
whichWeekday("2019-01-01", "2020-03-31", 4, 2)
# 2019-01 2019-02 2019-03 2019-04 2019-05
#"2019-01-23" "2019-02-20" "2019-03-20" "2019-04-17" "2019-05-22"
# 2019-06 2019-07 2019-08 2019-09 2019-10
#"2019-06-19" "2019-07-24" "2019-08-21" "2019-09-18" "2019-10-23"
# 2019-11 2019-12 2020-01 2020-02 2020-03
#"2019-11-20" "2019-12-18" "2020-01-22" "2020-02-19" "2020-03-18"

Change strings with different format to dates with the same format in a dataframe

I have a dataframe which looks like this (it has thounds of date rows like this, ranging from years 18xx until 2019)
date
1 25 February 1987
2 20 August 1974
3 9 October 1984
4 16-Oct-63
5 13-11-1961
6 03/23/87
7 01.01.1995
8 February 1988
9 1988
10 20050101-20051231
I need to change the date column to one date-format (eg.: YYYY-MM-DD, or any other).
Since there are just some Year values as for ID 9, I also have to autofill them. This should always lead to the last day of the particular year. If it ist like ID 8 a month and a year, it should always fill the last day of the particular month (and check if it was a leap year, like it was in 1988 and return in this case something like 1988-02-29). If it is a timeframe like in the last row, it should always cut off the first part and change it to the 31st of December of the given year.
How can I do this?
I thought about using the lubridate package or the anytime package. With lubridate and parse_date or parse_date_time. This even works, but it always fills the missing values for days to the first day of a month and not the last.
library(lubridate)
date <- c("25 February 1987", "20 August 1974", "9 October 1984", "16-Oct-63", "13-11-1961", "03/23/87", "01.01.1995",
"February 1988", "1988", "20050101-20051231")
df <- as.data.frame(date)
parse_date(df$date)
parse_date_time(x = df$date,
orders = c("d m y", "d B Y", "d/m/Y","B Y", "Y", "m/d/y",
"Ymd-Ymd"),
locale = "eng")
My actual results
(parse_date(df$date)):
[1] "1987-02-25 UTC" "1974-08-20 UTC" "1984-10-09 UTC" "2019-10-16 UTC" "2019-11-13 UTC" "1987-03-23 UTC" "1995-01-01 UTC"
[8] "1988-02-01 UTC" "1988-01-01 UTC" "2005-12-31 UTC"
For the parse_date_time I acutally get an error, due to the last orders "Ymd-Ymd" (If I just test: parse_date("20050101-20051231") it results in "2005-12-31 UTC", which I really want to have!)
Using lubridate cheat sheet (https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf)
and by trial and error with dplyr :
df %>%
mutate(newdate = parse_date_time(x = date, orders = c("dmy", "mdy", "my", "y")) ) %>%
mutate(newdate2 = case_when(
newdate > today() ~ newdate - 100*365.25*24*3600,
is.na(newdate) ~ paste0(substr(x=date, start = 1, stop = 4), "-",
substr(x=date, start = 5, stop = 6), "-",
substr(x=date, start = 7, stop = 8) )
%>%
parse_date_time(., orders = c("dmy", "mdy", "my", "y", "ymd")),
TRUE ~ newdate
)
)
Thank you. This is very close. Unfortunately, it still gives me the wrong output dates for some entries.
date newdate newdate2
1 25 February 1987 1987-02-25 1987-02-25
2 20 August 1974 1974-08-20 1974-08-20
3 9 October 1984 1984-10-09 1984-10-09
4 16-Oct-63 2063-10-16 1963-10-16
5 13-11-1961 1961-11-13 1961-11-13
6 03/23/87 1987-03-23 1987-03-23
7 01.01.1995 1995-01-01 1995-01-01
8 February 1988 1988-02-19 1988-02-19
9 1988 1988-01-01 1988-01-01
10 20050101-20051231 <NA> 2005-01-01
But I need it like this:
date newdate newdate2
1 25 February 1987 1987-02-25 1987-02-25
2 20 August 1974 1974-08-20 1974-08-20
3 9 October 1984 1984-10-09 1984-10-09
4 16-Oct-63 2063-10-16 1963-10-16
5 13-11-1961 1961-11-13 1961-11-13
6 03/23/87 1987-03-23 1987-03-23
7 01.01.1995 1995-01-01 1995-01-01
8 February 1988 1988-02-19 **1988-02-29**
9 1988 1988-01-01 **1988-12-31**
10 20050101-20051231 <NA> **2005-12-31**
This means: If I only have year and month: I need to enter the last day of the particular month and for February to consider leap years, like in the example row 8. If I just have a year, I need to change it to 31st of December of the given year. And if the entry looks like in row 10, I need to cut of the first part and just keep the 31st of December of the given year, but for this case, I already adjusted the part of your code:
is.na(newdate) ~ paste0(substr(x=date, start = 10, stop = 13), "-",
substr(x=date, start = 14, stop = 15), "-",
substr(x=date, start = 16, stop = 17) )

R: assign months to day of the year

Here's my data which has 10 years in one column and 365 day of another year in second column
dat <- data.frame(year = rep(1980:1989, each = 365), doy= rep(1:365, times = 10))
I am assuming all years are non-leap years i.e. they have 365 days.
I want to create another column month which is basically month of the year the day belongs to.
library(dplyr)
dat %>%
mutate(month = as.integer(ceiling(day/31)))
However, this solution is wrong since it assigns wrong months to days. I am looking for a dplyr
solution possibly.
We can convert it to to datetime class by using the appropriate format (i.e. %Y %j) and then extract the month with format
dat$month <- with(dat, format(strptime(paste(year, doy), format = "%Y %j"), '%m'))
Or use $mon to extract the month and add 1
dat$month <- with(dat, strptime(paste(year, doy), format = "%Y %j")$mon + 1)
tail(dat$month)
#[1] 12 12 12 12 12 12
This should give you an integer value for the months:
dat$month.num <- month(as.Date(paste(dat$year, dat$doy), '%Y %j'))
If you want the month names:
dat$month.names <- month.name[month(as.Date(paste(dat$year, dat$doy), '%Y %j'))]
The result (only showing a few rows):
> dat[29:33,]
year doy month.num month.names
29 1980 29 1 January
30 1980 30 1 January
31 1980 31 1 January
32 1980 32 2 February
33 1980 33 2 February

Plotting the frequency of string matches over time in R

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

Resources