Dividing data based on custom date range - r

I have a time series which spans multiple years and want to divide it into four categories based on date (ie, 15 April - 10 May, 11 May - 10 July, and so on). My first thought was to use lubridate to define each time period with interval() and then use %within% to determine whether an event occurs within it or not.
df
id datetime
1 HAR10 2019-06-26 04:35:06
2 HAR05 2019-08-05 19:15:00
3 HAR07 2018-07-26 01:01:00
4 HAR07 2018-07-24 23:36:00
5 HAR05 2019-08-27 18:59:43
6 HAR05 2019-07-12 03:33:00
7 HAR07 2018-08-09 16:21:00
8 HAR07 2019-05-01 00:04:28
9 HAR04 2019-07-01 05:25:00
10 HAR07 2018-07-18 15:17:00
perA <- interval(ymd(20190511), ymd(20190710))
df %within% perA
I immediately ran into a problem with year, since I want to get all events from, say, April - May, regardless of what year they occurred, but interval is year-specific so the interval defined above works for my 2019 data but not my 2018 data. I could define a new set of intervals for each year, but that seems very messy.
Another problem is that a vector of TRUE and FALSE, which %within% returns, is not what I need. I need to assign each event to a category based on which time range it falls within.
My second thought was to use filter(), but I don't think that solves either of my problems. Any help is appreciated!

You can easily extract the month, day or even hour and set to the same year across dates. I made up some groups. This is a dplyr solution, but you should be able to easily convert to base if you prefer.
library(dplyr)
library(lubridate)
df %>%
mutate(noyeardate = as.Date(paste(2000, month(datetime), day(datetime), sep = "-")),
interval = case_when(noyeardate %within% interval(ymd(20000101), ymd(20000331)) ~ "Group 1",
noyeardate %within% interval(ymd(20000401), ymd(20000630)) ~ "Group 2",
noyeardate %within% interval(ymd(20000701), ymd(20000930)) ~ "Group 3",
noyeardate %within% interval(ymd(20001001), ymd(20001231)) ~ "Group 4"))
id datetime noyeardate interval
1 HAR10 2018-07-18 15:17:00 2000-07-18 Group 3
2 HAR05 2018-07-24 23:36:00 2000-07-24 Group 3
3 HAR07 2018-07-26 01:01:00 2000-07-26 Group 3
4 HAR07 2018-08-09 16:21:00 2000-08-09 Group 3
5 HAR05 2019-05-01 00:04:28 2000-05-01 Group 2
6 HAR05 2019-06-26 04:35:06 2000-06-26 Group 2
7 HAR07 2019-07-01 05:25:00 2000-07-01 Group 3
8 HAR07 2019-07-12 03:33:00 2000-07-12 Group 3
9 HAR04 2019-08-05 19:15:00 2000-08-05 Group 3
10 HAR07 2019-08-27 18:59:43 2000-08-27 Group 3
Data:
df <- data.frame(id = c("HAR10", "HAR05", "HAR07", "HAR07", "HAR05", "HAR05", "HAR07", "HAR07", "HAR04", "HAR07"),
datetime = as.POSIXct(c("2018-07-18 15:17:00", "2018-07-24 23:36:00",
"2018-07-26 01:01:00", "2018-08-09 16:21:00", "2019-05-01 00:04:28",
"2019-06-26 04:35:06", "2019-07-01 05:25:00", "2019-07-12 03:33:00",
"2019-08-05 19:15:00", "2019-08-27 18:59:43")))

Related

Sum time across different continuous time events across date and time combinations in R

I am having trouble figuring out how to account for and sum continuous time observations across multiple dates and time events in my dataset. A similar question is found here, but it only accounts for one instance of a continuous time event. I have a dataset with multiple date and time combinations. Here is an example from that dataset, which I am manipulating in R:
date.1 <- c("2021-07-21", "2021-07-21", "2021-07-21", "2021-07-29", "2021-07-29", "2021-07-30", "2021-08-01","2021-08-01","2021-08-01")
time.1 <- c("15:57:59", "15:58:00", "15:58:01", "15:46:10", "15:46:13", "18:12:10", "18:12:10","18:12:11","18:12:13")
df <- data.frame(date.1, time.1)
df
date.1 time.1
1 2021-07-21 15:57:59
2 2021-07-21 15:58:00
3 2021-07-21 15:58:01
4 2021-07-29 15:46:10
5 2021-07-29 15:46:13
6 2021-07-30 18:12:10
7 2021-08-01 18:12:10
8 2021-08-01 18:12:11
9 2021-08-01 18:12:13
I tried following the following script from the link I present:
df$missingflag <- c(1, diff(as.POSIXct(df$time.1, format="%H:%M:%S", tz="UTC"))) > 1
df
date.1 time.1 missingflag
1 2021-07-21 15:57:59 FALSE
2 2021-07-21 15:58:00 TRUE
3 2021-07-21 15:58:01 FALSE
4 2021-07-29 15:46:10 FALSE
5 2021-07-29 15:46:13 TRUE
6 2021-07-30 18:12:10 TRUE
7 2021-08-01 18:12:10 FALSE
8 2021-08-01 18:12:11 FALSE
9 2021-08-01 18:12:13 TRUE
But it did not working as anticipated and did not get closer to my answer. It would have been an intermediate goal and probably wouldn't answer my questions.
The GOAL of my dilemma would be account to for all the continuous time observations and put into a new table like this:
date.1 time.1 secs
1 2021-07-21 15:57:59 3
4 2021-07-29 15:46:10 1
5 2021-07-29 15:46:13 1
6 2021-07-30 18:12:10 1
7 2021-08-01 18:12:10 2
9 2021-08-01 18:12:13 1
You will see that the start time of each of the continuous time observations are recorded and the total number of seconds (secs) observed since the start of the continuous observation are being recorded. The script would account for date.1 as there are multiple dates in the dataset.
Thank you in advance.
You can create a datetime object combining date and time columns, get the difference of consecutive values and create groups where all the time 1s apart are part of the same group. For each group count the number of rows and their first datetime value.
library(dplyr)
library(tidyr)
df %>%
unite(datetime, date.1, time.1, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(grp = cumsum(difftime(datetime,
lag(datetime, default = first(datetime)), units = 'secs') > 1)) %>%
summarise(datetime = first(datetime),
secs = n(), .groups = 'drop') %>%
select(-grp)
# datetime secs
# <dttm> <int>
#1 2021-07-21 15:57:59 3
#2 2021-07-29 15:46:10 1
#3 2021-07-29 15:46:13 1
#4 2021-07-30 18:12:10 1
#5 2021-08-01 18:12:10 2
#6 2021-08-01 18:12:13 1
I have kept datetime as single combined column here but if needed you can separate them again as two different columns using
%>% separate(datetime, c('date', 'time'), sep = ' ')

How to make Monthly Predictions in R Facebook Prophet, Data is also Monthly

Data (df3) Looks like this. One "1" for day at the end was added just to fulfill date format requirement.
ds y<br/>
1 2015-01-01 -390217.2<br/>
2 2015-02-01 230944.1<br/>
3 2015-03-01 367259.7<br/>
4 2015-04-01 567962.8<br/>
5 2015-05-01 753175.6<br/>
6 2015-06-01 -907767.5<br/>
7 2015-07-01 -52225619.9<br/>
8 2015-08-01 631666.1<br/>
9 2015-09-01 -792896.8<br/>
10 2015-10-01 430847.6<br/>
11 2015-11-01 5159146.7<br/>
12 2015-12-01 -2087233.7
Code i have tried:
try <- prophet(df3, seasonality.mode = 'multiplicative')
future <- make_future_dataframe(try, periods = 1)
forecast <- predict(try, future)
tail(forecast)
Result i am getting:
ds yhat<br/>
50 2019-02-01 -9536258.7<br/>
51 2019-03-01 -456995.5<br/>
52 2019-04-01 -1734330.0<br/>
53 2019-05-01 -3428825.1<br/>
54 2019-06-01 -2612847.0<br/>
55 2019-06-02 -2918161.2
Question is how to predict July 2019 instead of 2nd june 2019 value?
future = prophet.make_future_dataframe(periods=12 , freq='M')
for more information https://towardsdatascience.com/forecasting-in-python-with-facebook-prophet-29810eb57e66
future = prophet.make_future_dataframe(periods=12 , freq='MS')
forecast = prophet.predict(future)
fig = prophet.plot(forecast)
fig.show()
MS stands for Month Start.

How to subset data by specific hours of interest?

I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.
To summarise, the readings I am interested in are either:
0800, 1200, 1600 =day
2000, 0000, 0400 =night
Recordings between 0800-1600 and 2000-0400 each day should be averaged.
During daylight savings, the equivalent times are:
0900, 1300, 1700 =day
2100, 0100, 0500 =night
Recordings between 0900-1700 and 2100-0500 each day should be averaged.
In the process, I am hoping to subset by site.
There are also some NA values or blank cells which should be ignored.
So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!
temperature <- read.csv("SeaTemperatureData.csv",
stringsAsFactors = FALSE)
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
format="%d/%m/%Y %H:%M",
tz="Pacific/Auckland")
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))
Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises
1 2019-06-10 16:00:00 16.820 16.892 16.749 16.677 15.819
2 2019-06-10 20:00:00 16.773 16.844 16.582 16.654 15.796
3 2019-06-11 00:00:00 16.749 16.820 16.749 16.606 15.819
4 2019-06-11 04:00:00 16.487 16.796 16.654 16.558 15.796
5 2019-06-11 08:00:00 16.582 16.749 16.487 16.463 15.867
6 2019-06-11 12:00:00 16.630 16.773 16.725 16.654 15.867
One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.
Here a fake example over 4 days:
library(lubridate)
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Value = sample(1:100,97, replace = TRUE))
DateTime Value
1 2020-02-01 00:00:00 99
2 2020-02-01 01:00:00 51
3 2020-02-01 02:00:00 44
4 2020-02-01 03:00:00 49
5 2020-02-01 04:00:00 60
6 2020-02-01 05:00:00 56
Now, you can extract hours with hour function of lubridate and subset for the desired hour:
library(lubridate)
subset(df, hour(DateTime) == 5)
DateTime Value
6 2020-02-01 05:00:00 56
30 2020-02-02 05:00:00 31
54 2020-02-03 05:00:00 65
78 2020-02-04 05:00:00 80
EDIT: Getting mean of each sites per subset of hours
Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.
Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.
Here, a more elaborated example based on the previous one:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Site1 = sample(1:100,97, replace = TRUE),
Site2 = sample(1:100,97, replace = TRUE))
DateTime Site1 Site2
1 2020-02-01 00:00:00 100 6
2 2020-02-01 01:00:00 9 49
3 2020-02-01 02:00:00 86 12
4 2020-02-01 03:00:00 34 55
5 2020-02-01 04:00:00 76 29
6 2020-02-01 05:00:00 41 1
....
So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:
library(lubridate)
library(dplyr)
df %>% mutate(Date = date(DateTime),
Hour= hour(DateTime),
Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
group_by(Date, Category) %>%
summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))
# A tibble: 9 x 4
# Groups: Date [5]
Date Category Site1 Site2
<date> <chr> <dbl> <dbl>
1 2020-02-01 Daily 56.9 63.1
2 2020-02-01 Night 58.9 46.6
3 2020-02-02 Daily 54.5 47.6
4 2020-02-02 Night 36.9 41.7
5 2020-02-03 Daily 42.3 56.9
6 2020-02-03 Night 44.1 55.9
7 2020-02-04 Daily 54.3 50.4
8 2020-02-04 Night 54.8 34.3
9 2020-02-05 Night 75 16
Does it answer your question ?

How to divide monthly totals by the seasonal monthly ratio in R

I am trying de-seasonalize my data by dividing my monthly totals by the average seasonality ratio per that month. I have two data frames. avgseasonality that has 12 rows of the average seasonality ratio per month. The problem is since the seasonality ratio is the ratio of each month averaged only has 12 rows and the ordertotal data frame has 147 rows.
deseasonlize <- transform(avgseasonalityratio, deseasonlizedtotal =
df1$OrderTotal / avgseasonality$seasonalityratio)
This runs but it does not pair the months appropriately. It uses the first ratio of april and runs it on the first ordertotal of december.
> avgseasonality
Month seasonalityratio
1 April 1.0132557
2 August 1.0054602
3 December 0.8316988
4 February 0.9813396
5 January 0.8357475
6 July 1.1181648
7 June 1.0439899
8 March 1.1772450
9 May 1.0430667
10 November 0.9841149
11 October 0.9595041
12 September 0.8312318
> df1
# A tibble: 157 x 3
DateEntLabel OrderTotal `d$Month`
<dttm> <dbl> <chr>
1 2005-12-01 00:00:00 512758. December
2 2006-01-01 00:00:00 227449. January
3 2006-02-01 00:00:00 155652. February
4 2006-03-01 00:00:00 172923. March
5 2006-04-01 00:00:00 183854. April
6 2006-05-01 00:00:00 239689. May
7 2006-06-01 00:00:00 237638. June
8 2006-07-01 00:00:00 538688. July
9 2006-08-01 00:00:00 197673. August
10 2006-09-01 00:00:00 144534. September
# ... with 147 more rows
I need the ordertotal and ratio of each month respectively. The calculations would for each month respectively be such as (december) 512758/0.8316988 = 616518.864762 The output for the calculations would be in their new column that corresponds with the month and ordertotal. Please any help is greatly appreciated!
Easiest way would be to merge() your data first, then do the operation. You can use R base merge() function, though I will show here using the tidyverse left_join() function. I see that one of your columns has a strange name d$Month, renameing this to Month will simplify the merge!
Reproducible example:
library(tidyverse)
df_1 <- data.frame(Month = c("Jan", "Feb"), seasonalityratio = c(1,2))
df_2 <- data.frame(Month = rep(c("Jan", "Feb"),each=2), OrderTotal = 1:4)
df_1 %>%
left_join(df_2, by = "Month") %>%
mutate(eseasonlizedtotal = OrderTotal / seasonalityratio)
#> Month seasonalityratio OrderTotal eseasonlizedtotal
#> 1 Jan 1 1 1.0
#> 2 Jan 1 2 2.0
#> 3 Feb 2 3 1.5
#> 4 Feb 2 4 2.0
Created on 2019-01-30 by the reprex package (v0.2.1)

Plotting the frequency of string matches over time in R

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

Resources