Getting the same day across different years in R - r

I have a dataset for a time series spanning a couple of years with daily observations. I'm trying to smooth some clearly wrong data inserted there (for example, negative values when the variable cannot take values below zero) and what I came up with was trying to smooth it or "interpolate" it by using both the mean of the days around that observation and the mean of the same day or couple of days from previous years, as I have yearly seasonality (I'm still unsure about this part, any comment would be greatly appreciated).
So my question is whether I can easily access the same day acrosss different years.
Here's a dummy example of my data:
library(tidyverse)
library(lubridate)
date value
2016-10-01 00:00:00 28
2016-10-02 00:00:00 25
2016-10-03 00:00:00 24
2016-10-04 00:00:00 22
2016-10-05 00:00:00 -6
2016-10-06 00:00:00 26
I have that for years 2016 through 2020. So in this example I would use the dates around 2016-10-05 AND I would like to use the dates around the 5th of October from years 2017 to 2020 to kind of maintain the seasonality, but maybe this is incorrect.
I tried to use +years() from lubridate but I still have to do things manually and I would like to kind of autimatize things.

If your question is solely "whether [you] can easily access the same day [across] different years", you could do that as follows:
# say your data frame is called df
library(lubridate)
day(df$date)
This will return the day part of the date for every entry in that column of your data frame.
Edit to reply to comment from asker:
This is a very basic way to specify the day and month for which you would like to obtain the corresponding rows in your data frame:
df[day(df$dates) == 5 & month(df$dates) == 10, ]

Related

How to remove rows in data frames with dates close to each other in R?

I have longitudinal data in a data frame in long format in R, such that a person can be present on several rows, where each row has a specific date - but never the same date. Data is sorted by personal ID firstly and secondly by date, such that early dates for an individual comes first.
Following is what I would like to accomplish:
The first date for each individual should be kept. For the rest of the dates I want to remove all dates occurring within 30 days of a previous date for that person. But, if a row is removed, no other following dates should be compared to that date. The dates should be removed in order, from top to bottom. I.e. if a person has dates 14 May 2020, 20 May 2020, 22 May 2020 and 17 June 2020 I would like to remove the rows in the data frame with the two middle dates, as they are close to the first date: 14 May 2020. I have been able to do this with for loops, but it is not at all time efficient for big data. Does anybody know how I could solve this in a better way?

Making a missing data mapping/graph out of the days missing in a time series

I have a data frame of the following structure:
df <- read.table(text="ID Time
58 15-10-2015
59 16-10-2015
60 19-10-2015
61 21-10-2015
62 26-10-2015
63 28-12-2015",h=T)
I would like to make a missmap based on the days that are not represented, so for example 17-10-2015 and 18-10-2015 would be set as missing data. So would the month of November 2015, etc.
However, I have not found a way to make this visual without compromising the quantities of days missing.
I considered splitting the data into one column for every month to display in in Amelia, but that did not work for the end of months.
Thank you in advance!

Time Series Analysis - Model Choosing

I am new to time series analysis and wanted to know what the best r package is to solve my dilema. I have a data frame with the following columns:
Date Spend Result
2017-06-22 2 17
2017-06-21 5 19
2017-06-20 11 45
2017-06-19 34 78
2017-06-18 23 56
2017-06-17 12 34
The business problem trying to be solved is that based on the seasonality of the data and the amount spent, can I predict the Result column.
For example, let's say I wanted to increase my spend to $45 more per day, can I predict the Result based on the spend and the time of year?
I was going to use a generalized additive model but that only takes 1 variable into account. Is it possible to do a simple regression analysis with this with time being one of the variables?
I was thinking of taking the month from the date column and making the month dummy variables. Not sure if there is a better way though.
Thanks!

difftime for multiple dates in r

I have chemistry water data taken from a river. Normally, the sample dates were on a Wednesday every two weeks. The data record starts in 1987 and ends in 2013.
Now, I want to re-check if there are any inconsistencies within the data, that is if the samples are really taken every 14 days. For that task I want to use the r function difftime. But I have no idea on how to do that for multiple dates.
Here is some data:
Date Value
1987-04-16 12:00:00 1,5
1987-04-30 12:00:00 1,2
1987-06-25 12:00:00 1,7
1987-07-14 12:00:00 1,3
Can you tell me on how to use the function difftime properly in that case or any other function that does the job. The result should be the number of days between the samplings and/or a true and false for the 14 days.
Thanks to you guys in advance. Any google-fu was to no avail!
Assuming your data.frame is named dd, you'll want to verify that the Date column is being treated as a date. Most times R will read them as a character which gets converted to a factor in a data.frame. If class(df$Date) is "character" or "factor", run
dd$Date<-as.POSIXct(as.character(dd$Date), format="%Y-%m-%d %H:%M:%S")
Then you can so a simple diff() to get the time difference in days
diff(dd$Date)
# Time differences in days
# [1] 14 56 19
# attr(,"tzone")
# [1] ""
so you can check which ones are over 14 days.

Calculating days per month between interval of two dates

I have a set of events that each have a start and end date, but they take place over the scope of a number of months. I would like to create a table that shows the number of days in each month for this event.
I have the following example.
event_start_date <- as.Date("23/10/2012", "%d/%m/%Y")
event_end_date <- as.Date("07/02/2013", "%d/%m/%Y")
I would expect to get a table out as the following:
Oct-12 8
Nov-12 30
Dec-12 31
Jan-13 31
Feb-13 7
Does anybody know about a smart and elegant way of doing this or is creating a system of loops the only viable method?
Jochem
This is not necessarily efficient because it creates a sequence of days, but it does the job:
> library(zoo)
> table(as.yearmon(seq(event_start_date, event_end_date, "day")))
Oct 2012 Nov 2012 Dec 2012 Jan 2013 Feb 2013
9 30 31 31 7
If your time span is so large than this method is slow, you'll have to create a sequence of firsts of the months between your two (truncated) dates, take the diff, and do a little extra work for the end points.
As DjSol already pointed out in his comment, you can just subtract two dates to get the number of days:
event_start_date <- as.Date("23/10/2012", "%d/%m/%Y")
event_end_date <- as.Date("07/02/2013", "%d/%m/%Y")
as.numeric(event_end_date - event_start_date)
Is that what you want? I have the feeling that you might have more of a problem to get the start and end date in such a format so you can easily subtract them because you mention a loop. If so, however, I guess we need more details on how your actual data looks.

Resources