Removing inconsistent observation in r - r

I have the data frame as I have below. I want to examine all Intake frequencies with "03 Month". My mission is to remove any Intake that is far away from three months by comparing the dates. Intake stands for when a client first register with an agency. 03 Month is three months follow up.
I am running R version 3.3.2 in Window environment.
I have taking the difference between the current date and the previous date in days. However, it is not straight forward to eliminate the observation with less than 90 days and greater than say 100 days as some patient has only intake and no follow up which i will like to keep.
Any help please.
ID DATE FREQ
1 08/09/2014 Intake
1 27/03/2015 Intake
1 01/09/2015 Intake
1 07/12/2015 03 Months
1 18/03/2016 06 Months

Related

Getting the same day across different years in R

I have a dataset for a time series spanning a couple of years with daily observations. I'm trying to smooth some clearly wrong data inserted there (for example, negative values when the variable cannot take values below zero) and what I came up with was trying to smooth it or "interpolate" it by using both the mean of the days around that observation and the mean of the same day or couple of days from previous years, as I have yearly seasonality (I'm still unsure about this part, any comment would be greatly appreciated).
So my question is whether I can easily access the same day acrosss different years.
Here's a dummy example of my data:
library(tidyverse)
library(lubridate)
date value
2016-10-01 00:00:00 28
2016-10-02 00:00:00 25
2016-10-03 00:00:00 24
2016-10-04 00:00:00 22
2016-10-05 00:00:00 -6
2016-10-06 00:00:00 26
I have that for years 2016 through 2020. So in this example I would use the dates around 2016-10-05 AND I would like to use the dates around the 5th of October from years 2017 to 2020 to kind of maintain the seasonality, but maybe this is incorrect.
I tried to use +years() from lubridate but I still have to do things manually and I would like to kind of autimatize things.
If your question is solely "whether [you] can easily access the same day [across] different years", you could do that as follows:
# say your data frame is called df
library(lubridate)
day(df$date)
This will return the day part of the date for every entry in that column of your data frame.
Edit to reply to comment from asker:
This is a very basic way to specify the day and month for which you would like to obtain the corresponding rows in your data frame:
df[day(df$dates) == 5 & month(df$dates) == 10, ]

Can the line on a line graph use an input that is linked to the x-values shown on the x-axis?

I have data showing multiple people's records on multiple days. Each row also shows the week of the year that said day happened on. Some example data:
Date
Week of year
Person
Commission
2020-12-20
51
Alice
$3
2021-12-20
51
Alice
$4
2020-12-20
51
Bob
$14
2021-12-20
51
Bob
$22
2020-12-31
52
Alice
$34
2021-12-31
52
Alice
$42
2020-12-31
52
Bob
$4
2021-12-31
52
Bob
$2
What I want is to plot a line graph that shows 'Week of year' on the x-axis, but actually plots one value - the average commission between the two employees - for each day in each week per year. Is this possible?
Whenever I tell Power BI to use 'Week of year' in the x-axis and the year part of 'Date' as the legend, it gives the correct x-axis and correctly gives me one line per year. However, it clearly uses the average value of 'Commission' for each week rather than using each day's value. That is, it gives me about 52 values on the line when I really want about 365. Using 'Date' as the x-axis appears to give me the correct lines, but then I don't have the x-axis that I want.
If it helps, I already have a table that converts each date to its corresponding week.
I'm not a PowerBI user, but if you can use it to plot from two tables, and the first table is in Excel, make a second table that has the average for each day in the first table. Copy all the dates to a new column, remove the duplicates and then calculate the average commission using a formula like
=AVERAGEIFS(salesCom[Commission],salesCom[Date],"=" &H3)
Make the result into your second table and add to PowerBI.

How to I transform half-hourly data that does not span the whole day to a Time Series in R?

This is my first question on stackoverflow, sorry if the question is poorly put.
I am currently developing a project where I predict how much a person drinks each day. I currently have data that looks like this:
The menge column represents how much water a person has actually drunk in 30 minutes (So first value represents amount from 8:00 till before 8:30 etc..). This is a 1 day sample from 3 months of data. The day starts at 8 AM and ends at 8 PM.
I am trying to forecast the Time Series for each day. For example, given the first one or two time steps, we would predict the whole day and then we know how much in total the person has drunk until 8 PM.
I am trying to model this data as a Time Series object in R (Google Colab), in order to use Croston's Method for the forecasting. Using the ts() function, what should I set the frequency to knowing that:
The data is half-hourly
The data is from 8:00 till 20:00 each day (Does not span the whole day)
Would I need to make the data span the whole day by adding 0 values? Are there maybe better approaches for this? Thank you in advance.
When using the ts() function, the frequency is used to define the number of (usually regularly spaced) observations within a given time period. For your example, your observations are every 30 minutes between 8AM and 8PM, and your time period is 1 day. The time period of 1 day assumes that the patterns over each day is of most interest here, you could also use 1 week here.
So within each day of your data (8AM-8PM) you have 24 observations (24 half hours). So a suitable frequency for this data would be 24.
You can also pad the data with 0 values, however this isn't necessary and would complicate the model. If you padded the data so that it has observations for all half-hours of the day, the frequency would then be 48.

Representing an entire day or week or month as a number like timestamp

How can a day or week or month, essentially a range of time be represented by a single number?
The next interval would represent a number 1 more than the number for the previous interval, just how the next second is 1 more than the previous second, in timestamp representation.
Given a bunch of such numbers, the larger number simply means its representing a time interval afterwards in time, when compared to a number smaller than it.
Just realized if I stick to UTC and represent the day as YYYYMMDD, this becomes a number that I am looking for.
20180420 // 20 april 2018
20180421 // 21 april 2018
20180510 // 10 may 2018
20190101 // 1 jan 2019
This works for representing a day perfectly, I think.
For week, maybe do ceil() of days of current month divided by 7 for representing week as a number W and then using the format: YYYYMMW.
2018043 // 3rd week of april 2018
2018045 // 5th week of april 2018, though may not be the 5th week semantically but representation model works, greater than 4th week of april 2018 and smaller number than 1st week of may 2018
For month, simply YYYYMM works.
I feel so smart right now! 😄

Post-Process a Stata %tw date in R

The %tw format in Stata has the form: 1960w1 which has no equivalent in R.
Therefore %tw dates must be post-processed.
Importing a .dta file into R, the date is an integer like 1304 (instead of 1985w5) or 1426 (instead of 1987w23). If it was a simple time series you could set a starting date as follows:
ts(df, start= c(1985,5), frequency=52)
Another possibility would be:
as.Date(Camp$date, format= "%Yw%W" , origin = "1985w5")
But if each row is not a single date, then you must convert it.
The package ISOweek is based on ISO-8601 with the form "1985-W05" and does not process the Stata %tw.
The Lubridate package does not work with this format. The week() returns the number of complete seven day periods that have occurred between the date and January 1st, plus one. week function
In Stata week 1 of any year starts on 1 January, whatever day of the week that is. Stata Documentation on Dates
In the format %W of Date in R the week starts as Monday as first day of the week.
From strptime %V is
the Week of the year as decimal number (00--53) as defined in ISO
8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise,
it is the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.) Strptime
Larmarange noted on Github that Haven doesn't interpret dates properly:
months, week, quarter and halfyear are specific format from Stata,
respectively %tm, %tw, %tq and %th. I'm not sure that there are
corresponding formats available in R. So far they are imported as
integers.
Is there a way to convert Stata %tw to a date format R understands?
Here is an Stata file with dates
This won't be an answer in terms of R code, but it is commentary on Stata weeks that can't be fitted into a comment.
Strictly, dates in Stata are not defined by the display formats that make them intelligible to people. A date in Stata is always a numeric variable or scalar or macro defined with origin the first instance in 1960. Thus it is at best a shorthand to talk about %tw dates, etc. We can use display to see the effects of different date display formats:
. di %td 0
01jan1960
. di %tw 0
1960w1
. di %tq 0
1960q1
. di %td 42
12feb1960
. di %tw 42
1960w43
. di %tq 42
1970q3
A subtle point made explicit above is that changing the display format will not change what is stored, i.e. the numeric value.
Otherwise put, dates in Stata are not distinct data types; they are just integers made intelligible as dates by a pertinent display format.
The question presupposes that it was correct to describe some weekly dates in terms of Stata weeks. This seems unlikely, as I know no instance in which a body outside StataCorp uses the week rules of Stata, not only that week 1 always starts on 1 January, but also that week 52 always includes either 8 or 9 days and hence that there is never a week 53 in a calendar year.
So, you need to go upstream and find out what the data should have been. Failing some explanation, my best advice is to map the 52 weeks of each year to the days that start them, namely days 1(7)358 of each calendar year.
Stata weeks won't map one-to-one to any other scheme for defining weeks.
More in this article on Stata weeks
It's not completely clear what the question is but the year and week corresponding to 1304 are:
wk <- 1304
1960 + wk %/% 52
## [1] 1985
wk %% 52 + 1
## [1] 5
so assuming that the first week of the year is week 1 and starts on Jan 1st, the beginning of the above week is this date:
as.Date(paste(1960 + wk %/% 52, 1, 1, sep = "-")) + 7 * (wk %% 52)
## [1] "1985-01-29"

Resources