R - Selecting control group days days close to observation times - r

I have a set of many different kind of observations that contain the Date of observations on special days (sam1, sam2, sam3). My aim is now to perform a wilcox.test() to find out if there is a significant difference between the observations on these special days and not special days. So I need to find a method to find suitable days to use as a control group. I want to try out at least 3 different control groups. The special days are very different, some represent a whole season, some only rainy days, some only cold days, some stormy weather. So they can be spread over the whole 2 years, some might only occur in one single month.
Start <- as.Date("2016-01-01")
End <- as.Date("2017-12-31")
all_dates <- seq(from = Start, to = End, by = 1)
set.seed(1)
sam1<- sample(dates, 30)
sam2<- sample(dates, 5)
sam3<- sample(dates, 120)
all_dates represents my observation period. sam1-3 contain the days of different observations. What I want to do now is to find:
The closest days to my important observations (same number of days that are in the sample)
Random days in roughly the same time the important observations took place (also same number of days as in the sample). It must not be the closest days, but roughly around the same time like in the same months or only 1 before/later.
Any days (I know how to do that, no help needed there)
My idea was to cut out the days of important observations from my whole observation period and then apply a routine that selects my control group days. That is where I stuck now. Any ideas?

Related

How to I transform half-hourly data that does not span the whole day to a Time Series in R?

This is my first question on stackoverflow, sorry if the question is poorly put.
I am currently developing a project where I predict how much a person drinks each day. I currently have data that looks like this:
The menge column represents how much water a person has actually drunk in 30 minutes (So first value represents amount from 8:00 till before 8:30 etc..). This is a 1 day sample from 3 months of data. The day starts at 8 AM and ends at 8 PM.
I am trying to forecast the Time Series for each day. For example, given the first one or two time steps, we would predict the whole day and then we know how much in total the person has drunk until 8 PM.
I am trying to model this data as a Time Series object in R (Google Colab), in order to use Croston's Method for the forecasting. Using the ts() function, what should I set the frequency to knowing that:
The data is half-hourly
The data is from 8:00 till 20:00 each day (Does not span the whole day)
Would I need to make the data span the whole day by adding 0 values? Are there maybe better approaches for this? Thank you in advance.
When using the ts() function, the frequency is used to define the number of (usually regularly spaced) observations within a given time period. For your example, your observations are every 30 minutes between 8AM and 8PM, and your time period is 1 day. The time period of 1 day assumes that the patterns over each day is of most interest here, you could also use 1 week here.
So within each day of your data (8AM-8PM) you have 24 observations (24 half hours). So a suitable frequency for this data would be 24.
You can also pad the data with 0 values, however this isn't necessary and would complicate the model. If you padded the data so that it has observations for all half-hours of the day, the frequency would then be 48.

How to subtract a number of weeks from a yearweek/weeknumber in R?

I have a couples of weeknumbers of interest. Lets take '202124' (this week) as an example. How can I subtract x weeks from this week number?
Lets say I want to know the week number of 2 weeks prior, ideally I would like to do 202124 - 2 which would give me 202122. This is fine for most of the year however 202101 - 2 will give 202099 which is obviously not a valid week number. This would happen on a large scale so a more elegant solution is required. How could I go about this?
convert the year week values to dates subtract in days and format the output.
x <- c('202124', '202101')
format(as.Date(paste0(x, 1), '%Y%W%u') - 14, '%Y%V')
#[1] "202122" "202052"
To convert year week value to date we also need day of the week, I have used it as 1st day of the week.

Create moving-periods in a dataframe and calculate things (R studio)

I have a dataframe with Precipitation data for every day from January 1961 to December 2017 that looks like this:
DF=data.frame(Years,Month,Day,Precipitation Value)
I want to create periods of 30 days starting with 1th of January of 1961 so the first period will be 1st january to 30th January 1961 and want R to calculate the number of days without rain (Precipitation Value=0). Then, I want to do the same with the next day: 2th January so the period will be 2nd january-31st January, etc. After that, I need R to create a data frame with all the results for the year 1961. So it should be a data frame with of only one column with values (those values will be the number of days without rain in every period).
Then I need to do the same thing with all the years. Which means I will end up with 56 dataframes (1 for each year) and after that I could make a matrix with all of them (putting each data frame as a row).
The thing is I DO NOT KNOW how to start. I have no idea about how making the loop. I know it should be really easy, but I am having trouble with doing it. Specially i do not know how to tell R to stop every different year and start over and make a NEW data frame/vector with values.
Please provide a reproducible subset of your data so others can help you more effectively. While I cannot teach you how to create a loop from scratch here is some code that I think will help. This code simply calculates the moving 30 day average of precipitation using a simple for loop. You can use dplyr to filter these moving averages by year and create data frames doing that. Note I'm not counting the number of no precipitation days here but you can modify the loop easily to do that if needed
df<-data.frame(year = rep(1967:2002, each =12*30),
month = rep(rep(1:12, each = 30), 36),
day = rep(seq(1,30, by = 1), 432),
precipitation = sample(1:2000, 12*36))
df
#create a column that goes from 1 to however long your dataframe is
df$marker <- 1:nrow(df)
#'Now we create a simple loop to calculate the mean precipitation for
#'every 30 day window. You can modify this to count the number of days with
#'0 precipitation
#'the new column moving precip will tell you the mean precipitation for the
#' past 30 days relative to its postion. So if your on row 55, it will give
#' you the mean precipitation from row 25 to 55
df$movingprecip<-NA
for(i in 1:nrow(df)){
start = i #this says we start at i
end = i + 30 #we end 30 days later from i
if(end > nrow(df)){
#here I tell R to print this if there is not enough days
#in the dataset (30 days) to calculate the 30 day window mean
#this happens at the beginning of the dataset because we need to get to the
#30th row to start calculating means
print("not able to calculate, not 30 days into the data yet")
}else{
#Here I calculate the mean the of the past 30 days of precip
df$movingprecip[end] = mean(df[start:end,4])}
}

How to cluster according to hour of specific day

I have the logs of the amount of arrivals at a bank , every half an hour for one month.
I am trying to find different cluster groups according to the amount of "arrivals". I tried according to the day, and i tried according to the hour (not of a specific day). I would like to try according to the hour of a specific day.
An example:
Thursdays at 14:00 and Sundays at 15:00 are one cluster with an average of 10000 arrivals
Mondays at 13:00, Mondays at 10:00 and Tuesdays at 16:00 are one cluster with an average of 15000 arrivals.
all the rest are another cluster with an average of 2000 arrivals.
I have a csv file with the columns: Date, Day(1-7), Time, Arrivals
Until now I used this:
km <- kmeans(table, 3, 15)
plot(km)
(i tried 3 clusters) - this code clusters pairs .( a matrix of 3x3 with a plot of each 2 out of 3 columns)
Is there a way to do that?
k-means and similar algorithms will yield meaningless results on this kind of data.
The problem is you are using the wrong tool for the wrong problem on the wrong data.
Your data is: Date, Day(1-7), Time, Arrivals
K-means will try to minimize variance. But does variance make any sense on this data set? How do you know hich k makes most sense? Since Arrivals likely has the largest variance of these attributes, it will completely dominate your result.
The question you should first try to answer is: what is a good result? Then, consider ways of visualizing the results to verify that you are up to something. And when you've visualized the data, consider ways to manually mark the desired result on the visualization, this may well be good enough for you. Better than praying for k-means to yield a somewhat meaningful result; because on this kind of mixed type data, it usually does not work very well.

Compute average over sliding time interval (7 days ago/later) in R

I've seen a lot of solutions to working with groups of times or date, like aggregate to sum daily observations into weekly observations, or other solutions to compute a moving average, but I haven't found a way do what I want, which is to pluck relative dates out of data keyed by an additional variable.
I have daily sales data for a bunch of stores. So that is a data.frame with columns
store_id date sales
It's nearly complete, but there are some missing data points, and those missing data points are having a strong effect on our models (I suspect). So I used expand.grid to make sure we have a row for every store and every date, but at this point the sales data for those missing data points are NAs. I've found solutions like
dframe[is.na(dframe)] <- 0
or
dframe$sales[is.na(dframe$sales)] <- mean(dframe$sales, na.rm = TRUE)
but I'm not happy with the RHS of either of those. I want to replace missing sales data with our best estimate, and the best estimate of sales for a given store on a given date is the average of the sales 7 days prior and 7 days later. E.g. for Sunday the 8th, the average of Sunday the 1st and Sunday the 15th, because sales is significantly dependent on day of the week.
So I guess I can use
dframe$sales[is.na(dframe$sales)] <- my_func(dframe)
where my_func(dframe) replaces every stores' sales data with the average of the store's sales 7 days prior and 7 days later (ignoring for the first go around the situation where one of those data points is also missing), but I have no idea how to write my_func in an efficient way.
How do I match up the store_id and the dates 7 days prior and future without using a terribly inefficient for loop? Preferably using only base R packages.
Something like:
with(
dframe,
ave(sales, store_id, FUN=function(x) {
naw <- which(is.na(x))
x[naw] <- rowMeans(cbind(x[naw+7],x[naw-7]))
x
}
)
)

Resources