Create moving-periods in a dataframe and calculate things (R studio) - r

I have a dataframe with Precipitation data for every day from January 1961 to December 2017 that looks like this:
DF=data.frame(Years,Month,Day,Precipitation Value)
I want to create periods of 30 days starting with 1th of January of 1961 so the first period will be 1st january to 30th January 1961 and want R to calculate the number of days without rain (Precipitation Value=0). Then, I want to do the same with the next day: 2th January so the period will be 2nd january-31st January, etc. After that, I need R to create a data frame with all the results for the year 1961. So it should be a data frame with of only one column with values (those values will be the number of days without rain in every period).
Then I need to do the same thing with all the years. Which means I will end up with 56 dataframes (1 for each year) and after that I could make a matrix with all of them (putting each data frame as a row).
The thing is I DO NOT KNOW how to start. I have no idea about how making the loop. I know it should be really easy, but I am having trouble with doing it. Specially i do not know how to tell R to stop every different year and start over and make a NEW data frame/vector with values.

Please provide a reproducible subset of your data so others can help you more effectively. While I cannot teach you how to create a loop from scratch here is some code that I think will help. This code simply calculates the moving 30 day average of precipitation using a simple for loop. You can use dplyr to filter these moving averages by year and create data frames doing that. Note I'm not counting the number of no precipitation days here but you can modify the loop easily to do that if needed
df<-data.frame(year = rep(1967:2002, each =12*30),
month = rep(rep(1:12, each = 30), 36),
day = rep(seq(1,30, by = 1), 432),
precipitation = sample(1:2000, 12*36))
df
#create a column that goes from 1 to however long your dataframe is
df$marker <- 1:nrow(df)
#'Now we create a simple loop to calculate the mean precipitation for
#'every 30 day window. You can modify this to count the number of days with
#'0 precipitation
#'the new column moving precip will tell you the mean precipitation for the
#' past 30 days relative to its postion. So if your on row 55, it will give
#' you the mean precipitation from row 25 to 55
df$movingprecip<-NA
for(i in 1:nrow(df)){
start = i #this says we start at i
end = i + 30 #we end 30 days later from i
if(end > nrow(df)){
#here I tell R to print this if there is not enough days
#in the dataset (30 days) to calculate the 30 day window mean
#this happens at the beginning of the dataset because we need to get to the
#30th row to start calculating means
print("not able to calculate, not 30 days into the data yet")
}else{
#Here I calculate the mean the of the past 30 days of precip
df$movingprecip[end] = mean(df[start:end,4])}
}

Related

Define different timeseries for different columns

I have a dataframe where some of the columns are starting later than the other. Please find a reproducible example.
set.seed(354)
df <- data.frame(Product_Id = rep(1:100, each = 50),
Date = seq(from = as.Date("2014/1/1"),
to = as.Date("2018/2/1"),
by = "month"),
Sales = rnorm(100, mean = 50, sd= 20))
df <- df[-c(251:256, 301:312, 2551:2562, 2651:2662, 2751:2762), ]
library(zoo)
z <- read.zoo(df, index = "Date", split = "Product_Id", FUN = as.yearmon)
tt <- as.ts(z)
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Now for this dataframe for the columns 6,7,52,54 and 56 I want to define them as timeseries starting from a different date as compared to the rest of the dataframe. Supposedly the data begins from Jan 2000, column 6 will begin from July 2000, column 7 from Jan 2001 and so on. How should I proceed to do this?
There, AFAIK, no way to do this in R in a time series matrix. And if each column started at a different date, then (since each column has the same number of entries), each column would also need to end at a different date. Is this really what you need? A collection of time series that all happen to be of the same length (so they can fit into a matrix), but that start and end with offsets? I struggle to understand where something like this would be useful, outside a kind of forecasting competition.
If you really need this, then I would recommend you put your time series into a list structure. Then each one can start and end at any date, and they can be the same or different lengths. Take inspiration from Mcomp::M3.
Later, I want to perform a forecast on this dataset. Any inputs on this? Should I consider each column as a seperate dataframe and do the forecasting. Or can I convert each column to a different timeseries object that starts from the first non NA value?
Since your tt is already a time series object, the simplest way would be simply to iterate over its columns:
fcst <- matrix(nrow=10,ncol=ncol(tt))
for ( ii in 1:ncol(tt) ) fcst <- forecast(ets(tt[,ii]),10)$mean
Note that most modeling functions in forecast will throw a warning and do something reasonable on encountering NA values. Here, e.g.:
1: In ets(tt[, ii]) :
Missing values encountered. Using longest contiguous portion of time series
Of course, you could do something yourself inside the loop, e.g., search for the last NA and start the time series for modeling right after that (but make sure you fail gracefully if the last entry is NA).

How to match dates in 2 data frames in R, then sum specific range of values up to that date?

I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.
I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.
Here are 2 example data frames:
# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),]
# nitrate df
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))
I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:
match(nitrate$date[i] %in% rain$date)
(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:
sum(rain$mm[j-6:j-1]
And prints the sum in a new column in nitrate
print(nitrate$mm_sum[i])
To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.
Many thanks in advance.
You were more or less there!
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$ndate)) {
day = nitrate$ndate[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}
Step by step explanation:
Initialize empty result column:
nitrate$prev_five_rainfall = NA
For each line in the nitrate df: (i = 1,2,3,4,5)
for (i in 1:length(nitrate$ndate)) {
Grab the day we want final result for:
day = nitrate$ndate[i]
Take the rainfull sum and it put in in the results column
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
Close the for loop :)
}
Disclaimer: This answer is basic in that:
It will break if nitrate's ndate < 6
It will be incorrect if some dates are missing in the rain dataframe
It will be slow on larger data
As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.
#nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").
I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.
So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:
nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate))
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")
Then, calculate the difference using difftime():
nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))
Do the same for the rain data frame. Finally, the for loop looks like this:
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$diffdays)) {
day = nitrate$diffdays[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)]) # 5 days
}

How to use scale in R for a precise lookback period?

I would like to scale & center some data, I know how to scale it with
(scale(data.test[,1],center=TRUE,scale=TRUE))
I have 365 observations (one year), and would like to scale & center my data for a lookback period of 20 days.
For example I would like to do that:
"Normalized for a 20day lookback period" means that to scale my first value 01/01/2014 (dd/mm/yy) I have to scale it only with the 20 days before. So with values from the 11/12/13 to 31/12/13
And for the 02/01/14 scale it from the 12/12/13 to the 01/01/14 etc
Normalize the data would be
= ( the data - the mean of all data / standard deviation of all the data (see my code )
But as I want "20 day lookback period" means that I have to only look at the 20 last values it would be
= (the data - the mean of the 20 previous data) / standard deviation of the 20 previous data
I thought to make a loop maybe? As I am very new to R I don't know how to write a loop in R or even if there is a better way to do what I want...
If you could help me with this.
You want a 20 days lookback : lookback<-20 data.scale<-c() #Create
a vector for the data scaled for(i in lookback:nrow(data)){
mean<-mean(data[i-(lookback-1):i,1],na.rm=T)
sd<-sd(data[i-(lookback-1):i,1],na.rm=T)*sqrt(((lookback-1))/lookback)
data.scale<-c(data.scale,(data[i,1]-mean)/sd) }
for the row 20 you want to normalized with the data from day 1 to day 20, day 21 from day 2 to day 21 and so on...

Compute average over sliding time interval (7 days ago/later) in R

I've seen a lot of solutions to working with groups of times or date, like aggregate to sum daily observations into weekly observations, or other solutions to compute a moving average, but I haven't found a way do what I want, which is to pluck relative dates out of data keyed by an additional variable.
I have daily sales data for a bunch of stores. So that is a data.frame with columns
store_id date sales
It's nearly complete, but there are some missing data points, and those missing data points are having a strong effect on our models (I suspect). So I used expand.grid to make sure we have a row for every store and every date, but at this point the sales data for those missing data points are NAs. I've found solutions like
dframe[is.na(dframe)] <- 0
or
dframe$sales[is.na(dframe$sales)] <- mean(dframe$sales, na.rm = TRUE)
but I'm not happy with the RHS of either of those. I want to replace missing sales data with our best estimate, and the best estimate of sales for a given store on a given date is the average of the sales 7 days prior and 7 days later. E.g. for Sunday the 8th, the average of Sunday the 1st and Sunday the 15th, because sales is significantly dependent on day of the week.
So I guess I can use
dframe$sales[is.na(dframe$sales)] <- my_func(dframe)
where my_func(dframe) replaces every stores' sales data with the average of the store's sales 7 days prior and 7 days later (ignoring for the first go around the situation where one of those data points is also missing), but I have no idea how to write my_func in an efficient way.
How do I match up the store_id and the dates 7 days prior and future without using a terribly inefficient for loop? Preferably using only base R packages.
Something like:
with(
dframe,
ave(sales, store_id, FUN=function(x) {
naw <- which(is.na(x))
x[naw] <- rowMeans(cbind(x[naw+7],x[naw-7]))
x
}
)
)

time series with 10 min frequency in R

My data is memory consumption of an application for every 10 minute interval for the last 26 days.My start date is Oct 6th 2013 and end date is Novemeber 2nd 2013.I've read the data in to a time frame and cleaned it up. Now am trying to create a time series , something along the lines of my_ts<-ts(mydata[3],start=c(2013,10),frequency=10)
Am sure this not correct as the frequency , can someone point me in the right direction so I can plot the time series
.
In R, frequency actually means the period of the seasonality. i.e., frequency = frequency of observations per season. In your case, the "season" is presumably one day. So you want
ts(mydata[3],start=c(2013,10),frequency=24*60/10)

Resources