How to get these moving average results - r

I am trying to get moving average in this format
Where NA stays NA otherwise show average of three periods but for the first and second period assume the missing values to be extension of existing value.
I am trying the rollmean and rollapply functions with varying inputs but not the results I want.
tempo[,toto:= rollmean(original,3,align="left", fill="extend")]
tempo[,toto1:= rollapply(original,3,mean,align="left", na.pad=FALSE)]
tempo<-data.table(original = c(NA,NA,NA,10,0,0,0,10,10,10,0,NA,NA),
desired = c(NA,NA,NA,10,5,3.3,0,3.3,6.6,10,6.6,NA,NA))

Related

replacing missing values in R with the one value that follows (not the mean)

I'm trying to replace the missing values in R with the value that follows, I have annual data for income by country, and for the missing income value for 2001 for country A I want it to pull the next value (this is for time series analysis with multiple different countries and different columns for different variables - income is just one of them)
I wrote this code for replacing the missing values with the mean, but statistically I think it makes more sense to replace the missing values with the value right below it (that comes next, the next year) since the numbers will be very different depending on the country so if I take an average it'll be of all years for all countries).
Social_data_R<-within(Social_data_R,incomeNAavg[is.na(income)]<-mean(income,na.rm=TRUE))
I tried replacing the mean part of the code above with income[i+1] but it didn't recognize 'i' (I uploaded the data from excel, so didn't create the dataframe manually)

Summarise returns different values

I have a dataframe x
and I need to calculate the number of steps from the 1st column by days or by certain 5-min intervals.
This code for dates works fine
b<-summarise(group_by(x,date),h = sum(steps))
But when I change date on interval,
b<-summarise(group_by(x,interval),h = sum(steps))
it returns only NA values

How to remove rows in a data set according to if values exceed a given number in a particular column in Rstudio

I am trying to remove some outliers from my data set. I am investigating each variable in the data one at a time. I have constructed boxplots for variables but don't want to remove all the classified outliers, only the most extreme. So I am noting the value on the boxplot that I don't want my variable to exceed and trying to remove rows that correspond to the observations that have a specific column value that exceed the chosen value.
For example,
My data set is called milk and one of the variables is called alpha_s1_casein. I thought the following would remove all rows in the data set where the value for alpha_s1_casein is greater than 29:
milk <- milk[milk$alpha_s1_casein < 29,]
In fact it did. The amount of rows in the data frame decreased from 430 to 428. However it has introduced a lot of NA values in noninvolved columns in my data set
Before I ran the above code the amount of NA's were
sum(is.na(milk))
5909 NA values
But after performing the above the sum of NA's now returned is
sum(is.na(milk))
75912 NA values.
I don't understand what is going wrong here and why what I'm doing is introducing more NA values than when I started when all I'm trying to do is remove observations if a column value exceeds a certain number.
Can anyone help? I'm desperate
Without using additional packages, to remove all rows in the data set where the value for alpha_s1_casein is greater than 29, you can just do this:
milk <- milk[-which(milk$alpha_s1_casein > 29),]

How do I calculate overlapping three-day log returns in the same dataframe in R?

I've just started learning R. As for now, I have prices PRC in a dataframe test together with the date and several other variables.
My goal is to calculate the following within the same dataframe so I can maintain the connection to the date.
1. Overlapping three-day log returns
2. One-day log returns
Through other posts I came up with the following code for the three day lag returns and the one-day lag returns respectively, but I am still unsure on how to incorporate it into my dataframe:
test$logR3 <- diff(log(test$PRC)), lag=3)
This code currently doesn't work due to the difference in number of rows. How do I take this into account? Can I somehow put zeros or NAs in order to fill the missing rows?
Thank you in advance.
maybe something like:
days=c()
for(i in seq(3,nrow(test),3)){ #loop through it in steps of 3
one_day_ago_diff=log(test$PRC[i])-log(test$PRC[i-1]) #difference between today and yesterday
three_days_ago_diff=log(test$PRC[i])-log(test$PRC[i-3]) #difference between today and three days ago
days=c(days,c(three_days_ago_diff,NA,one_day_ago_diff)) # fills empty vector with diff from 3 days ago- followed by NA to skip 2 days ago and then one day ago
}
if(length(days)<nrow(test)){days=c(days, rep(NA,nrow(test)-length(days)))} #check they're the same length
test$lags=days #add column to test

Moving average with dynamic window

I'm trying to add a new column to my data table that contains the average of some of the following rows. How many rows to be selected for the average however depends on the time stamp of the rows.
Here is some test data:
DT<-data.table(Weekstart=c(1,2,2,3,3,4,5,5,6,6,7,7,8,8,9,9),Art=c("a","b","a","b","a","a","a","b","b","a","b","a","b","a","b","a"),Demand=c(1:16))
I want to add a column with the mean of all demands, which occured in the weeks ("Weekstart") up to three weeks before the respective week (grouped by Art, excluding the actual week).
With rollapply from zoo-library, it works like this:
setorder(DT,-Weekstart)
DT[,RollMean:=rollapply(Demand,width=list(1:3),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
The problem however is, some data is missing. In the example, the data for the Art b lack the week no 4, there is no Demand in week 4. As I want the average of the three prior weeks, not the three prior rows, the average is wrong. Instead, the result for Art b for week 6 should look like this:
DT[Art=="b"&Weekstart==6,RollMean:=6]
(6 instead of 14/3, because only Week 5 and Week 3 count: (8+4)/2)
Here is what I tired so far:
It would be possible to loop through the minima of the week of the following rows in order to create a vector that defines for each row, how wide the 'width' should be (the new column 'rollwidth'):
i<-3
DT[,rollwidth:=Weekstart-rollapply(Weekstart,width=list(1:3),partial=TRUE,FUN=min,align="left",fill=1),.(Art)]
while (max(DT[,Weekstart-rollapply(Weekstart,width=list(1:i),partial=TRUE,FUN=min,align="left",fill=NA),.(Art)][,V1],na.rm=TRUE)>3) {
i<-i-1
DT[rollwidth>3,rollwidth:=i]
}
But that seems very unprofessional (excuse my poor skills). And, unfortunately, the rollapply with width and rollwidth doesnt work as intended (produces warnings as 'rollwidth' is considered as all the rollwidths in the table):
DT[,RollMean2:=rollapply(Demand,width=list(1:rollwidth),partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
What does work is
DT[,RollMean3:=rollapply(Demand,width=rollwidth,partial=TRUE,FUN=mean,align="left",fill=NA),.(Art)]
but then again, the average includes the actual week (not what I want).
Does anybody know how to apply a criterion (i.e. the difference in the weeks shall be <= 3) instead of a number of rows to the argument width?
Any suggestions are appreciated!

Resources