How to use scale in R for a precise lookback period? - r

I would like to scale & center some data, I know how to scale it with
(scale(data.test[,1],center=TRUE,scale=TRUE))
I have 365 observations (one year), and would like to scale & center my data for a lookback period of 20 days.
For example I would like to do that:
"Normalized for a 20day lookback period" means that to scale my first value 01/01/2014 (dd/mm/yy) I have to scale it only with the 20 days before. So with values from the 11/12/13 to 31/12/13
And for the 02/01/14 scale it from the 12/12/13 to the 01/01/14 etc
Normalize the data would be
= ( the data - the mean of all data / standard deviation of all the data (see my code )
But as I want "20 day lookback period" means that I have to only look at the 20 last values it would be
= (the data - the mean of the 20 previous data) / standard deviation of the 20 previous data
I thought to make a loop maybe? As I am very new to R I don't know how to write a loop in R or even if there is a better way to do what I want...
If you could help me with this.

You want a 20 days lookback : lookback<-20 data.scale<-c() #Create
a vector for the data scaled for(i in lookback:nrow(data)){
mean<-mean(data[i-(lookback-1):i,1],na.rm=T)
sd<-sd(data[i-(lookback-1):i,1],na.rm=T)*sqrt(((lookback-1))/lookback)
data.scale<-c(data.scale,(data[i,1]-mean)/sd) }
for the row 20 you want to normalized with the data from day 1 to day 20, day 21 from day 2 to day 21 and so on...

Related

Average and conversion order of operation

I have a data that I need to analyse.
The data consists of a series of numbers (floating point) represent duration in milliseconds.
From the duration I need to calculate frequency of those events (occurrence in seconds). So I simply calculate it like
occurrence per second = (1000/ time in milliseconds)
Now I need to find the average occurrence of that event in seconds.
But I am not sure which would be accurate order of operation.
Should I average the duration first and then calculate the average occurrence by
average occurrence = (1000/average time)
or I should calculate the frequency for each duration and average the result?
Both case result varies a bit. So I am not sure which pne would be correct approach.
Example:
Say we are measure frame rate of a device,
Each frame take x milliseconds to draw.
From that we can say
frame per second = (1000/x)
Now if my data has 1000 duration,
Either I can average them and get a average duration of a frame and get a frame per second = (1000/average duration)
or
we calculate 1000 frame per seconds first,
frame per seconds = (1000/duration)
and average those 100 fps value?
which one is correct?
Any suggestions?
You should choose the first method: Calculate the average duration of each frame and then let avg_fps = 1000 / avg_duration_in_milliseconds, or, probably easier: avg_fps = number_of_frames / total_duration_in_seconds. This gives the same thing.
Example:
Say you have 3 frames in one second of durations 200ms, 300ms and 500ms. Since you have 3 frames in 1 second, the avg_fps is 3.
The average duration is 333.33ms which gives the right result (1000/333.33 = 3). But if you calculate the individual fps of each frame you get (1000/200 = 5), (1000/300 = 3.33) and (1000/500 = 2). The average of 5, 3.33 and 2 is 3.44 - the highest values will skew the result in the wrong direction. So choose the first method instead.

Create moving-periods in a dataframe and calculate things (R studio)

I have a dataframe with Precipitation data for every day from January 1961 to December 2017 that looks like this:
DF=data.frame(Years,Month,Day,Precipitation Value)
I want to create periods of 30 days starting with 1th of January of 1961 so the first period will be 1st january to 30th January 1961 and want R to calculate the number of days without rain (Precipitation Value=0). Then, I want to do the same with the next day: 2th January so the period will be 2nd january-31st January, etc. After that, I need R to create a data frame with all the results for the year 1961. So it should be a data frame with of only one column with values (those values will be the number of days without rain in every period).
Then I need to do the same thing with all the years. Which means I will end up with 56 dataframes (1 for each year) and after that I could make a matrix with all of them (putting each data frame as a row).
The thing is I DO NOT KNOW how to start. I have no idea about how making the loop. I know it should be really easy, but I am having trouble with doing it. Specially i do not know how to tell R to stop every different year and start over and make a NEW data frame/vector with values.
Please provide a reproducible subset of your data so others can help you more effectively. While I cannot teach you how to create a loop from scratch here is some code that I think will help. This code simply calculates the moving 30 day average of precipitation using a simple for loop. You can use dplyr to filter these moving averages by year and create data frames doing that. Note I'm not counting the number of no precipitation days here but you can modify the loop easily to do that if needed
df<-data.frame(year = rep(1967:2002, each =12*30),
month = rep(rep(1:12, each = 30), 36),
day = rep(seq(1,30, by = 1), 432),
precipitation = sample(1:2000, 12*36))
df
#create a column that goes from 1 to however long your dataframe is
df$marker <- 1:nrow(df)
#'Now we create a simple loop to calculate the mean precipitation for
#'every 30 day window. You can modify this to count the number of days with
#'0 precipitation
#'the new column moving precip will tell you the mean precipitation for the
#' past 30 days relative to its postion. So if your on row 55, it will give
#' you the mean precipitation from row 25 to 55
df$movingprecip<-NA
for(i in 1:nrow(df)){
start = i #this says we start at i
end = i + 30 #we end 30 days later from i
if(end > nrow(df)){
#here I tell R to print this if there is not enough days
#in the dataset (30 days) to calculate the 30 day window mean
#this happens at the beginning of the dataset because we need to get to the
#30th row to start calculating means
print("not able to calculate, not 30 days into the data yet")
}else{
#Here I calculate the mean the of the past 30 days of precip
df$movingprecip[end] = mean(df[start:end,4])}
}

Finding peaks in time series with duration condition

I was wondering if anybody could help. If I had a data set containing two columns of date and river flow, how could I obtain the top 100 largest values of river flow, with the condition of having at least a duration of XX days (e.g. 14 days) between each "peak" (i.e. two values which fall within two weeks of each other would only count as 1 peak).
Date
Q
01/01/1990
24
02/01/1990
18
03/01/1990
40
I started by ranking all values and then picking out each peak and manually calculating if the next peak fell outside the 14 day period but I was wondering if this could be performed using a formula. Thanks.

How to add individual timepoint standard error column to data frame with NaNs (SE of mean of week of timepoints for 24 hours)

I need to create a plot of the body temperatures of mice. I have data points collected every 15 minutes over the course of seven days. I also have the calculated mean temperature at each timepoint for the plot. The next step is calculating the standard error of each of these mean temperatures, taking into account all seven days' worth of temperature readings. This is an image of the extended data I am working from:
https://imgur.com/ukk0iOt
I also have a separate, condensed data frame that is the mean_temp from above averaged over seven days for every timepoint, so only one temperature reading each for 24 hours worth of timepoints. It is 96 rows and only contains columns for time and mean_temp24.
With the following code, I am only able to calculate a single standard error for all the timepoints (I know it's wrong but am having a heck of a time finding a solution). I am also unable to calculate standard error from the condensed 24-hour dataset since the full seven days' worth of temperatures are not present.
Adding column with mean temperatures (7 days) of three mice to data frame 'df'
df=cbind(df,"mean_temp"=rowMeans(df[,3:5],na.rm=TRUE))
Trying to calculate standard deviation for each timepoint, to start with
times = unique(df$time)
Function to achieve individual standard errors per row
for (current_time in times){
df$se=sd(df$mean_temp24, na.rm=T)/sqrt(3-1)
}
Ideally, I will end up with a data frame that is 96 lines (each a 15-minute interval timepoint) for 24 hours of temperature data, where the values are the means of the seven temperatures for each timepoint ("mean_temp" from the image of my data frame). I will also have an additional column for standard error, which takes into account the 7 temperature values used to calculate the mean temperature in the final, 24-hour dataset.
The actual output is a single, identical SE for every timepoint in the full dataset that is not condensed to 24 hours.
Use ddply from the plyr package. The function f is called for every unique combination of dt and time:
f = function(x) {
n3 = length(which(!is.na(x[,3])))
n4 = length(which(!is.na(x[,4])))
n5 = length(which(!is.na(x[,5])))
data.frame(
mean3 = mean(x[,3], na.rm=TRUE),
mean4 = mean(x[,4], na.rm=TRUE),
mean5 = mean(x[,5], na.rm=TRUE),
se3 = sd(x[,3], na.rm=TRUE)/sqrt(n3),
se4 = sd(x[,4], na.rm=TRUE)/sqrt(n4),
se5 = sd(x[,5], na.rm=TRUE)/sqrt(n5)
)
}
ddply(df, .(dt,time), f)

time series with 10 min frequency in R

My data is memory consumption of an application for every 10 minute interval for the last 26 days.My start date is Oct 6th 2013 and end date is Novemeber 2nd 2013.I've read the data in to a time frame and cleaned it up. Now am trying to create a time series , something along the lines of my_ts<-ts(mydata[3],start=c(2013,10),frequency=10)
Am sure this not correct as the frequency , can someone point me in the right direction so I can plot the time series
.
In R, frequency actually means the period of the seasonality. i.e., frequency = frequency of observations per season. In your case, the "season" is presumably one day. So you want
ts(mydata[3],start=c(2013,10),frequency=24*60/10)

Resources