I have a data frame (71568x4) consist of several variables which are observed every hour (total are 24hours in a day) and contain many NAs.
I want to find the maximum value in every 24hours (in other word is a daily maxima) for each variable. If 12 or more hourly observation are missing during this 24hours period on any day, the data for that day is considered missing and hence reported as NA. Can anyone help me to do this in R?
Here is a sample example:
tDate <- rep(c(19980101,19980102,19980103), each = 24)
tTime <- rep(seq(1:24), 3)
x1 <- c(c(1:4),rep(NA,7),c(2:10),6,2,9,1,rep(NA,4),c(4:23),c(2:8),
rep(NA,7),c(3:5),rep(NA,7))
x2 <- c(rep(NA,3),c(11:15),NA,c(3:15),rep(NA,10),c(7:10),NA,c(2:4),NA,3,
rep(NA,5),c(6:9),NA,c(8:20),rep(NA,5),5,1)
datmat <- cbind(tDate,tTime,x1,x2)
The output will be like this
> matrix(c(10,23,NA,15,NA,20), byrow = FALSE, ncol = 2)
Many thanks in advance.
I'd define a custom function to take the max that you want:
my.max <- function(vec) {
if(length(vec[is.na(vec)])>=12) {
return(NA)
} else {
return(max(vec, na.rm=T))
}
}
Then use plyr and specifically ddply:
ddply(as.data.frame(datmat), .(tDate), summarise, x1=my.max(x1), x2=my.max(x2))
Related
I have some lab data and I am looking to calculate the difference between sample measurements over a moving time frame/window e.g 2 minutes (as apposed to 0-2, 2-4, 4-6 minute, static windows)
The problem is that although the data is sampled every second there are some missed samples (e.g. 1,2,4,6,7) so I cannot use a fixed lag function especially for larger time windows.
Here is the most promising I have tried. I have tried to calculate the difference in the row positions that will then use that to determine the lag value.
library(tidyverse)
df <- data.frame(sample_group = c(rep("a", 25), rep("b", 25)),t_seconds = c(1:50), measurement = seq(1,100,2))
df <- df[-c(5,10,23,33,44),] #remove samples
t_window = 5
df_diff <- df %>%
group_by(sample_group) %>%
arrange(t_seconds) %>%
mutate(lag_row = min(which(t_seconds >= t_seconds + t_window))- min(which(t_seconds == t_seconds)), #attempt to identify the lag value for each element
Meas_diff = measurement - lag(measurement, lag_row))
In this example (lag_row) I am trying to call an element from a vector and the vector itself, which obviously does not work! to make it clearer, I have added '_v' to identify what I wanted as a vector and '_e' as an element of that vector min(which(t_seconds_v >= t_seconds_e + t_window))- min(which(t_seconds_v == t_seconds_e))
I have tried to stay away from using loops but I have failed to solve the problem.
I would appreciate if anyone has any better ideas?
Your first step should be inserting missing observations into your time series. Then you could fill the missing values using a Last-Observation-Carried-Backwards operation. This provides you with a complete regular time series.
Your desired output is very unclear, so the next step after that in the following example is just a guess. Adjust as needed.
#complete time series (using a data.table join):
library(data.table)
setDT(df)
df_fill <- df[, .SD[data.table(t_seconds = min(t_seconds):max(t_seconds)),
on = "t_seconds"],
by = sample_group]
df_fill[, filled := is.na(measurement)]
#last observation carried backwards
library(zoo)
df_fill[, measurement := na.locf(measurement, fromLast = TRUE), by = sample_group]
#differences
df_fill[, diff_value := shift(measurement, -t_window) - measurement, by = sample_group]
I frequently use to.daily to convert 1 min OHLC data to a daily format but am trying to find a way to do the same with overnight data. I was hoping to see the option to specify what time a "day" starts and ends but didn't see that.
Overnight session being 18:00 to 09:30.
Does anyone have a simple way to do this?
You could use time-of-day subsetting with which.i = TRUE to find all of the observations you don't want. Then subset the original data with the negative of the result, so all the non-overnight observations will be dropped.
# assume data are in a xts object named 'x'
DayObs <- x["T09:30/T18:30", which.i = TRUE]
Overnight <- x[-DayObs,]
You might need to change the start and end times in the time-of-day subset call.
If you already have your data subset so that it only includes the overnight session, you can aggregate to "daily" using period.apply() and custom endpoints. Assuming your data are in an object named x:
ep <- c(0, which(diff(.indexhour(x) > 9 & .indexmin(x) > 30) == 1))
makeOHLC <- function(x) {
op <- as.numeric(first(x))
cl <- as.numeric(last(x))
c(Open = op, High = max(x), Low = min(x), Close = cl)
}
period.apply(x, ep, makeOHLC)
I'm just starting to learn R for forecasting and analysis purposes, and I've decided to try and create a full package for the forecasting model I'm using (Additive Pickup). I work for a hotel, and one of the things I do on a regular basis is forecast our demand, so this will certainly make this part of my job faster and easier!
I've already created a few functions that will get me a data frame of my pickup numbers, and now I'm working on a function to average a user defined number of columns in that new data frame. I've included code to create some sample data, and the code I'm working on below.
Sample Data:
test = data.frame(replicate(10, sample(0:2, 32, rep = TRUE)))
Broken Code:
averagePickup = function(data, day, periods) {
# data will be your Pickup Data
# day is the day you're forecasting for (think row number)
# periods is the period or range of periods that you need to average (a column or range of columns).
pStart = ncol(data)
pEnd = ncol(data) - periods
row = (day-1)
new_frame = as.data.frame(matrix(nrow = 1, ncol = periods))
for(i in pStart:pEnd) {
new_frame[1,i] = mean(data[1:row , i])
}
return(sum(new_frame[1,1:i]))
}
The goal of this is to iterate backwards from the last column in the data to a user defined period. For example, setting "periods" to 1 should return the sum of the average of the last column only. Setting it to 2 would yield the sum of the averages of the last column and second to last column.
However, when I try to run a test of this I get an error that reads
Error in [<-.data.frame(tmp`, 1, i, value = 0.9) : new columns
would leave holes after existing columns
Any advice you guys could lend would be so appreciated. Also, let me know if I made absolutely zero sense, and apologies for the essay on this question... Note that this has to iterate backwards because of the way the input data is formatted.
I think this is what you want:
averagePickup = function(data, day, periods) {
# data will be your Pickup Data
# day is the day you're forecasting for (think row number)
# periods is the period or range of periods that you need to average (a column or range of columns).
pStart = ncol(data)
pEnd = ncol(data) - (periods-1)
row = (day-1)
new_frame <- as.data.frame(matrix(nrow = 1, ncol = periods))
for(i in pStart:pEnd) {
new_frame[1,1+abs(ncol(data)-i)] <- mean(data[1:row , i])
}
return(sum(new_frame[1,1:ncol(new_frame)]))
}
averagePickup(test,1,5)
[1] 7
I believe this does what you're looking for:
colMeans will return the average for each column
colMeans(test)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1.15625 1.00000 0.90625 1.03125 1.15625 1.09375 0.81250 0.93750 1.15625 0.84375
Now, instead of every column, you only want the last x columns. dim will give you the dimensions of your matrix/dataframe, and the second value is the number of columns.
dim(test)[2]
You can now subset your dataframe dynamically
test[, (dim(test)[2] - x):dim(test)[2]]
Finally, plug the subsetted dataframe into the colMeans function, and wrap a sum around it.
sum(colMeans(test[, (dim(test)[2] - x):dim(test)[2]]))
I am trying to create two vectors of the 20th and 80th percentiles of monthly return data for companies from 1927 to 2013. The issue I have encountered is that in my nested four loop I don't know how to reference both the month and the year (i.e. the returns across all companies in April 1945). Right now the code looks like this:
qunatile<-function(r){
vec20<-c(rep(0,1038))
vec80<-c(rep(0,1038))
for(i in 1927:2013){
for(j in 1:12){
vec20[j+12(i-1927)]<-quantile(r$(i, j),20)
vec80[j+12(i-1927)]<-quantile(r$(i, j),80)
}
}
data1decilest<-rbind(ps1NYSE,vec20,vec80)
}
But I know that that r$(i, j) notation is not correct. I was wondering if anyone knew how to do what I am attempting with that clearly incorrect code (i.e. reference all returns from a given month in a given year.
Thank you!
One option that would eliminate nesting loops is to create a column in your dataframe that contains a month/year combo (e.g. "Jan1955", "Apr1999", etc.) and then split your dataframe on that variable, and apply quantile functions. It's hard to say if this is solving your problem since there is not a reproducible example. I assume here your data is called df and contains a date and a value column.
library(lubridate)
library(plyr)
df$newtime <- paste0(month(df$date, label = T, abbr = T), year(df$date))
q20 <- function(df){ quantile(df$value, 20) }
q80 <- function(df){ quantile(df$value, 80) }
vec20 <- ddply(df, .(newtime), FUN=q20)
vec80 <- ddply(df, .(newtime), FUN=q80
Need help to speed up this code!
Goal is to create a dataframe where the TPS (transaction per second) of the first DF: TPS_Jan7_11h_13h_CheckIMEI will be accumulated from record 1 to 30, then reset to 0 and do that again.
This is what it looks like in graph form:
https://docs.google.com/spreadsheets/d/1-286za99C5gdHLDErR9B4ZazVrZFFINGaH3xzVMghFk/edit?usp=sharing
My dataset has more than 6millions rows...
I start creating a sequence where I need to reset to 0 my cumulative variable. Then I go through the full dataset and just add on top of the previous value.
I have been running this for a few hours on a quad code x64 8gig machine and still running... so... crazy slow!
Any ideas how to speed this up? Subsets or some magic with Tables?
Here's the code:
# Create a sequence of when to reset the cumulative TPS
TPS_Jan7_11h_13h_CheckIMEI_seq30 <- seq(from = 1,nrow(TPS_Jan7_11h_13h_CheckIMEI),by = 30)
# Initialize Dataframe
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30 <- data.frame(matrix(ncol = 3, nrow = nrow(Jan7_11h_13h_CheckIMEI)))
colnames(TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30) <- c("CumulTPS","100%","130%")
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30[2] = 1000*30
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30[3] = (1000*30)*1.3
CumulVal = 0
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$CumulTPS[1] = TPS_Jan7_11h_13h_CheckIMEI$TPS[1]
for(i in 2:nrow(Jan7_11h_13h_CheckIMEI)) {
CumulVal = CumulVal + TPS_Jan7_11h_13h_CheckIMEI$TPS[i-1]
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$CumulTPS[i] = CumulVal
# print(CumulVal)
if (i %in% TPS_Jan7_11h_13h_CheckIMEI_seq30) CumulVal = 0
}
The TPS DF is simply a list of TPS on the TPS column and timestamp on first column.
Goal is to recreate what I put in the spreadsheet example, but on millions of rows!
Thanks,
Simon
Use dplyr to group your data into groups of 30 records, then compute the cumulative sum for each value in each group.
Here's some code; note that it needs some refinement to include all values - take a look at the cut documentation for help.:
library(dplyr)
# Create a sequence of when to reset the cumulative TPS
TPS_Jan7_11h_13h_CheckIMEI_seq30 <- seq(from = 1,nrow(TPS_Jan7_11h_13h_CheckIMEI),by = 30)
#use cut() to add a factor column to the data frame with a different level for each group of 30
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$numgroup = cut(as.numeric(row.names(TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30)), TPS_Jan7_11h_13h_CheckIMEI_seq30)
#aggregate by the new column and get the cumulative sum at each line, within each group
newdf = TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30 %>% group_by(numgroup) %>% mutate(cumulsum = cumsum(TPS))