I am new to time series and was hoping someone could provide some input/ideas here.
I am trying to find ways to impute missing values.
I was hoping to find the moving average, but most of the packages (smooth, mgcv, etc.) don't seem to take time intervals into consideration.
For example, the dataset might look like something below and I would want value at 2016-01-10 to have the greatest influence in calculating the missing value:
Date Value Diff_Days
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-01-30 50 16
I have instances where NA might be the first observation or the last observation. Sometimes NA values also occur multiple times, at which point the rolling window would need to expand, and this is why I would like to use the moving average.
Is there a package that would take date intervals / separate weights into consideration?
Or please suggest if there is a better way to impute NA values in such cases.
You can use glm or any different model.
Input
con <- textConnection("Date Value Diff_Days
2015-12-14 NA 0
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-02-14 NA 0
2016-02-18 NA 0
2016-02-29 50 16")
df <- read.table(con, header = T)
df$Date <- as.Date(df$Date)
df$Date.numeric <- as.numeric(df$Date)
fit <- glm(Value ~ Date.numeric, data = df)
df.na <- df[is.na(df$Value),]
predicted <- predict(fit, df.na)
df$Value[is.na(df$Value)] <- predicted
plot(df$Date, df$Value)
points(df.na$Date, predicted, type = "p", col="red")
df$Date.numeric <- NULL
rm(df.na)
print(df)
Output
Date Value Diff_Days
1 2015-12-14 -3.054184 0
2 2016-01-01 10.000000 13
3 2016-01-10 14.000000 4
4 2016-01-14 18.518983 0
5 2016-01-28 30.000000 14
6 2016-02-14 40.092149 0
7 2016-02-18 42.875783 0
8 2016-02-29 50.000000 16
Related
I need to create 'n' number of variables with lags of the original variable from 1 to 'n' on the fly. Something like so :-
OrigVar
DatePeriod, value
2/01/2018,6
3/01/2018,4
4/01/2018,0
5/01/2018,2
6/01/2018,4
7/01/2018,1
8/01/2018,6
9/01/2018,2
10/01/2018,7
Lagged 1 variable
2/01/2018,NA
3/01/2018,6
4/01/2018,4
5/01/2018,0
6/01/2018,2
7/01/2018,4
8/01/2018,1
9/01/2018,6
10/01/2018,2
11/01/2018,7
Lagged 2 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,6
5/01/2018,4
6/01/2018,0
7/01/2018,2
8/01/2018,4
9/01/2018,1
10/01/2018,6
11/01/2018,2
12/01/2018,7
Lagged 3 variable
2/01/2018,NA
3/01/2018,NA
4/01/2018,NA
5/01/2018,6
6/01/2018,4
7/01/2018,0
8/01/2018,2
9/01/2018,4
10/01/2018,1
11/01/2018,6
12/01/2018,2
13/01/2018,7
and so on
I tried using the shift function and various other functions. Wtih most of them that worked for me, the lagged variables finished at the last date of the original variable. In other words, the length of the lagged variable is the same as that of the original variable.
What I am looking for the new lagged variable to be shifted down by the 'kth' lag and the data series to be extended by 'k' elements including the index.
The reason I need this is to be able to compute the value of the dependent variable using the regression coeffficients and the corresponding lagged variable value beyond the in-sample period
y1 <- Lag(ciresL1_usage_1601_1612, shift = 1)
head(y1)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA -5171.051 -6079.887 -3687.227 -3229.453 -2110.368
y2 <- Lag(ciresL1_usage_1601_1612, shift = 2)
head(y2)
2016-01-02 2016-01-03 2016-01-04 2016-01-05 2016-01-06 2016-01-07
NA NA -5171.051 -6079.887 -3687.227 -3229.453
tail(y2)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-2316.039 -2671.185 -4100.793 -2043.020 -1147.798 1111.674
tail(ciresL1_usage_1601_1612)
2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
-4100.793 -2043.020 -1147.798 1111.674 3498.729 2438.739
Is there a way to do it relatively easily. I know I can do it by looping and adding 'k' rows in a new vector and reloading the data in to this new vector appropriately shifting the data values in the new vector but I don't want to use that method unless I have to. I am quietly confident that there has to be a better way to do it than this!
By the way, the object is a zoo object with daily dates as the index.
Best regards
Deepak
Convert the input zoo object to zooreg and then use lag.zooreg like this:
library(zoo)
# test input
z <- zoo(1:10, as.Date("2008-01-01") + 0:9)
zr <- as.zooreg(z)
lag(zr, -(0:3))
giving:
lag0 lag-1 lag-2 lag-3
2008-01-01 1 NA NA NA
2008-01-02 2 1 NA NA
2008-01-03 3 2 1 NA
2008-01-04 4 3 2 1
2008-01-05 5 4 3 2
2008-01-06 6 5 4 3
2008-01-07 7 6 5 4
2008-01-08 8 7 6 5
2008-01-09 9 8 7 6
2008-01-10 10 9 8 7
2008-01-11 NA 10 9 8
2008-01-12 NA NA 10 9
2008-01-13 NA NA NA 10
I have a cross section data as following:
transaction_code <- c('A_111','A_222','A_333')
loan_start_date <- c('2016-01-03','2011-01-08','2013-02-13')
loan_maturity_date <- c('2017-01-03','2013-01-08','2015-02-13')
loan_data <- data.frame(cbind(transaction_code,loan_start_date,loan_maturity_date))
Now the dataframe looks like this
>loan_data
transaction_code loan_start_date loan_maturity_date
1 A_111 2016-01-03 2017-01-03
2 A_222 2011-01-08 2013-01-08
3 A_333 2013-02-13 2015-02-13
Now I want to create a monthly time series observing the time to maturity(in months) for each of the three loans for a period of 48 months. How can I achieve that? The final output should look like following:
>loan data
transaction_code loan_start_date loan_maturity_date feb13 march13 april13........
1 A_111 2016-01-03 2017-01-03 46 45 44
2 A_222 2011-01-08 2013-01-08 NA NA NA
3 A_333 2013-02-13 2015-02-13 23 22 21
Here new columns (for 48 months) represents the time to maturity for each loan from that respective months.
Would really appreciate your help. Thanks
Here's an approach using tidyverse packages.
# Define the months to use in the right-hand columns.
months <- seq.Date(from = as.Date("2013-02-01"), by = "month", length.out = 48)
library(tidyverse); library(lubridate)
loan_data2 <- loan_data %>%
# Make a row for each combination of original data and the `months` list
crossing(months) %>%
# Format dates as MonYr and make into an ordered factor
mutate(month_name = format(months, "%b%y") %>% fct_reorder(months)) %>%
# Calculate months remaining -- this task is harder than it sounds! This
# approach isn't perfect, but it's hard to accomplish more simply, since
# months are different lengths.
mutate(months_remaining =
round(interval(months, loan_maturity_date) / ddays(1) / 30.5 - 1),
months_remaining = if_else(months_remaining < 0,
NA_real_, months_remaining)) %>%
# Drop the Date format of months now that calcs done
select(-months) %>%
# Spread into wide format
spread(month_name, months_remaining)
Output
loan_data2[,1:6]
# transaction_code loan_start_date loan_maturity_date Feb13 Mar13 Apr13
# 1 A_111 2016-01-03 2017-01-03 46 45 44
# 2 A_222 2011-01-08 2013-01-08 NA NA NA
# 3 A_333 2013-02-13 2015-02-13 23 22 21
So, I have the following problem:
I have a data set, A (data.table object), of the following structure:
date days rate
1996-01-02 9 5.763067
1996-01-02 15 5.745902
1996-01-02 50 5.673317
1996-01-02 78 5.608884
1996-01-02 169 5.473762
1996-01-03 9 5.763067
1996-01-03 14 5.747397
1996-01-03 49 5.672263
1996-01-03 77 5.603705
1996-01-03 168 5.470584
1996-01-04 11 5.729460
1996-01-04 13 5.726104
1996-01-04 48 5.664931
1996-01-04 76 5.601891
1996-01-04 167 5.468961
Note that the days column and its size may differ for each day.
My goal is now to (piecewise linearly) interpolate rate along days. I am doing this for each day via
approx(x=A[,days],y=A[,rate],xout=days_vec,rule=2)
where days_vec <- min_days:max_days, i.e. the days range I am interested in (say 1:100).
I have two problems here:
approx only interpolates, i.e. it does not create a linear fit across min(x) and max(x). If I am now interested in days 1:100, I first need to do it by hand using days 9 and 15 (first 2 lines of A) via:
first_days <- 1:(A[1,days]-1) #1:8
rate_vec[first_days] <- A[1,rate] +
(first_days - A[1,days])/(A[2,days]-A[1,days])*(A[2,rate]-A[1,rate])
and then using the approx line above for rate_vec[9:100]. Is there a way of doing this in 1 step?
Right now, given that I need two steps and the shift point between the two procedures (here 9) differs among dates, I cannot see an implementation via data.table, although this would be vastly preferred (using data.table methods to interpolate/extrapolate and then returning the expanded data.table object). Thus, I currently run a for loop through the dates, which is of course much, much slower.
Question: Is the problem above better implementable and also, is this somehow doable with data.table methods instead of looping through A?
How about something like this.
# please try to make a fully reproducible example!
library(data.table)
df <- fread(input=
"date days rate
1996-01-02 9 5.763067
1996-01-02 15 5.745902
1996-01-02 50 5.673317
1996-01-02 78 5.608884
1996-01-02 169 5.473762
1996-01-03 9 5.763067
1996-01-03 14 5.747397
1996-01-03 49 5.672263
1996-01-03 77 5.603705
1996-01-03 168 5.470584
1996-01-04 11 5.729460
1996-01-04 13 5.726104
1996-01-04 48 5.664931
1996-01-04 76 5.601891
1996-01-04 167 5.468961")
df[,date := as.Date(date)]
1. Create NA values of rate for days in range 1:100 that aren't in dataset
df <-
merge(df,
expand.grid( days=1L:100L, # whatever range you are interested in
date=df[,sort(unique(date))] ), # dates with at least one observation
all=TRUE # "outer join" on all common columns (date, days)
)
2. For each value of date, use a linear model to predict NA values of rate.
df[, rate := ifelse(is.na(rate),
predict(lm(rate~days,.SD),.SD), # impute NA w/ lm using available data
rate), # if not NA, don't impute
keyby=date]
Gives you:
head(df,10)
# date days rate
# 1: 1996-01-02 1 5.766787 <- rates for days 1-8 & 10 are imputed
# 2: 1996-01-02 2 5.764987
# 3: 1996-01-02 3 5.763186
# 4: 1996-01-02 4 5.761385
# 5: 1996-01-02 5 5.759585
# 6: 1996-01-02 6 5.757784
# 7: 1996-01-02 7 5.755983
# 8: 1996-01-02 8 5.754183
# 9: 1996-01-02 9 5.763067 <- this rate was given
# 10: 1996-01-02 10 5.750581
If there are values of date without at least two observations of rate, you will probably get an error because you won't have enough points to fit a line.
Alternative: Rolling joins solution for piecewise linear interpolation
This requires rolling joins to left and right, and an average of the two that ignores NA values.
This doesn't do well for extrapolation, though, since it's just a constant (either the first or last obs) outside the indices of observations.
setkey(df, date, days)
df2 <- data.table( # this is your framework of date/days pairs you want to evaluate
expand.grid( date=df[,sort(unique(date))],
days=1L:100L),
key = c('date','days')
)
# average of non-NA values between two vectors
meanIfNotNA <- function(x,y){
(ifelse(is.na(x),0,x) + ifelse(is.na(y),0,y)) /
( as.numeric(!is.na(x)) + as.numeric(!is.na(y)))
}
df3 <- # this is your evaluations for the date/days pairs in df2.
setnames(
df[setnames( df[df2, roll=+Inf], # rolling join Last Obs Carried Fwd (LOCF)
old = 'rate',
new = 'rate_locf'
),
roll=-Inf], # rolling join Next Obs Carried Backwd (NOCB)
old = 'rate',
new = 'rate_nocb'
)[, rate := meanIfNotNA(rate_locf,rate_nocb)]
# once you're satisfied that this works, you can include rate_locf := NULL, etc.
head(df3,10)
# date days rate_nocb rate_locf rate
# 1: 1996-01-02 1 5.763067 NA 5.763067
# 2: 1996-01-02 2 5.763067 NA 5.763067
# 3: 1996-01-02 3 5.763067 NA 5.763067
# 4: 1996-01-02 4 5.763067 NA 5.763067
# 5: 1996-01-02 5 5.763067 NA 5.763067
# 6: 1996-01-02 6 5.763067 NA 5.763067
# 7: 1996-01-02 7 5.763067 NA 5.763067
# 8: 1996-01-02 8 5.763067 NA 5.763067
# 9: 1996-01-02 9 5.763067 5.763067 5.763067 <- this rate was given
# 10: 1996-01-02 10 5.745902 5.763067 5.754485
I am new to "R"; I have this html table here
I need to find out if there is a gap in the "time (DT)" column of more than one minute. I need to analyze the data and create a new table with just two columns, the first one with the time and the second one with the number of the gap.
Like this: output
So far I am able to download the data!!!
require(XML)
u='http://cronos.est.pr/test.html'
tables = readHTMLTable(u)
datatest=tables[[1]]
View(datatest)
What's next???
Convert the first column to "POSIXct" class, take differences and replace differences of one minute or less with NA. No packages are used.
with(datatest, {
Time <- as.POSIXct(`Time (DT)`)
Diff <- c(0 , c(diff(Time, units = "minutes")))
data.frame(Time, Diff = ifelse(Diff <= 1, NA, Diff))
})
giving:
Time Diff
1 2010-01-01 09:10:00 NA
2 2010-01-01 09:11:00 NA
3 2010-01-01 09:12:00 NA
4 2010-01-01 09:13:00 NA
5 2010-01-01 09:17:00 4
6 2010-01-01 09:18:00 NA
7 2010-01-01 09:19:00 NA
8 2010-01-01 09:20:00 NA
9 2010-01-01 09:22:00 2
10 2010-01-01 09:24:00 2
11 2010-01-01 09:25:00 NA
12 2010-01-01 09:26:00 NA
13 2010-01-01 09:38:00 12
14 2010-01-01 09:39:00 NA
15 2010-01-01 09:40:00 NA
Use the lubridate package.
library(lubridate)
minutes = minute(datatest[,"Time (DT)"])
gaps = c(0, diff(minutes))
output = data.frame("date_time" = datatest[,"Time (DT)"], gaps = gaps)
The output is like you requested except that every gap is recorded, not just the ones greater than 1 minute. To get just the big gaps, do
output[output$gaps > 1,]
This is probably a very simple question that has been asked already but..
I have a data frame that I have constructed from a CSV file generated in excel. The observations are not homogeneously sampled, i.e they are for "On Peak" times of electricity usage. That means they exclude different days each year. I have 20 years of data (1993-2012) and am running both non Robust and Robust LOESS to extract seasonal and linear trends.
After the decomposition has been done, I want to focus only on the observations from June through September.
How can I create a new data frame of just those results?
Sorry about the formatting, too.
Date MaxLoad TMAX
1 1993-01-02 2321 118.6667
2 1993-01-04 2692 148.0000
3 1993-01-05 2539 176.0000
4 1993-01-06 2545 172.3333
5 1993-01-07 2517 177.6667
6 1993-01-08 2438 157.3333
7 1993-01-09 2302 152.0000
8 1993-01-11 2553 144.3333
9 1993-01-12 2666 146.3333
10 1993-01-13 2472 177.6667
As Joran notes, you don't need anything other than base R:
## Reproducible data
df <-
data.frame(Date = seq(as.Date("2009-03-15"), as.Date("2011-03-15"), by="month"),
MaxLoad = floor(runif(25,2000,3000)), TMAX=runif(25,100,200))
## One option
df[months(df$Date) %in% month.name[6:9],]
# Date MaxLoad TMAX
# 4 2009-06-15 2160 188.4607
# 5 2009-07-15 2151 164.3946
# 6 2009-08-15 2694 110.4399
# 7 2009-09-15 2460 150.4076
# 16 2010-06-15 2638 178.8341
# 17 2010-07-15 2246 131.3283
# 18 2010-08-15 2483 112.2635
# 19 2010-09-15 2174 160.9724
## Another option: strftime() will be more _generally_ useful than months()
df[as.numeric(strftime(df$Date, "%m")) %in% 6:9,]