Seasonal adjustment of multiple series ignoring NA and zero columns - r

I have a dataframe with 105 months and 20 columns. The example below is simplified and shows that some of the columns start at January 2014 and some don't. Some others are zeroed:
df <- data.frame(months = c('2014-01-01','2014-02-01',
'2014-03-01','2014-04-01','2014-05-01',
'2014-06-01','2014-07-01'),
series2 = c(1754,3345,12226,1712,6703,8172,1545),
series3 = c(NA,NA,NA,NA,554,222,321)
series4 = c(NA,NA,NA,NA,0,0,0)
)
My objective is to seasonally adjust the series which can be seasonally adjusted and write a similar dataframe, keeping the seasonally adjusted series in the same order and position as in the original dataframe.
I have made a for loop to decide which columns can be seasonally adjusted. The for loop also finds out the initial date of every column.
library(seasonal)
# determine initial and final date in the first column of dataframe
initial_date <- as.POSIXct(pull(df[1,1]),format = "%Y-%m-%d")
final_date <- as.POSIXct(pull(dados0[nrow(df),1]),format = "%Y-%m-%d")
# create an empty dataframe to be completed with seasonally adjusted
dataseas_adj_df<-data.frame(matrix(ncol = ncol(df), nrow = nrow(df)))
# decide which series should be seasonally adjusted
for(i in 2:ncol(df)) { # Head of for-loop
# if a certain column contains only zeros...
if(sum(df[,i] != 0, na.rm=TRUE)==0) {
seas_adj_df[,i]<-as.numeric(NA) #fill the column with NA}
else {
#determine the number of values of the column
n_values_column<-length(df[,i][!is.na(df[,i])])
#how many months after the beginning of the dataframe did the column start?
months_to_add<-nrow(df)-n_values_column
#calculates the initial date of the column
column_initial_date<-initial_date %m+% months(months_to_add)
#transform the column values into a time series
time_series <- ts(df[,i],start = c(year(column_initial_date),
month(column_initial_date)),
end = c(year(final_date), month(final_date)),
freq = 12)
#perform seasonal adjustment
time_series_sa<- final(seas(time_series, multimode = "R"))
#insert seasonally adjusted series into the new dataframe
seas_adj_df[,i]<-time_series_sa #this part is wrong
}}
However, i receive the folowing error:
Error in \[\<-.data.frame(*tmp*, , i, value = c(928.211662624947, 993.311013042665, : replacement has 81 rows, data has 105
This happens because some of my columns have 81 values (the rest are filled with "NA").
My two questions are:
Is there a way to seasonally adjust all series, but asking R to "jump" NA columns and 0 columns? My final dataframe must have the seasonally ajusted series in the exact same position as the original ones (example: series5_SA must be in column 5, even if series4 couldn't be seasonally adjusted).
Using my code (or a similar code), how could I add a time series with 81 values into a dataframe with 105 rows? How can i tell R that the column should be inserted from line (105-81=24) on?

You can use lapply in conjunction with an error handling function (like tryCatch), instead of a for loop. This type of functions will try to perform an operation. But, if they encounter and error, they will provide another result as indicated (like the original ts not sa). The order of the time series will not be afected. Here is an example with the AirPassengers data set:
> library(seasonal)
> library(lubridate)
>
> data(AirPassengers)
>
> df <- replicate(5, AirPassengers)
> df <- cbind.data.frame(date_decimal(as.numeric(time(AirPassengers))), df)
>
> ## Adding NA to second and fourth data columns
> df[sample(1:nrow(df), 10), 3] <- NA
> df[sample(1:nrow(df), 10), 5] <- NA
>
> initial_date <- as.Date(df[1,1], format = "%Y-%m-%d")
>
> time_series <- lapply(df[, -1], function(x){
+ ts(x, start = c(year(initial_date), month(initial_date)), frequency = 12)
+ })
>
> time_series_sa <- lapply(time_series, function(x) {
+ tryCatch(final(seas(x, multimode = "R")), error = function(e) {x})
+ })
>
> summary(time_series_sa)
Length Class Mode
1 144 ts numeric
2 144 ts numeric
3 144 ts numeric
4 144 ts numeric
5 144 ts numeric
Hope it helps.

Related

How can I remove datasets from a list based on conditon using lapply?

I have list of data frames for which I have to perform the same operations for each one. I currently do this with a for loop but it is too slow. I would like to use lapply instead. The operations which I need to perform is to check how many of the values in a date column in each dataset that are missing from a vector of dates.
The data have the following structure.
# the dates which are of interest
dates <- seq(as.Date("2020-02-01"), as.Date("2020-02-09"), by = "days")
# the list of data frames
df_1 <- data.frame(seq(as.Date("2020-02-01"), as.Date("2020-02-09"), by = "days"), 1:9)
names(df_1) <- c("date", "value")
df_2 <- data.frame(seq(as.Date("2020-02-01"), as.Date("2020-02-07"), by = "days"), 1:7)
names(df_2) <- c("date", "value")
df_list <- list(df_1, df_2)
The loop which is working but is too slow looks like this.
for (i in 1:length(df_list)) {
# get range of dates in the data frame
df_date_range <- unique(df_list[[i]][["date"]])
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- dates[dates >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
df_list[[i]] <- NULL
}
}
I tried the following lapply approach.
# write function to use in lapply
clean <- function(data, date_range) {
# get range of dates in the data frame
df_date_range <- unique(data$date)
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- date_range[date_range >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
data <- NULL
}
}
# apply the function to the list of data frames
new_df_list <- lapply(df_list, clean, date_range = dates)
This however only yields a list of NULLs. Any help on what I'm doing wrong would be greatly appreciated.
While we dont't have to explicitly specify return values in R it's always better to do so. Your problem illustrates this point! (R function implicitly return the result of the last expression, but that is not always what one would expect!):
Consider the following function:
no_explicit_return_value <- function() {
some_non_NULL_value <- 10000
}
If we run:
test_value <- no_explicit_return_value()
test_value
We get back:
[1] 10000
Since the last expression returned 10000... so far all good!
Now, consider this function:
no_explicit_return_value <- function() {
some_non_NULL_value <- 10000
if (1000 < 4) {
x <- NULL
}
}
If we run:
test_value <- no_explicit_return_value()
test_value
We get back:
NULL
Not because the if clause evaluated to TRUE but because there is no return value from the if clause
The Solution:
clean <- function(data, date_range) {
# get range of dates in the data frame
df_date_range <- unique(data$date)
# get range of dates that occur from the point of the beginning of the data frame
dates_reduced <- date_range[date_range >= min(df_date_range)]
# get the share of dates missing
missing <- mean(!(dates_reduced %in% df_date_range))
# remove data frames where the share of missing values are above 1 %
if (missing > 0.1) {
data <- NULL
}
return(data)
}
# apply the function to the list of data frames
new_df_list <- lapply(df_list, clean, date_range = dates)
new_df_list
Returns:
[[1]]
date value
1 2020-02-01 1
2 2020-02-02 2
3 2020-02-03 3
4 2020-02-04 4
5 2020-02-05 5
6 2020-02-06 6
7 2020-02-07 7
8 2020-02-08 8
9 2020-02-09 9
[[2]]
NULL

Not all values storing in a loop

I want to store values in "yy" but my code below stores only one row (last value). Please see the output below. Can somebody help to store all the values in "yy"
Thanks in advance. I am a beginner to R.
arrPol <- as.matrix(unique(TN_97_Lau_Cot[,6]))
arrYear <- as.matrix(unique(TN_97_Lau_Cot[,1]))
for (ij in length(arrPol)){
for (ik in length(arrYear)) {
newPolicy <- subset(TN_97_Lau_Cot, POLICY == as.character(arrPol[ij]) & as.numeric(arrYear[ik]))
yy <- newPolicy[which.min(newPolicy$min_dist),]
}
}
Output:
YEAR DIVISION STATE COUNTY CROP POLICY STATE_ABB LRPP min_dist
1: 2016 8 41 97 21 699609 TN 0 2.6
Here is a image of "TN_97_Lau_Cot" matrix.
No loops required. There could be an easier way to do it, but two set-based steps are better than two loops. These are the two ways I would try and do it:
base
# Perform an aggregate and merge it to your data.frame.
TN_97_Lau_Cot_Agg <- merge(
x = TN_97_Lau_Cot,
y = aggregate(min_dist ~ YEAR + POLICY, data = TN_97_Lau_Cot, min),
by = c("YEAR","POLICY"),
all.x = TRUE
)
# Subset the values that you want.
TN_97_Lau_Cot_Final <- unique(subset(TN_97_Lau_Cot_Agg, min_dist.x == min_dist.y))
data.table
library(data.table)
# Convert your data.frame to a data.table.
TN_97_Lau_Cot <- data.table(TN_97_Lau_Cot)
# Perform a "window" function that calculates the min value for each year without reducing the rows.
TN_97_Lau_Cot[, minDistAggregate:=min(min_dist), by = c("YEAR","POLICY")]
# Find the policy numbers that match the minimum distance for that year.
TN_97_Lau_Cot_Final <- unique(TN_97_Lau_Cot[min_dist==minDistAggregate, -10, with=FALSE])

how to use loop to subset 1000 csv files and do co-integration test in R?

data looks like sample1
sample2
I have 1000 csv files, all of them have two columns, first column is date, second column is price. Files have different time periods, some data start from 1995, some data start from 2000,since I need to do co integration test, all data in test should have exactly same time point.
I need extract same time period from 1000 csv files, for example start from 1998-4-20.
It works if I do it individually using:
newdata208 <- subset(data208, Date >= "1998-04-20")
but when I try to loop them, error happen, could anyone help me fixed error?
v1 <- list()
for (i in 1:length(datasets)) {
v1[i] <- subset(datasets[i], Date >= "1998-04-20")
}
Error in subset.default(datasets[i], Date >= "1998-04-20") :
object 'Date' not found
for original problem, just add get(i) in loop, then problem is fixed, I do not why, could anyone tell me?
v1 <- list()
for (i in 1:length(datasets)) {
data <- get(i)
v1[i] <- subset(data, Date >= "1998-04-20")}
then, problem is fixed
updating my stupid code
#set dictionary.
setwd("F:/xxx/folder")
dataset <- list.files(pattern = "*.CSV")
datasets <- c()
for (i in 1:1000)) {
datasets[i] <- substr(dataset[i], 1, (nchar(dataset[i])-4))
}
# we only need closing price column and date column
setClass("myDate")
setAs("character", "myDate", function(from) as.Date(from, format = "%m/%d/%Y"))
# read date column and closing price column
for (i in 1:length(temps)) {
assign(temps[i], read.csv(temp[i],
colClass = c("myDate", rep("NULL", 4),
rowClass = "numeric",
rep("NULL", 2)), stringsAsFactor = FALSE, header = TRUE))
}
# extract same time period
v1 <- c()
for (i in temps) {
data <- get(i)
v1[i] <- subset(data, Date >= "1998-04-20", select = C)
}
# lengths are different, file505 has short time period
index <- subset(file505, Date >= "1998-04-20")
indexs <- index$Date
# try use index to extract data
selectdate <- which(file001$Date %in% indexs)
file001CLOSE <- file001[selectdate, "C"]
#redo loop to get same period
v2 <- c()
for (i in datasets) {
data2 <- get(i)
v2[[i]] <- data2[selectdate, "C"]
}
v2table <- do.call(cbind,v2)
# right now, data is wonderful, let's begin do time series.
# test co-integration
install.packages("urca")
library("urca")
comb <- combn(1000, 2)
pairs <- c()
for (i in 1:499500) {
pairs[[i]] <- v2table[, comb[, i]]
}
# test:FF <- pairs[[88]], it is working wonderful, display all details in result
# do ca.jo
testresults <- list()
for (i in 1:499500) {
testdata <- pairs[[i]]
testresults[[i]] <- ca.jo(testdata, ecdet = "const", type = "eigen", K = 1)
}
it creates a huge list containing all test results, I need to split Values of teststatistic and critical values of test, and find all pairs have co-integration factor.
I've got a two step solution for you:
x <- list.files(path = "your directory", pattern = ".csv")
y <- lapply(x,fread)
data <- rbindlist(y)
Reads in all the csvs in your directory as data tables then binds then together.
After that i'd just subset as follows:
data <- data[Date >= "your dates",]
EDIT*
I get the feeling you want to bring in your data and bind it column wise/merge. As it stands that would be incorrect as you have the same "C" value in each column. After looking at your samples, if you were to stack those row wise you'd just be getting a very long C column. I wonder whether each C column represents the same or a different variable. If it is a different variable, I've written up some code which would truncate your data appropriately.
I've used the first 6 rows of your sample data 1 and sample data 2
files <- list.files(path = dir, pattern = ".csv")
data_mock <- lapply(files,fread)
data_mock[[1]][, Date := data_mock[[2]][,Date]]
#I change the dates here because your sample dates are too far apart to test for date truncation to work
for (i in 1:length(data_mock)){
data_mock[[i]]$Date <- as.Date(data_mock[[i]]$Date, format = "%Y-%m-%d")
}
for (i in 1:length(data_mock)){
setnames(data_mock[[i]], old = names(data_mock[[i]]), new = c("Date", paste0("C",i)))
}
#I change the variable names here because I'm not sure whether you want to stack Cs ontop of one another or whether each C is a different variable.
#I've assumed each C is different.
start_finish <- function(data, start, finish){
data[Date >= start & Date <= finish,]
}
results <- list()
for (i in 1:length(data_mock)){
results[[i]] <- start_finish(data_mock[[i]], "1987-01-15", "1987-01-17")
}
This is what the original data looked like:
[[1]]
Date C
1: 1998-04-20 12.667
2: 1998-04-21 12.587
3: 1998-04-22 12.625
4: 1998-04-23 12.601
5: 1998-04-24 12.584
6: 1998-04-25 12.624
[[2]]
Date C
1: 1987-01-14 95.89
2: 1987-01-15 97.72
3: 1987-01-16 98.10
4: 1987-01-17 97.07
5: 1987-01-18 98.86
6: 1987-01-19 99.95
This is what it looks like once you run a loop over the "start_finish" function I wrote:
[[1]]
Date C
1: 1987-01-15 12.587
2: 1987-01-16 12.625
3: 1987-01-17 12.601
[[2]]
Date C2
1: 1987-01-15 97.72
2: 1987-01-16 98.10
3: 1987-01-17 97.07
I believe you wanted your data to start and end at the same time? You merely need to alter the "start" and "finish" dates in the pretty simple function I wrote.
Is that what you're after?

Matrix Error- data length doesn't match, even though it should

I am trying to write a function that will return the number of business days between two dates (not just excluding weekends, but holidays as well). I'm approaching it by building a matrix with rownames corresponding to days of the week with the elements of the matrix either a 1 or a 0: a 0 if it is a holiday or the extra couple elements to fill the matrix.
I've checked the length of each vector in the code. It checks out. I've run the code manually in the console, one line at a time, and it works perfectly. BUT if I run the function, it displays this error message:
Warning message:
In matrix(da, nrow = 7, dimnames = list(n)) :
data length [132] is not a sub-multiple or multiple of the number of rows [7]
I'm using R 3.1.1, mostly working in Rstudio. The cal mentioned in the code can be found here.
Here's the code:
dte <- function(date) {
#Input a date and it tells you the number of business (not including holidays)
#days until that date
#Take the target date and turn it into a date
d <- strptime(date,format="%Y-%m-%d")
#Obtain current date
c <- strptime(Sys.Date(), format="%Y-%m-%d")
#Calculate the difference in days
diff <- d-c
#Extract the actual number difference
f <- diff[[1]]
#Get the list of holidays
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- Sys.Date()+0:f
#Find which days in the range are holidays
if(any(b %in% cal)) {
bt <- b[b %in% cal]
#Return the position of the holidays within the range
bn <- which(b %in% bt)
} else {
#Set holidays present to 0
bn <- 0
}
#Build a vector of the weekdays starting with the current weekday
n <- weekdays(Sys.Date()+0:6)
#Create a vector as long as the difference with a 1 in each place
v <- rep(1,f)
#Set each holiday to 0
v[bn] <- v[bn]-1
#Extra steps to make sure that the matrix is full but only with 1s where we want them.
g <- ((trunc(f/7)+1)*7)-f
u <- rep(0,g)
da <- c(v,u)
#Create the matrix
m <- matrix(da,nrow=7,dimnames=list(n))
#Extract all of the workweeks and add them up
ww <- m[c("Monday","Tuesday","Wednesday","Thursday","Friday"),]
r <- sum(ww)
r
}
The problem is that your strptime calls return POSIXt objects which have time components and are then effected by daylight savings time. Observe
(d1<-strptime("2014-08-24",format="%Y-%m-%d"))
# [1] "2014-08-24 EDT"
(d2<-strptime("2014-12-31",format="%Y-%m-%d"))
# [1] "2014-12-31 EST"
d2-d1
# Time difference of 129.0417 days
So there are not a while number of dates between the two values which causes complications for you later in your code. If you use as.Date rather than strptime then you won't have this problem because Date objects don't care about time.
But i'm not sure really why you're even bothering with the matrix at all. I think a simpler implementation would look like
dte <- function(date) {
d <- as.Date(date,format="%Y-%m-%d")
c <- Sys.Date()
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- seq(c, d, by="1 day")
return(sum(as.POSIXlt(b)$wday %in% 1:5 & (!b %in% cal)))
}

Subsetting zoo series by a time that is not in the series

Is there a good package in R that allows to sub-set (i.e. index into) timeseries by times that are not in the time series?
E.g. for financial applications, indexing a price series by a time stamp that is not in the database, should return the latest available price before the time stamp.
in code, this is what I would like
n =15
full.dates = seq(Sys.Date(), by = 'day', length = n)
series.dates = full.dates[c(1:10, 12, 15)]
require(zoo)
series=zoo(rep(1,length(series.dates)), series.dates)
series[full.dates[11]]
this returns
Data:
numeric(0)
Index:
character(0)
however, I would like this to return the value of the last existing date before full.dates[11], which is full.dates[10]:
series[full.dates[10]]
2014-01-03
1
Thanks
You can use index to extract index of the observations in your zoo object. The index can then be used for subsetting the object. Step by step to show the logic (you only need the last step, if I have understood you correctly):
# the index of the observations, here dates
index(series)
# are the dates smaller than your reference date?
index(series) < full.dates[11]
# subset observations: dates less than reference date
series[index(series) < full.dates[11]]
# select last observation before reference date:
tail(series[index(series) < full.dates[11]], 1)
# 2014-01-03
# 1
A possible alternative may be to expand your time series and "replac[e] each NA with the most recent non-NA" using na.locf and the xout argument (see also ?na.locf and ?approx and this answer)
# expand time series to the range of dates in 'full.dates'
series2 <- na.locf(series, xout = full.dates)
series2
# select observation at reference date
series2[full.dates[10]]
# 2014-01-03
# 1
If you rather want missing values in your incomplete series to be replaced by "next observation carried backward", you need to merge your series with with a 'dummy' zoo object which contains the desired range of consecutive dates.
series3 <- merge(series, zoo(, full.dates))
na.locf(series3, fromLast = TRUE)
na.locf(x, xout = newdate) seems not much worse than subscripting but at any rate here we define a subclass of "zoo" called "zoo2" in which [ uses na.locf. This is an untested minimal implementation but it could be extended:
as.zoo2 <- function(x) UseMethod("as.zoo2")
as.zoo2.zoo <- function(x) structure(x, class = c("zoo2", setdiff(class(x), "zoo2")))
"[.zoo2" <- function(x, i, ...) {
if (!missing(i) && inherits(i, class(index(x)))) {
zoo:::`[.zoo`(na.locf(x, xout = i),, ...)
} else as.zoo2(zoo:::`[.zoo`(x, i, ...))
}
This gives:
> series2 <- as.zoo2(series)
> series2[full.dates[11]]
2014-01-04
1
I would strongly argue that subset functions should not return the prior row if the desired index value does not exist. Subset functions should return what the user requested; they should not assume the user wanted something different than what they requested.
If this is what you want, you can handle it fairly easily with an if statement.
series.subset <- series[full.dates[11]]
if(NROW(series.subset)==0) {
# merge series with an empty zoo object
# that contains the index value you want
prior <- merge(series, zoo(,full.dates[11]))
# lag *back* one period so the NA is on the prior value
prior <- lag(prior, 1)
# get the index value at the prior value
prior <- index(prior)[is.na(prior)]
# subset again
series.subset <- series[prior]
}

Resources