I use the following source and get an error:
>source("raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…)
Error in source("raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…) : raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…: unexpected input 1: ï»
Since I have to use what is the error and how I can fix it?
Here is my code (the last line is the relevant command:
library(zoo)
library (xts)
library(data.table)
source("https://raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/na.interp1.R")
Lines <- "D1,Diff
1,20/11/2014 16:00,0.01
2,20/11/2014 17:00,0.02
3,20/11/2014 19:00,0.03
4,21/11/2014 16:00,0.04
5,21/11/2014 17:00,0.06
6,21/11/2014 20:00,0.10"
z <- read.zoo(text = Lines, tz = "", format = "%d/%m/%Y %H:%M", sep = ",")
## Source 1 begins
startdate <- as.character((start(z)))
# set the start date/time as the 1st entry in the time series and make
# this a character vector.
start <- as.POSIXct(startdate)
# transform the character vector to a POSIXct object
enddate <- as.character((end(z)))
# set the end date/time as the last entry in the time series and make
# this a character vector.
end <- as.POSIXct(enddate)
# transform the character vector to a POSIXct object
gridtime <- seq(from = start, by = 3600, to = end)
# create a sequence beginning with the start date/time with a 60 minute
# interval ending at the end date/time
## Source 1 ends
## Source 2 begins
timeframe <- data.frame(rep(NA, length(gridtime)))
# create 1 NA column spaced out by the gridtime to complement the single
# column of z
timelength <- xts(timeframe, order.by = gridtime)
# create a xts time series object using timeframe and gridtime
zDate <- merge(timelength, z)
# merge the z zoo object and the timelength xts object
## Source 2 ends
Lines <- as.data.frame(zDate)
# to data.frame from zoo
Lines[, "D1"] <- rownames(Lines)
# create column named D1
Lines <- setDT(Lines)
# create data.table out of data.frame
setcolorder(Lines, c(3, 2, 1))
# set the column order as the 3rd column followed by the 2nd and 1st
# columns
Lines <- Lines[, 3 := NULL]
# remove the 3rd column
setnames(Lines, 2, "diff")
# change the name of the 2nd column to diff
Lines <- setDF(Lines)
# return to data.frame
rowsinterps1 <- which(is.na(Lines$diff == TRUE))
# index of rows of Lines that have NA (to be interpolated)
xi <- as.numeric(Lines[which(is.na(Lines$diff == TRUE)), 1])
# the Date-Times for diff to be interpolated in numeric format
interps1 <- na.interp1(as.numeric(Lines$Time), Lines$diff, xi = xi, na.rm = FALSE, maxgap = 3)
# the interpolated values where only gap sizes of 3 are filled
The package was updated that's the reason that the code didn't work.
I wish the people that make the points to drop would return them back. The question was OK.
Related
I have a dataframe with 105 months and 20 columns. The example below is simplified and shows that some of the columns start at January 2014 and some don't. Some others are zeroed:
df <- data.frame(months = c('2014-01-01','2014-02-01',
'2014-03-01','2014-04-01','2014-05-01',
'2014-06-01','2014-07-01'),
series2 = c(1754,3345,12226,1712,6703,8172,1545),
series3 = c(NA,NA,NA,NA,554,222,321)
series4 = c(NA,NA,NA,NA,0,0,0)
)
My objective is to seasonally adjust the series which can be seasonally adjusted and write a similar dataframe, keeping the seasonally adjusted series in the same order and position as in the original dataframe.
I have made a for loop to decide which columns can be seasonally adjusted. The for loop also finds out the initial date of every column.
library(seasonal)
# determine initial and final date in the first column of dataframe
initial_date <- as.POSIXct(pull(df[1,1]),format = "%Y-%m-%d")
final_date <- as.POSIXct(pull(dados0[nrow(df),1]),format = "%Y-%m-%d")
# create an empty dataframe to be completed with seasonally adjusted
dataseas_adj_df<-data.frame(matrix(ncol = ncol(df), nrow = nrow(df)))
# decide which series should be seasonally adjusted
for(i in 2:ncol(df)) { # Head of for-loop
# if a certain column contains only zeros...
if(sum(df[,i] != 0, na.rm=TRUE)==0) {
seas_adj_df[,i]<-as.numeric(NA) #fill the column with NA}
else {
#determine the number of values of the column
n_values_column<-length(df[,i][!is.na(df[,i])])
#how many months after the beginning of the dataframe did the column start?
months_to_add<-nrow(df)-n_values_column
#calculates the initial date of the column
column_initial_date<-initial_date %m+% months(months_to_add)
#transform the column values into a time series
time_series <- ts(df[,i],start = c(year(column_initial_date),
month(column_initial_date)),
end = c(year(final_date), month(final_date)),
freq = 12)
#perform seasonal adjustment
time_series_sa<- final(seas(time_series, multimode = "R"))
#insert seasonally adjusted series into the new dataframe
seas_adj_df[,i]<-time_series_sa #this part is wrong
}}
However, i receive the folowing error:
Error in \[\<-.data.frame(*tmp*, , i, value = c(928.211662624947, 993.311013042665, : replacement has 81 rows, data has 105
This happens because some of my columns have 81 values (the rest are filled with "NA").
My two questions are:
Is there a way to seasonally adjust all series, but asking R to "jump" NA columns and 0 columns? My final dataframe must have the seasonally ajusted series in the exact same position as the original ones (example: series5_SA must be in column 5, even if series4 couldn't be seasonally adjusted).
Using my code (or a similar code), how could I add a time series with 81 values into a dataframe with 105 rows? How can i tell R that the column should be inserted from line (105-81=24) on?
You can use lapply in conjunction with an error handling function (like tryCatch), instead of a for loop. This type of functions will try to perform an operation. But, if they encounter and error, they will provide another result as indicated (like the original ts not sa). The order of the time series will not be afected. Here is an example with the AirPassengers data set:
> library(seasonal)
> library(lubridate)
>
> data(AirPassengers)
>
> df <- replicate(5, AirPassengers)
> df <- cbind.data.frame(date_decimal(as.numeric(time(AirPassengers))), df)
>
> ## Adding NA to second and fourth data columns
> df[sample(1:nrow(df), 10), 3] <- NA
> df[sample(1:nrow(df), 10), 5] <- NA
>
> initial_date <- as.Date(df[1,1], format = "%Y-%m-%d")
>
> time_series <- lapply(df[, -1], function(x){
+ ts(x, start = c(year(initial_date), month(initial_date)), frequency = 12)
+ })
>
> time_series_sa <- lapply(time_series, function(x) {
+ tryCatch(final(seas(x, multimode = "R")), error = function(e) {x})
+ })
>
> summary(time_series_sa)
Length Class Mode
1 144 ts numeric
2 144 ts numeric
3 144 ts numeric
4 144 ts numeric
5 144 ts numeric
Hope it helps.
data looks like sample1
sample2
I have 1000 csv files, all of them have two columns, first column is date, second column is price. Files have different time periods, some data start from 1995, some data start from 2000,since I need to do co integration test, all data in test should have exactly same time point.
I need extract same time period from 1000 csv files, for example start from 1998-4-20.
It works if I do it individually using:
newdata208 <- subset(data208, Date >= "1998-04-20")
but when I try to loop them, error happen, could anyone help me fixed error?
v1 <- list()
for (i in 1:length(datasets)) {
v1[i] <- subset(datasets[i], Date >= "1998-04-20")
}
Error in subset.default(datasets[i], Date >= "1998-04-20") :
object 'Date' not found
for original problem, just add get(i) in loop, then problem is fixed, I do not why, could anyone tell me?
v1 <- list()
for (i in 1:length(datasets)) {
data <- get(i)
v1[i] <- subset(data, Date >= "1998-04-20")}
then, problem is fixed
updating my stupid code
#set dictionary.
setwd("F:/xxx/folder")
dataset <- list.files(pattern = "*.CSV")
datasets <- c()
for (i in 1:1000)) {
datasets[i] <- substr(dataset[i], 1, (nchar(dataset[i])-4))
}
# we only need closing price column and date column
setClass("myDate")
setAs("character", "myDate", function(from) as.Date(from, format = "%m/%d/%Y"))
# read date column and closing price column
for (i in 1:length(temps)) {
assign(temps[i], read.csv(temp[i],
colClass = c("myDate", rep("NULL", 4),
rowClass = "numeric",
rep("NULL", 2)), stringsAsFactor = FALSE, header = TRUE))
}
# extract same time period
v1 <- c()
for (i in temps) {
data <- get(i)
v1[i] <- subset(data, Date >= "1998-04-20", select = C)
}
# lengths are different, file505 has short time period
index <- subset(file505, Date >= "1998-04-20")
indexs <- index$Date
# try use index to extract data
selectdate <- which(file001$Date %in% indexs)
file001CLOSE <- file001[selectdate, "C"]
#redo loop to get same period
v2 <- c()
for (i in datasets) {
data2 <- get(i)
v2[[i]] <- data2[selectdate, "C"]
}
v2table <- do.call(cbind,v2)
# right now, data is wonderful, let's begin do time series.
# test co-integration
install.packages("urca")
library("urca")
comb <- combn(1000, 2)
pairs <- c()
for (i in 1:499500) {
pairs[[i]] <- v2table[, comb[, i]]
}
# test:FF <- pairs[[88]], it is working wonderful, display all details in result
# do ca.jo
testresults <- list()
for (i in 1:499500) {
testdata <- pairs[[i]]
testresults[[i]] <- ca.jo(testdata, ecdet = "const", type = "eigen", K = 1)
}
it creates a huge list containing all test results, I need to split Values of teststatistic and critical values of test, and find all pairs have co-integration factor.
I've got a two step solution for you:
x <- list.files(path = "your directory", pattern = ".csv")
y <- lapply(x,fread)
data <- rbindlist(y)
Reads in all the csvs in your directory as data tables then binds then together.
After that i'd just subset as follows:
data <- data[Date >= "your dates",]
EDIT*
I get the feeling you want to bring in your data and bind it column wise/merge. As it stands that would be incorrect as you have the same "C" value in each column. After looking at your samples, if you were to stack those row wise you'd just be getting a very long C column. I wonder whether each C column represents the same or a different variable. If it is a different variable, I've written up some code which would truncate your data appropriately.
I've used the first 6 rows of your sample data 1 and sample data 2
files <- list.files(path = dir, pattern = ".csv")
data_mock <- lapply(files,fread)
data_mock[[1]][, Date := data_mock[[2]][,Date]]
#I change the dates here because your sample dates are too far apart to test for date truncation to work
for (i in 1:length(data_mock)){
data_mock[[i]]$Date <- as.Date(data_mock[[i]]$Date, format = "%Y-%m-%d")
}
for (i in 1:length(data_mock)){
setnames(data_mock[[i]], old = names(data_mock[[i]]), new = c("Date", paste0("C",i)))
}
#I change the variable names here because I'm not sure whether you want to stack Cs ontop of one another or whether each C is a different variable.
#I've assumed each C is different.
start_finish <- function(data, start, finish){
data[Date >= start & Date <= finish,]
}
results <- list()
for (i in 1:length(data_mock)){
results[[i]] <- start_finish(data_mock[[i]], "1987-01-15", "1987-01-17")
}
This is what the original data looked like:
[[1]]
Date C
1: 1998-04-20 12.667
2: 1998-04-21 12.587
3: 1998-04-22 12.625
4: 1998-04-23 12.601
5: 1998-04-24 12.584
6: 1998-04-25 12.624
[[2]]
Date C
1: 1987-01-14 95.89
2: 1987-01-15 97.72
3: 1987-01-16 98.10
4: 1987-01-17 97.07
5: 1987-01-18 98.86
6: 1987-01-19 99.95
This is what it looks like once you run a loop over the "start_finish" function I wrote:
[[1]]
Date C
1: 1987-01-15 12.587
2: 1987-01-16 12.625
3: 1987-01-17 12.601
[[2]]
Date C2
1: 1987-01-15 97.72
2: 1987-01-16 98.10
3: 1987-01-17 97.07
I believe you wanted your data to start and end at the same time? You merely need to alter the "start" and "finish" dates in the pretty simple function I wrote.
Is that what you're after?
I have two variables date and referencenumber. Both are extracted from a text string, with the use of a regular expression. They both have the class character.
When I use the cbind.fill function to combine these variables in an already excising dataframe the values are transformed to numeric values, 1 and 1. Instead of "06-07-2016" and "123ABC". I use the cbind.fill function because something only 1 variables is found, and then this variable still must be placed in the dataframe.
When I run the same code on a computer at school, it doesn't transform the values to numeric. So maybe it has something to do with my settings?
Why is this happening?
library(rowr)
dataframevariablen <- as.data.frame(matrix(nrow = 0, ncol = 2))
colnames(dataframevariablen) <- c("date", "refnr")
rulebased(dfgg$Text[i]) #returns the date and refnr as global variable
dataframevariablen[i,] <- cbind.fill(date,refnr, fill = NULL)
This works for you?
x <- c("6jul2016", "2jan1960", "31mar1960", "30jul1960")
date <- as.Date(x, "%d%b%Y")
refnr="123ABC" #returns the date and refnr as global variable
for (i in 1:length(date))
dataframevariablen[i,] <- data.frame(date[i],refnr,stringsAsFactors = F)
dataframevariablen$date=as.Date(dataframevariablen$date,origin="1970-01-01")
dataframevariablen
date refnr
1 2016-07-06 123ABC
2 1960-01-02 123ABC
3 1960-03-31 123ABC
4 1960-07-30 123ABC
I am trying to write a function that will return the number of business days between two dates (not just excluding weekends, but holidays as well). I'm approaching it by building a matrix with rownames corresponding to days of the week with the elements of the matrix either a 1 or a 0: a 0 if it is a holiday or the extra couple elements to fill the matrix.
I've checked the length of each vector in the code. It checks out. I've run the code manually in the console, one line at a time, and it works perfectly. BUT if I run the function, it displays this error message:
Warning message:
In matrix(da, nrow = 7, dimnames = list(n)) :
data length [132] is not a sub-multiple or multiple of the number of rows [7]
I'm using R 3.1.1, mostly working in Rstudio. The cal mentioned in the code can be found here.
Here's the code:
dte <- function(date) {
#Input a date and it tells you the number of business (not including holidays)
#days until that date
#Take the target date and turn it into a date
d <- strptime(date,format="%Y-%m-%d")
#Obtain current date
c <- strptime(Sys.Date(), format="%Y-%m-%d")
#Calculate the difference in days
diff <- d-c
#Extract the actual number difference
f <- diff[[1]]
#Get the list of holidays
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- Sys.Date()+0:f
#Find which days in the range are holidays
if(any(b %in% cal)) {
bt <- b[b %in% cal]
#Return the position of the holidays within the range
bn <- which(b %in% bt)
} else {
#Set holidays present to 0
bn <- 0
}
#Build a vector of the weekdays starting with the current weekday
n <- weekdays(Sys.Date()+0:6)
#Create a vector as long as the difference with a 1 in each place
v <- rep(1,f)
#Set each holiday to 0
v[bn] <- v[bn]-1
#Extra steps to make sure that the matrix is full but only with 1s where we want them.
g <- ((trunc(f/7)+1)*7)-f
u <- rep(0,g)
da <- c(v,u)
#Create the matrix
m <- matrix(da,nrow=7,dimnames=list(n))
#Extract all of the workweeks and add them up
ww <- m[c("Monday","Tuesday","Wednesday","Thursday","Friday"),]
r <- sum(ww)
r
}
The problem is that your strptime calls return POSIXt objects which have time components and are then effected by daylight savings time. Observe
(d1<-strptime("2014-08-24",format="%Y-%m-%d"))
# [1] "2014-08-24 EDT"
(d2<-strptime("2014-12-31",format="%Y-%m-%d"))
# [1] "2014-12-31 EST"
d2-d1
# Time difference of 129.0417 days
So there are not a while number of dates between the two values which causes complications for you later in your code. If you use as.Date rather than strptime then you won't have this problem because Date objects don't care about time.
But i'm not sure really why you're even bothering with the matrix at all. I think a simpler implementation would look like
dte <- function(date) {
d <- as.Date(date,format="%Y-%m-%d")
c <- Sys.Date()
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- seq(c, d, by="1 day")
return(sum(as.POSIXlt(b)$wday %in% 1:5 & (!b %in% cal)))
}
Is there a good package in R that allows to sub-set (i.e. index into) timeseries by times that are not in the time series?
E.g. for financial applications, indexing a price series by a time stamp that is not in the database, should return the latest available price before the time stamp.
in code, this is what I would like
n =15
full.dates = seq(Sys.Date(), by = 'day', length = n)
series.dates = full.dates[c(1:10, 12, 15)]
require(zoo)
series=zoo(rep(1,length(series.dates)), series.dates)
series[full.dates[11]]
this returns
Data:
numeric(0)
Index:
character(0)
however, I would like this to return the value of the last existing date before full.dates[11], which is full.dates[10]:
series[full.dates[10]]
2014-01-03
1
Thanks
You can use index to extract index of the observations in your zoo object. The index can then be used for subsetting the object. Step by step to show the logic (you only need the last step, if I have understood you correctly):
# the index of the observations, here dates
index(series)
# are the dates smaller than your reference date?
index(series) < full.dates[11]
# subset observations: dates less than reference date
series[index(series) < full.dates[11]]
# select last observation before reference date:
tail(series[index(series) < full.dates[11]], 1)
# 2014-01-03
# 1
A possible alternative may be to expand your time series and "replac[e] each NA with the most recent non-NA" using na.locf and the xout argument (see also ?na.locf and ?approx and this answer)
# expand time series to the range of dates in 'full.dates'
series2 <- na.locf(series, xout = full.dates)
series2
# select observation at reference date
series2[full.dates[10]]
# 2014-01-03
# 1
If you rather want missing values in your incomplete series to be replaced by "next observation carried backward", you need to merge your series with with a 'dummy' zoo object which contains the desired range of consecutive dates.
series3 <- merge(series, zoo(, full.dates))
na.locf(series3, fromLast = TRUE)
na.locf(x, xout = newdate) seems not much worse than subscripting but at any rate here we define a subclass of "zoo" called "zoo2" in which [ uses na.locf. This is an untested minimal implementation but it could be extended:
as.zoo2 <- function(x) UseMethod("as.zoo2")
as.zoo2.zoo <- function(x) structure(x, class = c("zoo2", setdiff(class(x), "zoo2")))
"[.zoo2" <- function(x, i, ...) {
if (!missing(i) && inherits(i, class(index(x)))) {
zoo:::`[.zoo`(na.locf(x, xout = i),, ...)
} else as.zoo2(zoo:::`[.zoo`(x, i, ...))
}
This gives:
> series2 <- as.zoo2(series)
> series2[full.dates[11]]
2014-01-04
1
I would strongly argue that subset functions should not return the prior row if the desired index value does not exist. Subset functions should return what the user requested; they should not assume the user wanted something different than what they requested.
If this is what you want, you can handle it fairly easily with an if statement.
series.subset <- series[full.dates[11]]
if(NROW(series.subset)==0) {
# merge series with an empty zoo object
# that contains the index value you want
prior <- merge(series, zoo(,full.dates[11]))
# lag *back* one period so the NA is on the prior value
prior <- lag(prior, 1)
# get the index value at the prior value
prior <- index(prior)[is.na(prior)]
# subset again
series.subset <- series[prior]
}