Time series data arrangement - r

I have several time series data and I'm trying to make a arrangement before further analysis. The point is that, as you can see in the picture, 3 financial time series has a different dates-observed. I want to eliminate whole line if there's at least 1 blanked line. In order to make a arrangement, first I made whole dates line to the left side except saturdays and sundays from 1 Jan 2005 to 30 Jun 2015 for indexing.
example: at the 11th row, there exist unmatched dates. I want to put NA columns in the middle.
here's what I've tried
Day=data.frame(test[,1:2])
Rk=data.frame(test[,3:4])
Vix=data.frame(test[,5:6])
BA=data.frame(test[,7:8])
i=1
k=0
while(i<=2736){
if(Day[i,1]==Rk[i,1]){i=i+1}
else if(Day[i,1]!=Rk[i,1]){
k=k+1
Rk[i+1:k+2634,]=Rk[i:k+2633,]
Rk[i,]=c(Day[i,1],NA)
i=i+1}
}
but it shows error message: number of items to replace is not a multiple of replacement length
I will be very much appreciated. Any kind of helps will be more than welcomed.

This is fairly easy if you use a time-series class like xts (or zoo).
# create sample data
set.seed(21)
dayone <- as.Date("2005-01-03")
plusdays <-c(0:4, 7:11, 14:18, 21:24)
test <- data.frame(date=dayone + plusdays)
test$day <- weekdays(test$date, abbreviate=TRUE)
test$date.1 <- dayone + c(plusdays[-11L], 25)
test$Kernel <- rnorm(nrow(test), 3e-5, 1e-6)
test$date.2 <- dayone + c(plusdays[-11L], 25)
test$VIX.High <- round(rnorm(nrow(test), 14, 0.1), 2)
test$date.3 <- dayone + c(plusdays[-11L], 25)
test$Baa.Aaa <- round(rnorm(nrow(test), 0.66, 0.01), 2)
require(xts)
# create xts objects for each column
Rk <- xts(test['Kernel'], test$date.1)
Vix <- xts(test['VIX.High'], test$date.2)
Ba <- xts(test['Baa.Aaa'], test$date.3)
# use xts to merge data
testxts <- merge(xts(,test$date), Rk, Vix, Ba)
I've left out your day column because xts/zoo objects are just a matrix with an index attribute, and you cannot mix types in a matrix. But you can use the .indexwday function to extract the weekday of each row.

Related

Convert List of lists to data frame where each list within the list are the results from using Sapply + decompose on multiple columns

this is my first project using a coded environment so may not phrase things accurately. I am building an ARIMA forecast.
I want to forecast for multiple sectors (business areas) at a time. Using help forums I have managed to write code that takes my time series data as input, fits the model, and sends the outputs to CSV. I am happy with this.
My problem is that I would also like capture the results from the decomposition analysis on a sector level. Currently, when I use a solution I found elsewhere it outputs to CSV in a format that is unusable, where everything is spread by row and the different lists are half in one row and another.
Thanks In advance!
My current solution (probably not super efficient but like I say cobbled together based on forum tips)
Clean data down to TS
NLDemand <- read_excel("TS Demand 2018 + Non London no lockdown.xlsx")
NLDemand <- as_tibble(NLDemand)
NLDemand <- na.omit(NLDemand)
NLDemand <- subset(NLDemand, select = -c(Month,Year))
NLDemand <- subset(NLDemand, select = -c(YearMonth))
##this gets the data to a point where each column is has a header of business sector and the time series data underneath it with no categorical columns left E.G:
Sector 1a, sector1b, sector...
500,450,300
450,500,350
...,...,...
Season capture for all sectors
tsData<-sapply(NLDemand, FUN = ts, simplify = FALSE,USE.NAMES = TRUE,start=c(2018,1),frequency=12)
tsData
timeseriescomponents <- sapply(tsData,FUN=decompose,simplify = FALSE, USE.NAMES = TRUE)
timeseriescomponents
this produces a list of lists where each sublist is the decomposed elements of the sector time series.
##Covert all season captures to the same length
TSC <- list(timeseriescomponents[1:41])
n.obs <- sapply(TSC, length)
seq.max <- seq_len(max(n.obs))
mat <- t(sapply(TSC, "[", i = seq.max ))
##Export to CSV
write.csv(mat, "Non london 2018 + S-T componants.csv", row.names=FALSE)
***What I want as an output would be a table that showed each componant as a a column in a list
Desired output format
Current output(sample)

Dealing with big datasets in R

I'm having a memory problem with R giving the Can not allocate vector of size XX Gb error message. I have a bunch of daily files (12784 days) in netcdf format giving sea surface temperature in a 1305x378 (longitude-latitude) grid. That gives 493290 points each day, decreasing to about 245000 when removing NAs (over land points).
My final objective is to build a time series for any of the 245000 points from the daily files and find the temporal trend for each point. And my idea was to build a big data frame with a point per row and a day per column (2450000x12784) so I could apply the trend calculation to any point. But then, building such data frame, the memory problem appeared, as expected.
First I tried a script I had previously used to read data and extract a three column (lon-lat-sst) dataframe by reading nc file and then melting the data. This lead to an excessive computing time when tried for a small set of days and to the memory problem. Then I tried to subset the daily files into longitudinal slices; this avoided the memory problem but the csv output files were too big and the process was very time consuming.
Another strategy I've tried without success to the moment it's been to sequentially read all the nc files and then extract all the daily values for each point and find the trend. Then I would only need to save a single 245000 points dataframe. But I think this would be time consuming and not the proper R way.
I have been reading about big.memory and ff packages to try to declare big.matrix or a 3D array (1305 x 378 x 12784) but had not success by now.
What would be the appropriate strategy to face the problem?
Extract single point time series to calculate individual trends and populate a smaller dataframe
Subset daily files in slices to avoid the memory problem but end with a lot of dataframes/files
Try to solve the memory problem with bigmemory or ff packages
Thanks in advance for your help
EDIT 1
Add code to fill the matrix
library(stringr)
library(ncdf4)
library(reshape2)
library(dplyr)
# paths
ruta_datos<-"/home/meteo/PROJECTES/VERSUS/CMEMS/DATA/SST/"
ruta_treball<-"/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL/"
setwd(ruta_treball)
sst_data_full <- function(inputfile) {
sstFile <- nc_open(inputfile)
sst_read <- list()
sst_read$lon <- ncvar_get(sstFile, "lon")
sst_read$lats <- ncvar_get(sstFile, "lat")
sst_read$sst <- ncvar_get(sstFile, "analysed_sst")
nc_close(sstFile)
sst_read
}
melt_sst <- function(L) {
dimnames(L$sst) <- list(lon = L$lon, lat = L$lats)
sst_read <- melt(L$sst, value.name = "sst")
}
# One month list file: This ends with a df of 245855 rows x 33 columns
files <- list.files(path = ruta_datos, pattern = "SST-CMEMS-198201")
sst.out=data.frame()
for (i in 1:length(files) ) {
sst<-sst_data_full(paste0(ruta_datos,files[i],sep=""))
msst <- melt_sst(sst)
msst<-subset(msst, !is.na(msst$sst))
if ( i == 1 ) {
sst.out<-msst
} else {
sst.out<-cbind(sst.out,msst$sst)
}
}
EDIT 2
Code used in a previous (smaller) data frame to calculate temporal trend. Original data was a matrix of temporal series, being each column a series.
library(forecast)
data<-read.csv(....)
for (i in 2:length(data)){
var<-paste("V",i,sep="")
ff<-data$fecha
valor<-data[,i]
datos2<-as.data.frame(cbind(data$fecha,valor))
datos.ts<-ts(datos2$valor, frequency = 365)
datos.stl <- stl(datos.ts,s.window = 365)
datos.tslm<-tslm(datos.ts ~ trend)
summary(datos.tslm)
output[i-1]<-datos.tslm$coefficients[2]
}
fecha is date variable name
EDIT 2
Working code from F. Privé answer
library(bigmemory)
tmp <- sst_data_full(paste0(ruta_datos,files[1],sep=""))
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files),backingfile = "/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL" )
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(paste0(ruta_datos,files[i],sep=""))$sst
}
With this code a big matrix was created
dim(mat)
[1] 493290 12783
mat[1,1]
[1] 293.05
mat[1,1:10]
[1] 293.05 293.06 292.98 292.96 292.96 293.00 292.97 292.99 292.89 292.97
ncol(mat)
[1] 12783
nrow(mat)
[1] 493290
So, to your read data in a Filebacked Big Matrix (FBM), you can do
files <- list.files(path = "SST-CMEMS", pattern = "SST-CMEMS-198201*",
full.names = TRUE)
tmp <- sst_data_full(files[1])
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files))
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(files[i])$sst
}

write a for loop to automatically create subsets of datasets in r

Please help me as I am new to R and also programming
I am trying to write a loop in such that it should read the data for every 1000 rows and create a data-set in r
Following is my trial
for(i in 0:nl){
df[i] = fread('RM.csv',skip = 1000*i, nrows =1000,
col.names = colnames(read.csv('RM.csv', nrow=1, header = T)))
}
where nl is a integer and is equal to length of data 'RM.csv'
What I am trying to do is create a function which will skip every 1000 rows and read next 1000 rows and terminates once it reaches nl which is length of original data.
Now it is not mandatory to use only this approach.
You can try reading in the entire file into a single data frame, and then subsetting off the rows you don't want:
df <- read.csv('RM.csv', header=TRUE)
y <- seq(from = 0, to = 100000, by = 1) # replace the 'to' value with a value
seq.keep <- y[floor(y / 1000) %% 2 == 0] # large enough for the whole file
df.keep <- df[seq.keep, ]
Here is a rather messy demo which shows that the above sequence logic be correct:
Demo
You can inspect that the sequence generated is:
0-999
2000-2999
4000-4999
etc.
As mentioned in the code comment, make sure you generate a sequence large enough to accommodate the actual size of the data frame.
If you need to continue with your current approach, then try reading in only every other 1000 lines, e.g.
sq <- seq(from=0, to=nl, by=2)
names <- colnames(read.csv('RM.csv', nrow=1, header=TRUE))
for(i in sq) {
df_i <- fread('RM.csv', skip=1000*i, nrows=1000, col.names=names)
# process this chunk and move on
}

Loop over number preceded by underscore symbol in R

I apologize in advance but I did not find what I need in previous topic-related posts.
Suppose that I have the following data. "bchain" is a dataframe of 2192 observations. The column "Date" contains dates from 2011/01/01 to 2016/12/31. The column "Value" contains daily exchange rates.
>bchain
Date Value
1 2011-01-01 0.299998
2 2011-01-02 0.299996
3 2011-01-03 0.299998
4 2011-01-04 0.299899
5 2011-01-05 0.298998
6 2011-01-06 0.299000
7 2011-01-07 0.322000
8 2011-01-08 0.322898
. ....... .......
What I want to do is to visualize the exchange rates year by year in separate plots and save the six graphs on my desktop by using a "for" loop. Consider this simple following pseudo-code which I built around this post content:
https://www.r-bloggers.com/automatically-save-your-plots-to-a-folder/
PSEUDO-CODE:
Date_2011=bchain[1:365,1]
Date_2012=bchain[366:731,1]
Date_2013=bchain[732:1096,1]
Date_2014=bchain[1097:1461,1]
Date_2015=bchain[1462:1826,1]
Date_2016=bchain[1827:2192,1]
bchain_2011=bchain[1:365,2]
bchain_2012=bchain[366:731,2]
bchain_2013=bchain[732:1096,2]
bchain_2014=bchain[1097:1461,2]
bchain_2015=bchain[1462:1826,2]
bchain_2016=bchain[1827:2192,2]
years=2011:2016
for(i in years){
mypath = file.path("C:/Users/toshiba1/Desktop",paste("myplot_", years[i], ".jpg", sep = ""))
jpeg(file=mypath)
mytitle = paste("my title is", years[i])
plot(Date_[i],bchain_[i], main = mytitle)
dev.off()
}
Then I get the following error message: object "Date_" not found. I suspect that the problem is that the above loop does not recognize the numbers which come after the underscore sign. So, any suggestion?
Thank you in advance.
Here is another approach avoiding the need to make the year-specific data frames. I used the lubridate package to extract the year from the date values, generated a data.frame of that year, and plotted those data. As #Konrad also pointed out, the way in which you call some of the objects is giving you issues - I cleaned up some of those in your paste statements below.
library(lubridate)
# Create toy data to plot
bchain <- data.frame(Date = seq.Date(from = as.Date("2011-01-01"), to = as.Date("2016-12-31"),
by = 1),
Value = runif(2192, 0, 1))
years <- 2011:2016
for(i in years){
# Create dataset of just data to plot
bchain_plot <- bchain[year(bchain$Date) == i, ]
# Edited file name w/i jpeg call and fixed paste statement
jpeg(filename=paste0("C:/Users/toshiba1/Desktop/myplot_", i, ".jpg"))
# Plot data w/ title included in plot call
plot(bchain_plot$Date, bchain_plot$Value, main = paste("my title is", i))
dev.off()
}
You should call your object properly one approach may involve making use of get on the lines:
# Now plot data number i
x <- get(paste("Date", i, sep = "_"))
# Plot
plot(x)
or simply by nesting:
plot(get(paste("Date", i, sep = "_")))
To test it, see what happens if you type Date_[i] in R console? Are you getting the object you want to pass to the plot function? Arrive at the desired object via get or any other mechanism that suits you and then pass it to the plotting function.
I reckon that you want to iterate through your objects - you need i not [i]. Type [i] in the R console and see what happens.

Moving window over zoo time series in R

I'm running into issues while applying a moving window function to a time series dataset. I've imported daily streamflow data (date and value) into a zoo object, as approximated by the following:
library(zoo)
df <- data.frame(sf = c("2001-04-01", "2001-04-02", "2001-04-03", "2001-04-04",
"2001-04-05", "2001-04-06", "2001-04-07", "2001-06-01",
"2001-06-02", "2001-06-03", "2001-06-04", "2001-06-05",
"2001-06-06"),
cfs = abs(rnorm(13)))
zoodf <- read.zoo(df, format = "%Y-%m-%d")
Since I want to calculate the 3-day moving minimum for each month I've defined a function using rollapply:
f.3daylow <- function(x){rollapply(x, 3, FUN=min, align = "center")}
I then use aggregate:
aggregate(zoodf, by=as.yearmon, FUN=f.3daylow)
This promptly returns an error message:
Error in zoo(df, ix[!is.na(ix)]) :
“x” : attempt to define invalid zoo object
The problem appears to be that there are unequal number of data points in each month,since using the same dataframe with an additional date for June results in a correct response. Any suggestions for how to deal with this would be appreciated!
Ok, you might be thinking of something like this then. It pastes the results for each month into one data point, so that it can be returned in the aggregate function. Otherwise you may also have a look at ?aggregate.zoo for some more precise data manipulations.
f.3daylow <- function(x){paste(rollapply(x, 3, FUN=min,
align = "center"), collapse=", ")}
data <- aggregate(zoodf, by=as.yearmon, FUN=f.3daylow)
Returns, this is then a rolling window of 3 copied into 1 data point. To analyse it, eventually you will have to break it down again, so it is not recommended.
Apr 2001
0.124581285281643, 0.124581285281643, 0.124581285281643,
0.342222172241979, 0.518874882033892
June 2001
0.454158221843514, 0.454158221843514, 0.656966528249837,
0.513613009234435
Eventually you can cut it up again via strsplit(data[1],", "), but see Convert comma separated entry to columns for more details.

Resources