SuperLearner for survival outcome in R - r

I recently started reading about the SuperLearner and I am trying to run SuperLearner for survival outcome in R. I found an example code in the Targeted Learning book by Mark J. van der Laan and Sherri Rose, which require the data to be converted to long format to run.
The function that converts the data to the long format is no longer available. Here is the code:
library(survival)
data(lung)
subLung <- subset(lung, select = c(time, status, age,ph.ecog, ph.karno, pat.karno))
subLung$female <- (lung$sex - 1)
subLung <- subLung[complete.cases(subLung), ]
## Expand subLung to Long Format
longData <- SuperLearner:::createDiscrete(time =subLung$time,
event = (subLung$status == 2),dataX = subset(subLung,
select =-c(time, status)), n.delta = 30)
The createDiscrete function is no longer available in the SuperLearner package. Is there any other function that will convert the data to long format? If not, then a toy example of how to convert the data into appropriate long format would be very helpful. Or a sample R code to run SuperLearner for survival outcome would be also helpful.

I found the answer. To run SuperLearner for survival outcome, the data structure has to be converted to counting process format, meaning that, the time variable should be split in such a way that at most 1 event can happen given a time interval. The survsplit function in survival package does that! Thanks to Dr. Eric C. Polley.

Related

What is the correct time format for random Forest?

I've created a randomForest model to predict the working state of a machine. One of the variables used for prediction is "time". I've converted the timestamp of my dataset (POSIXct format) to ITime-format, so that I only have the daytime without the date.
kwpower <- kwpower %>% mutate(time = as.ITime(timestamp, format = "%H:%M:%S"))
If I run the model, which works fine, and use the varImpPlot, it tells me, that "time" is the most important variable. I wonder if this can really be true and if the ITime-format is the right time-format to use in randomForest? What is the correct/best time-format when using randomForest?
I hope you can help me, thanks in advance :)

how to get tsclean working on data frame with multiple time series

I'm in the process of creating a forecast based on the hts package but before getting this far I need to clean the data for outliers and missing values.
For this I thought of using the tsclean function in the forecast package. I got my data stored in data frame with multiple columns (time series) that I wish to get cleaned. I can get the function to work when only having one time serie, but since I do have quite a lot i'm looking for a smart way to do this.
When running the code:
SFA5 <- ts(SFA4, frequency=12, start=c(2012,1), end=c(2017,10))
ggt <- tsclean(SFA5[1:70, 1:94], replace.missing = TRUE)
I get this error message:
Error in na.interp(x, lambda = lambda) : The time series is not univariate.
The data is here:
https://www.dropbox.com/s/dow2jpuv5unmtgd/Data1850.xlsx?dl=0
My question is: what am i doing wrong or is the only solution to do a loop sequence
The error message suggests that the function takes univariate time series as its first argument only. So you need to apply tsclean to each column, as you might have guessed.
library(forecast)
ggt <- sapply(X = SFA5[1:70, 1:94], FUN = tsclean)

ts object not recognised in hybridModel of forecastHybrid package

Data is something like this:
df <- tribble(
~y,~timestamp
18.74682, 1500256800,
19.00424, 1500260400,
18.86993, 1500264000,
18.74960, 1500267600,
18.99854, 1500271200,
18.85443, 1500274800,
18.78031, 1500278400,
18.97948, 1500282000,
18.86576, 1500285600,
18.55633, 1500289200,
18.79052, 1500292800,
18.74790, 1500296400,
18.62743, 1500300000,
19.04696, 1500303600,
18.97851, 1500307200,
18.70956, 1500310800,
18.92302, 1500314400,
18.91465, 1500318000,
18.61556, 1500321600,
19.03535, 1500325200 )
I'm trying to apply hybridModel on timeseries data to perform ensemble.Below is my code:
library(tidyquant)
library(forecast)
library(timetk)
library(sweep)
library(forecastHybrid)
df <- mutate(df, timestamp = as_datetime(timestamp))
tk_ts_df <- tk_ts(df, start = 1, freq = 3600, silent = TRUE)
fit <- hybridModel(tk_ts_df)
On fitting timeseries object tk_ts_df (ts object) to hybridModel; it's giving error : "The time series must be numeric and may not be a matrix or dataframe object."
But on link: https://cran.r-project.org/web/packages/forecastHybrid/vignettes/forecastHybrid.html
It's clearly mentioned : The workhorse function of the package is hybridModel(), a function that combines several component models from the “forecast” package. At a minimum, the user must supply a ts or numeric vector for y
Please suggest what I'm doing wrong.
The "forecastHybrid" requires that the input timeseries is a numeric vector or ts type. While the "timekit" package does return a ts object, it also adds additional attributes that are not in regular ts objects so input checks failed.
See discussion here. and the fixing commit here.
The latest version from Github incorporating the fix can be downloaded with
devtools::install_github("ellisp/forecastHybrid/pkg")

How to use dyn package to perform regression on xts object?

I've recently learn that there is a package call dyn which can perform regressions on xts object, however I have trouble reading the manual.
If there is a datum like below:
data(sample_matrix)
#sample_matrix is a built-in datum in xts package
xtsObject=as.xts(sample_matrix)[,"Close"]
#Extract daily close price to xtsObject
I tried the code below,but it gives me some error message.
dyn$lm(xtsObject~index(xtsObject))
Is this code correct? If not, how to do it?(I want to set xtsObject as dependent variable, time or date of datum as independent variables)
This looks like a bug but here is a workaround:
tt <- xts(time(xtsObject), time(xtsObject))
dyn$lm(xtsObject ~ tt)
Note that you can ask dyn questions on https://groups.google.com/forum/#!forum/sqldf

Using 'PerformanceAnalytics' package to calculate Performance Measures

I need to use 'PerformanceAnalytics' package of R and to use this package, I understand that I need to convert the data into xts data, which is actually a panel data. Following this forum's suggestion I have done the following:
library(foreign)
RNOM <- read.dta("Return Panel without missing.dta")
RNOM_list<-split(RNOM,RNOM$gvkey)
xts_list<-lapply(RNOM_list,function(x)
{out<-xts(x[,-1],order.by=as.Date(x$datadate,format="%d/%m/%Y")) })
It gives me RNOM_list and xts_list.
After this, can some please help me to estimate the monthly returns using the function Return.calculate and lapply and save the output generated as an addition variable in my original data-set for regression analysis? Subsequently, I also need to estimate VaR, ES and semi-sd.
The data can be downloaded here. Note, prccm is the monthly closing price in the data and gvkey is the firm ID.
An efficient way to achieve this goal is to covert the Panel Data (long format) into wide format using 'reshape2' package. After performing the estimations, convert it back to long format or panel data format. Here is an example:
library(foreign)
library(reshape2)
dd <- read.dta("DDA.dta") // DDA.dta is Stata data; keep only date, id and variable of interest (i.e. three columns in total)
wdd<-dcast(dd, datadate~gvkey) // gvkey is the id
require(PerformanceAnalytics)
wddxts <- xts(wdd[,-1],order.by=as.Date(wdd$datadate,format= "%Y-%m-%d"))
ssd60A<-rollapply(wddxts,width=60,SemiDeviation,by.column=TRUE,fill=NA) // e.g of rolling window calculation
ssd60A.df<-as.data.frame(ssd60A.xts) // convert dataframe to xts
ssd60A.df$datadate=rownames(ssd60A.df) // insert time index
lssd60A.df<-melt(ssd60A.df, id.vars=c('datadate'),var='gvkey') // convert back to panel format
write.dta(lssd60A.df,"ssd60A.dta",convert.factors = "string") // export as Stata file
Then simply merge it with the master database to perform some regression.

Resources