I want to prepare my data for Survival analysis modelling - r

Like this we have 500 entries. Entries may be repeated.
The date represents the date on which that particular car part(carparts) was malfunctioning. We have to predict on which date a car part(carparts) is going to malfunction.
Codes are written in R. The code to develop the table is mentioned below:-
q<-c("Mercedes","Audi","Tata","Renault","Ferrari","Lamborgini")
w<-sample(q,500,replace=TRUE)
m <- c("accelerator", "gear", "coolant", "brakes", "airbags")
k <- sample(m, 500, replace=TRUE)
e <- seq(as.Date("2010/1/1"), as.Date("2011/1/1"), by="days")
l <- sample(e, 500, replace=TRUE)
test <- list(w,k, l)
t2 <- as.data.frame(test)
names(t2) <- c("carnames","carparts", "date")
t2$Diffdate<-as.numeric(t2$date-as.Date("2010-01-01"))
head(t2)
I'm preparing my data for survival analysis .In the code above I haven't included the censor and event variable .(I tried a rough draft and it went messy.) I just need an idea of how to include the event and censor variable along with the carparts variable and carnames variable. I'm getting stuck as I'm unable to frame all the variables in a single table.
Two problems that I'm facing are:-
1> I can't find a way to keep carparts ,carnames,event and censor variable in one table.
2> And the event variable is always 1 in each entry because for each entry(row) there is a breakdown/defect of carparts happening.Is it ok to have so?
As in all the examples I saw from internet the event variable had both ones and zeroes.
Edited 1:- Its not necessary to do it in R, you may write it down (draw the table including the columns present as well as the censor and event variable) on a piece of paper and attach the snapshot.
Thanks

Related

Convert List of lists to data frame where each list within the list are the results from using Sapply + decompose on multiple columns

this is my first project using a coded environment so may not phrase things accurately. I am building an ARIMA forecast.
I want to forecast for multiple sectors (business areas) at a time. Using help forums I have managed to write code that takes my time series data as input, fits the model, and sends the outputs to CSV. I am happy with this.
My problem is that I would also like capture the results from the decomposition analysis on a sector level. Currently, when I use a solution I found elsewhere it outputs to CSV in a format that is unusable, where everything is spread by row and the different lists are half in one row and another.
Thanks In advance!
My current solution (probably not super efficient but like I say cobbled together based on forum tips)
Clean data down to TS
NLDemand <- read_excel("TS Demand 2018 + Non London no lockdown.xlsx")
NLDemand <- as_tibble(NLDemand)
NLDemand <- na.omit(NLDemand)
NLDemand <- subset(NLDemand, select = -c(Month,Year))
NLDemand <- subset(NLDemand, select = -c(YearMonth))
##this gets the data to a point where each column is has a header of business sector and the time series data underneath it with no categorical columns left E.G:
Sector 1a, sector1b, sector...
500,450,300
450,500,350
...,...,...
Season capture for all sectors
tsData<-sapply(NLDemand, FUN = ts, simplify = FALSE,USE.NAMES = TRUE,start=c(2018,1),frequency=12)
tsData
timeseriescomponents <- sapply(tsData,FUN=decompose,simplify = FALSE, USE.NAMES = TRUE)
timeseriescomponents
this produces a list of lists where each sublist is the decomposed elements of the sector time series.
##Covert all season captures to the same length
TSC <- list(timeseriescomponents[1:41])
n.obs <- sapply(TSC, length)
seq.max <- seq_len(max(n.obs))
mat <- t(sapply(TSC, "[", i = seq.max ))
##Export to CSV
write.csv(mat, "Non london 2018 + S-T componants.csv", row.names=FALSE)
***What I want as an output would be a table that showed each componant as a a column in a list
Desired output format
Current output(sample)

R forecastML package keeps renaming outcome columns

I am trying to use the forecast ML r package to run some tests but the moment I hit this step, it renames the columns
data <- read.csv("C:\\Users\\User\\Desktop\\DG ST Forecast\\LassoTemporalForecast.csv", header=TRUE)
date_frequency <- "1 week"
dates <- seq(as.Date("2012-10-05"), as.Date("2020-10-05"), by = date_frequency)
data_train <- data[1:357,]
data_test <- data[358:429,]
outcome_col <- 1 # The column index of our DriversKilled outcome.
horizons <- c(1,2,3,4,5,6,7,8,9,10,11,12) # 4 models that forecast 1, 1:3, 1:6, and 1:12 time steps ahead.
# A lookback across select time steps in the past. Feature lags 1 through 9, for instance, will be
# silently dropped from the 12-step-ahead model.
lookback <- c(1)
# A non-lagged feature that changes through time whose value we either know (e.g., month) or whose
# value we would like to forecast.
dynamic_features <- colnames(data_train)
data_list <- forecastML::create_lagged_df(data_train,
type = "train",
outcome_col = 1,
horizons = horizons,
lookback = lookback,
date = dates[1:nrow(data_train)],
frequency = date_frequency,
dynamic_features = colnames(data_train)
)
After the data_list, here is a snapshot of what happens in the console:
Next, when I try to create windows following the name change,
windows <- forecastML::create_windows(lagged_df = data_list, window_length = 36,
window_start = NULL, window_stop = NULL,
include_partial_window = TRUE)
plot(windows, data_list, show_labels = TRUE)
this error: Can't subset columns that don't exist. x Column cases doesn't exist.
I've checked through many times based on my input data and the code previously and still can't understand why the name change occurs, if anyone is familiar with this package please assist thank you!
I'm the package author. It's difficult to tell without a reproducible example, but here's what I think is going on: Dynamic features are essentially features with a lag of 0. Dynamic features also retain their original names, as opposed to lagged features which have "_lag_n" appended to the feature name. So by setting dynamic_features to all column names you are getting duplicate columns specifically for the outcome column. My guess is that "cases" is the outcome here. Fix this by removing dynamic_features = colnames(data_train) and setting it to only those features that you really want to have a lag of 0.

Regression Analysis

I am currently working on a project and I need some help. I want to predict the length of flight delays, using a statistical model. The data set does not contain the length of flight delays, but it can be calculated from the actual and scheduled departure times, I know that actual departure times - scheduled departure time will give me the flight delay which is the dependent variable. I am struggling to get the explanatory (independent) variables in a useful form to do regression analysis - the main problem is the time format of the first two columns when you read in the table from the csv file. I have attach the data file to the question because I wasn't too sure how to attach my file, I'm new to this coding thing hehe. Any help will be appreciated. xx
https://drive.google.com/file/d/11BXmJCB5UGEIRmVkM-yxPb_dHeD2CgXa/view?usp=sharing
EDIT:
Firstly Thank you for all the help
Okay I'm going to try and ask more precise questions on this topic:
So after importing the file using:
1)
Delays <- read.table("FlightDelaysSM.csv",header =T,sep=",")
2)The main issue I am having is getting the columns schedule time and deptime into a format where I can do arithmetic calculation
3)I tried the below
Delays[,1] - Delays[,2]
where the obvious issue arises for example 800 (8am) - 756 (7.56am) = 44 not 4 minutes
4)Using the help from #kerry Jackson (thank you, you're amazing x) I tried
DepartureTime <- strptime(formatC(Delays$deptime, width = 4, format = "d", flag = "0", %H%M)
ScheduleTime <- strptime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0", %H%M)
DelayTime = DepartureTime - ScheduleTime
The values are also given are in seconds, I want the difference to be in minutes how would I go about doing this?
5) I then did the following:
DelayData <- data.frame(ScheduleTime, DepartureTime, DelayTime, Delays[, 4:7])
What I attain after making the DelayData
as you can see by the image I have the seconds units in my column called DelayTime which I don't want as stated in 4), and the date is in the columns ScheduleTime and DepartureTime could I possibly get some suggestions on how to correct this?
Create a new column called flight_delay:
install.packages('tidyverse')
library(tidyverse)
your_data <- your_data %>%
mutate(flight_delay=deptime-schedtime)
Now, create a linear regression model predicting flight_delay by every other variable:
mod <- lm(flight_delay ~ ., data=your_data)
To optimize your model, use the step function:
mod <- step(mod)
Analyze results:
summary(mod)

Including additional data at each stage of a loop

I am trying to create minimum convex polygons for a set of GPS coordinates, each day has 32 coordinates and I want to create a MCP with 1 day,2 days,3 days... and so on worth of data. For instance in the first step I want to include rows 1-32 which I have managed:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:32, ]
Currently to select data for me to do 2 days at a time I have written:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:64, ]
Is there a way to automate adding 32 rows at each step (in a loop) rather than me running the code manually each time and changing the amount of data used manually each time?
I am very new to R so I do not know whether it is possible to do this, the way I thought would work was:
n <- 32
for (i in 1:100) {
mydata <- mydata[1:n, ]
## CREATE MCP AND STORE HOME RANGE OUTPUT
n <- n+32
}
However it is not possible to have n representing a row number but is there a way to do this?
Apologies if this is unclear but as I said I am quite new to using R and really would appreciate any help that can be given.

Creating a loop for all variables with a similar name

I'm new to R, and new to this forum. I've searched but cannot easily find an answer to this question:
I have numbers of cases of a disease by week according to location, stored in a .csv file with variable names cases.wk24, cases.wk25, etc. I also have population for each location, and want to generate incidence rates (# cases/population) for each of the locations.
I would like to write a loop that generates incidence rates by location for each week, and stores these in new variables called "ir.wk24", "ir.wk25", etc
I am stuck at 2 points:
is it possible to tell R to run a loop if it comes across a variable that looks like "cases.wk"? In some programmes, one would use a star - cases.wk*
How could I then generate the new variables with sequential naming and store these in the dataset?
I really appreciate any help on this - been stuck with internet searches all day!
thanks
x <- data.frame(case.wk24=c(1,3),case.wk25=c(3,2), pop=c(7,8))
weeks <- 24:25
varnames <- paste("case.wk", weeks, sep="")
ir <- sapply(varnames,FUN=function(.varname){
x[,.varname]/x[,"pop"]
})
ir <- as.data.frame(ir)
names(ir) <- paste("ir.wk", weeks, sep="")
x <- cbind(x,ir)
x

Resources