I've got a big database which I've split up by year and created at train and test for each respective year
#split the dataset into a list of datasets
Y <- split(dat_all, dat_all$year)
#create a train and test dataset for all years
#takes Y is inp
create_sets <- function(x){
train_set <- sample(2, nrow(x), replace = TRUE, prob = c(0.7, 0.3))
train <- x[train_set == 1, ]
test <- x[train_set == 2, ]
assign('x', list(train = train, test = test))
}
Ylist <- lapply(Y, create_sets)
To call each item out you use Ylist$'2016'$train
I've made an accuracy ratio function which I can run each list through individually but I am looking for a way to do it all in one to save massive amounts of code (theres 16 years of data)
Below is how I currently create an accuracy ratio for one year
val_train<-Ylist$'2016'$train
val_train$pred<-predict(modf,newdata=Ylist$'2016'$train)
val_train$probs<-exp(val_train$pred)/(1+exp(val_train$pred))
x<-data.frame(rcorr.cens(val_train$probs, val_train$default_flag))
train_AR<-x[2,1]
train_AR
modfull <-ModFit(test)
val_test<-test
val_test$pred<-predict(modf,newdata=test)
val_test$probs<-exp(val_test$pred)/(1+exp(val_test$pred))
x<-data.frame(rcorr.cens(val_test$probs, val_test$default_flag))
test_AR<-x[2,1]
test_AR
AR_Logistic1<-c(train_AR,test_AR,)
AR_Logistic2<-c(train_AR,test_AR) #just in to see if table works
AccuracyRatio<-rbind(AR_Logistic1,AR_Logistic2)
colnames(AccuracyRatio)<-c("Train","Test","All")
AccuracyRatio
Just to clarify I'm trying to run through my whole list through my accuracy ratio and then output the AR for each year for its train and test.
Any help is greatly appreciated
With lapply and wrapping the AR calculations in a function you can summarise the output as below.
Without sample data, I could not test it but let us know if you face any errors.
fn_Calc_AR <- function(yearDat = listInput) {
#yearDat <== Ylist$'2016'
trainDat <- yearDat$train
testDat <- yearDat$test
val_train<- trainDat
val_train$pred<-predict(modf,newdata= trainDat)
val_train$probs<-exp(val_train$pred)/(1+exp(val_train$pred))
x<-data.frame(rcorr.cens(val_train$probs, val_train$default_flag))
train_AR<-x[2,1]
#train_AR
modfull <-ModFit(testDat)
val_test<-testDat
val_test$pred<-predict(modf,newdata=testDat)
val_test$probs<-exp(val_test$pred)/(1+exp(val_test$pred))
x<-data.frame(rcorr.cens(val_test$probs, val_test$default_flag))
test_AR<-x[2,1]
#test_AR
AR_Logistic1<-c(train_AR,test_AR) # removed extraneous comma, previous input c(train_AR,test_AR,)
AR_Logistic2<-c(train_AR,test_AR) #just in to see if table works
AccuracyRatio<-rbind(AR_Logistic1,AR_Logistic2)
colnames(AccuracyRatio)<-c("Train","Test","All")
#confirm yearName is being created
try(yearName <- head(names(x),1)) #retain only year
if(length(yearName) > 0L) {
AR_DF <- data.frame(yearName = yearName , AccuracyRatio,stringsAsFactors=FALSE)
}else{
AR_DF <- AccuracyRatio
}
return(AR_DF)
}
Summarise Output:
AR_Summary = do.call(rbind,lapply(Ylist,fn_Calc_AR))
Aggregate Dataset:
aggregateTrain = do.call(rbind,lapply(Ylist,function(x) x$train))
aggregateTest = do.call(rbind,lapply(Ylist,function(x) x$test))
aggregateList = list(train = aggregateTrain,test = aggregateTest)
AR_AggregateSummary = do.call(rbind,lapply(aggregateList,function(x) fn_Calc_AR(x) ))
Related
I am trying to create 10 folds of my data. What I want to have is a data structure of length 10 (number of folds) and each element of the data structure contains an object/data structure that has two attributes/elements; the training set and the test set at that fold. This is my R code.
I wanted to access for example, the training set at fold 8 by View(data_pairs[[8]]$training_set). But it did not work. Any help would be appreciated :)
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(data), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- data[test_ind,]
train <- data[-test_ind,]
data_pair <- list(training_set = list(train), test_set = list(test))
data_pairs <- append(x = data_pairs, values = data_pair)
}
You were very close, you just needed to wrap values in a list call.
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(mtcars), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- mtcars[test_ind,]
train <- mtcars[-test_ind,]
data_pair <- list(training_set = train, test_set = test)
data_pairs <- append(x = data_pairs, values = list(data_pair))
#data_pairs <- c(data_pairs, list(data_pair))
}
If your data is big I would suggest you read these two posts on more efficient ways to grow a list.
Append an object to a list in R in amortized constant time, O(1)?
Here we go again: append an element to a list in R
I would also like to point out that you are not creating "folds" of your data. In your case you are attempting a 10-fold cross validation, which means your data should be separated into 10 "equal" sized chunks. Then you create 10 train/test data sets using each fold as the test data and the rest for training.
It seems like the package modelr could help you here.
In particular I would point you to:
https://modelr.tidyverse.org/reference/resample_partition.html
library(modelr)
ex <- resample_partition(mtcars, c(test = 0.3, train = 0.7))
mod <- lm(mpg ~ wt, data = ex$train)
rmse(mod, ex$test)
#> [1] 3.229756
rmse(mod, ex$train)
#> [1] 2.88216
Alternatively, producing a dataset of these partitions can be done with:
crossv_mc(data, n, test = 0.2, id = ".id")
I am writing to paste here my code.
I am following an online course in R and I was trying to automate a multiple variables regression. I have tried to check what's going on and at the beginning, it works, but when it comes to the last two variables, it enters in a loop and does not eliminate them, even though it enters in the if.
At the end, I have this error
Error in if (maxVar > sl) { : missing value where TRUE/FALSE needed
Here is the code
backwardElimination <-function(training,sl) {
numVar=length(training)
funzRegressor = lm(formula = profit ~.,
data = training)
p = summary(funzRegressor)$coefficients[,4]
maxVar = max(p)
if (maxVar > sl){
for (j in c(1:numVar)){
if (maxVar == p[j]) {
training = training[, -j]
backwardElimination(training,sl)
}
}
}
return(summary(funzRegressor))
}
Thanks in advance
Edit: this is the rest of my code
#importing dataset
dataset = read.csv('50_Startups.csv')
# Encoding categorical data
dataset$State = factor(dataset$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
#splitting in train / test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 4/5)
trainingSet = subset(dataset, split == TRUE)
testSet = subset(dataset, split == FALSE)
#Transforming state in dummy variables
trainingSet$State = factor(trainingSet$State)
dummies = model.matrix(~trainingSet$State)
trainingSet = cbind(trainingSet,dummies)
profit = trainingSet$Profit
trainingSet = trainingSet[, -4]
trainingSet = trainingSet[, -4]
trainingSet = cbind(trainingSet,profit)
#calling the function
SL = 0.05
backwardElimination(trainingSet, SL)
This error indicates that you have an NA instead of a boolean value in your if statement.
if (NA) {}
## Error in if (NA) { : missing value where TRUE/FALSE needed
Either your p contains NA, either sl is NA.
Your intercepts are also fed back in the next step of modeling, you need to get rid of it before moving to the next iteration.
I can replicate your error with R in-built dataset state.x77
dataset <- as.data.frame(state.x77)
dataset$State <- rownames(dataset)
dataset$profit <- rnorm(nrow(dataset))
backwardElimination <-function(training,sl) {
if (!"profit" %in% names(training)) return(NULL)
numVar=length(training)
funzRegressor = lm(formula = profit ~.,
data = training)
p = summary(funzRegressor)$coefficients[,4]
maxVar = max(p)
#print(funzRegressor)
if (maxVar > sl){
for (j in c(1:numVar)){
if (maxVar == p[j]) {
training = training[, -j]
backwardElimination(training,sl)
}
}
}
return(summary(funzRegressor))
}
backwardElimination(dataset, 0.05)
There are NAs in some of your betas and all the p-values becomes NaN. Do you need to regress within states? Otherwise you can remove the State column to remove the error.
There will be another error when you reach the boundary case in your recursion, which you can fix :)
I have a customized function (psup2) that gets data from a data frame and returns a result. The problem is that it takes a while since I am using a "for" loop that runs for every row and column.
Input:
I have a table that contains the ages (table_costumers), an n*m matrix of different terms, and two different mortality tables (for males and females).
The mortality tables i´m using contains one column for ages and another one for its corresponding survival probabilities.
Output:
I want to create a separate dataframe with the same size as that of the term table. The function will take the data from the different mortality tables (depending on the gender) and then apply the function above (psup2) taking the ages from the table X and the terms from the matrix terms.
Up to now I managed to create a very inefficient way to do this...but hopefully by using one of the functions from the apply family this could get faster.
The following code shows the idea of what I am trying to do:
#Function
psup2 <- function(x, age, term) {
P1 = 1
for (i in 1:term) {
P <- x[age + i, 2]
P1 <- P1*P
}
return(P1)
}
#Inputs
terms <- data.frame(V1 = c(1,2,3), V2 = c(1,3,4), V2 = c(2,3,4))
male<- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9981,0.9979,0.9978,.994,.992,.99))
female <- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9983,0.998,0.9979,.9970,.9964,.9950))
table_customers <- data.frame(id = c(1,2,3), age = c(0,0,0), gender = c(1,2,1))
#Loop
output <- data.frame(matrix(NA, nrow = 3, ncol = 0))
for (i in 1:3) {
for (j in 1:3) {
prob <- ifelse(table_customers[j, 3] == 1,
psup2(male, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])),
psup2(female, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])))
output[j, i] <- prob
}
}
your psup function can be simplified into:
psup2 <- function(x, age, term) { prod(x$probability[age+(1:term)]) }
So actually, we won't use it, we'll use the formula directly.
We'll put your male and female df next to each other, so we can use the value of the gender column to choose one or another.
mf <- merge(male,female,by="age") # assuming you have the same ages on both sides
input_df <- cbind(table_customers,terms)
output <- t(apply(input_df,1,function(x){sapply(1:3,function(i){prod(mf[x["age"]+(1:x[3+i]),x["gender"]+1])})}))
And that's it :)
The sapply function is used to loop on the columns of terms.
x["age"]+(1:x[3+i]) are the indices of the rows you want to multiply
x["gender"]+1 is the relevant column of the mf data.frame
I created a data set like this, the reason is I wanted to use a time series based data set
getSymbols("^GSPC")
DF=data.frame(GSPC,DATE=time(GSPC))
PriceChange=(DF$GSPC.Close-DF$GSPC.Open)
DF$Class<-as.factor(ifelse(PriceChange>0,"UP","DOWN"))
DF$year = as.numeric(format(DF$DATE, format = "%Y"))
DF$MONTH = as.numeric(format(DF$DATE, format = "%m"))
GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume Class Year Month
1418.03 1429.42 1407.86 1416.60 3429160000 Down 2007 1
1416.60 1421.84 1408.43 1418.34 3004460000 Up 2007 1
Then I replaced the year by a number which can be added on to keep the month index (I guess there is a smarter way of doing this)
DF=data.table(DF)
DF[year==2007,year:=0]
DF[year==2008,year:=12]
DF[year==2009,year:=24]
DF[year==2010,year:=36]
DF[year==2011,year:=48]
DF[year==2012,year:=60]
DF[year==2013,year:=72]
DF[year==2014,year:=84]
DF[year==2015,year:=96]
DF[year==2016,year:=108]
DF$Month_Index=(DF$year+DF$MONTH)
so the data set has now the additional column
Month_Index
01
01
Month_Index=115
Then I used the createTimeSlices from caret to make a growing window prediction.
TimeSlices=createTimeSlices(1:Month_Index, 5, horizon = 2,
fixedWindow = FALSE, skip = 0)
for(i in 1:nrow(DF))
{
plsFitTime <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
Prediction=predict(plsFitTime,DF[TimeSlices$test[[i]],])
}
Now I want to save predictions for each step and its proper index along with the accuracy. My question is how can I do this.
One of the ways to accomplish this:
library(quantmod)
library(data.table)
library(caret)
getSymbols("^GSPC")
DF <- data.frame(GSPC,DATE=time(GSPC))
PriceChange <- (DF$GSPC.Close-DF$GSPC.Open)
DF$Class <- as.factor(ifelse(PriceChange>0,"UP","DOWN"))
You can create Month indices in the following two ways:
# 1
DF$yearMon <- zoo::as.yearmon(DF$DATE)
DF <- data.table(DF)
DF[, Month_Index:= .GRP, by = yearMon]
# 2
DF$year <- as.numeric(format(DF$DATE, format = "%Y"))
DF$MONTH <- as.numeric(format(DF$DATE, format = "%m"))
DF[, Month_Index2 := .GRP, by = .(year, MONTH)]
identical(DF$Month_Index, DF$Month_Index2)
[1] TRUE
Month_Index <- length(unique(DF$Month_Index))
TimeSlices <- createTimeSlices(1:Month_Index, 5, horizon = 2,
fixedWindow = FALSE, skip = 0)
Create three empty lists to save your results:
totalSlices <- length(TimeSlices$train)
plsFitTime <- vector("list", totalSlices)
Prediction <- vector("list", totalSlices)
Accuracy <- vector("list", totalSlices)
Save all the results to these lists:
k <- 1:totalSlices
for(i in seq_along(k))
{
plsFitTime[[i]] <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
Prediction[[i]] <- predict(plsFitTime[[i]],
DF[TimeSlices$test[[i]],])
Accuracy[[i]] <- confusionMatrix(Prediction[[i]],
DF[TimeSlices$test[[i]],]$Class)$overall[1]
}
All the models are saved in plsFitTime, predictions in Prediction, and accuracies in Accuracy.
Update:
A more tidier approach would be to use the purrr package.
After creating time slices, you can use:
library(purrr)
customFunction <- function(x, y) {
model <- train(Class~.,
data = DF[x],
method = "pls")
prediction <- predict(model, DF[y])
accuracy <- confusionMatrix(prediction,
DF[y]$Class)$overall[1]
return(list(prediction, accuracy))
}
results <- map2_df(TimeSlices$train, TimeSlices$test, customFunction)
map2_df is a function that takes 2 lists .x and .y as the arguments, applies the function .f on all elements of those lists and returns the results as a dataframe.
You can create the function on the fly (just like lapply), but I created customFunction in the global environment just to keep the code clean.
DF[x] in the function is equivalent to DF[TimeSlices$train[[n]]] and DF[y] is DF[TimeSlices$test[[n]]]
map2_df now does everything that the for loop above did, and returns only the predictions and accuracies for all the models in the form of a dataframe.
class(results)
[1] "tbl_df" "tbl" "data.frame"
dim(results)
[1] 2 109
Each column in results is a list. 109 columns are the results from 109 models.
To access the result of each model (in this case prediction and accuracy) use results$columnName or results[[columnNumber]] .
If you want to store the models as well, just change the return statement in customFunction to include the model: return(list(model, prediction, accuracy))
You can use plyr to gather your results using a list:
results <- plyr::llply(1:length(TimeSlices$train), function(i){
plsFitTime <- train(Class~.,
data = DF[TimeSlices$train[[i]],],
method = "pls")
testData <- DF[TimeSlices$test[[i]],]
Prediction <- predict(plsFitTime, testData)
list(index = i, model = plsFitTime, prediction = Prediction)
})
# The model created for slice no. 3
results[[3]]$model
# ... and it's predictions
results[[3]]$prediction
You can add accuracy inside the function passed to llply if you need it.
I wrote a small function to partition my dataset into training and testing sets. However, I am running into trouble when dealing with factor variables. In the model validation phase of my code, I get an error if the model was built on a dataset that doesn't have representation from each level of a factor. How can I fix this partition() function to include at least one observation from every level of a factor variable?
test.df <- data.frame(a = sample(c(0,1),100, rep = T),
b = factor(sample(letters, 100, rep = T)),
c = factor(sample(c("apple", "orange"), 100, rep = T)))
set.seed(123)
partition <- function(data, train.size = .7){
train <- data[sample(1:nrow(data), round(train.size*nrow(data)), rep= FALSE), ]
test <- data[-as.numeric(row.names(train)), ]
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
part.data <- partition(test.df)
table(part.data$train[,'b'])
table(part.data$test[,'b'])
EDIT - New function using 'caret' package and createDataPartition():
partition <- function(data, factor=NULL, train.size = .7){
if (("package:caret" %in% search()) == FALSE){
stop("Install and Load 'caret' package")
}
if (is.null(factor)){
train.index <- createDataPartition(as.numeric(row.names(data)),
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
else{
train.index <- createDataPartition(factor,
times = 1, p = train.size, list = FALSE)
train <- data[train.index, ]
test <- data[-train.index, ]
}
partitioned.data <- list(train = train, test = test)
return(partitioned.data)
}
Try the caret package, particularly the function createDataPartition(). It should do exactly what you need, available on CRAN, homepage is here:
caret - data splitting
The function I mentioned is partially some code I found a while back on net, and then I modified it slightly to better handle edge cases (like when you ask for a sample size larger than the set, or a subset).
stratified <- function(df, group, size) {
# USE: * Specify your data frame and grouping variable (as column
# number) as the first two arguments.
# * Decide on your sample size. For a sample proportional to the
# population, enter "size" as a decimal. For an equal number
# of samples from each group, enter "size" as a whole number.
#
# Example 1: Sample 10% of each group from a data frame named "z",
# where the grouping variable is the fourth variable, use:
#
# > stratified(z, 4, .1)
#
# Example 2: Sample 5 observations from each group from a data frame
# named "z"; grouping variable is the third variable:
#
# > stratified(z, 3, 5)
#
require(sampling)
temp = df[order(df[group]),]
colsToReturn <- ncol(df)
#Don't want to attempt to sample more than possible
dfCounts <- table(df[group])
if (size > min(dfCounts)) {
size <- min(dfCounts)
}
if (size < 1) {
size = ceiling(table(temp[group]) * size)
} else if (size >= 1) {
size = rep(size, times=length(table(temp[group])))
}
strat = strata(temp, stratanames = names(temp[group]),
size = size, method = "srswor")
(dsample = getdata(temp, strat))
dsample <- dsample[order(dsample[1]),]
dsample <- data.frame(dsample[,1:colsToReturn], row.names=NULL)
return(dsample)
}