Supplying several methods in a loop and saving the output - r

EDIT:Actually looking at the method shows it only uses one method, how can I write a loop to select a method given a vector of methods?!
I've looked at a couple of questions(Loop in R: how to save the outputs?) but can't seem to figure out how to save the output of these models. Here is my "function":
library(caret)
Control<-trainControl(method="cv",number=5)
metric<-"Accuracy"
modeler<-function(df,yname,xname,method,metric,control){
df<-df
methods1<-method
f1<-as.formula(paste0(yname,"~",xname))
for(method in methods1){
fit<-do.call("train",list(data=quote(df),f1,metric=metric,trControl=control,
method=method))
}
fit
}
An implementation of the "function":
methods1<-c("rf","rpart")
modeler(iris,yname="Species",xname=".",method = methods1,control=Control,
metric = metric)
Now I could save the above like:
mod1<-modeler(iris,yname="Species",xname=".",method = methods1,control=Control,
metric = metric)
And then:
sapply(mod1,"[",1)
This works but returns only the last model and in not the best of displays.
How can I optimise this process?

Here is an abridge answer for the key parts.
With methods1 as is vector of methods one can use the lapply function:
methods1<-c("rf","rpart")
#use lapply to loop through the methods
#fit will be a list of the results from modeler
fit<-lapply(methods1, function(met) {
modeler(iris,yname="Species",xname=".",method = met,control=Control, metric = metric)
})
If the calculations are time consuming the lapply function can easily made to run in parallel with the parallel package.

Based on #Dave2e 's comment: Here is how I've managed to "solve" it:
library(caret)
Control<-trainControl(method="cv",number=5)
metric<-"Accuracy"
modeler<-function(df,yname,xname,method,metric,control){
df<-df
methods1<-method
f1<-as.formula(paste0(yname,"~",xname))
for(method in methods1){
fit<-do.call("train",list(data=quote(df),f1,metric=metric,trControl=control,
method=methods1))
}
data.frame(fit$results,Type=fit$modelInfo$tags[1])
}
Implement:
methods1<-c("rf","rpart")
lapply(methods1, function(met) {modeler(iris,yname="Species",xname=".",
method = met,control=Control, metric = metric)})
Result with no seed: This is still lacking but it works so;
[[1]]
mtry Accuracy Kappa AccuracySD KappaSD Type
1 2 0.9533333 0.93 0.03800585 0.05700877 Random Forest
2 3 0.9533333 0.93 0.03800585 0.05700877 Random Forest
3 4 0.9533333 0.93 0.03800585 0.05700877 Random Forest
[[2]]
cp Accuracy Kappa AccuracySD KappaSD Type
1 0.00 0.9533333 0.93 0.02981424 0.04472136 Tree-Based Model
2 0.44 0.7733333 0.66 0.14605935 0.21908902 Tree-Based Model
3 0.50 0.3333333 0.00 0.00000000 0.00000000 Tree-Based Model

Related

RandomForest() to return probabilities of positive result as well as classification

I'm building a Random Forrest Classifier and I would like to return classification and associated probabilities. My result variable is either 1 or 0, 1 being the positive class that I want to track.
no_of_trees <- 50
rf.under <- randomForest(as.factor(result) ~ . ,
data=data_balanced_under,
importance=TRUE,
ntree=no_of_trees)
prediction <- predict(rf.under, df.test)
probability <- predict(rf.under, df.test, type="prob")
submit <- data.frame( predicted = prediction, actual = df.test$result)
I wanted probability to return the probability of positive results, however I get:
> probability
0 1
242339 1.00 0.00
3356431 1.00 0.00
138327 1.00 0.00
111327 1.00 0.00
3307151 1.00 0.00
222414 1.00 0.00
1817297 1.00 0.00
3860922 1.00 0.00
1710532 1.00 0.00
in my output. What are these numbers on the left? I'm not sure what they are? I thought they are row numbers, but then, why aren't they indexed from 1,2,3..?
I tied to get probability[,2] which I'm assuming gives me probability of the result, but also doesn't work.
Ideally, I would like to include the probabilities in the submit data frame, but currently unable to do so.
Also, confusion matrix gives me:
confusionMatrix(data = submit$predicted, reference = df.test$result , positive="1")
#Reference
Prediction 0 1
0 913730 160
1 50872 8219
Is it possible to switch this around? So that it shows positive class "1" first?
probability returns the probability by class (here you have two classes so two columns).
This as been built this way to alow multiclass classification.
If you want probability of result == 1 just take the second column of probability
Since you have highly unbalanced classes (0.8% of ones) your classifier tends to predict that it is always 0... So your probability of result==1 is close to 0 for most exemples. This is why your probabilities doesn't look like probabilities.
Regarding the index of probability, it is rownames(df.test) the index of df.test. I guess you randomly splitted df.test from df. So index doesn't start by 1.

Combining binary classification algorithms

I have several algorithms which solve a binary classification (with response 0 or 1) problem by assigning to each observation a probability of the target value being equal to 1. All the algorithms try to minimize the log loss function where N is the number of observations, y_i is the actual target value and p_i is the probability of 1 predicted by the algorithm. Here is some R code with sample data:
actual.response = c(1,0,0,0,1)
prediction.df = data.frame(
method1 = c(0.5080349,0.5155535,0.5338271,0.4434838,0.5002529),
method2 = c(0.5229466,0.5298336,0.5360780,0.4217748,0.4998602),
method3 = c(0.5175378,0.5157711,0.5133765,0.4372109,0.5215695),
method4 = c(0.5155535,0.5094510,0.5201827,0.4351625,0.5069823)
)
log.loss = colSums(-1/length(actual.response)*(actual.response*log(prediction.df)+(1-actual.response)*log(1-prediction.df)))
The sample code gives the log loss for each algorithm:
method1 method3 method2 method4
0.6887705 0.6659796 0.6824404 0.6719181
Now I want to combine this algorithms so I can minimize the log loss even further. Is there any R package which can do this for me? I will appreciate references to any algorithms, articles, books or research papers which solve this kind of problem. Note that as a final result I want to have the predicted probabilities of each class and note plain 0,1 responses.
This is called ensemble learning (Wikipedia).
Check out this article: "an intro to ensemble learning in r."
Here is an example I did using the Cornell movie review data which can be downloaded by clicking the link. I used to data set with 1000 positive and 1000 negative reviews. Once you get the data into R:
library(RTextTools)
library(tm)
library(glmnet)
library(ipred)
library(randomForest)
library(data.table)
## create a column of sentiment score. 0 for negative and 1 for
## positive.
text_neg$pos_neg<-rep(0,1000)
text_pos$pos_neg<-rep(1,1000)
## Combine into 1 data.table and rename.
text_all<-rbind(text_neg, text_pos)
##dont forget to shuffle
set.seed(26)
text2<-text_all[sample(nrow(text_all)),]
## turn the data.frame into a document term matrix. This uses the handy
##RTextTools wrappers and functions.
doc_matrix <- create_matrix(text2$V1, language="english",
removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.98)
ncol(data.frame(as.matrix(doc_matrix)))
## 2200 variables at .98 sparsity. runs pretty slow...
## create a container with the very nice RTextTools package
container <- create_container(doc_matrix, text2$pos_neg,
trainSize=1:1700, testSize=1701:2000, virgin=FALSE)
## train the data
time_glm<-system.time(GLMNET <- train_model(container,"GLMNET"));
time_glm #1.19
time_slda<-system.time(SLDA <- train_model(container,"SLDA"));
time_slda #45.03
time_bag<-system.time(BAGGING <- train_model(container,"BAGGING"));
time_bag #59.24
time_rf<-system.time(RF <- train_model(container,"RF")); time_rf #69.59
## classify with the models
GLMNET_CLASSIFY <- classify_model(container, GLMNET)
SLDA_CLASSIFY <- classify_model(container, SLDA)
BAGGING_CLASSIFY <- classify_model(container, BAGGING)
RF_CLASSIFY <- classify_model(container, RF)
## summarize results
analytics <- create_analytics(container,cbind( SLDA_CLASSIFY,
BAGGING_CLASSIFY,RF_CLASSIFY, GLMNET_CLASSIFY))
summary(analytics)
This ran an ensemble classifier using the 4 different methods (random forests, GLM, SLD and bagging). The ensemble summary at the end shows
# ENSEMBLE SUMMARY
#
# n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
# n >= 1 1.00 0.86
# n >= 2 1.00 0.86
# n >= 3 0.89 0.89
# n >= 4 0.63 0.96
That if all 4 methods agreed on if the review was positive or negative, then the ensemble had a 96% recall rate. But be careful, because with a binary outcome (2 choices) and 4 different algorithms, there is bound to be a lot of agreement.
See the RTextTools documentation for more explanation. They also do an almost identical example with U.S Congress data that I more or less mimicked in the above example.
Hope this was helpful.

predicting and calculating reliability test statistics from repeated multiple regression model in r

I want to run MLR on my data using lm function in R. However, I am using data splitting cross validation method to access the reliability of the model. I intend using "sample" function to randomly split the data into the calibration and validation datasets by 80:20 ratio. This I want to repeat in say 100 times. Without setting a seed I believe the model from the different samplings will differ. I came across the function in previous post here and it solves the first part;
lst <- lapply(1:100, function(repetition) {
mod <- lm(...)
# Replace this with the code you need to train your model
return(mod)
})
save(lst, file="myfile.RData")
The concern now is how do I validate each of these 100 models and obtain reliability test statistics like RSME, ME, Rsquare for each of the models and hopefully obtain the confidence interval.
If I can get an output in the form of dataframe containing the predicted values for all the 100 models then I should proceed from there.
Any help please?
Thanks
To quickly recap your question: it seems that you want to fit an MLR model to a large training set and then use this model to make predictions on the remaining validation set. You want to repeat this process 100 times and afterwards you want to be able to analyze the characteristics and predictions of the individual models.
To accomplisch this you could just store temporary modelinformation in a datastructure during the modelgeneration and prediction process. You can then re-obtain and process all the information afterwards. You did not provide your own dataset in the description, so I will use one of R's built in datasets in order to demonstrate how this might work:
> library(car)
> Prestige <- Prestige[,c("prestige","education","income","women")]
> Prestige[,c("income")] <- log2(Prestige[,c("income")])
> head(Prestige,n=5)
prestige education income women
gov.administrators 68.8 13.11 -0.09620212 11.16
general.managers 69.1 12.26 -0.04955335 4.02
accountants 63.4 12.77 -0.11643822 15.70
purchasing.officers 56.8 11.42 -0.11972061 9.11
chemists 73.5 14.62 -0.12368966 11.68
We start by initializing some variables first. Let's say you want to create 100 models and use 80% of your data for training purposes:
nrIterations=100
totalSize <- nrow(Prestige)
trainingSize <- floor(0.80*totalSize)
We also want to create the datastructure that will be used to hold the intermediate modelinformation. R is quite a generic high level language in this regard, so we will just create a list of lists. This means that every listentry can by itself again hold another list of information. This gives us the flexibility to add whatever we need:
trainTestTuple <- list(mode="list",length=nrIterations)
We are now ready to create our models and predictions. During every loopiteration a different random trainingsubset is created while using the remaining data for testing purposes. Next, we fit our model to the trainingdata and we then use this obtained model to make predictions on the testdata. Note that we explicitly use the independent variables in order to predict the dependent variable:
for(i in 1:nrIterations)
{
trainIndices <- sample(seq_len(totalSize),size = trainingSize)
trainSet <- Prestige[trainIndices,]
testSet <- Prestige[-trainIndices,]
trainingFit <- lm(prestige ~ education + income + women, data=trainSet)
# Perform predictions on the testdata
testingForecast <- predict(trainingFit,newdata=data.frame(education=testSet$education,income=testSet$income,women=testSet$women),interval="confidence",level=0.95)
# Do whatever else you want to do (compare with actual values, calculate other stuff/metrics ...)
# ...
# add your training and testData to a tuple and add it to a list
tuple <- list(trainingFit,testingForecast) # Add whatever else you need ..
trainTestTuple[[i]] <- tuple # Add this list to the "list of lists"
}
Now, the relevant part: At the end of the iteration we put both the fitted model and the out of sample prediction results in a list. This list contains all the intermediate information that we want to save for the current iteration. We finish by putting this list in our list of lists.
Now that we are done with the modeling, we still have access to all the information we need and we can process and analyze it any way we want. We will take a look at the modeling and prediction results of model 50. First, we extract both the model and the prediction results from the list of lists:
> tuple_50 <- trainTestTuple[[50]]
> trainingFit_50 <- tuple_50[[1]]
> testingForecast_50 <- tuple_50[[2]]
We take a look at the model summary:
> summary(trainingFit_50)
Call:
lm(formula = prestige ~ education + log2(income) + women, data = trainSet)
Residuals:
Min 1Q Median 3Q Max
-15.9552 -4.6461 0.5016 4.3196 18.4882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -287.96143 70.39697 -4.091 0.000105 ***
education 4.23426 0.43418 9.752 4.3e-15 ***
log2(income) 155.16246 38.94176 3.984 0.000152 ***
women 0.02506 0.03942 0.636 0.526875
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.308 on 77 degrees of freedom
Multiple R-squared: 0.8072, Adjusted R-squared: 0.7997
F-statistic: 107.5 on 3 and 77 DF, p-value: < 2.2e-16
We then explicitly obtain the model R-squared and RMSE:
> summary(trainingFit_50)$r.squared
[1] 0.8072008
> summary(trainingFit_50)$sigma
[1] 7.308057
We take a look at the out of sample forecasts:
> testingForecast_50
fit lwr upr
1 67.38159 63.848326 70.91485
2 74.10724 70.075823 78.13865
3 64.15322 61.284077 67.02236
4 79.61595 75.513602 83.71830
5 63.88237 60.078095 67.68664
6 71.76869 68.388457 75.14893
7 60.99983 57.052282 64.94738
8 82.84507 78.145035 87.54510
9 72.25896 68.874070 75.64384
10 49.19994 45.033546 53.36633
11 48.00888 46.134464 49.88329
12 20.14195 8.196699 32.08720
13 33.76505 27.439318 40.09079
14 24.31853 18.058742 30.57832
15 40.79585 38.329835 43.26187
16 40.35038 37.970858 42.72990
17 38.38186 35.818814 40.94491
18 40.09030 37.739428 42.44117
19 35.81084 33.139461 38.48223
20 43.43717 40.799715 46.07463
21 29.73700 26.317428 33.15657
And finally, we obtain some more detailed results about the 2nd forecasted value and the corresponding confidence intervals:
> testingPredicted_2ndprediction <- testingForecast_50[2,1]
> testingLowerConfidence_2ndprediction <- testingForecast_50[2,2]
> testingUpperConfidence_2ndprediction <- testingForecast_50[2,3]
EDIT
After rereading, it occured to me that you are obviously not splitting up the the same exact dataset each time. You are using completely different partitions of data during each iteration and they should be split up in a 80/20 fashion. However, the same solution can still be applied with minor modifications.
Also: For cross validation purposes you should probably take a look at cv.lm()
Description from the R help:
This function gives internal and cross-validation measures of predictive accuracy for multiple linear regression. (For binary logistic regression, use the CVbinary function.) The data are randomly assigned to a number of ‘folds’. Each fold is removed, in turn, while the remaining data is used to re-fit the regression model and to predict at the deleted observations.
EDIT: Reply to comment.
You can just take the means of the relevant performance metrics that you saved. For example, you can use an sapply on the trainTestTuple in order to extract the relevant elements from each sublist. sapply will return these elements as a vector from which you can calculate the mean. This should work:
mean_ME <- mean(sapply(trainTestTuple,"[[",2))
mean_MAD <- mean(sapply(trainTestTuple,"[[",3))
mean_MSE <- mean(sapply(trainTestTuple,"[[",4))
mean_RMSE <- mean(sapply(trainTestTuple,"[[",5))
mean_adjRsq <- mean(sapply(trainTestTuple,"[[",6))
Another small edit: The calculation of your MAD looks rather strange. It might be a good thing to double check if this is exactly what you want.

rpart not splitting obvious nodes

I am using a data set of about 54K records and 5 classes(pop) of which one class is insignicant. I am using the caret package and the following to run rpart:
model <- train(pop ~ pe + chl_small, method = "rpart", data = training)
and I get the following tree:
n= 54259
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 54259 38614 pico (0.0014 0.18 0.29 0.25 0.28)
2) pe< 5004 39537 23961 pico (0 0.22 0.39 2.5e-05 0.38)
4) chl_small< 32070.5 16948 2900 pico (0 0.00012 0.83 5.9e-05 0.17) *
5) chl_small>=32070.5 22589 10281 ultra (0 0.39 0.068 0 0.54) *
3) pe>=5004 14722 1113 synecho (0.0052 0.052 0.0047 0.92 0.013) *
It is obvious that node 5 should be further split, but rpart is not doing it. I tried using cp = .001 to cp =.1 and also minbucket = 1000 as additional parameters, but no improvement.
Appreciate any help on this.
Try running the model with an even smaller cp=0.00001 or cp = -1. If it is still not splitting that node then it means that the split will not improve the overall fit.
You can also try changing the splitting criteria from the default Gini impurity to information gain criterion: parms = list(split = "information")
If you do force it to split, it might be a good idea to do a quick check:
compare the accuracy of the training vs testing set for the original model and model with small cp.
If the difference between training vs testing is much smaller for the original model then the other model probably overfits the data.

Initialise covariance structure in lme

How can I initialise a unstructured covariance matrix for the following model?
y<-data.frame(response=c(10,19,27,28,9,13,25,29,4,10,20,18,5,6,12,17),
treatment=factor(rep(1:4,4)),
subject=factor(rep(1:4,each=4))
)
fit<-lme(response~-1+treatment,y,random=~1|subject,
correlation=corSymm(form=~1|subject))
I tried some variants but I get every time I get the error:
Error in lme.formula(response ~ -1 + treatment, y, random = ~1 | :
nlminb problem, convergence error code = 1
message = function evaluation limit reached without convergence (9)
It's practically difficult to fit an unstructured correlation matrix with 6 parameters in addition to a treatment mean effect (4 parameters), a random-effects variance (1), and a residual variance (1) to a data set with only 16 points. If I try with a larger, randomized version of your data set, it works fine.
nSubj <- 20
respVec <- c(10,19,27,28,9,13,25,29,4,10,20,18,5,6,12,17)
set.seed(101)
y<-data.frame(response=sample(respVec,size=4*nSubj,replace=TRUE),
treatment=factor(rep(1:4,nSubj)),
subject=factor(rep(1:nSubj,each=4))
)
library(nlme)
fit<-lme(response~-1+treatment,y,random=~1|subject,
correlation=corSymm(form=~1|subject),
control=lmeControl(msVerbose=TRUE))
Now we can experiment and see how small a data set we can get away with. Package the stuff above into a test function that simulates data and tries a fit, returning TRUE if the fit fails:
testFun <- function(nSubj) {
y<-data.frame(response=sample(respVec,size=4*nSubj,replace=TRUE),
treatment=factor(rep(1:4,nSubj)),
subject=factor(rep(1:nSubj,each=4))
)
fit <- try(lme(response~-1+treatment,y,random=~1|subject,
correlation=corSymm(form=~1|subject)),silent=TRUE)
inherits(fit,"try-error")
}
Try the test function N times and report the proportion of failures:
testFun2 <- function(nSubj,N) {
mean(replicate(N,testFun(nSubj)))
}
Try it out for a range of numbers of subjects (slow):
set.seed(101)
testRes <- sapply(4:20,testFun2,N=50)
Results:
## [1] 0.64 0.04 0.00 0.00 ... 0.00
Somewhat to my surprise, this will work a third of the time with 4 subjects; 96% of the time with 5 subjects: and always with >5 subjects.

Resources