Excluding ID field when fitting model in R - r

I have a simple Random Forest model I have created and tested in R. For now I have excluded an internal company ID from my training/testing data frames. Is there a way in R that I could include this column in my data and have the training/execution of my model ignore the field?
I obviously would not want the model to try and incorporate it as a variable, but upon an export of the data with a column added of the predicted outcome, I would need that internal id to tie back in other customer data so I know what customers have been categorized as
I am just using the out of the box random forest function from the randomForest library
#divide data into training and test sets
set.seed(3)
id<-sample(2,nrow(Churn_Model_Data_v2),prob=c(0.7,0.3),replace = TRUE)
churn_train<-Churn_Model_Data_v2[id==1,]
churn_test<-Churn_Model_Data_v2[id==2,]
#changes Churn data 1/2 to a factor for model
Churn_Model_Data_v2$`Churn` <- as.factor(Churn_Model_Data_v2$`Churn`)
churn_train$`Churn` <- as.factor(churn_train$`Churn`)
#churn_test$`Churn` <- as.factor(churn_test$`Churn`)
bestmtry <- tuneRF(churn_train,churn_train$`Churn`, stepFactor = 1.2,
improve =0.01, trace=T, plot=T )
#creates model based on training data, views model
churn_forest <- randomForest(`Churn`~. , data= churn_train )
churn_forest
#shows us what variables are most important
importance(churn_forest)
varImpPlot(churn_forest)
#predicts churn diagnosis on test data
predict_churn <- predict(churn_forest, newdata = churn_test, type="class")
predict_churn

A simple example of excluding a particular column or set of columns is as follows
library(MASS)
temp<-petrol
randomForest(No ~ .,data = temp[, !(colnames(temp) %in% c("SG"))]) # One Way
randomForest(No ~ .-SG,data = temp) #Another way with similar result
This method of exclusion is commonly valid across other fuctions/alogorithms in R too.

Related

Data set for regression: different response values for same combination of input variables

Hey dear stackoverflowers,
I would like to perform (multiple) regression analysis on a large customer data set, trying to predict amount spent after initial purchase based on various independent variables, observed during the first purchase.
In this data set, for the same combination of input variable values (say gender=male, age=30, income=40k, first_purchase_value = 99,90), I can have multiple observartions with varying y values (i.e. multiple customers share the same independent variable attributes, but behave differently according to their observed y values).
Is this a problem for regression analysis, i.e. do I have to condense these observations by e.g. averaging? I am getting negative R2 values, that's why I'm asking (I know that a linear model might also just be the wrong assumption here) ...
Thank you for helping me. I tried using the search function, but was unable to find similar topics (probably because the question is silly?).
Cheers!
Edit: This is the code I'm using:
spl <- sample.split(data$spent, SplitRatio = 0.75)
data_train <- subset(data, spl == TRUE)
data_test <- subset(data, spl == FALSE)
model_lm_spent <- lm(spent ~ ., data = data_train)
summary(model_lm_spent)
model_lm_predictions_spent <- predict(model_lm_spent, newdata = data_test)
SSE_spent = sum((data_test$spent - model_lm_predictions_spent)^2)
SST_spent = sum((data_test$spent - mean(data$spent))^2)
1 - SSE_spent/SST_spent

Missing Formula for Plot of SVM model

I have this code I am trying to run. It gets everything right until I want to create my Plot.
# Install package to use Support Vector Machine Algorithm
install.packages("e1071")
# If this function does not work click on the packages tab and check e1071
library("e1071", lib.loc="/Library/Frameworks/R.framework/Versions/3.2/Resources/library")
# Choose File
diabetes <- read.csv(file.choose(), na.strings = "?")
View(diabetes)
##### Data Preprocessing
# Count number of rows with missing data
sum(!complete.cases(diabetes))
# Summary of data set
summary(diabetes)
str(diabetes)
# Replace "no" and ">30" with 0 and "<30" with 1
diabetes$readmitted<-as.character(diabetes$readmitted)
diabetes$readmitted[diabetes$readmitted== "NO"] <- "0"
diabetes$readmitted[diabetes$readmitted== "<30"] <- "1"
diabetes$readmitted[diabetes$readmitted== ">30"] <- "0"
diabetes$readmitted<-factor(diabetes$readmitted)
str(diabetes$readmitted)
summary(diabetes$readmitted)
# Removal of insignificant variables
diabetes$encounter_id<-NULL
diabetes$patient_nbr<-NULL
diabetes$weight<-NULL # Weight had too many missing values to be a part of our model
diabetes$payer_code<-NULL
diabetes$medical_specialty<-NULL
diabetes$nateglinide<-NULL
diabetes$chlorpropamide<-NULL
diabetes$acetohexamide<-NULL
diabetes$tolbutamide<-NULL
diabetes$acarbose<-NULL
diabetes$miglitol<-NULL
diabetes$troglitazone<-NULL
diabetes$tolazamide<-NULL
diabetes$examide<-NULL
diabetes$citoglipton<-NULL
diabetes$glyburide.metformin<-NULL
diabetes$glipizide.metformin<-NULL
diabetes$glimepiride.pioglitazone<-NULL
diabetes$metformin.rosiglitazone<-NULL
diabetes$metformin.pioglitazone<-NULL
# Change variables to be factors
diabetes$admission_type_id<-factor(diabetes$admission_type_id)
diabetes$discharge_disposition_id<-factor(diabetes$discharge_disposition_id)
diabetes$admission_source_id<-factor(diabetes$admission_source_id)
str(diabetes)
# Summary after data pre-processing
summary(diabetes)
# Set Seed and split data set into training and test data
set.seed(1234)
ind <- sample(2, nrow(diabetes), replace = TRUE, prob = c(0.7, 0.3))
train.data <- diabetes[ind == 1, ]
test.data <- diabetes[ind == 2, ]
# Create Model using readmitted as dependent variable
model1<-readmitted~.
model1<-svm(readmitted~., data=train.data)
summary(model1)
plot(model1, diabetes, type='C-classification', kernel='radial')
### I am also having trouble here making the tables###########
# Create table of model vs training data in confusion matrix
table(predict(model1), train.data$readmitted)
# Pull Test data to get confusion matrix
testPred <- predict(model1, newdata = test.data)
table (testPred, test.data$readmitted)
# Create second model using select readmitted and select variables
model2<-readmitted~race + gender + age + admission_type_id + discharge_disposition_id + time_in_hospital + num_lab_procedures + num_procedures + num_medications + number_outpatient + number_inpatient + number_emergency + number_diagnoses + change + diabetesMed
model2<-svm(model2, data=train.data)
summary(model2)
### Also having trouble here making the second table#########
# Create table using second model and training data
table(predict(model2), train.data$readmitted)
testPred2 <- predict(model2, newdata = test.data)
table (testPred2, test.data$readmitted)
I have been playing around with plot and the tables and cant seems to get anything to work.
I have been using a data set with 9999 rows to test this out on. But my real data set is 107,000 rows. So it takes a long time to run this and find out I am wrong.
Any help would be greatly appreciated. Thank You
Well , I need data that you are working on. I did run on these kind of problems with large data sets.
For data sets ,I prefer using package(caret) this helps in parallel
processing too and handles large grids.
For plots , library(hexbin) or tabplot package in R might help you.
well above said , is for fast processing your data so that you can use the whole data set and visualizing large datasets.
I am not sure what error you are getting plot. please tell about the error you are getting.

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

R random forest - training set using target column for prediction

I am learning how to use various random forest packages and coded up the following from example code:
library(party)
library(randomForest)
set.seed(415)
#I'll try to reproduce this with a public data set; in the mean time here's the existing code
data = read.csv(data_location, sep = ',')
test = data[1:65] #basically data w/o the "answers"
m = sample(1:(nrow(factor)),nrow(factor)/2,replace=FALSE)
o = sample(1:(nrow(data)),nrow(data)/2,replace=FALSE)
train2 = data[m,]
train3 = data[o,]
#random forest implementation
fit.rf <- randomForest(train2[,66] ~., data=train2, importance=TRUE, ntree=10000)
Prediction.rf <- predict(fit.rf, test) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
#cforest implementation
fit.cf <- cforest(train3[,66]~., data=train3, controls=cforest_unbiased(ntree=10000, mtry=10))
Prediction.cf <- predict(fit.cf, test, OOB=TRUE) #to see if the predictions are accurate -- but it errors out unless I give it all data[1:66]
Data[,66] is the is the target factor I'm trying to predict, but it seems that by using "~ ." to solve for it is causing the formula to use the factor in the prediction model itself.
How do I solve for the dimension I want on high-ish dimensionality data, without having to spell out exactly which dimensions to use in the formula (so I don't end up with some sort of cforest(data[,66] ~ data[,1] + data[,2] + data[,3}... etc.?
EDIT:
On a high level, I believe one basically
loads full data
breaks it down to several subsets to prevent overfitting
trains via subset data
generates a fitting formula so one can predict values of target (in my case data[,66]) given data[1:65].
so my PROBLEM is now if I give it a new set of test data, let’s say test = data{1:65], it now says “Error in eval(expr, envir, enclos) :” where it is expecting data[,66]. I want to basically predict data[,66] given the rest of the data!
I think that if the response is in train3 then it will be used as a feature.
I believe this is more like what you want:
crtl <- cforest_unbiased(ntree=1000, mtry=3)
mod <- cforest(iris[,5] ~ ., data = iris[,-5], controls=crtl)

Resources