I'm using using a support vector machine on the Titanic dataset and some of the observations are not being predicted when using the predict function with my model.
library(e1071)
library(data.table)
library(ISLR)
titanic.index <- sample(891, 600)
titanic.train <- dat[index]
titanic.test <- dat[-index]
titanic. fit <- svm(Survived ~ Pclass + Sex + SibSp, data = train, kernel = "polynomial")
titanic.preds <- predict(fit, newdata = test)
titanic.preds
length(titanic.preds)
Whenever I run this on my comp I get anywhere from 220 to 240 predictions, but the their are clearly 291 observations in the test data. There aren't any missing observations for these predictors. To make matters even more weird, when I build an SVM using the auto dataset in the ISLR package this same problem doesn't occur.
data("Auto")
auto <- as.data.table(Auto)
auto[, mileage := ifelse(auto[, mpg] > median(auto[, mpg]), 1, 0)]
auto[, mileage := factor(mileage)]
auto.index <- sample(392, 200)
auto.train <- auto[auto.index]
auto.test <- auto[-auto.index]
auto.fit <- svm(mileage ~ ., data = auto.train)
auto.preds <- predict(auto.fit, newdata = auto.test)
auto.preds
length(auto.preds)
I have no idea why this is happening. Any insight you can provide is greatly appreciated!
Related
I'm trying to predict automobile prices based on a bunch of independent variables using linear regression. The only attributes in my data set that are chr is fuel and color, the rest are either num or int. I omitted fuel because it only has one level.
here is my code:
# Loading Data
car_data = read.csv("Car_Data (1).csv",header =TRUE)
car_data$Fuel <-NULL
car_data$Colour<- as.factor(car_data$Colour)
str(car_data)
set.seed(123)
indx <- sample(2, nrow(car_data), replace = T, prob = c(0.8, 0.2))
train <- car_data[indx == 1, ]
test <- car_data[indx == 2, ]
lmModel <- lm(Price ~ ., data = train)
summary(lmModel)
When I run the summary(lmModel), it shows all NA's for the Error, tvalue, and Pr(>|t|).
Can someone help...
It's possible that your dataset has too few observations in it and you are trying to fit too many features. It would be helpful for reproducibility if you could supply your dataset (or a minimal working example of a similar dataset). Perhaps you could also try running a simpler regression specification to see if that might tease out some errors.
lmModelSimple <- lm(Price ~ Colour, data = train)
summary(lmModelSimple)
I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.
I am trying to use randomForest function in R. For my analysis, I have a dataset that has 151 observations. I used 70/30 split to get 105 observations in training and 46 n test.
I'm using subset argument to indicate training dataset observations. However, when I use the "rf$predicted" command, I see that the model used the entire dataset (151 observations, and not just the training dataset.
Also, when I use the "predict()" function to provide test data, the model is using the predicting on 150 observations. But, the test has only 46 observations.
Can you please tell me what I may be doing wrong? I want to fit the model using the training dataset only and predict on the test dataset only. Thank you in advance!
Data is available here:
https://archive.ics.uci.edu/ml/datasets/teaching+assistant+evaluation
https://archive.ics.uci.edu/ml/machine-learning-databases/tae/
Code:
library("randomForest")
library(caTools)
# Importing tae.csv
setwd("C:\\Users\\Saulat Majid\\Documents\\MSDataAnalytics\\DSU\\10 STAT 702 Modern Applied Statistics II - Saunders\\HW5")
tae <- read.table(file = "tae.csv", header = FALSE, sep = ",")
head(tae)
colnames(tae) <- c("N_Speaker", "Instructor", "Course", "Summer", "C_Size", "Class")
# Numerical Summary of tae dataset
head(tae)
str(tae)
summary(tae)
# Converted categorical variables into factor
tae$Class <- as.factor(tae$Class)
tae$N_Speaker <- as.factor(tae$N_Speaker)
tae$Instructor <- as.factor(tae$Instructor)
tae$Summer <- as.factor(tae$Summer)
tae$Course <- as.factor(tae$Course)
str(tae)
# Splitting data into train and test
tae.Split <- sample.split(tae$Class, SplitRatio = 0.7)
table(tae.Split)
tae.train <- tae[tae.Split,]
tae.test <- tae[!tae.Split,]
dim(tae)
dim(tae.train)
dim(tae.test)
rf <- randomForest(Class ~ N_Speaker + Summer + C_Size, data = tae, subset = tae.Split)
rf$predicted
predict(object = rf,
newdata = tae[-tae.Split,],
type = "response")
How can I use result of randomForest call in R to predict labels on some unlabled data (e.g. real world input to be classified)?
Code:
train_data = read.csv("train.csv")
input_data = read.csv("input.csv")
result_forest = randomForest(Label ~ ., data=train_data)
labeled_input = result_forest.predict(input_data) # I need something like this
train.csv:
a;b;c;label;
1;1;1;a;
2;2;2;b;
1;2;1;c;
input.csv:
a;b;c;
1;1;1;
2;1;2;
I need to get something like this
a;b;c;label;
1;1;1;a;
2;1;2;b;
Let me know if this is what you are getting at.
You train your randomforest with your training data:
# Training dataset
train_data <- read.csv("train.csv")
#Train randomForest
forest_model <- randomForest(label ~ ., data=train_data)
Now that the randomforest is trained, you want to give it new data so it can predict what the labels are.
input_data$predictedlabel <- predict(forest_model, newdata=input_data)
The above code adds a new column to your input_data showing the predicted label.
You can use the predict function
for example:
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.pred <- predict(iris.rf, iris[ind == 2,])
This is from http://ugrad.stat.ubc.ca/R/library/randomForest/html/predict.randomForest.html
Anyone's got a quick short educational example how to use Neural Networks (nnet in R) for the purpose of prediction?
Here is an example, in R, of a time series
T = seq(0,20,length=200)
Y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
plot(T,Y,type="l")
Many thanks
David
I think you can use the caret package and specially the train function
This function sets up a grid of tuning parameters for a number
of classification and regression routines.
require(quantmod)
require(nnet)
require(caret)
T = seq(0,20,length=200)
y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
dat <- dat[c(3:200),] #delete first 2 observations
#Fit model
model <- train(y ~ x1+x2 ,
dat,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, dat)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[-c(1:2)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
The solution proposed by #agstudy is useful, but in-sample fits are not a reliable guide to out-of-sample forecasting accuracy. The gold standard in forecasting accuracy measurement is to use a holdout sample. Remove the last 5 or 10 or 20 observations (depending to the length of the time series) from the training sample, fit your models to the rest of the data, use the fitted models to forecast the holdout sample and simply compare accuracies on the holdout, using Mean Absolute Deviations (MAD) or weighted Mean Absolute Percentage Errors (wMAPEs).
So to do this you can change the code above in this way:
require(quantmod)
require(nnet)
require(caret)
t = seq(0,20,length=200)
y = 1 + 3*cos(4*t+2) +.2*t^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
train_set <- dat[c(3:185),]
test_set <- dat[c(186:200),]
#Fit model
model <- train(y ~ x1+x2 ,
train_set,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, test_set)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[c(186:200)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
This last two lines output the wMAPE of the forecasts from the model
sum(abs(ps-test_set["y"]))/sum(test_set)