I'm running the SVM model on a dataset, which runs through fine on the train/fitted model. However when I run it for the prediction/test data, it seems to be dropping rows for some reason, when I try to add 'pred_SVM' back into the dataset, the lengths are different.
Below is my code
#SVM MODEL
SVM_swim <- svm(racetime_mins ~ event_date+ event_month +year
+event_id +
gender + place + distance+ New_Condition+
raceNo_Updated +
handicap_mins +points+
Wind_Speed_knots+
Air_Temp_Celsius +Water_Temp_Celsius +Wave_Height_m,
data = SVMTrain, kernel='linear')
summary(SVM_swim)
#Predict Race_Time Using Test Data
pred_SVM <- predict(SVM_swim, SVMTest, type ="response")
View(pred_SVM)
#Add predicted Race_Times back into the test dataset.
SVMTest$Pred_RaceTimes<- pred_SVM
View(SVMTest) #Returns 13214 rows
View(pred_SVM) #Returns 12830
Error in $<-.data.frame(*tmp*, Pred_RaceTime, value = c(2 = 27.1766438249356, :
replacement has 12830 rows, data has 13214
As it is mentioned in the command, you need to get rid of the NA values in your dataset. SVM is handling it for you so that, the pred_SVM output is calculated without the NA values.
To test if there exist NA in your data, just run : sum(is.na(SVMTest))
I am pretty sure that you will see a number greater than zero.
Before starting to build your SVM algorithm, get rid of all NA values by,
dataset <- dataset[complete.cases(dataset), ]
Then after separating your data into Train and Test sets you can run ,
SVM_swim <- svm(.....,data = SVMTrain, kernel='linear')
I'm having trouble with my first forecasting implementation in R. What I'd like to achieve is to predict the variable Y with 2 exogenous variables X1 and X2. The 3 datasets are each represented as a single column with 12 rows.
From another Stackpost I followed a similar approach:
DataSample <- data.frame(Y=Y[,1],Month=rep(1:12,1),
X1=X1[,1],X2=X2[,1])
predictor_matrix <- cbind(Month=model.matrix(~as.factor(DataSample$Month)),
X1=DataSample$X1,
X2=DataSample$X2)
# Remove intercept
predictor_matrix <- predictor_matrix[,-1]
# Rename columns
colnames(predictor_matrix) <- c("January","February","March","April","May","June","July","August","September","October","November","X1","X2")
# Variable to be modeled
var <- ts(DataSample$Y, frequency=12)
#Find ARIMA
modArima <- auto.arima(var, xreg = predictor_matrix)
At this line I get the following error:
Error in optim(init[mask], armaCSS, method = optim.method, hessian =
FALSE, : non-finite value supplied by optim
I presume that my predictor_matrix is not in the correct format but I can't find the error.
Any help would be appreciated,
You have indicated "datasets are ... 12 rows". Your predictor matrix has 13 columns (11 months [of dummy variables?] and 2 other variables). Therefore, you necessarily have a linear dependence among the columns and the optimization procedure fails.
You need (ideally much) more data to support the number of predictor variables and/or a sparser set of predictors.
I'm having a similar problem to the questioners here had with the linear model predict function, but I am trying to use the "time series linear model" function from Rob Hyndman's forecasting package.
Predict.lm in R fails to recognize newdata
predict.lm with newdata
totalConv <- ts(varData[,43])
metaSearch <- ts(varData[,45])
PPCBrand <- ts(varData[,38])
PPCGeneric <- ts(varData[,34])
PPCLocation <- ts(varData[,35])
brandDisplay <- ts(varData[,29])
standardDisplay <- ts(varData[,3])
TV <- ts(varData[,2])
richMedia <- ts(varData[,46])
df.HA <- data.frame(totalConv, metaSearch,
PPCBrand, PPCGeneric, PPCLocation,
brandDisplay, standardDisplay,
TV, richMedia)
As you can see I've tried to avoid the names issues by creating a data frame of the time series objects.
However, I then fit a tslm object (time series linear model) as follows -
fit1 <- tslm(totalConv ~ metaSearch
+ PPCBrand + PPCGeneric + PPCLocation
+ brandDisplay + standardDisplay
+ TV + richMedia data = df.HA
)
Despite having created a data frame and named all the objects properly I get the same dimension error as these other users have experienced.
Error in forecast.lm(fit1) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 696 rows
2: 'newdata' had 10 rows but variables found have 696 rows
the model frame seems to give sensible names to all of the variables, so I don't know what is up with the forecast function:-
names(model.frame(fit1))
[1] "totalConv" "metaSearch" "PPCBrand" "PPCGeneric" "PPCLocation" "brandDisplay"
[7] "standardDisplay" "TV" "richMedia"
Can anyone suggest any other improvements to my model specification that might help the forecast function to run?
EDIT 1: Ok, just so there's a working example, I've used the data given in Irsal's answer to this question (converting to time series objects) and then fitted the tslm. I get the same error (different dimensions obviously):-
Is there an easy way to revert a forecast back into a time series for plotting?
I'm really confused about what I'm doing wrong, my code looks identical to that used in all of the examples on this....
data <- c(11,53,50,53,57,69,70,65,64,66,66,64,61,65,69,61,67,71,74,71,77,75,85,88,95,
93,96,89,95,98,110,134,127,132,107,94,79,72,68,72,70,66,62,62,60,59,61,67,
74,87,112,134,51,50,38,40,44,54,52,51,48,50,49,49,48,57,52,53,50,50,55,50,
55,60,65,67,75,66,65,65,69,72,93,137,125,110,93,72,61,55,51,52,50,46,46,45,
48,44,45,53,55,65,89,112,38,7,39,35,37,41,51,53,57,52,57,51,52,49,48,48,51,
54,48,50,50,53,56,64,71,74,66,69,71,75,84,93,107,111,112,90,75,62,53,51,52,
51,49,48,49,52,50,50,59,58,69,95,148,49,83,40,40,40,53,57,54,52,56,53,55,
55,51,54,45,49,46,52,49,50,57,58,63,73,66,63,72,72,71,77,105,97,104,85,73,
66,55,52,50,52,48,48,46,48,53,49,58,56,72,84,124,76,4,40,39,36,38,48,55,49,
51,48,46,46,47,44,44,45,43,48,46,45,50,50,56,62,53,62,63)
data2 <- c(rnorm(237))
library(forecast)
nData <- ts(data)
nData2 <- ts(data2)
dat.ts <- tslm(nData~nData2)
forecast(dat.ts)
Error in forecast.lm(dat.ts) : Variables not found in newdata
In addition: Warning messages:
1: 'newdata' had 10 rows but variables found have 237 rows
2: 'newdata' had 10 rows but variables found have 237 rows
EDIT 2: Same error even if I combine both series into a data frame.
nData.df <- data.frame(nData, nData2)
dat.ts <- tslm(nData~nData2, data = nData.df)
forecast(dat.ts)
tslm fits a linear regression model. You need to provide the future values of the explanatory variables if you want to forecast. These should be provided via the newdata argument of forecast.lm.
I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)