I am trying to use the predict function but the output does not have the number of trials I expect. I assume something is wrong with my data.frame after reading other errors but can't figure it out.
I've tried to make sure my newdata has the same variable name as my model but that won't fix it. The differing rows are the differing number of solutions being found, for example I train over 50 different sets of information, and I test over 39950 sets.
In both the train_data and the test_data there are 10 columns which are the samples that will be included in each calculation. The model correctly finds these and names them test_data1, test_data2, etc.
I'm sure there is something I'm missing but I can't seem to figure it out.
trainingSampleSize <- k
sample_sample[[k-1]] <- sample(1:ncol(pre$train_data), k, replace = FALSE)
train_data <- pre$train_data[,sample_sample[[k-1]]]
test_data <- pre$test_data[,sample_sample[[k-1]]]
data_lm <- data.frame(train_data, pre$train_targets)
cvFitList[[(k-1)]] <- lm(pre$train_targets ~ train_data, data_lm)
prediction[[k-1]] <- predict(cvFitList[[(k-1)]], data.frame(train_data=test_data))
My goal is to get a prediction for every set of test_data, 39950 results from predict.
I got a warning message:
'newdata' had 39950 rows but variables found have 50 rows
and prediction[[k-1]] has only 50 rows
Related
We created a table in R with values from the S&P500 and added rows like the simple 10 Days Moving Average. We set the NA-values to 0. Example:
myStartDate <- '2020-01-01'
myEndDate <- Sys.Date()
Dataset$SMA10 <- SMA(Dataset[,"Close"], 10)
Dataset$SMA10 <- as.numeric(Dataset$SMA10)
Dataset$SMA10[is.na(Dataset$SMA10)] <- 0
Our goal is to create a random forest model. Therefore we split the data into a train and a valid data:
set.seed(100)
train <- sample(nrow(Dataset), 0.5*nrow(Dataset), replace = FALSE)
TrainSet <- Dataset [train,]
ValidSet <- Dataset [-train,]
Now if we want to generate the model with following code;
model1 <- randomForest(SMA10~.,data=TrainSet, mtry=5, importance=TRUE,ntree=500)
print(model1)
we get this error message:
Error in x[, i] <- frame[[i]] : number of items to replace is not a multiple of replacement length
By looking up this error in the forum, we found that this is related with NA-Values. Therefore we are a little confused, because we have no NA-Values in our table. Can you tell us what we are doing wrong? Thank you very much in advance.
I have a dataframe dfab that contains 2 columns that I used as argument to generate a series of linear models as following:
models = list()
for (i in 1:10){
models[[i]] = lm(fc_ab10 ~ (poly(nUs_ab, i)), data = dfab)
}
dfab has 32 observations and I want to predict fc_ab10 for only 1 value.
I thought of doing so:
newdf = data.frame(newdf = nUs_ab)
newdf[] = 0
newdf[1,1] = 56
prediction = predict(models[[1]], newdata = newdf)
First I tried writing newdf as a dataframe with only one position, but since there are 32 in the dataset on which the model was built, I thought I had to provide at least 32 points as well. I don't think this is necessary though.
Every time I run that piece of code I am given the following error:
Error: variable 'poly(nUs_ab, i) was fitted with type “nmatrix.1” but type “numeric” was supplied.
In addition: Warning message:
In Z/rep(sqrt(norm2[-1L]), each = length(x)) :
longer object length is not a multiple of shorter object length
I thought all I need to use predict was a LM model, predictors (the number 56) given in a column-named dataframe. Obviously, I am mistaken.
How can I fix this issue?
Thanks.
newdf should be a data.frame with column name nUs_ab, otherwise R won't be able to know which column to operate upon (i.e., generate the prediction design matrix). So the following code should work
newdf = data.frame(nUs_ab = 56)
prediction = predict(models[[1]], newdata = newdf)
I am using random forest for prediction and in the predict(fit, test_feature) line, I get the following error. Can someone help me to overcome this. I did the same steps with another dataset and had no error. but I get error here.
Error: Error in x[, vname, drop = FALSE] : subscript out of bounds
training_index <- createDataPartition(shufflled[,487], p = 0.8, times = 1)
training_index <- unlist(training_index)
train_set <- shufflled[training_index,]
test_set <- shufflled[-training_index,]
accuracies<- c()
k=10
n= floor(nrow(train_set)/k)
for(i in 1:k){
sub1<- ((i-1)*n+1)
sub2<- (i*n)
subset<- sub1:sub2
train<- train_set[-subset, ]
test<- train_set[subset, ]
test_feature<- test[ ,-487]
True_Label<- as.factor(test[ ,487])
fit<- randomForest(x= train[ ,-487], y= as.factor(train[ ,487]))
prediction<- predict(fit, test_feature) #The error line
correctlabel<- prediction == True_Label
t<- table(prediction, True_Label)
}
I had similar problem few weeks ago.
To go around the problem, you can do this:
df$label <- factor(df$label)
Instead of as.factor try just factor generic function. Also, try first naming your label variable.
Are there identical column names in your training and validation x?
I had the same error message and solved it by renaming my column names because my data was a matrix and their colnames were all empty, i.e. "".
Your question is not very clear, anyway I try to help you.
First of all check your data to see the distribution in levels of your various predictors and outcomes.
You may find that some of your predictor levels or outcome levels are very highly skewed, or some outcomes or predictor levels are very rare. I got that error when I was trying to predict a very rare outcome with a heavily tuned random forest, and so some of the predictor levels were not actually in the training data. Thus a factor level appears in the test data that the training data thinks is out of bounds.
Alternatively, check the names of your variables.
Before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.
For example You can try:
names(test) <- names(train)
Add the expression
dimnames(test_feature) <- NULL
before
prediction <- predict(fit, test_feature)
I'm trying to use the random forest model to predict Gender based on Height, Weight and Number of siblings. I've gotten the data from a much larger data set that contains dozens of variables, but I've cleaned it into this "clean" data.frame with omitted NA values and only the 4 variables I care about, the last column being Gender.
I've tried fiddling with the code and searching everywhere but I can't find a concrete fix.
Here's the code:
ind <- sample(nrow(clean),0.8*nrow(clean))
train <- clean[ind,]
test <- clean[-ind,]
rf <- randomForest(Gender ~ ., data = train[,1:4], ntree = 20)
pred <- predict(rf, newdata = test[,-c(length(test))])
cm <- table(test$Gender, pred)
cm
and here's the output:
Error in `[.default`(table(observed = y, predicted = out.class), levels(y), : subscript out of bounds
Traceback:
1. randomForest(Gender ~ ., data = train[, 1:4], ntree = 20)
2. randomForest.formula(Gender ~ ., data = train[, 1:4], ntree = 20)
3. randomForest.default(m, y, ...)
4. table(observed = y, predicted = out.class)[levels(y), levels(y)]
5. `[.table`(table(observed = y, predicted = out.class), levels(y),
. levels(y))
6. NextMethod()
The problem is likely that you have some kind of a variable level in your test data that was not reflected in your training data. So when it goes to assign the outcome, it has no basis to do so.
It is impossible to say for sure without sample data, but it is the most likely scenario. Try setting a seed set.seed=3 and then change the seed number set.seed=28 and so on, a few times to see if you end up finding a combination where you do not get the error.
Compare the conflicted data frame with the un-conflicted one to see what is missing.
EDIT:
Also, try running str(train) and str(test) to be sure the fields have remained the same. You can share that if you like by editing your post.
If any of the columns are factors with levels missing (meaning it has 10 levels but only 8 are represented in the train with 9 or 10 in the test) it might be a problem. They should be balanced if you are trying to create a predictor for all possible outcomes.
If nothing else works, you can set a seed and remove predictors one at a time until it runs correctly, then look to see how the train and test sets are different in that removed column.
I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)