Why we just use fit() method at train data in scaling problem? - scaling

In feature Scaling, we just use fit() method at train data.
And not using in valid or test data.
Why we dont use mean and sd in test or valid data when we scaling test or valid data?

Related

R randomForest localImp for test set

I'm using the R-package randomForest version 4.6-14. The function randomForest takes a parameter localImp and if that parameter is set to true the function computes local explanations for the predictions. However, these explanations are for the provided training set. I want to fit a random forest model on a training set and use that model to compute local explanations for a separate test set. As far as I can tell the predict.randomForest function in the same package provides no such functionality. Any ideas?
Can you explain more about what it means to have some local explanation on a test set?
According to this answer along with the package document, the variable importance (or, the casewise importance implied by localImp) evaluates how the variable may affect the prediction accuracy. On the other hand, for the test set where there is no label to assess the prediction accuracy, the variable importance should be unavailable.

When to use Train Validation Test sets

I know this question is quite common but I have looked at all the questions that have been asked before and I still can't understand why we also need a validation set.
I know sometimes people only use a train set and a test set, so why do we also need a validation set?
And how do we use it?
For example, in order to impute missing data, I impute these 3 different sets separately or not?
Thank you!
I will try to answer with an example.
If I'm training a neural network or doing linear regression, and I'm using only train and test data I can check my test data loss for each iteration and stop when my test data loss begins to grow or get a snapshot of the model with the lowest test loss.
Is some sense this is "overfiting" to my test data since i decide when to stop based on that.
If I was using test, train and validation data I can do the same process as above with the validation instead of the test data, and then after i decide when my model is done training, I can test it on the never before seen test data to give me a more unbiased score of my models predictions.
For the second part of the question, I would suggest to treat at least the test data as independent and impute the missing data differently, but it depends on the situation and data.

Do i exclude data used in a training set to run predict () model?

I am very new to machine learning. I have a question about running predict on data used for training set.
Here are details: I took a portion of my initial dataset and split that portion into 80% (train) and 20% (test). I trained the model on 80% of training set
model <- train(name ~ ., data = train.df, method = ...)
and then run the model on 20% test data:
predict(model, newdata = test.df, type = "prob")
Now I want to predict using my trained model on initial dataset which also includes the training portion. Do I need to exclude that portion that was used for the training?
When you report accuracy to a third person about how good your machine learning model works, you always report the accuracy you get on the data set that was not used in training (and validation).
You can report your accuracy numbers for the over all data set but always include the remark that this data set also includes the data partition that was used for training the machine learning algorithm.
This care is taken to make sure your algorithm has not overfitted on your training set: https://en.wikipedia.org/wiki/Overfitting
Julie, I saw your comment below your original post. I would suggest you edit the original post and include your data split to be more complete in your question. It would also help to know what method of regression/classification you're using.
I'm assuming you're trying to assess the accuracy of your model with the 90% of data you left out. Depending on the number of samples you used in your training set you may or may not have the accuracy you'd like. Accuracy will also depend on your approach to the method of regression/classification you used.
To answer your question directly: you don't need to exclude anything from your dataset - the model doesn't change when you call predict().
All you're doing when you call predict is filling in the x-variables in your model with whatever data you supply. Your model was fitted to your training set, so if you supply training set data again it will still create predictions. Note though, for proving accuracy your results will be skewed if you include the set of data that you fit the model to since that's what it learned from to create predictions in the first place - kind of like watching a game, and then watching the same game again and being asked to make predictions about it.

perform knn with clustered input data

How do I perform knn cross validation with input data that has been clustered using k-means.
I seem to be unable to find the correct function which is able to do so.
predict.strengt from fpcseem to be able to compute some form of prediction rate given a classifier method, but it seems it test it against the training set, which in my mind doesn't seem that beneficial.
Aren't there any function which can perform cross validation?
Example:
library("datasets")
library("stats")
iris_c3 = kmeans(iris$Sepal.Length,center= 10, iter.max = 30)
How do I provide iris_c3 as training data for some form of knn which also performs cv, if a given test set was provided in the same manner.

Does Seed Value affect the result of training data in R?

I am trying to create a model with my data divided into training(70%) ,validation(15%) and testing(15%) set.After running the model I am getting some accuracy(ROC) and some Value for my confusion matrix.But every time I keep changing the seed value,it is affecting my output. How do I address this? Is this the expected behavior? If so how can I come to a conclusion of which value to be chosen as the final output?
set.seed() defines a starting point for the generation of random values. Running an analysis with the same seed should return the same result. Using a different seed can result in different output. In your case probably because of a different split in training, validation and testing.
If the differences are acceptable small, then your model is robust for different splits in training, testing and validation. If the differences are large, then your model is not robust and should not be trusted. You will have to change the way the data is split (stratification might help) or revise the model.

Resources