Machine learning project: split training/test sets before or after exploratory data analysis? - r

Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?
I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.
I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?
I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.
I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).
What do you do in your own work and why?
Thanks for helping a new programmer!

To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).
Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.
Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.
Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.
This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.

I would take the problem the other way round; is it bad to use the test set ?
The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.

From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.
Here are my reasons:
The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.
Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.

Related

normalizing test sample for supervised models?

have a few questions about normalizing training and test data.
As far as I understand it, for certain models (like k-NN) its imperative that you normalize the training sample because similarity is based on distance and if certain explanatory variables have much higher scales than others, it will disproportionally affect the similarity measurements.
So far, while I have read some competing arguments, it seems like overall one should standardize/normalize training sets before building models.
However, I am curious if one also does this with testing data, the arguments I have read is that you can't normalize "future data" you don't know, and if you normalize the dataset before you split you are leaking "future information" into your training sample which will affect the integrity of your model.
I have tried to read what literature is out there, but have still not found an answer I fully understand, though the above make some intuitive sense.
I know it is never “that simple” but I am looking for an generalization to some degree.
What approach should I use for my supervised models?
1. Normalize then split into training and testing samples?
2. Split first then normalize each sample?
3. Split and only normalize the training sample and after test for accuracy on a non-normalized testing set?
Or will it vary substantially between models?
Thanks!

Keras plot after model is finished explanaition

It is not really a programming question, I apologise for that. I have trained my network and generated these graphs.
http://imgur.com/a/zOglL
http://imgur.com/a/90JKl
I am struggling to find an answer to what do Accuracy vs Val_Accuracy and Loss vs Val_loss really represent. I do understand that if val_loss starts to jump high it means that there is over fitting going on and network just starts to memorise the data rather then learning. Could anyone explain in a bit more detail what they all mean?
During a neural network training - you usually provide two sets of data - a train one and validation one. Your training algorithm is taking data from a train set - and using a calculus and backpropagation it's trying to decrease a cost function which somehow represents how good your representation is (smaller the better). Aside of this - a cost is computed for a validation set which is not seen by a training algorithm - so one may check if model doesn't overfit on a train data provided (this happens if train loss is substancially smaller than validation loss). Despite loss different metrics might be computed - one of them might be e.g. accuracy. Sometimes they give you better insights into how your model works because loss might be hard to understand. Metrics give you better understanding on if your model works good.

Strugling to understand complete predictive model process in R

I'm very new to all this and I have a bit of a mental block on the logic of the process. I am trying to predict customer churn using a database of current and already churned customers. So far I have
1) Taken complete customer database of current customers and already churned customers along with customer service variables etc to use to predict on.
2) Split the data set randomly 70/30 into train and test
3) Using R, I have trained a random forest model to predict make predictions and then compared to the actual status using a confusion matrix.
4) I have ran that model using the test data to check accuracy for identifying the churners
I'm now a bit confused. What I want to do now is take all of our current customers and predict which ones will churn. Have I done this all wrong as alot of the current customers I need to predict if will churn have already been seen by the model as they appear in the training set?
Was I somehow supposed to use a training and test set that will not be part of the dataset I need to make predictions on?
Many thanks for any help.
As far as I have understood your question, I feel you want to know if you've done the right thing by using overlapping examples in your training and test set. You first need to understand that you need to keep your training set separate from your test set. Since your model parameters have been computed based on your training set, for similar examples in the test set, the model will give you the correct prediction, so your accuracy will definitely be positively impacted for those common training and test set examples but that is not the correct thing to do. Your test set should always contain previously unseen examples in order to properly evaluate the performance of your algorithm.
If your current customers (on which you want to test your model) are already there in the training set, you would want to leave them out in the testing process. I'd suggest you perform a check between the training set customers and the current set of customers based on some unique identifier (if present) such as the Customer ID and leave common customers out of your fresh batch of unseen test examples.
It looks to me that you have the standard training-test-validation set problem. If I understood correctly, you want to test the performance of your model (Random Forest) to all the data you have.
Standard classroom way to do this is indeed what you already did: Split the dataset for example 70% training and 30% test/validation set, train the model with training set and test with test set.
Better way to test (and predict for all of the data) is to use Cross-Validation to perform the analysis (https://en.wikipedia.org/wiki/Cross-validation_(statistics)). One example for cross-validation is 10-fold cross-validation: You split your data to 10 equal size blocks, loop over all the blocks and for every iteration use the remaining 9 blocks to train your model and the test the model on the specific block.
What you end up with cross-validation is a more comprehensive knowledge of the performance of your model, as well as the results for all of the customers in your database. Cross-validation mitigates the errors in analysis due to random selection of the test set.
Hope this helps!

mlr classification training with rpart does not complete

I have a classification task that I managed to train with mlr package using LDA ("classif.lda") in a few seconds. However when I trained it using "classif.rpart" the training never ended.
Is there any different setup to be done for the different methods?
My training data here if needed to replicate the problem. I tried to train it simply with
pred.bin.task <- makeClassifTask(id="CountyCrime", data=dftrain, target="count.bins")
train("classif.rpart", pred.bin.task)
In general, you don't need to change anything about the setup when switching learners -- one of the main points of mlr is to make this easy! This does not mean that it'll always work though, as different learning methods do different things under the hood.
It looks like in this particular case the model simply takes a long time to train, so you probably didn't wait long enough for it to complete. You have quite a large data frame.
Looking at your data, you seem to have an interval of values in count.bins. This is treated as a factor by R (i.e. intervals are only the same if the string matches completely), which is probably not what you want here. You could encode start and end as separate (numerical) features.

When to use test and training sets in Weka?

I've been working with Weka for awhile now, and in my research on it, I find that a lot of code examples use test and training sets. For instance, with Discretization and Bayesian Networks,their examples are almost always shown using test and training sets. I may be missing some fundamental understanding of data processing here, but I don't understand why this seems to always be the case. I am using Discretization and Bayesian Networks in a project and for both of them, I have not used test or training sets, and do not see why I would need to either. I am performing cross validation on the BayesNet, so I am testing its accuracy. Am I misunderstanding what test and training sets are used for??? Oh and please use the simplest of terminology; I'm still not very experienced with the world of data processing.
The idea behind training and test sets is to test the generalization error. That is, if you used just one data set, you could achieve perfect accuracy by simply learning this set (this is what nearest neighbour classifiers do, IBk in Weka). In general, this is not what you want however -- the machine learning algorithm should learn the general concept behind the example data that you give it. A way of testing whether this happens is to use separate data for training and testing.
If you're using cross-validation, you're using separate training and test sets. This is simply a way of coming up with the partition of your entire data set into training and test. If you do 10 fold cross-validation for example, your entire data is partitioned into 10 sets of equal size. Nine of these are combined and used for training, the remaining one for testing. Then the process is repeated with nine different sets combined for training and so on until all the ten individual partitions will have been used for testing.
So training/test sets and cross-validation are conceptually doing the same thing, cross-validation simply takes a more rigorous approach by averaging over the entire data set.
Training data refers to the data used to "build the model".
For example, it you are using the algorithm J48 (a tree classifier) to classify instances, the training data will be used to generate the tree that will represent the "learned concept" that should be a generalization of the concept. It means that the learned rules, generated trees, the adjusted neural network, or whatever; will be able to get new (unseen) instances and classify them correctly (the "learned concept" does not depends on the training data).
The test sets are a percentage of the data that will be used to test whether the model has learned the concept properly (it is independent of the training data).
In WEKA you can run an execution splitting your data set into trainig data (to build the tree in the case of J48) and test data (to test the model in order to determine that the concept has been learned). For example, you can use 60% of the data for training and 40% for testing (determine how much data is needed for training and testing is one of the key problems of data mining).
But I would recommend you to have a quick look to cross-validation, that is a robust testing method that is implemented in WEKA. It has been explained quite well here:
https://stackoverflow.com/a/10539247/1565171
If you have more questions just leave a comment.

Resources