My knowledge in this area is poor, so apologize me if this is a trivial question.
I need to train a model and I have two data sets: Train data for building the model and a Scoring data to apply the model on it.
One important categorical variable has 200 level in Train data and it has only 50 levels in the scoring data. In fact they only share 20 levels.
So, what is the correct way to deal with such situation? should I limit the levels to the intersect of the levels or keep it as it or what?
Bests.
There are a number of different options here. I assume you are talking about a single attribute here and i'm also assuming since you are talking about a level it is numerical:
The first option is to do nothing and to see what result you get.
The second is to normalize the values. Setting them all on the same scale accordingly from 0to1.
You could also try binning, I'm not sure what that is in R.
I'm not an expert but I've found doing some testing and trying different methods doesn't hurt. A program I use with school is called Weka it's free and opened source plus there are instructional videos that will introduce you to the theory behind data analysis
http://www.cs.waikato.ac.nz/ml/index.html
When using your test dataset to test your model, you will need to filter-out the levels that are not present in your test dataset (assuming your model cannot handle missing levels).
Alternatively, you could re-partition your data into a test and training sets where all of the levels in the test set are present in the training set. The createDataParition function from the caret package will do this for you - e.g. see here.
Related
I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.
I have a data set called Data, with 30 scaled and centered features and 1 outcome with column name OUTCOME, referred to 700k records, stored in data.table format. I computed its PCA, and observed that its first 8 components account for the 95% of the variance. I want to train a random forest in h2o, so this is what I do:
Data.pca=prcomp(Data,retx=TRUE) # compute the PCA of Data
Data.rotated=as.data.table(Data.pca$x)[,c(1:8)] # keep only first 8 components
Data.dump=cbind(Data.rotated,subset(Data,select=c(OUTCOME))) # PCA dataset plus outcomes for training
This way I have a dataset Data.dump where I have 8 features that are rotated on the PCA components, and at each record I associated its outcome.
First question: is this rational? or do I have to permute somehow the outcomes vector? or the two things are unrelated?
Then I split Data.dump in two sets, Data.train for training and Data.test for testing, all as.h2o. The I feed them to a random forest:
rf=h2o.randomForest(training_frame=Data.train,x=1:8,y=9,stopping_rounds=2,
ntrees=200,score_each_iteration=T,seed=1000000)
rf.pred=as.data.table(h2o.predict(rf,Data.test))
What happens is that rf.pred seems not so similar to the original outcomes Data.test$OUTCOME. I tried to train a neural network as well, and did not even converge, crashing R.
Second question: is it because I am carrying on some mistake from the PCA treatment? or because I badly set up the random forest? Or I am just dealing with annoying data?
I do not know where to start, as I am new to data science, but the workflow seems correct to me.
Thanks a lot in advance.
The answer to your second question (i.e. "is it the data, or did I do something wrong") is hard to know. This is why you should always try to make a baseline model first, so you have an idea of how learnable the data is.
The baseline could be h2o.glm(), and/or it could be h2o.randomForest(), but either way without the PCA step. (You didn't say if you are doing a regression or a classification, i.e. if OUTCOME is a number or a factor, but both glm and random forest will work either way.)
Going to your first question: yes, it is a reasonable thing to do, and no you don't have to (in fact, should not) involve the outcomes vector.
Another way to answer your first question is: no, it unreasonable. It may be that a random forest can see all the relations itself without needing you to use a PCA. Remember when you use a PCA to reduce the number of input dimensions you are also throwing away a bit of signal, too. You said that the 8 components only capture 95% of the variance. So you are throwing away some signal in return for having fewer inputs, which means you are optimizing for complexity at the expense of prediction quality.
By the way, concatenating the original inputs and your 8 PCA components, is another approach: you might get a better model by giving it this hint about the data. (But you might not, which is why getting some baseline models first is essential, before trying these more exotic ideas.)
Is it best to split your data into training and test sets before doing any exploratory data analysis, or do all exploration based solely on training data?
I'm working on my first full machine learning project (a recommendation system for a course capstone project) and am looking for clarification on order of operations. My rough outline is to import and clean, do exploratory analysis, train my model, and then evaluate on a test set.
I am doing exploratory data analysis now - nothing special initially, just starting with variable distributions and whatnot. But I am not sure: should I split my data into training and test sets before or after exploratory analysis?
I don't want to potentially contaminate algorithm training by inspecting the test set. However, I also don't want to miss visual trends that might reflect real signal that my poor human eye might not see after filtering, and thus potentially miss investigating an important and relevant direction while designing my algorithm.
I checked other threads, like this, but the ones I found seem to ask more about things like regularization or actual manipulation of the original data. The answers I found were mixed but prioritized splitting first. However, I don't plan to do any actual manipulation of the data before splitting it (beyond inspecting distributions and potentially doing some factor conversions).
What do you do in your own work and why?
Thanks for helping a new programmer!
To answer this question, we should remind ourselves of why, in machine learning, we split data into training, validation and testing sets (see also this question).
Training sets are used for model development. We often carefully explore this data to get ideas for feature engineering and the general structure of the machine learning model. We then train the model using the training data set.
Usually, our goal is to generate models that will perform well not only on the training data, but also on previously unseen data. Therefore, we want to avoid models that capture the peculiarities of the data we have available now rather than the general structure of the data we will see in the future ("overfitting"). To do so, we assess the quality of the models we're training by evaluating their performance on a different set of data, the validation data, and choose the model that performs best on the validation data.
Having trained our final model, we often want to have an unbiased estimate of its performance. Since we have already used the validation data in the process of model development (we chose the model that performed best on the validation data), we cannot be sure that our model will perform equally well on unseen data. So, to assess model quality, we test performance unsing a new batch of data, the testing data.
This discussion gives the answer your question: We should not use the testing (or validation) data set for exploratory data analysis. Because if we did, we would run the risk of overfitting the model to the peculiarities of the data we have, for example by engineering features that work well for the testing data. At the same time, we would lose the ability of getting an unbiased estimate of our model's performance.
I would take the problem the other way round; is it bad to use the test set ?
The objective of modeling is to end up with a model with low variance (and small bias): that's why the test set is keeping a bunch of data aside to assess how your model behaves with new data (i.e. its variance). If you use the test set during modeling you are left with nothing to do that, and you are overfitting your data.
The objective of EDA is to understand the data you're working with; the distributions of features, their relationships, their dynamics, etc ... If you leave your test set in the data, is there a risk of "overfitting" your understanding of data ? If that was the case, you would observe on say 70% of your data some properties that are not valid for the 30% remaining (test set) ... knowing that the split is random, this is impossible, or you have been extremely unlucky.
From my understanding in Machine Learning Pipeline is exploratory data analysis should be done before splitting the data into train and test.
Here are my reasons:
The data may not be cleaned in the beginning. It might have missing values, mismatch datatypes and outliers.
Need to understand every features with the target variable in the dataset. This will help to understand the importance of every features with respect to the business problem and will help to derive the additional features as well.
The data visualization will also help to get the insights information from the dataset.
Once the above operations done, then we can split the dataset into train and test. Because the features must be similar in both train and test.
I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.
This is the problem I am having. I hope someone can explain why
I have a large dataset I am using to predict a categorical value - L,M,H - in the original data.frame it is a factor.
The training set is large, so I do not have enough memory to train on it - so I took a sample of my training dataset and create a randomForest. Then I created a different random sample and created a second forest, .... They all have similar performance which was a concern
I found the combine function in randomForest and decided to use it to combine my models.
I then need to use the new model to score the train set to get an OOB estimate and then the same with my validation sample.
I am having a problem with the prediction on the test set.
I basically get a message saying "Error in eval(expr,envirmenclos) : object 'XXX' not found" where XXX is the variable name. But this makes no sense as the variables never changed names
I redid this a few times, in case my data got corrupted.
Any idea why am I getting this?
Without the data is hard to know but this is my hunch based on similar errors in the past- If you are sampling your data and running separate models, you may run into a problem with categorical variables where the factor levels in one model do not match the factor levels from another model. The way to potentially fix this is to specify the factor levels in the data frame (using the levels function) before you run the model.
Edit- one way to debut is to run two models on the same sample data combine them and try to apply the model and see if you get the same error..