how do duplicated rows effect a decision tree? - r

I am using Rpart{} to build a decision tree for a categorical variable and I am wondering whether I should use the full data set of just the set of unique rows.

I am answering this as a general question on decision trees, rather than on the R implementation.
The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting a weight on the values in those rows.
This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.
In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.

In some ways this would depend on the data itself. Are the duplicated rows valid data? Or are they only partly duplicated but still important?
If the data were temperature measurements in a town at a given hour, maybe duplicated temperatures are important as they would weight this variable to be a more correct temperature than another lone measurement that was different.
If the data were temperature measurements that three people recorded off the same thermometer at the same time, then you would want to remove the noise from the data by reducing to just unique values.
The answer could very well be a combination of the above. If you had multiple readings that conflicted at the same time period, you would choose the most heavily weighted one, and then decide how to break ties, if all the measurements were the same you removed duplicates. In this way you are cleaning the data before you put it through an algorithm.
It all comes down to what is relevant in the data model and whether duplicated rows are of relevance to the result.

Related

Partition data while preserving groups with caret

Apologies for the cross-stack post, I wasn't sure if this is more appropriate for stackoverflow or for crossvalidated. I initiatlly posted on the latter, but realized this might be the more appropriate place.
So, I have a dataset with many rows of individuals, each with a unique indvidual ID.
For each individual, there is also a column indicating whether or not that person belongs to the same household, which is a unique householdID.
Finally, there is a Target variable, for each row, which is what I will be trying to make predictions on. Of course, there are columns with different features.
My question is--as the membership to different households is important--is there a way to partition the data into train and test sets where all the people belonging to the same household are kept together and not randomly split over both sets? (i.e., any given householdID number should not appear in both sets). But also, it is possible to split the households over both train and test sets and keep a balanced Target variable?
So, using the createDataPartition function in caret, I've managed to have a blanced Target value in both train and test when I set y = Target, and I've managed to separate the households cleanly over both train and test when I set y = unique(householdID), but I can't figure out if there's a way to get both of these results at the same time.
I'm pretty flat out of ideas, so any suggestions would be most welcome!
Thanks!
groupKFold is the way to go. But instead of using data$Target you need to split on data$householdID (or whatever your household ID column is named). This will make sure that all members of a group will be in the same fold.
After this you can use the folds in trainControl to model on data$Target.

Suggested Neural Network for small, highly varying dataset?

I am currently working with a small dataset of training values, no more than 20, and am getting large MSE. The input data vectors themselves consist of 16 parameters, many of which are binary variables. Across all the training values, a majority of the 16 parameters stay the same (but not all). The remaining input variables, across all the exemplars, vary a lot amongst one another. This is to say, two exemplars might appear to be the same except for two parameters in which they differ, one parameter being a binary variable, and another being a continuous variable, where the difference could be greater than a single standard deviation (for that variable's set of values).
My single output variable (as of now) can either be a continuous variable, OR depending on the true difficulty of reducing the error in my situation, I can make this a classification problem instead, with 12 different forms for classification.
I have long been researching different neural networks than my current implementation of a feed-forward MLP, as I have read into Stochastic NNs, Ladder NNs, and many forms of recurrent NNs. I am stuck with which one I should investigate, as I do not have time to try every NN available.
While my description may be vague, could anyone make a suggestion as to which network I should investigate to minimize my cost function (as of now, MSE) the most?
If my current setup must be rendered implacable because of the sheer difficulty involved with predicting correct output for such a small set of highly variant training values, which network would best work, should my dataset be expanded to the order of thousands of exemplars (at the cost of having a significantly more redundant, seemingly homogenous set of input values)?
Any help is most certainly appreciated.
20 samples is very small especially if you have 16 input variables. It will be hard to determine which one of those inputs is responsible for your output value. If you keep your network simple (fewer layers) you may be able to use as many samples as you would need for traditional regression.

Running regression tree on large dataset in R

I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.

Tracking down the "unbalanced" in a design

I have some data from an experiment that should be balanced, but when I try
print(model.tables(aov.PDD,"means"),digits=3)
I get
Error in model.tables.aovlist(aov.PDD, "means") :
design is unbalanced so cannot proceed
I suspect this means that the coding or entry of the data was incorrect somewhere, but I'd like to be able to track this to more detail before wading into the data frame itself. How can I get more detail on what factor is producing the unbalance here?
The balance in aov refers to the number of observations per cell (combination of all factors). There are certain formulas that require balance (all the numbers are the same) and therefore give errors when there is not balance. Sometimes you don't need exactly the same numbers in all cells, but equal withing blocks of cells. You can just use the table function to count how many observations you have per cell.
Generally the parts that require balance are when you start having nested random effects, in this case it is probably better to use a mixed effects model (package nlme or lme4) which uses different techniques and does not require tha balance.

How to oversample instances in a data set in R

I have a data set with 20 classes, and it has a pretty non-uniform distribution. Is there any functionality in R that allows us to balance the data set (weighted perhaps)?
I want to use the balanced data with Weka for classification. Since my class distribution is skewed, I am hoping to get better results if there's no single majority class.
I have tried to use the SMOTE filter and Resample filter but they don't quite do what I want.
I dont want any instances to be removed, repetition is fine.
I think there's a misunderstanding in your terminology. Your question's title refers to sampling, and yet the question text involves weighting.
To clarify:
With sampling, you either have fewer, the same, or more instances than in the original set; the unique membership of a sample can be either a strict subset of the original set or can be identical to the original set (with replacement - i.e., duplicates).
By weighting, you simply adjust weights that may be used for some further purpose (e.g. sampling, machine learning) to address or impose some (im)balance relative to a uniform weighting.
I believe that you are referring to weighting, but the same answer should work in both cases. If the total # of observations is N and the frequency of each class is an element of the 20-long vector freq (e.g. the count of items in class 1 is freq[1]*N), then simply use a weight vector of 1/freq to normalize the weights. You can scale it by some constant, e.g. N, though it wouldn't matter. In case any frequency is 0 or very close to it, you might address this by using a vector of smoothed counts (e.g. Good-Turing smoothing).
As a result, each set will have an equal proportion of the total weight.

Resources