How can I split two groups as equally as possible in R? - r

I’m trying to do a basic A/B test, where I have to split my observations into two as equal groups as possible. The observations are based on session data from Google Analytics for different cities.
Therefore I need to create a random 50\50 split between the cities, but the split has to make the two groups as equal regarding sessions as possible. What would be my best option?
I’ve tried to perform a genetic algorithm approach to the problem, but when there’s only one variable, there must be an easier and faster way, right?

Related

Is it possible to use the evtree package in R for panel data / over multiple years?

I would like to know, if it is possible to use evtree over multiple years?
I have an unbalanced panel data set (8 years), with two groups based on a (binary) dependent variable (dv). For every year, the dv-value for each observation can be either 0 or 1, and thus constitutes group membership. Also, I have multiple predictor variables (pv), where their influence on dv might change over time.
Evtree generally seems the correct approach for me (at least for a single year). My goal is to train the evtree model over multiple periods (to capture possible temporal effects) in order to classify the two groups as good as possible.
Any help is highly appreciated.
Thanks in advance!

Partition data while preserving groups with caret

Apologies for the cross-stack post, I wasn't sure if this is more appropriate for stackoverflow or for crossvalidated. I initiatlly posted on the latter, but realized this might be the more appropriate place.
So, I have a dataset with many rows of individuals, each with a unique indvidual ID.
For each individual, there is also a column indicating whether or not that person belongs to the same household, which is a unique householdID.
Finally, there is a Target variable, for each row, which is what I will be trying to make predictions on. Of course, there are columns with different features.
My question is--as the membership to different households is important--is there a way to partition the data into train and test sets where all the people belonging to the same household are kept together and not randomly split over both sets? (i.e., any given householdID number should not appear in both sets). But also, it is possible to split the households over both train and test sets and keep a balanced Target variable?
So, using the createDataPartition function in caret, I've managed to have a blanced Target value in both train and test when I set y = Target, and I've managed to separate the households cleanly over both train and test when I set y = unique(householdID), but I can't figure out if there's a way to get both of these results at the same time.
I'm pretty flat out of ideas, so any suggestions would be most welcome!
Thanks!
groupKFold is the way to go. But instead of using data$Target you need to split on data$householdID (or whatever your household ID column is named). This will make sure that all members of a group will be in the same fold.
After this you can use the folds in trainControl to model on data$Target.

how do duplicated rows effect a decision tree?

I am using Rpart{} to build a decision tree for a categorical variable and I am wondering whether I should use the full data set of just the set of unique rows.
I am answering this as a general question on decision trees, rather than on the R implementation.
The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting a weight on the values in those rows.
This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.
In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.
In some ways this would depend on the data itself. Are the duplicated rows valid data? Or are they only partly duplicated but still important?
If the data were temperature measurements in a town at a given hour, maybe duplicated temperatures are important as they would weight this variable to be a more correct temperature than another lone measurement that was different.
If the data were temperature measurements that three people recorded off the same thermometer at the same time, then you would want to remove the noise from the data by reducing to just unique values.
The answer could very well be a combination of the above. If you had multiple readings that conflicted at the same time period, you would choose the most heavily weighted one, and then decide how to break ties, if all the measurements were the same you removed duplicates. In this way you are cleaning the data before you put it through an algorithm.
It all comes down to what is relevant in the data model and whether duplicated rows are of relevance to the result.

Complex clustering of RNA-seq data

Even though I'm working with RNA-seq data, my question is more of a statistics and machine learning nature.
So, what I have is expression data from wild types (controls) and mutants over four different conditions, where a condition is a combination of two cell types and time points. Basically, my expression matrix looks like this:
time1_loc1_control time1_loc1_mutant time1_loc2_control time1_loc2_mutant ...
gene1
gene2
...
Expression values are initially in counts-per-million, but I have tried using different transformations before clustering attempts.
What I am trying to achieve is to cluster the genes based on the direction of the change between mutant and control (upregulated or downregulated) and absolute values of expression.
So far I was able to roughly group the genes based solely on the direction of the change, but I would also need to retain the information of absolute expression values. Are there any methods that could help me with this?
One other idea was to split the dataset by conditions (as conditions are independent) and cluster the genes separately. This would mean each gene would be clustered four times. Is there any way to determine which genes are being clustered together?
Thank you for any input on this. I realize it's a bit vague but I am not extremely experienced with this.
Kind regards,
Z

Running regression tree on large dataset in R

I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.

Resources