Tracking down the "unbalanced" in a design - r

I have some data from an experiment that should be balanced, but when I try
print(model.tables(aov.PDD,"means"),digits=3)
I get
Error in model.tables.aovlist(aov.PDD, "means") :
design is unbalanced so cannot proceed
I suspect this means that the coding or entry of the data was incorrect somewhere, but I'd like to be able to track this to more detail before wading into the data frame itself. How can I get more detail on what factor is producing the unbalance here?

The balance in aov refers to the number of observations per cell (combination of all factors). There are certain formulas that require balance (all the numbers are the same) and therefore give errors when there is not balance. Sometimes you don't need exactly the same numbers in all cells, but equal withing blocks of cells. You can just use the table function to count how many observations you have per cell.
Generally the parts that require balance are when you start having nested random effects, in this case it is probably better to use a mixed effects model (package nlme or lme4) which uses different techniques and does not require tha balance.

Related

Can `mvabund::traitglm()` handle random effects?

I am using the R package mvabund to examine how environmental conditions and species traits are correlated with ecological community structure. The traitglm() function is a nice tool for this. However, my data consist of ~40 'sites' repeatedly sampled annually over a 20-year time period. I think an appropriate method would be to include 'year' as a fixed effect and 'site' as a random effect, but I do not see an option in mvabund to include random effects. Note that the environmental conditions are not expected to dramatically vary across space as much as through time. So, incorporating the temporal element is very important, but the sites of course are resampled each year. If there is no such way to include random effects in mvabund, how should I handle time?

How to code a predictor in logistic regression when some values are purposefully unknown

I decides to post my question here because, strictly speaking, it has to do with coding.
The problem is as follows. In a psychological experiment involving two conditions, an independent variable - made up of numeric values - was present in one condition but not in the the other. Accordingly, in one condition the variable in point provided relevant information, and ranged between 0 and 20. In the other condition participants were simply not provided with such information.
Binding the data together, in the second condition - where participants were not provided with such information - I coded the variable as NA. However, when I run my logistic model, setting na.action = na.omit causes the model to fail.
In principle, the NAs in my data are not missing values but, in accordance with the experimental design, would like to reflect the absence of this information within one of the conditions.
Therefore, it seems to me that multivariate imputation - as could be implemented with mice or other packages - is not the correct course of action. In fact, if I wanted, I could simply retrive the values of interest, but including them in the data would be improper because, as already mentioned, participants were kept from knowing the values thereof.
Is there any strategy to code such unknown values and cope with this problem?
Any help would be much appreciated. Thank you very much!

Suggested Neural Network for small, highly varying dataset?

I am currently working with a small dataset of training values, no more than 20, and am getting large MSE. The input data vectors themselves consist of 16 parameters, many of which are binary variables. Across all the training values, a majority of the 16 parameters stay the same (but not all). The remaining input variables, across all the exemplars, vary a lot amongst one another. This is to say, two exemplars might appear to be the same except for two parameters in which they differ, one parameter being a binary variable, and another being a continuous variable, where the difference could be greater than a single standard deviation (for that variable's set of values).
My single output variable (as of now) can either be a continuous variable, OR depending on the true difficulty of reducing the error in my situation, I can make this a classification problem instead, with 12 different forms for classification.
I have long been researching different neural networks than my current implementation of a feed-forward MLP, as I have read into Stochastic NNs, Ladder NNs, and many forms of recurrent NNs. I am stuck with which one I should investigate, as I do not have time to try every NN available.
While my description may be vague, could anyone make a suggestion as to which network I should investigate to minimize my cost function (as of now, MSE) the most?
If my current setup must be rendered implacable because of the sheer difficulty involved with predicting correct output for such a small set of highly variant training values, which network would best work, should my dataset be expanded to the order of thousands of exemplars (at the cost of having a significantly more redundant, seemingly homogenous set of input values)?
Any help is most certainly appreciated.
20 samples is very small especially if you have 16 input variables. It will be hard to determine which one of those inputs is responsible for your output value. If you keep your network simple (fewer layers) you may be able to use as many samples as you would need for traditional regression.

how do duplicated rows effect a decision tree?

I am using Rpart{} to build a decision tree for a categorical variable and I am wondering whether I should use the full data set of just the set of unique rows.
I am answering this as a general question on decision trees, rather than on the R implementation.
The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting a weight on the values in those rows.
This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.
In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.
In some ways this would depend on the data itself. Are the duplicated rows valid data? Or are they only partly duplicated but still important?
If the data were temperature measurements in a town at a given hour, maybe duplicated temperatures are important as they would weight this variable to be a more correct temperature than another lone measurement that was different.
If the data were temperature measurements that three people recorded off the same thermometer at the same time, then you would want to remove the noise from the data by reducing to just unique values.
The answer could very well be a combination of the above. If you had multiple readings that conflicted at the same time period, you would choose the most heavily weighted one, and then decide how to break ties, if all the measurements were the same you removed duplicates. In this way you are cleaning the data before you put it through an algorithm.
It all comes down to what is relevant in the data model and whether duplicated rows are of relevance to the result.

Running regression tree on large dataset in R

I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.

Resources