Orthoplan in R (fractional factorial design) - r

I am trying to create a factorial design in R for a conjoint analysis experiment (like SPSS orthoplan).
Searching among past Stackoverflow questions, I have found this previous answer:
How to create a fractional factorial design in R?
It is indeed an useful answer but only in the case you have factors with numeric levels.
That's unfortunately not my case, because the factors I want to use are nominal variable, i.e. their levels are not numeric type but factor type: for example I have to deal with a factor indicating the color of a product which can be green, yellow or red.
I've tried modifying the code proposed as an answer to the question How to create a fractional factorial design in R?
in such a way:
f.design <- gen.factorial(levels.design,factors="all")
but the result is nor balanced, nor orthogonal. Moreover, you have to define the exact number of trials in the optFederov function. In that answer the suggested number of trials was :
nTrials=sum(levels.design)
but in order to have a balanced solution in a design with nominal factors, I expect it should at least be :
nTrials=prod(unique(levels.design))
There's a package anyway that could deal with such an issue, it is the package FrF2 by Prof. Ulrike Groemping, but it handles only dichotomous variables and I cannot figure out how to use it to solve my problem.

After having been for a while researching an answer by myself, I can share here what I have found:
yes, you can build orthogonal designs in R, in a similar fashion as it happens in SPSS orthoplan.
Just define the variable nlevels as a vector containing the levels of your variables.
Then you have to call:
library(DoE.base)
fract.design <- oa.design(nlevels=levels.design)
The function will look up into a library of orthogonal designs (exactly Kuhfeld W., 2009, Orthogonal arrays)
If there isn't a suitable available orthogonal design, the function will just return the full factorial design (and therefore you'll have no other choice in R but to call the optFederov function, as explained above in my question).
As an example try:
oa.design(nlevels=c(2,2,3))
oa.design(nlevels=c(2,2,4))
The first doesn't have a solution (so you'll get back the full factorial), but the second one does have a solution, an 8 cards, orthogonal and balanced design.

Related

K-Means Distance Measure - Large Data and mixed Scales

I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/

How to deal with large number of factors/categories within partykit

I use the partykit package and come across the following error message:
Error in matrix(0, nrow = mi, ncol = nl) :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In matrix(0, nrow = mi, ncol = nl) :
NAs introduced by coercion to integer range
I used the example given in this article, which compares packages and their handling with a lot of categories.
The problem is, that the used splitting variable has too many categories. Within the mob() functions a matrix with all possible splits is created. This matrix alone is of size p * (2^(p-1)-1), where p is the number of categories of the splitting variable.
Depending on the used system resources (RAM etc.) the given error occurs for different numbers of p.
The article suggest the use of the Gini criterion. I think with the intention of the partykit package, the Gini criterion can not be used, because I do not have a classification problem with a target variable, but a model specification problem.
My question therefore: is there is a way to find the split for such cases or a way to reduce the number of splits to check?
This trick of searching just k ordered splits rather then 2^k -1 unordered partitions only works under certain circumstances, e.g., when it is possible to order the response by their average value within each category. I have never looked at the underlying theory in close enough detail but this only works under certain assumptions and I'm not sure whether these are spelled out nicely enough somewhere. You certainly need a univariate problem in the sense that only one underlying parameter (typically the mean) is optimized. Possibly continuous differentiability of the objective function might also be an issue, given the emphasis on Gini.
As mob() is probably most frequently applied in situations where you partition more than a single parameter, I don't think it is possible to exploit this trick. Similarly, ctree() can easily be applied in situations with multivariate scores, even the response variable is univarite (e.g., for capturing location and scale difference).
I would usually recommend to break down the factor with the many levels into smaller pieces. For example, if you have a factor for the ZIP code of an observation: Then one could use a factor for state/province, and a numeric variable coding the "size" (area or population), a factor coding rural vs. urban, etc. Of course, this is additional work but typically also leads to more interpretable results.
Having said that, it is on our wish list for partykit to exploit such tricks if they are available. But it is not on the top of our current agenda...
The way I used to solve the problem was through the transformation of the variable in a contrast matrix, using model.matrix(~ 0 + predictor, data). ctree() cannot manage long factors but can easily manage datasets with many variables.
Of course, there are drawbacks, with this technique you lose the factor clustering feature of ctree(); each node will use just one level since they are now different columns.

Dimension Reduction for Clustering in R (PCA and other methods)

Let me preface this:
I have looked extensively on this matter and I've found several intriguing possibilities to look into (such as this and this). I've also looked into principal component analysis and I've seen some sources that claim it's a poor method for dimension reduction. However, I feel as though it may be a good method, but am unsure how to implement it. All the sources I've found on this matter give a good explanation, but rarely do they provide any sort of advice as to actually go about applying one of these methods (i.e. how one can actually apply a method in R).
So, my question is: is there a clear-cut way to go about dimension reduction in R? My dataset contains both numeric and categorical variables (with multiple levels) and is quite large (~40k observations, 18 variables (but 37 if I transform categorical variables into dummies)).
A few points:
If we want to use PCA, then I would have to somehow convert my categorical variables into numeric. Would it be okay to simply use a dummy variable approach for this?
For any sort of dimension reduction for unsupervised learning, how do I treat ordinal variables? Do the concept of ordinal variables even make sense in unsupervised learning?
My real issue with PCA is that when I perform it and have my principal components.. I have no idea what to actually do with them. From my knowledge, each principal component is a combination of the variables - and as such I'm not really sure how this helps us pick and choose which are the best variables.
I don't think this is an R question. This is more like a statistics question.
PCA doesn't work for categorical variables. PCA relies on decomposing the covariance matrix, which doesn't work for categorical variables.
Ordinal variables make lot's of sense in supervised and unsupervised learning. What exactly are you looking for? You should only apply PCA on ordinal variables if they are not skewed and you have many levels.
PCA only gives you a new transformation in terms of principal components, and their eigenvalues. It has nothing to do with dimension reduction. I repeat, it has nothing to do with dimension reduction. You reduce your data set only if you select a subset of the principal components. PCA is useful for regression, data visualisation, exploratory analysis etc.
A common way is to apply optimal scaling to transform your categorical variables for PCA:
Read this:
http://www.sicotests.com/psyarticle.asp?id=159
You may also want to consider correspondence analysis for categorical variables and multiple factor analysis for both categorical and continuous.

Customized Fisher exact test in R

Beginner's question ahead!
(after spending much time, could not find straightforward solution..)
After trying all relevant posts I can't seem to find the answer, perhaps because my question is quite basic.
I want to run fisher.test on my data (Whatever data, doesn't really matter to me - mine is Rubin's children TV workshop from QR33 - http://www.stat.columbia.edu/~cook/qr33.pdf) It has to simulate completely randomized experiment.
My assumption is that RCT in this context means that all units have the same probability to be assigned to treatment(1/N). (of course, correct me if I'm wrong. thanks).
I was asked to create a customized function and my function has to include the following arguments:
Treatment observations (vector)
Control observations (vector)
A scalar representing the value, e.g., zero, of the sharp null hypothesis; and
The number of simulated experiments the function should run.
When digging in R's fisher.test I see that I can specify X,Y and many other params, but I'm unsure reg the following:
What's the meaning of Y? (i.e. a factor object; ignored if x is a matrix. is not informative as per the statistical meaning).
How to specify my null hypothesis? (i.e. if I don't want to use 0.) I see that there is a class "htest" with null.value but how can I use it in the function?
Reg number of simulations, my plan is to run everything through a loop - sounds expensive - any ideas how to better write it?
Thanks for helping - this is not an easy task I believe, hopefully will be useful for many people.
Cheers,
NB - Following explanations were found unsatisfying:
https://www.r-bloggers.com/contingency-tables-%E2%80%93-fisher%E2%80%99s-exact-test/
https://stats.stackexchange.com/questions/252234/compute-a-fisher-exact-test-in-r
https://stats.stackexchange.com/questions/133441/computing-the-power-of-fishers-exact-test-in-r
https://stats.stackexchange.com/questions/147559/fisher-exact-test-on-paired-data
It's not completely clear to me that a Fisher test is necessarily the right thing for what you're trying to do (that would be a good question for stats.SE) but I'll address the R questions.
As is explained at the start of the section on "Details", R offers two ways to specify your data.
You can either 1. supply to the argument x a contingency table of counts (omitting anything for y), or you can supply observations on individuals as two vectors that indicate the row and column categories (it doesn't matter which is which); each vector containing factors for x and y. [I'm not sure why it also doesn't let you specify x as a vector of counts and y as a data frame of factors, but it's easy enough to convert]
With a Fisher test, the null hypothesis under which (conditionally on the margins) the observation-categories become exchangeable is independence, but you can choose to make it one or two tailed (via the alternative argument)
I'm not sure I clearly understand the simulation aspect but I almost never use a loop for simulations (not for efficiency, but for clarity and brevity). The function replicate is very good for doing simulations. I use it roughly daily, sometimes many times.

Determine number of factors in EFA (R) using Comparison Data

I am looking for ways to determine number of optimal factors in R factanal function. The most used method (conduct a pca and use scree plot to determine the number of factors) is already known to me. I have found a method described here to be easier for non technical folks like me. Unfortunately the R script is no longer accessible in which the method was implemented. I was wondering if there is a package available in R that does the same?
The method was originally proposed in this study: Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure.
The R code is now moved here as per the author.
EFA.dimensions ist also a nice and easy to use package for that

Resources