How do I represent domain knowledge information with bnlearn - r

I am learning about Dynamic Bayesian Network models using the R package bnlearn. To this end, I am following this paper where they impose certain constraints in the form of 6 layers (Table 1 in the paper):
1 Gender, age at ALS onset
2 Onset site, onset delta (start of the trial - onset)
3 Riluzole intake, placebo/treatment
4 Variables at time t-1
5 Variables at time t, TSO
6 Survival
In this example, since gender and age are in the top layer they cannot be influenced by Riluzole intake but influence (or have a causal connection) Riluzole intake and ultimately survival. This guarantees acyclicality in the network, that is, we do not have non-ending feedback loops among the variables.
My question is, how can we model such prior knowledge using the R package bnlearn.

You can add domain knowledge or constraints to structure learning in a couple of ways.
If you want to specify the network structure and parameters using domain knowledge, you can build the network manually using custom.fit.
If you want to estimate the structure of the BN from data then you can impose constraints on edge direction & edge presence using the whitelist and blacklist parameters in the structure learning algorithms.
A prior can be placed on the edges in structure learning (e.g. prior="cs", where "If prior is cs, beta is a data frame with columns from, to and prob specifying the prior probability for a set of arcs. A uniform probability distribution is assumed for the remaining arcs."). There are other priors that can be used.

Related

Should I use Friedman test or Mixed Model for my data in R? Nested or not?

I have my Response variable which is Proportion of Range Exposed to extreme events for terrestrial mammal species in the future. More clearly, it is the Difference of Proportion of Range Exposed (DPRE) from historical period to future green gases emission scenarios (it is a measure of the level of increase/decrease of percentage of range exposed): it means that my response variable goes from -1 to 1 (where +1 implies that the range will experience a +100% increase in the proportion of exposure: from 0% in historical period, to 100% in the future scenario).
As said, I am analyzing these differences for all terrestrial mammals (5311 species, across different scenarios and for two time periods, near future (means of 2021-2040) and far future (means of 2081-2100).
So, my Explicative variables are:
3 Scenarios of green gas emissions (Representative Concentration Pathways: RCP2.6, RCP4.5 and RCP8.5);
Time Periods (Near Future and Far Future): NF and FF;
Species: 5311 individuals.
I am not so expert in statistics , so I'm not sure which of the two suggestions I recieved:
Friedman test with Species as blocks (but in which I should somehow do a nested model, with RCPs as groups, nested within TimePeriods; or a sort of two way Friedman, with RCP and TimePeriod as the two different factors).
Linear Mixed Models with RCP*TimePeriod as fixed effects, and (TimePeriod | Species ) as random effects.
I run t-test, and all distribution result to be not normal, this is why I was suggested to use Friendman instead of ANOVA; I run pairwise Wilcoxon Rank Sum test and in this case I found significative differences from NF and FF for all RCPs.
I have to say I run 3 Wilcoxon, one for every RCP, so maybe a third option would be to create 3 different models, one for every RCP, but this would also go away from the standard analysis of "repated measures" for Friedman test.
Last consideration: I have to run Another model, where the Response variable is the Difference of Proportion of Subrange Exposed. In this case, other Explicative variables are mantained, but in this case analysis is not global but takes in consideration the difference that could be present across 14 IUCN Biomes. So every analysis is made across RCPs, for NF and FF and for all Biomes. Should I create and run 14 (biomes) x 3 (RCPs) x 2 (Time Periods) = 84 models, in this case? OR a sort of double nested (Time Periods and Biomes) model?
If necessary I can provide the large dataframe.

How to determine the correct mixed effects structure in a binomial GLMM (lme4)?

Could someone help me to determine the correct random variable structure in my binomial GLMM in lme4?
I will first try to explain my data as best as I can. I have binomial data of seedlings that were eaten (1) or not eaten (0), together with data of vegetation cover. I try to figure out if there is a relationship between vegetation cover and the probability of a tree being eaten, as the other vegetation is a food source that could attract herbivores to a certain forest patch.
The data is collected in ~90 plots scattered over a National Park for 9 years now. Some were measured all years, some were measured only a few years (destroyed/newly added plots). The original datasets is split in 2 (deciduous vs coniferous), both containing ~55.000 entries. Per plot about 100 saplings were measured every time, so the two separate datasets probably contain about 50 trees per plot (though this will not always be the case, since the decid:conif ratio is not always equal). Each plot consists of 4 subplots.
I am aware that there might be spatial autocorrelation due to plot placement, but we will not correct for this, yet.
Every year the vegetation is surveyed in the same period. Vegetation cover is estimated at plot-level, individual trees (binary) are measured at a subplot-level.
All trees are measured, so the amount of responses per subplot will differ between subplots and years, as the forest naturally regenerates.
Unfortunately, I cannot share my original data, but I tried to create an example that captures the essentials:
#set seed for whole procedure
addTaskCallback(function(...) {set.seed(453);TRUE})
# Generate vector containing individual vegetation covers (in %)
cover1vec <- c(sample(0:100,10, replace = TRUE)) #the ',number' is amount of covers generated
# Create dataset
DT <- data.frame(
eaten = sample(c(0,1), 80, replace = TRUE),
plot = as.factor(rep(c(1:5), each = 16)),
subplot = as.factor(rep(c(1:4), each = 2)),
year = as.factor(rep(c(2012,2013), each = 8)),
cover1 = rep(cover1vec, each = 8)
)
Which will generate this dataset:
>DT
eaten plot subplot year cover1
1 0 1 1 2012 4
2 0 1 1 2012 4
3 1 1 2 2012 4
4 1 1 2 2012 4
5 0 1 3 2012 4
6 1 1 3 2012 4
7 0 1 4 2012 4
8 1 1 4 2012 4
9 1 1 1 2013 77
10 0 1 1 2013 77
11 0 1 2 2013 77
12 1 1 2 2013 77
13 1 1 3 2013 77
14 0 1 3 2013 77
15 1 1 4 2013 77
16 0 1 4 2013 77
17 0 2 1 2012 46
18 0 2 1 2012 46
19 0 2 2 2012 46
20 1 2 2 2012 46
....etc....
80 0 5 4 2013 82
Note1: to clarify again, in this example the number of responses is the same for every subplot:year combination, making the data balanced, which is not the case in the original dataset.
Note2: this example can not be run in a GLMM, as I get a singularity warning and all my random effect measurements are zero. Apparently my example is not appropriate to actually use (because using sample() caused the 0 and 1 to be in too even amounts to have large enough effects?).
As you can see from the example, cover data is the same for every plot:year combination.
Plots are measured multiple years (only 2012 and 2013 in the example), so there are repeated measures.
Additionally, a year effect is likely, given the fact that we have e.g. drier/wetter years.
First I thought about the following model structure:
library(lme4)
mod1 <- glmer(eaten ~ cover1 + (1 | year) + (1 | plot), data = DT, family = binomial)
summary(mod1)
Where (1 | year) should correct for differences between years and (1 | plot) should correct for the repeated measures.
But then I started thinking: all trees measured in plot 1, during year 2012 will be more similar to each other than when they are compared with (partially the same) trees from plot 1, during year 2013.
So, I doubt that this random model structure will correct for this within plot temporal effect.
So my best guess is to add another random variable, where this "interaction" is accounted for.
I know of two ways to possibly achieve this:
Method 1.
Adding the random variable " + (1 | year:plot)"
Method 2.
Adding the random variable " + (1 | year/plot)"
From what other people told me, I still do not know the difference between the two.
I saw that Method 2 added an extra random variable (year.1) compared to Method 1, but I do not know how to interpret that extra random variable.
As an example, I added the Random effects summary using Method 2 (zeros due to singularity issues with my example data):
Random effects:
Groups Name Variance Std.Dev.
plot.year (Intercept) 0 0
plot (Intercept) 0 0
year (Intercept) 0 0
year.1 (Intercept) 0 0
Number of obs: 80, groups: plot:year, 10; plot, 5; year, 2
Can someone explain me the actual difference between Method 1 and Method 2?
I am trying to understand what is happening, but cannot grasp it.
I already tried to get advice from a colleague and he mentioned that it is likely more appropriate to use cbind(success, failure) per plot:year combination.
Via this site I found that cbind is used in binomial models when Ntrails > 1, which I think is indeed the case given our sampling procedure.
I wonder, if cbind is already used on a plot:year combination, whether I need to add a plot:year random variable?
When using cbind, the example data would look like this:
>DT3
plot year cover1 Eaten_suc Eaten_fail
8 1 2012 4 4 4
16 1 2013 77 4 4
24 2 2012 46 2 6
32 2 2013 26 6 2
40 3 2012 91 2 6
48 3 2013 40 3 5
56 4 2012 61 5 3
64 4 2013 19 2 6
72 5 2012 19 5 3
80 5 2013 82 2 6
What would be the correct random model structure and why?
I was thinking about:
Possibility A
mod4 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot),
data = DT3, family = binomial)
Possibility B
mod5 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot) + (1 | year:plot),
data = DT3, family = binomial)
But doesn't cbind(success, failure) already correct for the year:plot dependence?
Possibility C
mod6 <- glmer(cbind(Eaten_suc, Eaten_fail) ~ cover1 + (1 | year) + (1 | plot) + (1 | year/plot),
data = DT3, family = binomial)
As I do not yet understand the difference between year:plot and year/plot
Thus: Is it indeed more appropriate to use the cbind-method than the raw binary data? And what random model structure would be necessary to prevent pseudoreplication and other dependencies?
Thank you in advance for your time and input!
EDIT 7/12/20: I added some extra information about the original data
You are asking quite a few questions in your question. I'll try to cover them all, but I do suggest reading the documentation and vignette from lme4 and the glmmFAQ page for more information. Also I'd highly recommend searching for these topics on google scholar, as they are fairly well covered.
I'll start somewhere simple
Note 2 (why is my model singular?)
Your model is highly singular, because the way you are simulating your data does not indicate any dependency between the data itself. If you wanted to simulate a binomial model you would use g(eta) = X %*% beta to simulate your linear predictor and thus the probability for success. One can then use this probability for simulating the your binary outcome. This would thus be a 2 step process, first using some known X or randomly simulated X given some prior distribution of our choosing. In the second step we would then use rbinom to simulate binary outcome while keeping it dependent on our predictor X.
In your example you are simulating independent X and a y where the probability is independent of X as well. Thus, when we look at the outcome y the probability of success is equal to p=c for all subgroup for some constant c.
Can someone explain me the actual difference between Method 1 and Method 2? ((1| year:plot) vs (1|year/plot))
This is explained in the package vignette fitting linear mixed effects models with lme4 in the table on page 7.
(1|year/plot) indicates that we have 2 mixed intercept effects, year and plot and plot is nested within year.
(1|year:plot) indicates a single mixed intercept effect, plot nested within year. Eg. we do not include the main effect of year. It would be somewhat similar to having a model without intercept (although less drastic, and interpretation is not destroyed).
It is more common to see the first rather than the second, but we could write the first as a function of the second (1|year) + (1|year:plot).
Thus: Is it indeed more appropriate to use the cbind-method than the raw binary data?
cbind in a formula is used for binomial data (or multivariate analysis), while for binary data we use the raw vector or 0/1 indicating success/failure, eg. aggregate binary data (similar to how we'd use glm). If you are uninterested in the random/fixed effect of subplot, you might be able to aggregate your data across plots, and then it would likely make sense. Otherwise stay with you 0/1 outcome vector indicating either success or failures.
What would be the correct random model structure and why?
This is a topic that is extremely hard to give a definitive answer to, and one that is still actively researched. Depending on your statistical paradigm opinions differ greatly.
Method 1: The classic approach
Classic mixed modelling is based upon knowledge of the data you are working with. In general there are several "rules of thumb" for choosing these parameters. I've gone through a few in my answer here. In general if you are "not interested" in the systematic effect and it can be thought of as a random sample of some population, then it could be a random effect. If it is the population, eg. samples do not change if the process is repeated, then it likely shouldn't.
This approach often yields "decent" choices for those who are new to mixed effect models, but is highly criticized by authors who tend towards methods similar to those we'd use in non-mixed models (eg. visualizing to base our choice and testing for significance).
Method 2: Using visualization
If you are able to split your data into independent subgroups and keeping the fixed effect structure a reasonable approach for checking potential random effects is the estimate marginal models (eg. using glm) across these subgroups and seeing if the fixed effects are "normally distributed" between these observations. The function lmList (in lme4) is designed for this specific approach. In linear models we would indeed expect these to be normally distributed, and thus we can get an indication whether a specific grouping "might" be a valid random effect structure. I believe the same is approximately true in the case of generalized linear models, but I lack references. I know that Ben Bolker have advocated for this approach in a prior article of his (the first reference below) that I used during my thesis. However this is only a valid approach for strictly separable data, and the implementation is not robust in the case where factor levels are not shared across all groups.
So in short: If you have the right data, this approach is simple, fast and seemingly highly reliable.
Method 3: Fitting maximal/minimal models and decreasing/expanding model based on AIC or AICc (or p-value tests or alternative metrics)
Finally an alternative to use a "step-wise"-like procedure. There are advocates of both starting with maximal and minimal models (I'm certain at least one of my references below talk about problems with both, otherwise check glmmFAQ) and then testing your random effects for their validity. Just like classic regression this is somewhat of a double-edged sword. The reason is both extremely simple to understand and amazingly complex to comprehend.
For this method to be successful you'd have to perform cross-validation or out-of-sample validation to avoid selection bias just like standard models, but unlike standard models sampling becomes complicated because:
The fixed effects are conditional on the random structure.
You will need your training and testing samples to be independent
As this is dependent on your random structure, and this is chosen in a step-wise approach it is hard to avoid information leakage in some of your models.
The only certain way to avoid problems here is to define the space
that you will be testing and selecting samples based on the most
restrictive model definition.
Next we also have problems with choice of metrics for evaluation. If one is interested in the random effects it makes sense to use AICc (AIC estimate of the conditional model) while for fixed effects it might make more sense to optimize AIC (AIC estimate of the marginal model). I'd suggest checking references to AIC and AICc on glmmFAQ, and be wary since the large-sample results for these may be uncertain outside a very reestrictive set of mixed models (namely "enough independent samples over random effects").
Another approach here is to use p-values instead of some metric for the procedure. But one should likely be even more wary of test on random effects. Even using a Bayesian approach or bootstrapping with incredibly high number of resamples sometimes these are just not very good. Again we need "enough independent samples over random effects" to ensure the accuracy.
The DHARMA provides some very interesting testing methods for mixed effects that might be better suited. While I was working in the area the author was still (seemingly) developing an article documenting the validity of their chosen method. Even if one does not use it for initial selection I can only recommend checking it out and deciding upon whether one believes in their methods. It is by far the most simple approach for a visual test with simple interpretation (eg. almost no prior knowledge is needed to interpret the plots).
A final note on this method would thus be: It is indeed an approach, but one I would personally not recommend. It requires either extreme care or the author accepting ignorance of model assumptions.
Conclusion
Mixed effect parameter selection is something that is difficult. My experience tells me that mostly a combination of method 1 and 2 are used, while method 3 seems to be used mostly by newer authors and these tend to ignore either out-of-sample error (measure model metrics based on the data used for training), ignore independence of samples problems when fitting random effects or restrict themselves to only using this method for testing fixed effect parameters. All 3 do however have some validity. I myself tend to be in the first group, and base my decision upon my "experience" within the field, rule-of-thumbs and the restrictions of my data.
Your specific problem.
Given your specific problem I would assume a mixed effect structure of (1|year/plot/subplot) would be the correct structure. If you add autoregressive (time-spatial) effects likely year disappears. The reason for this structure is that in geo-analysis and analysis of land plots the classic approach is to include an effect for each plot. If each plot can then further be indexed into subplot it is natural to think of "subplot" to be nested in "plot". Assuming you do not model autoregressive effects I would think of time as random for reasons that you already stated. Some years we'll have more dry and hotter weather than others. As the plots measured will have to be present in a given year, these would be nested in year.
This is what I'd call the maximal model and it might not be feasible depending on your amount of data. In this case I would try using (1|time) + (1|plot/subplot). If both are feasible I would compare these models, either using bootstrapping methods or approximate LRT tests.
Note: It seems not unlikely that (1|time/plot/subplot) would result in "individual level effects". Eg 1 random effect per row in your data. For reasons that I have long since forgotten (but once read) it is not plausible to have individual (also called subject-level) effects in binary mixed models. In this case It might also make sense to use the alternative approach or test whether your model assumptions are kept when withholding subplot from your random effects.
Below I've added some useful references, some of which are directly relevant to the question. In addition check out the glmmFAQ site by Ben Bolker and more.
References
Bolker, B. et al. (2009). „Generalized linear mixed models: a practical guide for ecology and evolution“. In: Trends in ecology & evolution 24.3, p. 127–135.
Bolker, B. et al. (2011). „GLMMs in action: gene-by-environment interaction in total fruit production of wild populations of Arabidopsis thaliana“. In: Revised version, part 1 1, p. 127–135.
Eager, C. og J. Roy (2017). „Mixed effects models are sometimes terrible“. In: arXiv preprint arXiv:1701.04858. url: https://arxiv.org/abs/1701.04858 (last seen 19.09.2019).
Feng, Cindy et al. (2017). „Randomized quantile residuals: an omnibus model diagnostic tool with unified reference distribution“. In: arXiv preprint arXiv:1708.08527. (last seen 19.09.2019).
Gelman, A. og Jennifer Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Hartig, F. (2019). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.2.4. url: http://florianhartig.github.io/DHARMa/ (last seen 19.09.2019).
Lee, Y. og J. A. Nelder (2004). „Conditional and Marginal Models: Another View“. In: Statistical Science 19.2, p. 219–238.
doi: 10.1214/088342304000000305. url: https://doi.org/10.1214/088342304000000305
Lin, D. Y. et al. (2002). „Model-checking techniques based on cumulative residuals“. In: Biometrics 58.1, p. 1–12. (last seen 19.09.2019).
Lin, X. (1997). „Variance Component Testing in Generalised Linear Models with Random Effects“. In: Biometrika 84.2, p. 309–326. issn: 00063444. url: http://www.jstor.org/stable/2337459
(last seen 19.09.2019).
Stiratelli, R. et al. (1984). „Random-effects models for serial observations with binary response“. In:
Biometrics, p. 961–971.

Unsupervised Supervised Clusters with NAs and Qualitative Data in R

I have basketball player data that looks like the following:
Player Weight Height Shots School
A NA 70 23 AB
B 130 62 10 AB
C 180 66 NA BC
D 157 65 22 CD
and I want to do unsupervised and supervised(based on height) clustering. Looking into online resources I found that I can use kmeans for unsupervised but I don't know how to handle NAs without losing a good amount of data. I also don't know how to handle the quantitative variable "school". Are there any ways to resolve both issues for unsupervised and supervised clustering?
K-means cannot be used for categorical data. One work around would be to instead use data about the schools such as # of enrollments or local SES data.
kmeans() in R cannot handle NA's so you could either omit them (and you should check that the NA's are distributed fairly evenly among other factors) or look into using cluster::clara() from the cluster library.
You have not asked anything specifically about super-learning so I cannot address that part of the question.
The problem you are facing is known as missing data. And you have to decide about it before start the clustering. in most cases the samples with missed data (NAs here) are simple omitted. that happen in preparing data and clearing process step of data mining. In R you can do it using the following code:
na.omit(yourdata)
it omit the records or samples (in row) that contains NAs.
but if you want to include them in the clustering process you can use the average value of that feature in entire cluster for the missing value option.
in your case, Consider weight:
for player A you can set (130+180+157)/3 for his weight.
For another question: it seems you are a little bit confused about the meaning of supervised and unsupervised learning. in supervised learning you need to define the class label of the samples. then you build a model (classifier) and train it to learn about each class of samples and after training you can use the model to predict the label of a test sample, like you give it a player with this values (W=100,H=190,shots=55) and it will give you the predicted class label.
For unsupervised learning you just need to cluster the data to find group or cluster relation of samples. for doing this you do not need a class label, you should define the features that you are going to cluster the samples based on them, for example you can cluster players only based on their weights, or just cluster them based on their height,... or you can use all height, weight and shots features for clustering. this is possible in R using the following code:
clus <- kmeans(na.omit(data$weight), 5) #for cluster them to 5 clusters based on weight
clus <- kmeans(na.omit(data[,1:3]),5) # to cluster them based on weight, height, shots into 5 clusters.
consider the using of na.omit here that remove rows which has NAs in their columns.
let me know if this helps you.

Spark ML Logistic Regression with Categorical Features Returns Incorrect Model

I've been doing a head-to-head comparison of Spark 1.6.2 ML's LogisticRegression with R's glmnet package (the closest analog I could find based on other forum posts).
I'm specifically looking at these two fitting packages when using categorical features. When using continuous features, results for the two packages are comparable.
For my first attempt with Spark, I used the ML Pipeline API to transform my single 21-level categorical variable (called FAC for faculty) with StringIndexer followed by OneHotEncoder to get a binary vector representation.
When I fit my models in Spark and R, I get the following sets of results (that aren't even close):
SPARK 1.6.2 ML
lrModel.intercept
-3.1453838659926427
lrModel.weights
[0.37664264958084287,0.697784342445422,0.4269429071484017,0.3521764371898419,0.19233585960734872,0.6708049751689226,0.49342372792676115,0.5471489576300356,0.37650628365008465,1.0447861554914701,0.5371820187662734,0.4556833133252492,0.2873530144304645,0.09916227313130375,0.1378469333986134,0.20412095883272838,0.4494641670133712,0.4499625784826652,0.489912016708041,0.5433020878341336]
R (glmnet)
(Intercept) -2.79255253
facG -0.35292166
facU -0.16058275
facN 0.69187146
facY -0.06555711
facA 0.09655696
facI 0.02374558
facK -0.25373146
facX 0.31791765
facM 0.14054251
facC 0.02362977
facT 0.07407357
facP 0.09709607
facE 0.10282076
facH -0.21501281
facQ 0.19044412
facW 0.18432837
facF 0.34494177
facO 0.13707197
facV -0.14871580
facS 0.19431703
I've manually checked the glmnet results and they're correct (calculating the proportion of training samples with a particular level of the categorical feature and comparing that to the softmax prob. under the estimated model). These results do not change even when the max. no. of iterations in the optimization is set to 1000000 and the convergence tolerance is set to 1E-15. These results also do not change when the Spark LogisticRegression weights are initialized to the glmnet-estimated weights (Spark's optimizing a different cost function?).
I should say that the optimization problem is not different between these two approaches. You should be minimizing logistic loss (a convex surface) and thereby arriving at nearly the exact same answer).
Now, when I manually recode the FAC feature as a binary vector in the data file and read those binary columns as "DoubleType" (using Spark's DataFrame schema), I get very comparable results. (The order of the coefficients for the following results is different from the above results. Also the reference levels are different--"B" in the first case, "A" in the second--so the coefficients for this test should not match those from the above test.)
SPARK 1.6.2 ML
lrModel.intercept
-2.9530485080391378
lrModel.weights
[-0.19233467682265934,0.8524505857034615,0.09501714540028124,0.25712829253044844,0.18430675058702053,0.09317325898819705,0.4784688407322236,0.3010877381053835,0.18417033887042242,0.2346069926274015,0.2576267066227656,0.2633474197307803,0.05448893119304087,0.35096612444193326,0.3448460751810199,0.505448794876487,0.29757609104571175,0.011785058030487976,0.3548130904832268,0.15984047288368383]
R (glmnet)
s0
(Intercept) -2.9419468179
FAC_B -0.2045928975
FAC_C 0.8402716731
FAC_E 0.0828962518
FAC_F 0.2450427806
FAC_G 0.1723424956
FAC_H -0.1051037449
FAC_I 0.4666239456
FAC_K 0.2893153021
FAC_M 0.1724536240
FAC_N 0.2229762780
FAC_O 0.2460295934
FAC_P 0.2517981380
FAC_Q -0.0660069035
FAC_S 0.3394729194
FAC_T 0.3334048723
FAC_U 0.4941379563
FAC_V 0.2863010635
FAC_W 0.0005482422
FAC_X 0.3436361348
FAC_Y 0.1487405173
Standardization is set to FALSE for both and no regularization is performed (you shouldn't perform it here since you're really just learning the incidence rate of each level of the feature and the binary feature columns are completely uncorrelated from one another). Also, I should mention that the 21 levels of the categorical feature range in incidence counts from ~800 to ~3500 (so this is not due to lack of data; large error in estimates).
Anyone experience this? I'm one step away from reporting this to the Spark guys.
Thanks as always for the help.

cluster ordinal data

I want to do clustering of my data (kmeans or hclust) in R language (coding). My data is ordinal, which means that the data is Likert scale to measure the causes of cost escalation (I have 41 causes "variables") that scaled from 1 to 5, which 1 is no effect to 5 major effect (I have about 160 observations "who rank the causes")... any help of how to cluster the 41 cause based on the observations ... do I have to convert the scale to percentage or z score before clustering or any thing that help ...... I really need your help!! here is the data to play with https://docs.google.com/spreadsheet/ccc?key=0AlrR2eXjV8nXdGtLdlYzVk01cE96Rzg2NzRpbEZjUFE&usp=sharing
I want to cluster the variables (the columns) in terms of similarity of occurrence in observations... I follow the code in statmethods.net/advstats/cluster.html; but I couldn't cluster the variables (the columns) in terms of similarity of occurrence in observations and also I follow the work at mattpeeples.net/kmeans.html#help; but I don't know why he convert the data to percentage and then to Z-score standardize.
It isn't clear to me if you want to cluster the rows (the observations) in terms of similarity in the variables, or cluster the variables (the columns) in terms of similarity of occurrence in observations?
Anyway, see package cluster. This is a recommended package that comes with all R installations.
Read ?daisy for details of what is done with ordinal data. This metric can be used in functions such as agnes (for hierarchical clustering) or pam (for partitioning about medoids, a more robust version of k-means).
By default, these will cluster the rows/observations. Simply transpose the data object using t() if you want to cluster the columns (variables). Although that may well mess up the data depending on how you have stored them.
Converting the data to percentage is called normalization of data so all the variables are in the range of 0 - 1.
If data is not normalized you run the risk of bias towards dimensions with large values

Resources