I am new to pathway analysis. (I am working in R, but am open to try other programs for this).
I have built a model to analyze some ecological data. Therefore I have some known relationships among some of my variables. Let's say I have a known structure among variables v1- v6 as depicted in the attached diagram. I know that an external variable, e.g. latitude, acts on v6. But I want to find out at which instance latitude actually acts (e.g. it could be that latitude in reality affects any of v1- v5, thus carrying over its effect on v6, due to the relationships between v1- v6).
Also, at a later stage, I'd like to do this for more external variables.
My question is, is there any way to take into account such known relationships in pathway analysis? Furthermore, a few of these known relationships are actually non-linear. I understand that nonlinear relations are not easy in pathway analysis, such as e.g. in SEM's, but it also seems to me that the difficulty arises from testing for such nonlinear relationships. Here, I would not need to test for any of those nonlinear relationships. The relationship of latitude on v1-v6 is assumed to be linear.
Thanks for any input, appreciate to hear if anyone has dealt with a similar situation!
Pathway Diagram
Related
I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/
For my thesis assignment I need to perform a cluster analysis on a high dimensional data set containing purchase data from a retail store (+1000 dimensions). Because traditional clustering algorithms are not well suited for high dimensions (and dimension reduction is not really an option), I would like to try algorithms specifically developed for high dimensional data(e.g. ProClus).
Here however, my problem starts.
I have no clue what value I should use for parameter d. Can anyone help me?
This is just one of the many limitations of ProClus.
The parameter is the average dimensionality of your cluster. It assumes there is a linear cluster somewhere in your data. This likely will not hold for purchase data, but you can try. For sparse data such as purchases, I would rather focus on frequent itemset mining.
There is no universal clustering algorithm. Any clustering algorithm will come with a variety of parameters that you need to experiment with.
For cluster analysis it is essential that you somehow can visualize or analyze the result, to be able to find out if and how well the method worked.
background
i have some private survey data that contains a column of confidential information: the geographic location of the survey respondents. under no circumstances can this information be released.
as is common in survey research, in order for users to correctly calculate a variance on my survey data set, those users will either need that geographic location (unacceptable) or, alternatively, a set of replicate weights. i can create that set of replicate weights; however, it's quite easy to look at the correlations between those weights and back-calculate which of the survey respondents share the same geographic location. that is also unacceptable.
to help me with this question, you don't have to be familiar with replicate weights -- just think of them as a few columns of strongly-correlated clustered data.
i understand that if i want to maintain that clustering, an evil data user will always have semi-decent guesses at who shares geographic locations; i just want to make that guessing game less precise. on the un-obfuscated replicate weights, an evil data user can figure out 100% of the cases.
request
i am looking for a technique that
prevents the public use file users from easily deducing the shared geographic location off of the correlations between my replicate weights variables
does not obliterate the correlations between my columns of data (the replicate weights variables)
can be implemented on an R data.frame object without a major time investment
i say shared because the evil user might not know where the location is, but they might know if two survey respondents are from the same location -- an unacceptable possibility.
what i have tried
i don't really want to re-invent the wheel here. i am looking for r syntax, an r package, or anything else that would be relatively straightforward to implement. i've found one, two, three, four papers describing techniques that would all be suitable for my purposes; unfortunately, none of the authors have been willing to share actual code to implement them.
i can do simple things like add and subtract random values to my replicate weights columns according to a normal distribution, but i'd prefer to rely on the work of someone who understands privacy issues better than i do.
thanks!!!!
i have written this nine-step tutorial to walk through the process in an attempt to answer my own question. i am not an expert in the field of privacy/confidentiality and would love to hear both feedback about this idea and also other ideas. thanks!
http://www.asdfree.com/2014/09/how-to-provide-variance-calculation-on.html
I am aiming to better predict a buying habits of a company's customer base according to several customer attributes (demographic, past purchase categories, etc). I have a data set of about 100,000 returning customers including the time interval from their last purchase (the dependent variable in this study) along with several attributes (both continuous and categorical).
I plan on doing a survival analysis on each segment (segments defined as having similar time intervals across observations) to help understand likely time intervals between purchases. The problem I am encountering is how to best define these segments; i.e. groupings of attributes such that the time interval is sufficiently different between segments and similar within segments. I believe that building a decision tree is the best way to do this, I would suppose using recursive partitioning.
I am new to R and have poked around with the party package's mob command, however I am confused by which variables to include in the model and which to include for partitioning (command: mob(y ~ x1 + ... + xk | z1 + ... + zk), x being model variables and z being partitions). I simply want to build a tree from the set of attributes, so I suppose I want to partition on all of them? Not sure. I have also tried the rpart command but either get no tree or a tree with hundreds of thousands of nodes depending on the cp level.
If anyone has any suggestions, I'd appreciate it. Sorry for the novel and thanks for the help.
From the documentation at ?mob:
MOB is an algorithm for model-based recursive partitioning yielding a
tree with fitted models associated with each terminal node.
It's asking for model variables because it will build a model at every terminal node (e.g. linear, logistic) after splitting on the partition variables. If you want to partition without fitting models to the terminal nodes, the function I've used is ctree (also in the party package).
I'm playing arround with a Genetic Algorithm in which I want to evolve graphs.
Do you know a way to apply crossover and mutation when the chromosomes are graphs?
Or am I missing a coding for the graphs that let me apply "regular" crossover and mutation over bit strings?
thanks a lot!
Any help, even if it is not directly related to my problem, is appreciated!
Manuel
I like Sandor's suggestion of using Ken Stanley's NEAT algorithm.
NEAT was designed to evolve neural networks with arbitrary topologies, but those are just basically directed graphs. There were many ways to evolve neural networks before NEAT, but one of NEAT's most important contributions was that it provided a way to perform meaningful crossover between two networks that have different toplogies.
To accomplish this, NEAT uses historical markings attached to each gene to "line up" the genes of two genomes during crossover (a process biologists call synapsis). For example:
(source: natekohl.net)
(In this example, each gene is a box and represents a connection between two nodes. The number at the top of each gene is the historical marking for that gene.)
In summary: Lining up genes based on historical markings is a principled way to perform crossover between two networks without expensive topological analysis.
You might as well try Genetic Programming. A graph would be the closest thing to a tree and GP uses trees... if you still want to use GAs instead of GPs then take a look at how crossover is performed on a GP and that might give you an idea how to perform it on the graphs of your GA:
(source: geneticprogramming.com)
Here is how crossover for trees (and graphs) works:
You select 2 specimens for mating.
You pick a random node from one parent and swap it with a random node in the other parent.
The resulting trees are the offspring.
As others have mentioned, one common way to cross graphs (or trees) in a GA is to swap subgraphs (subtrees). For mutation, just randomly change some of the nodes (w/ small probability).
Alternatively, if you are representing a graph as an adjacency matrix, then you might swap/mutate elements in the matrices (kind of like using a two-dimensional bit string).
I'm not sure if using a bitstring is the best idea, I'd rather represent at least the weights with real values. Nevertheless bitstrings may also work.
If you have a fixed topology, then both crossover and mutation are quite easy (assuming you only evolve the weights of the network):
Crossover: take some weights from one parent, the rest from the other, can be very easily done if you represent the weights as an array or list. For more details or alternatives see http://en.wikipedia.org/wiki/Crossover_%28genetic_algorithm%29.
Mutation: simply select some of the weights and adjust them slightly.
Evolving some other stuff (e.g. activation function) is pretty similar to these.
If you also want to evolve the topology then things become much more interesting. There are quite some additional mutation possibilities, like adding a node (most likely connected to two already existing nodes), splitting a connection (instead of A->B have A->C->B), adding a connection, or the opposites of these.
But crossover will not be too easy (at least if the number of nodes is not fixed), because you will probably want to find "matching" nodes (where matching can be anything, but likely be related to a similar "role", or a similar place in the network). If you also want to do it I'd highly recommend studying already existing techniques. One that I know and like is called NEAT. You can find some info about it at
http://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_topologies
http://nn.cs.utexas.edu/?neat
and http://www.cs.ucf.edu/~kstanley/neat.html
Well, I have never played with such an implementation, but eventually for crossover you could pick a branch of one of the graphs and swap it with a branch from another graph.
For mutation you could randomly change a node inside the graph, with small probability.