how to merge nodes in a decision tree in R - r

I would like to use a decision tree as proposed by Yan et al. 2004; Adaptive Testing With Regression Trees in the Presence of Multidimensionality (https://journals.sagepub.com/doi/epdf/10.3102/10769986029003293) That produce trees like this:
It also seems like a similar (same?) thing was proposed more recently under the name decision stream https://arxiv.org/pdf/1704.07657.pdf
The goal is to perform the usual splitting as in CART or a similar decision tree at each node but then merge nodes where the difference in the target variable between nodes is smaller than some prespecified value.
I didn't find a package that can do this (I think SAS might be able to do it, and there is the decision stream implementation in closure). I looked into partykit package with the hope that I might change it a bit to produce this behavior, but the main issues are that the tree is constructed recursively, so the nodes don't know about other nodes at the same level, and that the tree representation does not allow to point to other nodes that are already in the tree, so a node can only have one parent, but I would need more. I was also thinking about repeatedly fitting tree stumps, then merging nodes and repeat, but I don't know how I would make predictions with something like that.
Edit: some example code
set.seed(1)
n_subjects <- 100
n_items <- 4
responses <- matrix(rep(c(0,1),times=(n_subjects/2)*n_items),
ncol=n_items)
responses <- as.data.frame(apply(responses, 2, function(x) sample(x)))
weights <- c(20,20,20,10)
responses$outcome <- rowSums(responses[,1:n_items] * weights)
library(rattle)
tree <- rpart(outcome~., data=responses)
fancyRpartPlot(tree, tweak=1.2, sub='')
Outcome:
I would expect nodes 5 and 6 to get merged because the outcome value is basically the same (34 and 35), before the splitting continues from the merged node

Related

Generating multiple random graphs in R with same number of nodes and ties?

I would like to generate a large number of random graphs with the same number of nodes and ties, and use the result to find the distributions etc of the standard metrics.
I found this link for generating random graphs with a given number of nodes and ties (Graph generation given number of edges and nodes). Is there an easy way to tell R to do this 1000x or so, and combine all of those into one object, that I can then analyze? (for things like av. distance, degree, diameter, etc).
Ultimately I want to be able to use this information for comparison with an empirical network.
I got this answer from a friend, and it appears to work:
RR1 <- list()
for(i in 1:1000) {
RR1[[i]] <- erdos.renyi.game(559,9640,type=c("gnm"),directed=FALSE,loops=FALSE)
}
number_of_edges <- sapply(RR1, gsize)
number_of_edges

Understanding R subspace package clustering output

Somewhat related to this question.
I am using R subspace package for subspace clustering. As in the question above, I have failed to use the generic plotting method to plot out my resulting clusters in a way native to the package. The next step is to understand the output of the command
CLIQUE(df, xi = 40, tau = 0.2)
That looks something like this:
I understand that the "object" is the row number for the clustered unit, and the subspace indicates the dimensions of the data in which the clustering was done. However I don't see how the clusters in the given dimensions can be distinguished.
The documentation does not contain information on the output. Ideally, my goal is to plot out all the clusters with something like ggplot2, or in 3D, what have you. And I need to know which units are in which clusters in corresponding dimensions.
Additionally, checked if the dimensions of any of the two members of the output list are the same like this:
cluster_result <- clique_model
equalities_matrix <- matrix(
0L, nrow = length(cluster_result), ncol = length(cluster_result)
)
for (i in 1:length(cluster_result)){
for (j in 1:length(cluster_result)){
equalities_matrix[i,j] <- (
all(cluster_result[[i]]$subspace == cluster_result[[j]]$subspace)
)
}
}
sum(equalities_matrix)
The answer is no.
So, here is what my researches lead to, might be helpful for someone in the future.
Above, the CLIQUE algorithm above outputted one cluster per every set of dimensions, possibly by the virtue of the data or tuning of the algorithm. I added more features to the data, ran it again, and checked if the dimensions of any of the two members of the output list are the same again. This time, yes, several dimensions were the same as the sum(equalities_matrix) yielded a number larger than the number of features.
In conclusion, the output of the algorithm is a list of lists where each member list represents one cluster in one subspace:
subspace ... the dimensions making the subspace indicated with TRUE,
objects ... members of the cluster.
If there is more than one cluster in a given subspace, there will be more member lists with the same subspace, and different members of the cluster.
Here are the papers that helped me understand the theory:
Parsons, Haque, and Liu; 2004
Agrawal Gehrke, Gunopulos, Raghavan

R and SPSS: Different results for Hierarchical Cluster Analysis

I'm performing hierarchical cluster analysis using Ward's method on a dataset containing 1000 observations and 37 variables (all are 5-point likert-scales).
First, I ran the analysis in SPSS via
CLUSTER Var01 to Var37
/METHOD WARD
/MEASURE=SEUCLID
/ID=ID
/PRINT CLUSTER(2,10) SCHEDULE
/PLOT DENDROGRAM
/SAVE CLUSTER(2,10).
FREQUENCIES CLU2_1.
I additionaly performed the analysis in R:
datA <- subset(dat, select = Var01:Var37)
dist <- dist(datA, method = "euclidean")
hc <- hclust(d = dist, method = "ward.D2")
table(cutree(hc, k = 2))
The resulting cluster sizes are:
1 2
SPSS 712 288
R 610 390
These results are obviously confusing to me, as they differ substentially (which becomes highly visible when observing the dendrograms; also applies for the 3-10 clusters solutions). "ward.D2" takes into account the squared distance, if I'm not mistaken, so I included the simple distance matrix here. However, I tried several (combinations) of distance and clustering methods, e.g. EUCLID instead of SEUCLID, squaring the distance matrix in R, applying "ward.D" method,.... I also looked at the distance matrices generated by SPSS and R, which are identical (when applying the same method). Ultimately, I excluded duplicate cases (N=29) from my data, guessing that those might have caused differences when being allocated (randomly) at a certain point. All this did not result in matching outputs in R and SPSS.
I tried running the analysis with the agnes() function from the cluster package, which resulted in - again - different results compared to SPSS and even hclust() (But that's a topic for another post, I guess).
Are the underlying clustering procedures that different between the programs/packages? Or did I overlook a crucial detail? Is there a "correct" procedure that replicates the results yielded in SPSS?
If the distance matrices are identical and the merging methods are identical, the only thing that should create different outcomes is having tied distances handled differently in two algorithms. Tied distances might be present with the original full distance matrix, or might occur during the joining process. If one program searches the matrix and finds two or more distances tied at the minimum value at that step, and it selects the first one, while another program selects the last one, or one or both select one at random from among the ties, different results could occur.
I'd suggest starting with a small example with some data with randomness added to values to make tied distances unlikely and see if the two programs produce matching results on those data. If not, there's a deeper problem. If so, then tie handling might be the issue.

LDA - How do I normalise and "add the smoothing constant" to raw document-topic allocation counts?

Context
thanks in advance for your help. Right now, I have run a dataset through the LDA function in Jonathan Chang's 'lda' package (N.B. this is different from the 'topicmodels' package). Below is a replicable example, which uses the cora dataset that comes automatically when you install and load the 'lda' package.
library(lda)
data(cora.documents) #list of words contained in each of the 2,410 documents
data(cora.vocab) #vocabulary list of words that occur at least once across all documents
Thereafter, I conduct the actual LDA by setting the different parameters and running the code.
#parameters for LDA algorithm
K <- 20 #number of topics to be modelled from the corpus, "K"
G <- 1000 #number of iterations to cover - the higher this number, the more likely the data converges, "G"
alpha <- 0.1 #document-topic distributions, "alpha"
beta <- 0.1 #topic-term distributions, "beta/eta"
#creates an LDA fit based on above parameters
lda_fit <- lda.collapsed.gibbs.sampler(cora.documents, cora.vocab, K = 20,
num.iterations = G, alpha, beta)
Following which, we examine one component of the output of the LDA model, which is called document_sums. This component displays the number of words that each individual document contains that is allocated to each of the 20 topics (based on the K-value I chose). For instance, one document may have 4 words allocated to Topic 3, and 12 words allocated to Topic 19, in which case the document is assigned to Topic 19.
#gives raw figures for no. of times each document (column) had words allocated to each of 20 topics (rows)
document_sums <- as.data.frame(lda_fit$document_sums)
document_sums[1:20, 1:20]
Question
However, what I want to do is essentially use the principle of fuzzy membership. Instead of allocating each document to the topic it contains the most words in, I want to extract the probabilities that each document gets allocated to each topic. document_sums quite close to this, but I still have to do some processing on the raw data.
Jonathan Chang, the creator of the 'lda' package, himself says this in this thread:
n.b. If you want to convert the matrix to probabilities just row normalize and add the smoothing constant from your prior. The function here just returns the raw number of assignments in the last Gibbs sampling sweep. ()
Separately, another reply on another forum reaffirms this:
The resulting document_sums will give you the (unnormalized) distribution over topics for the test documents. Normalize them, and compute the inner product, weighted by the RTM coefficients to get the predicted link probability (or use predictive.link.probability)
And thus, my question is, how do I normalise my document_sums and 'add the smoothing constant'? These I am unsure of.
As asked: You need to add the prior to the matrix of counts and then divide each row by its total. For example
theta <- document_sums + alpha
theta <- theta / rowSums(theta)
You'll need to do something similar for the matrix of counts relating words to topics.
However if you're using LDA, may I suggest you check out textmineR? It does this normalization (and other useful things) for you. I originally wrote it as a wrapper for the 'lda' package, but have since implemented my own Gibbs sampler to enable other features. Details on using it for topic modeling are in the third vignette

Low-pass fltering of a matrix

I'm trying to write a low-pass filter in R, to clean a "dirty" data matrix.
I did a google search, came up with a dazzling range of packages. Some apply to 1D signals (time series mostly, e.g. How do I run a high pass or low pass filter on data points in R? ); some apply to images. However I'm trying to filter a plain R data matrix. The image filters are the closest equivalent, but I'm a bit reluctant to go this way as they typically involve (i) installation of more or less complex/heavy solutions (imageMagick...), and/or (ii) conversion from matrix to image.
Here is sample data:
r<-seq(0:360)/360*(2*pi)
x<-cos(r)
y<-sin(r)
z<-outer(x,y,"*")
noise<-0.3*matrix(runif(length(x)*length(y)),nrow=length(x))
zz<-z+noise
image(zz)
What I'm looking for is a filter that will return a "cleaned" matrix (i.e. something close to z, in this case).
I'm aware this is a rather open-ended question, and I'm also happy with pointers ("have you looked at package so-and-so"), although of course I'd value sample code from users with experience on signal processing !
Thanks.
One option may be using a non-linear prediction method and getting the fitted values from the model.
For example by using a polynomial regression, we can predict the original data as the purple one,
By following the same logic, you can do the same thing to all columns of the zz matrix as,
predictions <- matrix(, nrow = 361, ncol = 0)
for(i in 1:ncol(zz)) {
pred <- as.matrix(fitted(lm(zz[,i]~poly(1:nrow(zz),2,raw=TRUE))))
predictions <- cbind(predictions,pred)
}
Then you can plot the predictions,
par(mfrow=c(1,3))
image(z,main="Original")
image(zz,main="Noisy")
image(predictions,main="Predicted")
Note that, I used a polynomial regression with degree 2, you can change the degree for a better fitting across the columns. Or maybe, you can use some other powerful non-linear prediction methods (maybe SVM, ANN etc.) to get a more accurate model.

Resources