How do I draw a line in a dendrogram that corresponds the best K for a given criteria?
Like this:
Lets suppose that this is my dendrogram, and the best K is 4.
data("mtcars")
myDend <- as.dendrogram(hclust(dist(mtcars)))
plot(myDend)
I know that abline function is able to draw lines in graphs similarly to the one showed above. However, I don't know how could I calculate the height, so the function is used as abline(h = myHeight)
The information that you need to get the heights came with hclust. It has a variable containing the heights. To get the 4 clusters, you want to draw your line between the 3rd biggest and 4th biggest height.
HC = hclust(dist(mtcars))
myDend <- as.dendrogram(HC)
par(mar=c(7.5,4,2,2))
plot(myDend)
k = 4
n = nrow(mtcars)
MidPoint = (HC$height[n-k] + HC$height[n-k+1]) / 2
abline(h = MidPoint, lty=2)
Related
I am trying to do k-means clustering using R, and this is what I have done so far:
tmp <- kmeans(ds, centers = 4, iter.max = 1000)
plot(ds[tmp$cluster==1,c(1,5)], col = "red", xlim = c(min(ds[,1]),
max(ds[,1])), ylim = c(min(ds[,5]), max(ds[,5])))
points(ds[tmp$cluster==2,c(1,5)], col = "blue")
points(ds[tmp$cluster==3,c(1,5)], col = "seagreen")
points(ds[tmp$cluster==4,c(1,5)], col = "orange")
points(tmp$centers[,c(1,5)], col = "black")
and I get the following graph:
I am quite new to this, so I may be way off, but this graph does not look quite right to me. The data is basically divided in zones and to be honest, I was expecting to see something along the lines of this:
The circles in this picture are just to showcase where I was expecting the clusters to be. Can anyone explain why the data is clustered like that? I did the clustering multiple times and I always end up with this result.
The dataset I am using can be found here.
Notice that Age runs from about 18 to 60, so the maximum distance between age is about 40. Now notice that the incomes range from 0 to 20000. The distance between points is heavily dominated by the income. If you wish both variables to be used in the clustering, you should scale the data before clustering. Try
tmp<-kmeans(scale(ds), centers = 4, iter.max = 1000)
This is how the k-means clustering algorithm work. Google "k-means clustering" and look at the picture results and you will see different variations: circular clusters and the type you received. If you set number of clusters k to a different number, you will get different clusters. The goal of the algorithm is to partition a data set into a desired number of non-overlapping clusters k, so that the total within-cluster variation is minimized. And this is the result you see in your plot.
I have an hclust tree with nearly 2000 samples. I have cut it to an appropriate number of clusters and would like to plot the dendrogram but ending at the height that I cut the clusters rather than all the way to every individual leaf. Every plotting guide is about coloring all the leaves by cluster or drawing a box, but nothing seems to just leave the leaves below the cut line out completely.
My full dendrogram looks like the following:
I would like to plot it as if it stops where I've drawn the abline here (for example):
This should get you started. I suggest reading the help page for "dendrogram"
Here is the example from the help page:
hc <- hclust(dist(USArrests))
dend1 <- as.dendrogram(hc)
plot(dend1)
dend2 <- cut(dend1, h = 100)
plot(dend2$upper)
plot(dend2$upper, nodePar = list(pch = c(1,7), col = 2:1))
By performing the cut on the dendrogram object (not the hclust object) you can then plot the upper part of the dendrogram. It will take a some work to replace the branch1, 2, 3, and 4 labels depending on your analysis.
Good luck.
I am trying to do the following:
plot a time series in R using a polygonal line
plot one or more horizontal lines superimposed
find the intersections of said line with the orizontal ones
I got this far:
set.seed(34398)
c1 <- as.ts(rbeta(25, 33, 12))
p <- plot(c1, type = 'l')
# set thresholds
thresholds <- c(0.7, 0.77)
I can find no way to access the segment line object plotted by R. I really really really would like to do this with base graphics, while realizing that probably there's a ggplot2 concoction out there that would work. Any idea?
abline(h=thresholds, lwd=1, lty=3, col="dark grey")
I will just do one threshold. You can loop through the list to get all of them.
First find the points, x, so that the curve crosses the threshold between x and x+1
shift = (c1 - 0.7)
Lower = which(shift[-1]*shift[-length(shift)] < 0)
Find the actual points of crossing, by finding the roots of Series - 0.7 and plot
shiftedF = approxfun(1:length(c1), c1-0.7)
Intersections = sapply(Lower, function(x) { uniroot(shiftedF, x:(x+1))$root })
points(Intersections, rep(0.7, length(Intersections)), pch=16, col="red")
How can I plot a degree graoh like that?
The picture is only indicative, the result may not be identical to the image.
The important thing is that on the X axis there are labels of the nodes and on the Y axis the degree of each node.
Then the degree can be represented as a histogram (figure), with points, etc., it is not important.
This is what I tried to do and did not come close to what I want:
d = degree(net, mode="all")
hist(d)
or
t = table(degree(net))
plot(t, xlim=c(1,77), ylim=c(0, 40), xlab="Degree", ylab="Frequency")
I think it is a trivial thing but it's the first time I use R.
Thank you
This is what I have now:
I would like a graph that was more readable (I have 77 bars). That is, with more space between the bars and between the labels.
My aim is to show how a node (Valjean) has higher value than the other, I don't know if I am using the right graphic..
You can just use a barplot and specify the row names. For example,
net = rgraph(10)
rownames(net) = colnames(net) = LETTERS[1:10]
d = degree(net)
barplot(d, names.arg = rownames(net))
I am attempting to plot discrete functions in R for a flow model equation. I have to plot the original function u(x) = tanh(x - 0.1), with u(x) on the Y-axis and x on the X-axis. I then must plot a discrete function that describes the slope.
u <- array(0,dim=c(21))
#Plot the original function u(x)=tanh(ax-x0)
curve(tanh(x-0.1), from=0, to=5, n=100, col="red", xlab="x", ylab = "u(x)")
grid (NULL,NULL, col = "lightgray", lty="dotted")
x = seq(0, 5, by=0.25)
for (i in 1:21){
u[i] = tanh(x[i]-0.1)
}
x1 = seq(0, 4.75, by=0.25)
du1 <- array(0,dim=c(20))
for (i in 1:20){
du1[i] = (u[i+1]-u[i])/0.25
}
plot(x1, du1, xlab = "x", ylab = "du/dx")
So per the definition of my derivative function, my du/dx vector will only have 20 vector points, but my x vector still has 21 points. I must then repeat giving defined du/dx vectors that have 19 and 18 vector points. Is there any way I can plot the du/dx vs. x functions all on the same graph without having to redefine x every time?
I'm not sure I'm totally clear on what you're asking, but here's code that prevents you from writing out 18 individual code blocks (using the "diff" function in base).
derivs <- matrix(NA, nrow=21, ncol=18)
x <- seq(0, 5, by=0.25)
orig <- tanh(x-0.1)
derivs[,1] <- c(diff(orig)/.25, NA)
for(col in 2:18) {
print(col)
derivs[,col] <- c((diff(derivs[,col-1])/.25), NA)
}
The resulting matrix (here called "derivs" has a column for each derivative (first column is first derivative, second is second derivative, etc...)
One reason I'm a bit confused about what you're trying for is that, if you were to plot all these on one graph, it would be a really weird graph, because the order of magnitudes are really different between the first few, and the last few derivatives.
The dimensions aren't really different for each derivative; I've simply padded it with NAs, which won't appear on a graph.
Also note that you can use the diff function to get second-order differences and so forth.
PS. The graph will probably look more reasonable if, rather than taking the differences as you did (and as I did, to emulate you), so that the different is assigned to the first x value...you probably want to center. E.g. every other derivative would actually be plotted at .125, .375, etc.)