How to add cluster id in a seperate column of a dataframe? - r

I have produced a dendogram with hclust and cut it into two clusters. I know from the graph which row corresponds to which cluster. What I want to do is create a separate column in the dataframe that will contain element "class-1" if the row corresponds to the first cluster and will contain the element "class-2" if corresponds the second cluster.

Without an example dataset, I will use the built-in USArrests.
If you create a column of class factor with the labels "class-1" and "class-2" R will automatically assign them to the values 1 and 2, respectively.
hc <- hclust(dist(USArrests), "ave") # taken from the help page ?hclust
memb <- cutree(hc, k = 2) #
res <- cbind(USArrests, Class = factor(unname(memb), labels = c("class-1", "class-2")))
head(res)
# Murder Assault UrbanPop Rape Class
#Alabama 13.2 236 58 21.2 class-1
#Alaska 10.0 263 48 44.5 class-1
#Arizona 8.1 294 80 31.0 class-1
#Arkansas 8.8 190 50 19.5 class-2
#California 9.0 276 91 40.6 class-1
#Colorado 7.9 204 78 38.7 class-2

Related

groupLabels not shown when using dendextend colour_branches

The workflow I want to implement is:
dm <- dist(data)
dend <- hclust(dm)
k <- stats::cutree(dend, k = 10)
data$clusters <- k
plot(hclust, colorBranchees = k) #???? What I can use here.
So I searched for color dendrogram branches using cutree output. All I found is dendextend.
Problem is that I am failing to implement the workflow with dendextend.
This is what I came up with, but I would now like to have clusterLabels shown
library(dendextend)
hc <- hclust(dist(USArrests))
dend <- as.dendrogram(hc)
kcl <- dendextend::cutree(dend, k = 4)
dend1 <- color_branches(dend, clusters = kcl[order.dendrogram(dend)], groupLabels = TRUE)%>% set("labels_cex", 1)
plot(dend1, main = "Dendrogram dist JK")
Also, trying something like groupLabels = 1:4 does not help.
Specifying with the param k (number of o clusters) the groupLable does work. But unfortunately, the labels are different than those generated by dendextend own cutree method.
Note that here cluster 4 has 2 members.
> table(kcl)
kcl
1 2 3 4
14 14 20 2
This post suggest to use dendextend::cutree(dend,k = nrCluster, order_clusters_as_data = FALSE)
r dendrogram - groupLabels not match real labels (package dendextend)
But then I can not use the output of dendextend::cutree to group the data (since the ordering does not match.
I would be happy to use a different dendrogram plotting library in R but so far my Web searches for "coloring dendrogram branches by cutree output" point to the dendextend package.
I'm sorry but I'm not sure I fully understand your question.
It seems like you want to align between curtree's output and your original data.
If that's the case, then you need to use dendextend::cutree(dend,k = nrCluster, order_clusters_as_data = TRUE) e.g.:
require(dendextend)
d1 <- USArrests[1:10,]
hc <- hclust(dist(d1))
dend <- as.dendrogram(hc)
k <- dendextend::cutree(dend, k = 3, order_clusters_as_data = TRUE)
d2 <- cbind(d1, k)
plot(color_branches(dend, 3))
d2
# an easier way to see the clusters is by ordering the rows of the data based on the order of the dendrogram
d2[order.dendrogram(dend),]
The plot is fine:
And the clusters are mapped correctly to the data (see outputs)
> require(dendextend)
> d1 <- USArrests[1:10,]
> hc <- hclust(dist(d1))
> dend <- as.dendrogram(hc)
> k <- dendextend::cutree(dend, k = 3, order_clusters_as_data = TRUE)
> d2 <- cbind(d1, k)
> plot(color_branches(dend, 3))
> d2
Murder Assault UrbanPop Rape k
Alabama 13.2 236 58 21.2 1
Alaska 10.0 263 48 44.5 1
Arizona 8.1 294 80 31.0 2
Arkansas 8.8 190 50 19.5 1
California 9.0 276 91 40.6 2
Colorado 7.9 204 78 38.7 1
Connecticut 3.3 110 77 11.1 3
Delaware 5.9 238 72 15.8 1
Florida 15.4 335 80 31.9 2
Georgia 17.4 211 60 25.8 1
> # an easier way to see the clusters is by ordering the rows of the data based on the order of the dendrogram
> d2[order.dendrogram(dend),]
Murder Assault UrbanPop Rape k
Connecticut 3.3 110 77 11.1 3
Florida 15.4 335 80 31.9 2
Arizona 8.1 294 80 31.0 2
California 9.0 276 91 40.6 2
Arkansas 8.8 190 50 19.5 1
Colorado 7.9 204 78 38.7 1
Georgia 17.4 211 60 25.8 1
Alaska 10.0 263 48 44.5 1
Alabama 13.2 236 58 21.2 1
Delaware 5.9 238 72 15.8 1
Please LMK if this answers your question or if you have followup questions here.

Question about prcomp() in R. I want my country names on the plot

I am using prcomp() on a dataset, but R said "Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I know all col should be numeric, but the country col is not one of the x- factor. What should I do to let R knows it. I tried deleting the whole country col and then R thinks the number is my y-factor instead of country. enter image description here
I would like to output a plot like this . enter image description here . Instead, this is the best I can do enter image description here
dd <- `rownames<-`(within(USArrests, State <- rownames(USArrests)), NULL)
head(dd)
# Murder Assault UrbanPop Rape State
# 1 13.2 236 58 21.2 Alabama
# 2 10.0 263 48 44.5 Alaska
# 3 8.1 294 80 31.0 Arizona
# 4 8.8 190 50 19.5 Arkansas
# 5 9.0 276 91 40.6 California
# 6 7.9 204 78 38.7 Colorado
biplot(prcomp(dd))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
rownames(dd) <- dd$State
dd$State <- NULL
biplot(prcomp(dd, scale. = TRUE))

str() shows USArrests tohave 4 columns, but it has 5. Why is this present & part of downstream analysis but not in str()

The structure function in R shows that USArrests only has 4 variables.
However, there are 5. State names are in the first column however it is unlabeled.
I am struggling to understand the intuition behind this and how this works.
I have done a K-means clustering algorithm with the data and it seems that the first column(state names) acts as labels in the analysis. Without being used a categorical data.
this is the tutorial I used.
https://uc-r.github.io/kmeans_clustering
Below is some code to explain myself in a clearer manner.
str(USArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
How it looks as "label" in the K Means Clustering
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
Data Cleaning
df <- USArrests
df <- na.omit(df)
Scaling
(df <- scale(df))
Compute K-means Clustering
k2 <- kmeans(df, centers = 2, nstart = 25)
Sample Output
Clustering vector:
Alabama Alaska Arizona Arkansas California
2 2 2 1 2
If there are only four variables how does R, or the clustering algorithm know to associate the cluster with the state name, which technically isn't a column?
The first "column" is not actually a column but the index to the dataset. Instead of the index being 1,2,3,4, etc. like default, it is Alabama, Alaska, Arizona, Arkansas, etc. Which is why running the str() function gives us only 4 columns as an index is never treated as a column.
Now, the clustering output showed which cluster each state belonged to. This is simply the index and the algorithm at the end is telling us which cluster each row belongs to. For example, if the index was 1, 2, 3, 4, etc. instead of state names, we would still get the result as row 1 being cluster 2, row 2 being in cluster 2, row 3 being in cluster 2,row 4 being in cluster 1, etc. The algorithm does what you tell it to do. It sees the index and labels the respective cluster against that index.
Hope this helps.

Comparing variables with values in another dataframe and replace them with another value

I have a Data.Frame with:
Height <- c(169,176,173,172,176,158,168,162,178)
and another with reference heights and weights.
heights_f <- c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6)
weights_f <- c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84)
weightfactor_f <- data.frame(heights_f, weights_f)
I now need to match the values of the heights from the first data.frame with the height reference in the second one that's the most fitting and to give me the correspondent reference weight.
I haven't yet had any success, as I haven't been able to find anything about matching values that are not exactly the same.
If I understand your goal, instead of taking the nearest value, consider interpolating through the approx function. For instance:
approx(weightfactor_f$heights_f,weightfactor_f$weights_f,xout=Height)$y
#[1] 60.23846 66.44400 63.85385 62.95600 66.44400 50.36000 59.35385 53.96923
#[9] 68.28400
You could do:
Height<- c(169,176,173,172,176,158,168,162,178)
heights_f<- as.numeric(c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6))
weights_f<- as.numeric(c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84))
df = data.frame(Height=Height, match_weight=
sapply(Height, function(x) {weights_f[which.min(abs(heights_f-x))]}))
i.e. for each entry in Height, find the corresponding element in the heights_f vector by doing which.min(abs(heights_f-x) and fetch the corresponding entry from the weights_f vector.
Output:
Height match_weight
1 169 61.3
2 176 65.8
3 173 63.6
4 172 63.6
5 176 65.8
6 158 49.9
7 168 59.0
8 162 54.5
9 178 68.1
library(dplyr)
Slightly different structure to reproducible example:
Height <- data.frame(height = as.numeric(c(169,176,173,172,176,158,168,162,178)))
The rest is the same:
heights_f<- as.numeric(c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6))
weights_f<- as.numeric(c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84))
weightfactor_f<- data.frame(heights_f,weights_f)
Then, round to the nearest whole number:
weightfactor_f$heights_f <- round(weightfactor_f$heights_f, 0)
Then just:
left_join(Height, weightfactor_f, by = c("height" = "heights_f"))
Output:
height weights_f
1 169 NA
2 176 NA
3 173 63.6
4 172 NA
5 176 NA
6 158 49.9
7 168 59.0
8 162 NA
9 178 68.1
z <- vector()
for(i in 1:length(Height)) {
z[i] <- weightfactor_f$weights_f[which.min(abs(Height[i]-weightfactor_f$heights_f))]
}

Quickest way to find the maximum value from one column with multiple duplicates in others? [duplicate]

This question already has answers here:
How to get the maximum value by group
(5 answers)
Closed 5 years ago.
In reality I have a very large data frame. One column contains an ID and another contains a value associated with that ID. However, each ID occurs multiple times with differing values, and I wish to record the maximum value for each ID while discarding the rest. Here is a replicable example using the quakes dataset in R:
data <- as.data.frame(quakes)
##Create output matrix
output <- matrix(,length(unique(data[,5])),2)
colnames(output) <- c("Station ID", "Max Mag")
##Grab unique station IDs
uni <- unique(data[,5])
##Go through each station ID and record the maximum magnitude
for (i in 1:dim(output)[1])
{
sub.data <- data[which(data[,5]==uni[i]),]
##Put station ID in column 1
output[i,1] <- uni[i]
##Put biggest magnitude in column 2
output[i,2] <- max(sub.data[,4])
}
Considering that with my real data I have data frames with dimensions of 100000's of rows, this is a slow process. Is there a quicker way to execute such a task?
Any help much appreciated!
library(plyr)
ddply(data, "stations", function(data){data[which.max(data$mag),]})
lat long depth mag stations
1 -27.21 182.43 55 4.6 10
2 -27.60 182.40 61 4.6 11
3 -16.24 168.02 53 4.7 12
4 -27.38 181.70 80 4.8 13
-----
You can also use:
> data2 <- data[order(data$mag,decreasing=T),]
> data2[!duplicated(data2$stations),]
lat long depth mag stations
152 -15.56 167.62 127 6.4 122
15 -20.70 169.92 139 6.1 94
17 -13.64 165.96 50 6.0 83
870 -12.23 167.02 242 6.0 132
1000 -21.59 170.56 165 6.0 119
558 -22.91 183.95 64 5.9 118
109 -22.55 185.90 42 5.7 76
151 -23.34 184.50 56 5.7 106
176 -32.22 180.20 216 5.7 90
275 -22.13 180.38 577 5.7 104
Also :
> library(data.table)
> data <- data.table(data)
> data[,.SD[which.max(mag)],by=stations]
stations lat long depth mag
1: 41 -23.46 180.11 539 5.0
2: 15 -13.40 166.90 228 4.8
3: 43 -26.00 184.10 42 5.4
4: 19 -19.70 186.20 47 4.8
5: 11 -27.60 182.40 61 4.6
---
98: 77 -21.19 181.58 490 5.0
99: 132 -12.23 167.02 242 6.0
100: 115 -17.85 181.44 589 5.6
101: 121 -20.25 184.75 107 5.6
102: 110 -19.33 186.16 44 5.4
data.table works better for large dataset
You could try tapply, too:
tapply(data$mag, data$stations, FUN=max)
You can try the new 'dplyr' package as well, which is much faster and easier to use than 'plyr'. Using what Hadley called "like a grammar of data manipulation" by chaining the operations together with %.%, like so :
library(dplyr)
df <- as.data.frame(quakes)
df %.%
group_by(stations) %.%
summarise(Max = max(mag)) %.%
arrange(desc(Max)) %.%
head(5)
Source: local data frame [5 x 2]
stations Max
1 122 6.4
2 94 6.1
3 132 6.0
4 119 6.0
5 83 6.0

Resources