Clustering list for hclust function - r

Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.
In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:
1 a,b,c
2 d,e,f,g
Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.
How is this possible?

I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.
Construct a hclust object.
hc <- hclust(dist(USArrests), "ave")
#plot(hc)
You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).
cutree(hc, k = 2)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 2 1 1 2
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 1 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 1 2
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 1 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 1
South Dakota Tennessee Texas Utah Vermont
2 2 2 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2

lets say,
y<-dist(x)
clust<-hclust(y)
groups<-cutree(clust, k=3)
x<-cbind(x,groups)
now you will get for each record, the cluster group.
You can subset the dataset as well:
x1<- subset(x, groups==1)
x2<- subset(x, groups==2)
x3<- subset(x, groups==3)

Related

Convert DataFrame Column to Factors

R, RStudio
How do I convert DataFrame column to Factors?
I wish 0 is "North", 1 is "South", 2 is "East" and 3 is "West".
directions <- data.frame(
state=c("New York","New Jersey","Deleware","Texax","Alaska"),
travel=c(0,0,3,2,1)
)
head(directions)
Outputs
state travel
1 New York 0
2 New Jersey 0
3 Deleware 3
4 Texax 2
5 Alaska 1
I tried the following, but the entire travel column is NA
directions$travel <- factor(directions$travel,levels=c("North","South","East","West"))
head(directions)
Outputs
state travel
1 New York <NA>
2 New Jersey <NA>
3 Deleware <NA>
4 Texax <NA>
5 Alaska <NA>
We need to specify it in labels
factor(directions$travel,labels=c("North","South","East","West"))
#[1] North North West East South
#Levels: North South East West
If we need a custom grouping, then specify the levels as well
factor(directions$travel,levels = c(0, 1, 2, 3),
labels=c("North","South","East","West"))

Mutate DF1 based on DF2 with a check

nubie here with a dataframe/mutate question... I want to update a dataframe (df1) based on data in another dataframe (df2). For one offs I've used MUTATE so I figure this is the way to go. Additionally I would like a check function added (TRUE/FALSE ?) to indicate if the the field in df1 was updated.
For Example..
df1-
State
<chr>
1 N.Y.
2 FL
3 AL
4 MS
5 IL
6 WS
7 WA
8 N.J.
9 N.D.
10 S.D.
11 CALL
df2
State New_State
<chr> <chr>
1 N.Y. New York
2 FL Florida
3 AL Alabama
4 MS Mississippi
5 IL Illinois
6 WS Wisconsin
7 WA Washington
8 N.J. New Jersey
9 N.D. North Dakota
10 S.D. South Dakota
11 CAL California
I want the output to look like this
df3
New_State Test
<chr>
1 New York TRUE
2 Florida TRUE
3 Alabama TRUE
4 Mississippi TRUE
5 Illinois TRUE
6 Wisconsin TRUE
7 Washington TRUE
8 New Jersey TRUE
9 North Dakota TRUE
10 South Dakota TRUE
11 CALL FALSE
In essence I want R to read the data in df1 and change df1 based on the match in df2 chaining out to the full state name and replace. Lastly if the data in df1 was update mark as "TRUE" (N.Y. to NEW YORK) and "FALSE" if not updated (CALL vs CAL)
Thanks in advance for any and all help.
This should give you the result you're looking for:
match_vec <- match(df1$State, table = df2$State)
This vector should match all the abbreviated state names in df1 with those in df2. Where there's no match, you end up with a missing value:
Then the following code using dplyr should produce the df3 you requested.
library(dplyr)
df3 <- df1 %>%
mutate(New_State = df2$New_State[match_vec]) %>%
mutate(Test = !is.na(match_vec)) %>%
mutate(New_State = ifelse(is.na(New_State),
State, New_State)) %>%
select(New_State, Test)

R: split data based on a factor, add a ranking column and extract

I still haven't been able to know how we can access different elements of a split data. Here is my problem:
I have a data set, that I want to split based on a column (State). I want to have a ranking column added to my data for each subset. This is part of a function I'm writing.
My data set has 2 columns, Hospital, State, Outcome. For each state, I want to add a 'Rank' column that ranks the data based on Outcome; the lowest Outcome will be ranked 1 and the highest outcome will be ranked the last.
How can I use split, sapply/lapply to do this? Is there a better way, like using "arrange"?
My main problem is that when I use either of these methods, I do not know how to access each element of the split or arranged data.
Here's how my data set looks like:
Hospital State Outcome. The row lines are not important here.
Hospital State Outcome
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1
7 ST VINCENT'S EAST TX 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
The desired outcome would be
Hospital State Outcome Rank
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
5 ST VINCENT'S EAST TX 17.7 1
6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
Thanks in advance.
The dplyr package provides a very elegant solution for this type of problem. I'm using the mtcars data as an example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(rank = row_number(mpg))
The OP's example is hard to read into R because of all the spaces in the string variable.
Here's a simpler example:
set.seed(1)
DF <- data.frame(id=rep(1:2,sample(5,2))); DF$v <- runif(nrow(DF))*100
# id v
# 1 A 57.28534
# 2 A 90.82078
# 3 B 20.16819
# 4 B 89.83897
# 5 B 94.46753
# 6 B 66.07978
# 7 B 62.91140
Here's a solution without using any packages:
DF$r <- ave(DF$v,DF$id,FUN=rank)
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 4 B 89.83897 4
# 5 B 94.46753 5
# 6 B 66.07978 3
# 7 B 62.91140 2
Finally, to order by ranking within state:
DF[order(DF$id,DF$r),]
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 7 B 62.91140 2
# 6 B 66.07978 3
# 4 B 89.83897 4
# 5 B 94.46753 5
If you have ties in the column you're ranking, read the documentation for rank and decide how you want the ties treated. The dplyr and data.table packages (mentioned in the other answers) also have nice functionality for dealing with ties, like the notion of a "dense rank."
You could try this
library(data.table)
setDT(dat)[, myrank := rank(Outcome), by = State]
dat[,.SD[order(myrank)], by=State]
# State Hospital Outcome myrank
#1: AL SOUTHEAST ALABAMA MEDICAL CENTER 14.3 1
#2: AL SHELBY BAPTIST MEDICAL CENTER 15.9 2
#3: AL DEKALB REGIONAL MEDICAL CENTER 18.0 3
#4: AL MARSHALL MEDICAL CENTER SOUTH 18.5 4
#5: TX ST VINCENT EAST 17.7 1
#6: TX ELIZA COFFEE MEMORIAL HOSPITAL 18.1 2
Or using ddply
library(plyr)
ddply(dat, .(State), function(x){x$myrank = rank(x$Outcome); x[order(x$myrank),]})
# Hospital State Outcome myrank
#1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
#2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
#3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
#4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
#5 ST VINCENT EAST TX 17.7 1
#6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
You can use by:
do.call(
rbind,
by(d, list(State = d$State), function(x) { x$Rank <- order(x$Outcome); x[order(x$Rank), ] }))
where d is your raw data.

R-How to obtain relationships between cutree groups?

Hopefully title is not too badly worded. I have a tree that I used cutree to obtain groups from, but it is clear that the groups are not numbered left-to-right or right-to-left (I know the orientation within a branch doesn't matter so much, was hoping the grouping would be the same as the ordering in the hclust object). Is it possible to extract groups from a tree (using the height option of cutree) and know which of those groups are more related to one another? I walk through an example using USArrests below.
hc <- hclust(dist(USArrests), "ave")
plot(hc)
cutree(hc,h=60)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 3 1 4 2
Hawaii Idaho Illinois Indiana Iowa
3 3 1 3 3
Kansas Kentucky Louisiana Maine Maryland
3 3 1 3 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 3 1 2
Montana Nebraska Nevada New Hampshire New Jersey
3 3 1 3 2
New Mexico New York North Carolina North Dakota Ohio
1 1 4 3 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 3 2 1
South Dakota Tennessee Texas Utah Vermont
3 2 2 3 3
Virginia Washington West Virginia Wisconsin Wyoming
2 2 3 3 2
If you plot the tree it is clear that groups 1 and 4 are more related then groups 2 and 3 are more related. However when I just print the contents of each group there is no way to know what that relationship is. Is there a function or standard process I am missing? The real data I'm working with I split 36k values into 10 groups, so it would be tough to visually validate the relationships as I do with the example data, and want to code it as a script for future analyses. Thanks ahead of time.
I think you want to use
hc <- hclust(dist(USArrests), "ave")
cuthc <- cut(as.dendrogram(hc), h=60)
This will return a list with an $upper showing the tree above the cut, and a $lower element which is a list of each of the subtrees made from the cut. We can plot them with
layout(matrix(1:4, ncol=2))
sapply(1:4, function(i) plot(cuthc$lower[[i]]))
Then, if you want to extract the names and groups in the order they appear in the dendrograms, you can do
stack(setNames(Map(labels, cuthc$lower),seq_along(cuthc$lower)))
Here I use stack() and setNames() just to assign a unique ID to each element in the $lower list. stack() doesn't like it when the list isn't named

Calculate rows with same title

Since my other question got closed, here is the required data.
What I'm trying to do is have R calculate the last column 'count' towards the column city so I can map the data. Therefore I would need some kind of code to match this. Since I want to show how many participants (in count) are in the state of e.g Hawaii (HI)
zip city state latitude longitude count
96860 Pearl Harbor HI 24.859832 -168.021815 36
96863 Kaneohe Bay HI 21.439867 -157.74772 39
99501 Anchorage AK 61.216799 -149.87828 12
99502 Anchorage AK 61.153693 -149.95932 17
99506 Elmendorf AFB AK 61.224384 -149.77461 2
what I've tried is
match<- c(match(datazip$state, datazip$number))>$
but I'm really helpless trying to find a solution since I don't even know how to describe this in short. My plan afterwards is to make choropleth map with the data and believe me by now I've seen almost all the pages that try to give advice. so your help is pretty much appreciated. Thanks
# I read your sample data to a data frame
> df
zip city state latitude longitude count
1 96860 Pearl_Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe_Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf_AFB AK 61.22438 -149.7746 2
# If you want to sum the number of counts by state
library(plyr)
> ddply(df, .(state), transform, count2 = sum(count))
zip city state latitude longitude count count2
1 99501 Anchorage AK 61.21680 -149.8783 12 31
2 99502 Anchorage AK 61.15369 -149.9593 17 31
3 99506 Elmendorf_AFB AK 61.22438 -149.7746 2 31
4 96860 Pearl_Harbor HI 24.85983 -168.0218 36 75
5 96863 Kaneohe_Bay HI 21.43987 -157.7477 39 75
Maybe aggregate would be a nice and simple solution for you:
df
zip city state latitude longitude count
1 96860 Pearl Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf AFB AK 61.22438 -149.7746 2
aggregate(df$count,by=list(df$state),sum)
Group.1 x
1 AK 31
2 HI 75
aggregate(df$count,by=list(df$city),sum)
Group.1 x
1 Anchorage 29
2 Elmendorf AFB 2
3 Kaneohe Bay 39
4 Pearl Harbor 36

Resources