R-How to obtain relationships between cutree groups? - r

Hopefully title is not too badly worded. I have a tree that I used cutree to obtain groups from, but it is clear that the groups are not numbered left-to-right or right-to-left (I know the orientation within a branch doesn't matter so much, was hoping the grouping would be the same as the ordering in the hclust object). Is it possible to extract groups from a tree (using the height option of cutree) and know which of those groups are more related to one another? I walk through an example using USArrests below.
hc <- hclust(dist(USArrests), "ave")
plot(hc)
cutree(hc,h=60)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 3 1 4 2
Hawaii Idaho Illinois Indiana Iowa
3 3 1 3 3
Kansas Kentucky Louisiana Maine Maryland
3 3 1 3 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 3 1 2
Montana Nebraska Nevada New Hampshire New Jersey
3 3 1 3 2
New Mexico New York North Carolina North Dakota Ohio
1 1 4 3 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 3 2 1
South Dakota Tennessee Texas Utah Vermont
3 2 2 3 3
Virginia Washington West Virginia Wisconsin Wyoming
2 2 3 3 2
If you plot the tree it is clear that groups 1 and 4 are more related then groups 2 and 3 are more related. However when I just print the contents of each group there is no way to know what that relationship is. Is there a function or standard process I am missing? The real data I'm working with I split 36k values into 10 groups, so it would be tough to visually validate the relationships as I do with the example data, and want to code it as a script for future analyses. Thanks ahead of time.

I think you want to use
hc <- hclust(dist(USArrests), "ave")
cuthc <- cut(as.dendrogram(hc), h=60)
This will return a list with an $upper showing the tree above the cut, and a $lower element which is a list of each of the subtrees made from the cut. We can plot them with
layout(matrix(1:4, ncol=2))
sapply(1:4, function(i) plot(cuthc$lower[[i]]))
Then, if you want to extract the names and groups in the order they appear in the dendrograms, you can do
stack(setNames(Map(labels, cuthc$lower),seq_along(cuthc$lower)))
Here I use stack() and setNames() just to assign a unique ID to each element in the $lower list. stack() doesn't like it when the list isn't named

Related

Mutate DF1 based on DF2 with a check

nubie here with a dataframe/mutate question... I want to update a dataframe (df1) based on data in another dataframe (df2). For one offs I've used MUTATE so I figure this is the way to go. Additionally I would like a check function added (TRUE/FALSE ?) to indicate if the the field in df1 was updated.
For Example..
df1-
State
<chr>
1 N.Y.
2 FL
3 AL
4 MS
5 IL
6 WS
7 WA
8 N.J.
9 N.D.
10 S.D.
11 CALL
df2
State New_State
<chr> <chr>
1 N.Y. New York
2 FL Florida
3 AL Alabama
4 MS Mississippi
5 IL Illinois
6 WS Wisconsin
7 WA Washington
8 N.J. New Jersey
9 N.D. North Dakota
10 S.D. South Dakota
11 CAL California
I want the output to look like this
df3
New_State Test
<chr>
1 New York TRUE
2 Florida TRUE
3 Alabama TRUE
4 Mississippi TRUE
5 Illinois TRUE
6 Wisconsin TRUE
7 Washington TRUE
8 New Jersey TRUE
9 North Dakota TRUE
10 South Dakota TRUE
11 CALL FALSE
In essence I want R to read the data in df1 and change df1 based on the match in df2 chaining out to the full state name and replace. Lastly if the data in df1 was update mark as "TRUE" (N.Y. to NEW YORK) and "FALSE" if not updated (CALL vs CAL)
Thanks in advance for any and all help.
This should give you the result you're looking for:
match_vec <- match(df1$State, table = df2$State)
This vector should match all the abbreviated state names in df1 with those in df2. Where there's no match, you end up with a missing value:
Then the following code using dplyr should produce the df3 you requested.
library(dplyr)
df3 <- df1 %>%
mutate(New_State = df2$New_State[match_vec]) %>%
mutate(Test = !is.na(match_vec)) %>%
mutate(New_State = ifelse(is.na(New_State),
State, New_State)) %>%
select(New_State, Test)

How to Merge Uneven Data Frames With Real Data

Problem:
I have two different size data sets that I would like to merge together. Without abandoning rows or inserting NA's. To compare this to a excel document situation you would have five columns and you would drago down 3 of them to populate the blank space left by the rows inserted by adding your data to the 4th and 5th column.
Example Data Set
zipcode = a, step3 = b in my later brainstorming code to solve my problem
> head(zipcode_joincsv)
zip city abv latitude longitude median mean pop
226 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
227 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
228 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
229 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
230 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
231 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
> head(step3_df)
tolower.state.name. state.abb
1 alabama AL
2 alaska AK
3 arizona AZ
4 arkansas AR
5 california CA
6 colorado CO
Desired Result:
One DF where each zipcode city combo is combined with their states pop and
income. A column in common they have is the abbreviation column.
tolower.state.name. zip city abv latitude longitude median mean pop
1 alabama 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
2 alabama 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
3 alabama 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
4 alabama 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
5 alabama 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
6 alabama 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
7 alaska data from these rows
8 arizona data from these rows
9 arkansas data from these rows
10 california data from these rows
11 colorado data from these rows
I've contemplated using something like
sqldf ("SELECT a.Zip, a.City, a.State Abv, a.Lat, a.Long, a.median, a.mean, a.pop, b.state.name, b.states.abb, b.pop, b.income
FROM a a
LEFT JOIN b b using (abv)")
I know that is probably not going to work if only that if it worked all the rows that there was not a matching set from A would input a NA where what I would like is that for every abv of NY the states average income and total population gets copied down the line. Than for every AR and every AL etc until the two data sets are one that a ggplot using all of the data can be created.
dplyr::left_join(a, b, by="abv") should work.

Interactive Map Drill-down ability in R

I have a dataframe like the one below:
State<-c("Alabama","Alabama","Alaska","Alaska")
StateCode<-c("AL","AL","AK","AK")
County<-c("AUTAUGA","BALDWIN","ANCHORAGE","BETHEL")
CountyCode<-c("AL001","AL003","AK020","AK050")
Murder<-c(5,6,7,8)
d<-data.frame(State,StateCode,County,CountyCode, Num)
State StateCode County CountyCode Num
1 Alabama AL AUTAUGA AL001 5
2 Alabama AL BALDWIN AL003 6
3 Alaska AK ANCHORAGE AK020 7
4 Alaska AK BETHEL AK050 8
I have been searching for an option between R packages to create a drill-down map from State to County level out of this but I can't find a working example with code anywhere. Here is an example Any feedback on this?

Calculate rows with same title

Since my other question got closed, here is the required data.
What I'm trying to do is have R calculate the last column 'count' towards the column city so I can map the data. Therefore I would need some kind of code to match this. Since I want to show how many participants (in count) are in the state of e.g Hawaii (HI)
zip city state latitude longitude count
96860 Pearl Harbor HI 24.859832 -168.021815 36
96863 Kaneohe Bay HI 21.439867 -157.74772 39
99501 Anchorage AK 61.216799 -149.87828 12
99502 Anchorage AK 61.153693 -149.95932 17
99506 Elmendorf AFB AK 61.224384 -149.77461 2
what I've tried is
match<- c(match(datazip$state, datazip$number))>$
but I'm really helpless trying to find a solution since I don't even know how to describe this in short. My plan afterwards is to make choropleth map with the data and believe me by now I've seen almost all the pages that try to give advice. so your help is pretty much appreciated. Thanks
# I read your sample data to a data frame
> df
zip city state latitude longitude count
1 96860 Pearl_Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe_Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf_AFB AK 61.22438 -149.7746 2
# If you want to sum the number of counts by state
library(plyr)
> ddply(df, .(state), transform, count2 = sum(count))
zip city state latitude longitude count count2
1 99501 Anchorage AK 61.21680 -149.8783 12 31
2 99502 Anchorage AK 61.15369 -149.9593 17 31
3 99506 Elmendorf_AFB AK 61.22438 -149.7746 2 31
4 96860 Pearl_Harbor HI 24.85983 -168.0218 36 75
5 96863 Kaneohe_Bay HI 21.43987 -157.7477 39 75
Maybe aggregate would be a nice and simple solution for you:
df
zip city state latitude longitude count
1 96860 Pearl Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf AFB AK 61.22438 -149.7746 2
aggregate(df$count,by=list(df$state),sum)
Group.1 x
1 AK 31
2 HI 75
aggregate(df$count,by=list(df$city),sum)
Group.1 x
1 Anchorage 29
2 Elmendorf AFB 2
3 Kaneohe Bay 39
4 Pearl Harbor 36

Clustering list for hclust function

Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.
In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:
1 a,b,c
2 d,e,f,g
Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.
How is this possible?
I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.
Construct a hclust object.
hc <- hclust(dist(USArrests), "ave")
#plot(hc)
You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).
cutree(hc, k = 2)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 2 1 1 2
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 1 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 1 2
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 1 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 1
South Dakota Tennessee Texas Utah Vermont
2 2 2 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2
lets say,
y<-dist(x)
clust<-hclust(y)
groups<-cutree(clust, k=3)
x<-cbind(x,groups)
now you will get for each record, the cluster group.
You can subset the dataset as well:
x1<- subset(x, groups==1)
x2<- subset(x, groups==2)
x3<- subset(x, groups==3)

Resources