R - Isolate clusters with specific characteristics in hclust - r

I've used hclust to generate a cluster dendrogram of some data, but I need to isolate all the paired clusters, i.e. all the clusters that comprise just 2 pieces of data (the first ones to be clustered together), even if they might be clustered with other data on a "higher" branch. Does anyone know how I can do that?
I've highlighted the clusters I want to isolate in the attached image, hopefully that explains it better.
I'd like to be able to isolate all the paired data in those clusters in such a way to be able to compare the clusters on their contents. For example to see which of them contain a particular type of data.

FWIW, you could extract the "forks" like this:
hc <- hclust(dist(USArrests), "ave")
plot(hc)
res <- list()
invisible(dendrapply(as.dendrogram(hc), function(x) {
if (attr(x, "members")==2)
if (all(sapply(x[1:2], is.leaf)))
res <<- c(res, list(c(attr(x[[1]], "label"), attr(x[[2]], "label"))))
x
}))
head( do.call(rbind, res) )
# [,1] [,2]
# [1,] "Florida" "North Carolina"
# [2,] "Arizona" "New Mexico"
# [3,] "Alabama" "Louisiana"
# [4,] "Illinois" "New York"
# [5,] "Michigan" "Nevada"
# [6,] "Mississippi" "South Carolina"
(just the first 6 rows of the result)

Related

Reducing a data.tree created from List

I'm working on a shiny app which plots data trees. I'm looking to incorporate the shinyTree app to permit quick comparison of plotted nodes. The issue is that the shinyTree app returns a redundant list of lists of the sub node plot.
The actual list of list is included below. I would like to keep the longest branches only. I would also like to remove the id node (integer node), I'm struggling as to why it even shows up based on the list. I have tried many different methods to work with this list but it's been a real struggle. The list concept is difficult to understand.
I create the data.tree and plot via:
dataTree.a <- FromListSimple(checkList)
plot(dataTree.a)
> checkList
[[1]]
[[1]]$Asia
[[1]]$Asia$China
[[1]]$Asia$China$Beijing
[[1]]$Asia$China$Beijing$Round
[[1]]$Asia$China$Beijing$Round$`20383994`
[1] 0
[[2]]
[[2]]$Asia
[[2]]$Asia$China
[[2]]$Asia$China$Beijing
[[2]]$Asia$China$Beijing$Round
[1] 0
[[3]]
[[3]]$Asia
[[3]]$Asia$China
[[3]]$Asia$China$Beijing
[1] 0
[[4]]
[[4]]$Asia
[[4]]$Asia$China
[[4]]$Asia$China$Shanghai
[[4]]$Asia$China$Shanghai$Round
[[4]]$Asia$China$Shanghai$Round$`23740778`
[1] 0
[[5]]
[[5]]$Asia
[[5]]$Asia$China
[[5]]$Asia$China$Shanghai
[[5]]$Asia$China$Shanghai$Round
[1] 0
[[6]]
[[6]]$Asia
[[6]]$Asia$China
[[6]]$Asia$China$Shanghai
[1] 0
[[7]]
[[7]]$Asia
[[7]]$Asia$China
[1] 0
[[8]]
[[8]]$Asia
[[8]]$Asia$India
[[8]]$Asia$India$Delhi
[[8]]$Asia$India$Delhi$Round
[[8]]$Asia$India$Delhi$Round$`25703168`
[1] 0
[[9]]
[[9]]$Asia
[[9]]$Asia$India
[[9]]$Asia$India$Delhi
[[9]]$Asia$India$Delhi$Round
[1] 0
[[10]]
[[10]]$Asia
[[10]]$Asia$India
[[10]]$Asia$India$Delhi
[1] 0
[[11]]
[[11]]$Asia
[[11]]$Asia$India
[1] 0
[[12]]
[[12]]$Asia
[[12]]$Asia$Japan
[[12]]$Asia$Japan$Tokyo
[[12]]$Asia$Japan$Tokyo$Round
[[12]]$Asia$Japan$Tokyo$Round$`38001000`
[1] 0
[[13]]
[[13]]$Asia
[[13]]$Asia$Japan
[[13]]$Asia$Japan$Tokyo
[[13]]$Asia$Japan$Tokyo$Round
[1] 0
[[14]]
[[14]]$Asia
[[14]]$Asia$Japan
[[14]]$Asia$Japan$Tokyo
[1] 0
[[15]]
[[15]]$Asia
[[15]]$Asia$Japan
[1] 0
[[16]]
[[16]]$Asia
[1] 0
Well, I did cobble together a poor hack to make this work here is what I did to the 'checkList' list
checkList <- get_selected(tree, format = "slices")
# Convert and collapse shinyTree slices to data.tree
# This is a bit of a cluge to work the graphic with
# shinyTree an alternate one liner is in works
# This transform works by finding the longest branches
# and only plotting them since the other branches are
# subsets due to the slices.
# Extract the checkList name (as characters) from the checkList
tmp <- names(unlist(checkList))
# Determine the length of the individual checkList Names
lens <- lapply(tmp, function(x) length(strsplit(x, ".", fixed=TRUE)[[1]]))
# Find the elements with the highest length returns a list of high vals
lens.max <- which(lens == max(sapply(lens, max)))
# Replace all '.' with '\' prepping for DataFrameTable Converions
tmp <- relist(str_replace_all(tmp, "\\.", "/"), skeleton=tmp)
# Add a root node to work with multiple branches
tmp <- unlist(lapply(tmp, function(x) paste0("Root/", x)))
# Create a list of only the longest branches
longBranches <- as.list(tmp[lens.max])
# Convert the list into a data.frame for convert
longBranches.df <- data.frame(pathString = do.call(rbind, longBranches))
# Publish the data.frame for use
vals$selDF <- longBranches.df
#save(checkList, file = "chkLists.RData") # Save for troubleshooting
print(vals$selDF)ode here
The new checkList looks like this:
[1] "Root/Europe/France/Paris/Round/10843285" "Root/Europe/France/Paris/Round"
[3] "Root/Europe/France/Paris" "Root/Europe/France"
[5] "Root/Europe/Germany/Berlin/Diamond/3563194" "Root/Europe/Germany/Berlin/Diamond"
[7] "Root/Europe/Germany/Berlin/Round/3563194" "Root/Europe/Germany/Berlin/Round"
[9] "Root/Europe/Germany/Berlin" "Root/Europe/Germany"
[11] "Root/Europe/Italy/Rome/Round/3717956" "Root/Europe/Italy/Rome/Round"
[13] "Root/Europe/Italy/Rome" "Root/Europe/Italy"
[15] "Root/Europe/United Kingdom/London/Round/10313307" "Root/Europe/United Kingdom/London/Round"
[17] "Root/Europe/United Kingdom/London" "Root/Europe/United Kingdom"
[19] "Root/Europe"
It works :)... but I think this could be done with a two liner.... I'll work on it again in a week or so. Any other Ideas would be appreciated.

Extract address components from coordiantes

I'm trying to reverse geocode with R. I first used ggmap but couldn't get it to work with my API key. Now I'm trying it with googleway.
newframe[,c("Front.lat","Front.long")]
Front.lat Front.long
1 -37.82681 144.9592
2 -37.82681 145.9592
newframe$address <- apply(newframe, 1, function(x){
google_reverse_geocode(location = as.numeric(c(x["Front.lat"],
x["Front.long"])),
key = "xxxx")
})
This extracts the variables as a list but I can't figure out the structure.
I'm struggling to figure out how to extract the address components listed below as variables in newframe
postal_code, administrative_area_level_1, administrative_area_level_2, locality, route, street_number
I would prefer each address component as a separate variable.
Google's API returns the response in JSON. Which, when translated into R naturally forms nested lists. Internally in googleway this is done through jsonlite::fromJSON()
In googleway I've given you the choice of returning the raw JSON or a list, through using the simplify argument.
I've deliberately returned ALL the data from Google's response and left it up to the user to extract the elements they're interested in through usual list-subsetting operations.
Having said all that, in the development version of googleway I've written a few functions to help accessing elements of various API calls. Here are three of them that may be useful to you
## Install the development version
# devtools::install_github("SymbolixAU/googleway")
res <- google_reverse_geocode(
location = c(df[1, 'Front.lat'], df[1, 'Front.long']),
key = apiKey
)
geocode_address(res)
# [1] "45 Clarke St, Southbank VIC 3006, Australia"
# [2] "Bank Apartments, 275-283 City Rd, Southbank VIC 3006, Australia"
# [3] "Southbank VIC 3006, Australia"
# [4] "Melbourne VIC, Australia"
# [5] "South Wharf VIC 3006, Australia"
# [6] "Melbourne, VIC, Australia"
# [7] "CBD & South Melbourne, VIC, Australia"
# [8] "Melbourne Metropolitan Area, VIC, Australia"
# [9] "Victoria, Australia"
# [10] "Australia"
geocode_address_components(res)
# long_name short_name types
# 1 45 45 street_number
# 2 Clarke Street Clarke St route
# 3 Southbank Southbank locality, political
# 4 Melbourne City Melbourne administrative_area_level_2, political
# 5 Victoria VIC administrative_area_level_1, political
# 6 Australia AU country, political
# 7 3006 3006 postal_code
geocode_type(res)
# [[1]]
# [1] "street_address"
#
# [[2]]
# [1] "establishment" "general_contractor" "point_of_interest"
#
# [[3]]
# [1] "locality" "political"
#
# [[4]]
# [1] "colloquial_area" "locality" "political"
After reverse geocoding into newframe$address the address components could be extracted further as follows:
# Make a boolean array of the valid ("OK" status) responses (other statuses may be "NO_RESULTS", "REQUEST_DENIED" etc).
sel <- sapply(c(1: nrow(newframe)), function(x){
newframe$address[[x]]$status == 'OK'
})
# Get the address_components of the first result (i.e. best match) returned per geocoded coordinate.
address.components <- sapply(c(1: nrow(newframe[sel,])), function(x){
newframe$address[[x]]$results[1,]$address_components
})
# Get all possible component types.
all.types <- unique(unlist(sapply(c(1: length(address.components)), function(x){
unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
})))
# Get "long_name" values of the address_components for each type present (the other option is "short_name").
all.values <- lapply(c(1: length(address.components)), function(x){
types <- unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
matches <- match(all.types, types)
values <- address.components[[x]]$long_name[matches]
})
# Bind results into a dataframe.
all.values <- do.call("rbind", all.values)
all.values <- as.data.frame(all.values)
names(all.values) <- all.types
# Add columns and update original data frame.
newframe[, all.types] <- NA
newframe[sel,][, all.types] <- all.values
Note that I've only kept the first type given per component, effectively skipping the "political" type as it appears in multiple components and is likely superfluous e.g. "administrative_area_level_1, political".
You can use ggmap:revgeocode easily; look below:
library(ggmap)
df <- cbind(df,do.call(rbind,
lapply(1:nrow(df),
function(i)
revgeocode(as.numeric(
df[i,2:1]), output = "more")
[c("administrative_area_level_1","locality","postal_code","address")])))
#output:
df
# Front.lat Front.long administrative_area_level_1 locality
# 1 -37.82681 144.9592 Victoria Southbank
# 2 -37.82681 145.9592 Victoria Noojee
# postal_code address
# 1 3006 45 Clarke St, Southbank VIC 3006, Australia
# 2 3833 Cec Dunns Track, Noojee VIC 3833, Australia
You can add "route" and "street_number" to the variables that you want to extract but as you can see the second address does not have street number and that will cause an error.
Note: You may also use sub and extract the information from the address.
Data:
df <- structure(list(Front.lat = c(-37.82681, -37.82681), Front.long =
c(144.9592, 145.9592)), .Names = c("Front.lat", "Front.long"), class = "data.frame",
row.names = c(NA, -2L))

Creating a table with scraped CSV data in R

I have the following name_total = matrix(nrow = 51, ncol=3, NA), where each row corresponds to a state (51 being District of Columbia). The first column is a string giving the name of the state (for example: name_total[1,1]= "Alabama").
The second and third are urls of CSV files from the Census, respectively linking counties with the state senate districts, and counties with state house districts.
For Alabama:
name_total[1,2] ="http://www2.census.gov/geo/relfiles/cdsld13/01/co_lu_delim_01.txt"
name_total[1,3] ="http://www2.census.gov/geo/relfiles/cdsld13/01/co_ll_delim_01.txt"
I wish to get as a final output a table which would basically be all 50 states + DC with their respective counties and linked Senate and House districts. I don't know if that's very clear so here is an example:
[,1] [,2] [,3] [,4]
[1,] "Alabama" "countyX1" "Senate District Y1" "House District Z1"
[2,] "Alabama" "countyX2" "Senate District Y2" "House District Z2"
[3,] "Alabama" "countyX3" "Senate District Y3" "House District Z3"
[4,] "Alaska" "countyX4" "Senate District Y4" "House District Z4"
[5,] "Alaska" "countyX5" "Senate District Y4" "House District Z5"
I use a forloop:
for (i in 1:51){
senate= name_total[i,2]
link_senate = url(senate)
house= name_total[i,3]
link_house = url(house)
state=name_total[i,1]
data_senate= read.csv2(link_senate,sep=",",header=TRUE, skip=1)
data_house= read.csv2(link_house,sep=",",header=TRUE, skip=1)
final=cbind(state, data_senate, data_house)
}
Of course each element has a different number of rows, for Alabama (i=1) State returns "Alabama" once, the others returning respectively 3 by 122 and 3 by 207 matrices. I get an error message about these variations in the number of rows.
I'm pretty sure one of the issues is the use of the cbind function, but I do not know what to use to get a better result.
In case others have similar issues, I found a way to get what I wanted separately for State Senates and State Houses. First of all some of the States only have of the two, and the link for Oregon was down. Personally I took them out of my original data.
Then I initialized for the first state outside of the loop:
senate = url(name_total[1,2])
data_senate= read.csv2(senate,sep=",",header=TRUE, skip=1)
assign(paste("Base_senate_",name_total[1,1],sep=""),data_senate)
A = assign(paste("Base_senate_",name_total[1,1],sep=""),data_senate)
house= url(name_total[1,3])
data_house= read.csv2(house,sep=",",header=TRUE, skip=1)
assign(paste("Base_house_",name_total[1,1],sep=""),data_house)
B = assign(paste("Base_house_",name_total[1,1],sep=""),data_house)
and then I used for loop:
for (i in 2:48){
senate = url(name_total[i,2])
house= url(name_total[i,3])
data_senate= read.csv2(senate,sep=",",header=TRUE, skip=1)
assign(paste("Base_senate_",name_total[i,1],sep=""),data_senate)
names(data_senate)[2] = "County"
A = rbind(A,assign(paste("Base_senate_",name_total[i,1],sep=""),data_senate))
data_house= read.csv2(house,sep=",",header=TRUE, skip=1)
assign(paste("Base_house_",name_total[i,1],sep=""),data_house)
names(data_house)[2] = "County"
B = rbind(B,assign(paste("Base_house_",name_total[i,1],sep=""),data_house))
}
A and B give you the expected tables (without the string name of the State, but the first variable identifies the state).
I had to use the names(data_senate)[2] = "County" because the second column had a different name for some states.
Hope it helps!

R: How to identify a road type using GPS?

I have a GPS coordinates of several points and I want to know if they are on a highway, or trunk road, or minor road, and it would be even greater if I could identify a road name. I'm using R leaflet to draw maps and I can see with OpenStreetMap that different types of roads are colored differently, and I wonder how I can extract this information. It's not a problem to use Google maps instead if it will solve my problem.
I would appreciate any help.
You can use revgeocode() from ggmap:
library(ggmap)
gc <- c(-73.596706, 45.485501)
revgeocode(gc)
Which gives:
#[1] "4333 Rue Sherbrooke O, Westmount, QC H3Z 1E2, Canada"
Note: As per mentioned in the comments, this method uses Google Maps API, not OpenStreetMap. You have a limit of 2500 queries per day. You can always check how many queries you have left using geocodeQueryCheck()
From the package documentation:
reverse geocodes a longitude/latitude location using Google Maps. Note
that in most cases by using this function you are agreeing to the
Google Maps API Terms of Service at
https://developers.google.com/maps/terms.
Update
If you need more detailed information, use output = "all" and extract the components you need:
lst <- list(
g1 = c(-73.681069, 41.433155),
g2 = c(-73.643196, 41.416240),
g3 = c(-73.653324, 41.464168)
)
res <- lapply(lst, function(x) revgeocode(x, output = "all")[[1]][[1]][[1]][[2]])
Which gives:
#$g1
#$g1$long_name
#[1] "Highway 52"
#
#$g1$short_name
#[1] "NY-52"
#
#$g1$types
#[1] "route"
#
#
#$g2
#$g2$long_name
#[1] "Carmel Avenue"
#
#$g2$short_name
#[1] "US-6"
#
#$g2$types
#[1] "route"
#
#
#$g3
#$g3$long_name
#[1] "Wakefield Road"
#
#$g3$short_name
#[1] "Wakefield Rd"
#
#$g3$types
#[1] "route"
Using Google's API it's not possible to identify the type of road (yet - they may introduce that capability in the future).
But you can use their Roads API to get the road details for a given set of coordinates.
I've written the googleway package that accesses the roads API through the functions google_snapToRoads() and google_nearestRoads(), and if you have a premium account you can use google_speedLimits()
In all calls to Google's API you need a Google API key enabled on each API you are using.
library(googleway)
df_points <- data.frame(lat = c(60.1707, 60.172, 60.192),
lon = c(24.9426, 24.86, 24.89))
## plot the points on a map
google_map(key = map_key) %>%
add_markers(df_points)
nearRoads <- google_nearestRoads(df_points, key = api_key)
nearRoads
# $snappedPoints
# location.latitude location.longitude originalIndex placeId
# 1 60.17070 24.94272 0 ChIJNX9BrM0LkkYRIM-cQg265e8
# 2 60.17229 24.86028 1 ChIJpf7azXMKkkYRsk5L-U5W4ZQ
# 3 60.17229 24.86028 1 ChIJpf7azXMKkkYRs05L-U5W4ZQ
# 4 60.19165 24.88997 2 ChIJN1s1vhwKkkYRKGm4l5KmISI
# 5 60.19165 24.88997 2 ChIJN1s1vhwKkkYRKWm4l5KmISI
In these results, the originalIndex value tells you which of the orignal df_points the value is refering to (where 0 == the first row of df_points, 1 == the second row of df_points)
The placeId value is Google's unique key that identifies each place in their database. So you can then use Google's Places API to get the information about those places
roadDetails <- lapply(nearRoads$snappedPoints$placeId, function(x){
google_place_details(place_id = x, key = api_key)
})
## road address
lapply(roadDetails, function(x){
x[['result']][['formatted_address']]
})
# [[1]]
# [1] "Rautatientori, 00100 Helsinki, Finland"
#
# [[2]]
# [1] "Svedjeplogsstigen 7-9, 00340 Helsingfors, Finland"
#
# [[3]]
# [1] "Svedjeplogsstigen 18-10, 00340 Helsingfors, Finland"
#
# [[4]]
# [1] "Meilahdentie, 00250 Helsinki, Finland"
#
# [[5]]
# [1] "Meilahdentie, 00250 Helsinki, Finland"

How to access attributes of a dendrogram in R

From a dendrogram which i created with
hc<-hclust(kk)
hcd<-as.dendrogram(hc)
i picked a subbranch
k=hcd[[2]][[2]][[2]][[2]][[2]][[2]][[2]][1]
When i simply have k displayed, this gives:
> k
[[1]]
[[1]][[1]]
[1] 243
attr(,"label")
[1] "NAfrica_002"
attr(,"members")
[1] 1
attr(,"height")
[1] 0
attr(,"leaf")
[1] TRUE
[[1]][[2]]
[1] 257
attr(,"label")
[1] "NAfrica_016"
attr(,"members")
[1] 1
attr(,"height")
[1] 0
attr(,"leaf")
[1] TRUE
attr(,"members")
[1] 2
attr(,"midpoint")
[1] 0.5
attr(,"height")
[1] 37
How can i access, for example, the "midpoint" attribute, or the second of the "label" attributes?
(I hope i use the correct terminology here)
I have tried things like
k$midpoint
attr(k,"midpoint")
but both returned 'NULL'.
Sorry for question number 2: how could i add a "label" attribute after the attribute "midpoint"?
Your k is still buried one layer too deep. The attributes have been set on the first element of the list k.
attributes(k[[1]]) # Display attributes
attributes(k[[1]])$label # Access attributes
attributes(k[[1]])$label <- 'new' # Change attribute
Alternatively, you can use attr:
attr(k[[1]],'label') # Display attribute
You can change parameters manually as in the previous answer. The problem with this is that it is not efficient to do manually when you want to do it many times. Also, while it is easy to change parameters - that change may not be reflected in any other function, since they won't implement any action based on that change (it must be programmed).
For your specific question - it generally depends on which attribute we want to view. For "midpoint", use the get_nodes_attr function, with the "midpoint" parameter - from the dendextend package.
# install.packages("dendextend")
library(dendextend)
dend <- as.dendrogram(hclust(dist(USArrests[1:5,])))
# Like:
# dend <- USArrests[1:5,] %>% dist %>% hclust %>% as.dendrogram
# midpoint for all nodes
get_nodes_attr(dend, "midpoint")
And you get this:
[1] 1.25 NA 1.50 0.50 NA NA 0.50 NA NA
To also change an attribute, you can use the various assign functions from the package: assign_values_to_leaves_nodePar, assign_values_to_leaves_edgePar, assign_values_to_nodes_nodePar, assign_values_to_branches_edgePar, remove_branches_edgePar, remove_nodes_nodePar
If all you want is to change the labels, the following ability from the package would solve your question:
> labels(dend)
[1] "Arkansas" "Arizona" "California" "Alabama" "Alaska"
> labels(dend) <- 1:5
> labels(dend)
[1] 1 2 3 4 5
For more details on the package, you can have a look at its vignette.

Resources