Interactive Map Drill-down ability in R - r

I have a dataframe like the one below:
State<-c("Alabama","Alabama","Alaska","Alaska")
StateCode<-c("AL","AL","AK","AK")
County<-c("AUTAUGA","BALDWIN","ANCHORAGE","BETHEL")
CountyCode<-c("AL001","AL003","AK020","AK050")
Murder<-c(5,6,7,8)
d<-data.frame(State,StateCode,County,CountyCode, Num)
State StateCode County CountyCode Num
1 Alabama AL AUTAUGA AL001 5
2 Alabama AL BALDWIN AL003 6
3 Alaska AK ANCHORAGE AK020 7
4 Alaska AK BETHEL AK050 8
I have been searching for an option between R packages to create a drill-down map from State to County level out of this but I can't find a working example with code anywhere. Here is an example Any feedback on this?

Related

How to Merge Uneven Data Frames With Real Data

Problem:
I have two different size data sets that I would like to merge together. Without abandoning rows or inserting NA's. To compare this to a excel document situation you would have five columns and you would drago down 3 of them to populate the blank space left by the rows inserted by adding your data to the 4th and 5th column.
Example Data Set
zipcode = a, step3 = b in my later brainstorming code to solve my problem
> head(zipcode_joincsv)
zip city abv latitude longitude median mean pop
226 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
227 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
228 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
229 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
230 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
231 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
> head(step3_df)
tolower.state.name. state.abb
1 alabama AL
2 alaska AK
3 arizona AZ
4 arkansas AR
5 california CA
6 colorado CO
Desired Result:
One DF where each zipcode city combo is combined with their states pop and
income. A column in common they have is the abbreviation column.
tolower.state.name. zip city abv latitude longitude median mean pop
1 alabama 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
2 alabama 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
3 alabama 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
4 alabama 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
5 alabama 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
6 alabama 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
7 alaska data from these rows
8 arizona data from these rows
9 arkansas data from these rows
10 california data from these rows
11 colorado data from these rows
I've contemplated using something like
sqldf ("SELECT a.Zip, a.City, a.State Abv, a.Lat, a.Long, a.median, a.mean, a.pop, b.state.name, b.states.abb, b.pop, b.income
FROM a a
LEFT JOIN b b using (abv)")
I know that is probably not going to work if only that if it worked all the rows that there was not a matching set from A would input a NA where what I would like is that for every abv of NY the states average income and total population gets copied down the line. Than for every AR and every AL etc until the two data sets are one that a ggplot using all of the data can be created.
dplyr::left_join(a, b, by="abv") should work.

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?
We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement
I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

Maps, ggplot2, fill by state is missing certain areas on the map

I am working with maps and ggplot2 to visualize the number of certain crimes in each state for different years. The data set that I am working with was produced by the FBI and can be downloaded from their site or from here (if you don't want to download the dataset I don't blame you, but it is way too big to copy and paste into this question, and including a fraction of the data set wouldn't help, as there wouldn't be enough information to recreate the graph).
The problem is easier seen than described.
As you can see California is missing a large chunk as well as a few other states. Here is the code that produced this plot:
# load libraries
library(maps)
library(ggplot2)
# load data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
states <- map_data("state")
# merge data sets by region
fbi$region <- tolower(fbi$state)
fbimap <- merge(fbi, states, by="region")
# plot robbery numbers by state for year 2012
fbimap12 <- subset(fbimap, Year == 2012)
qplot(long, lat, geom="polygon", data=fbimap12,
facets=~Year, fill=Robbery, group=group)
This is what the states data looks like:
long lat group order region subregion
1 -87.46201 30.38968 1 1 alabama <NA>
2 -87.48493 30.37249 1 2 alabama <NA>
3 -87.52503 30.37249 1 3 alabama <NA>
4 -87.53076 30.33239 1 4 alabama <NA>
5 -87.57087 30.32665 1 5 alabama <NA>
6 -87.58806 30.32665 1 6 alabama <NA>
And this is what the fbi data looks like:
Year Population Violent Property Murder Forcible.Rape Robbery
1 1960 3266740 6097 33823 406 281 898
2 1961 3302000 5564 32541 427 252 630
3 1962 3358000 5283 35829 316 218 754
4 1963 3347000 6115 38521 340 192 828
5 1964 3407000 7260 46290 316 397 992
6 1965 3462000 6916 48215 395 367 992
Aggravated.Assault Burglary Larceny.Theft Vehicle.Theft abbr state region
1 4512 11626 19344 2853 AL Alabama alabama
2 4255 11205 18801 2535 AL Alabama alabama
3 3995 11722 21306 2801 AL Alabama alabama
4 4755 12614 22874 3033 AL Alabama alabama
5 5555 15898 26713 3679 AL Alabama alabama
6 5162 16398 28115 3702 AL Alabama alabama
I then merged the two sets along region. The subset I am trying to plot is
region Year Robbery long lat group
8283 alabama 2012 5020 -87.46201 30.38968 1
8284 alabama 2012 5020 -87.48493 30.37249 1
8285 alabama 2012 5020 -87.95475 30.24644 1
8286 alabama 2012 5020 -88.00632 30.24071 1
8287 alabama 2012 5020 -88.01778 30.25217 1
8288 alabama 2012 5020 -87.52503 30.37249 1
... ... ... ...
Any ideas on how I can create this plot without those ugly missing spots?
I played with your code. One thing I can tell is that when you used merge something happened. I drew states map using geom_path and confirmed that there were a couple of weird lines which do not exist in the original map data. I, then, further investigated this case by playing with merge and inner_join. merge and inner_join are doing the same job here. However, I found a difference. When I used merge, order changed; the numbers were not in the right sequence. This was not the case with inner_join. You will see a bit of data with California below. Your approach was right. But merge somehow did not work in your favour. I am not sure why the function changed order, though.
library(dplyr)
### Call US map polygon
states <- map_data("state")
### Get crime data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
fbi$state <- tolower(fbi$state)
### Check if both files have identical state names: The answer is NO
### states$region does not have Alaska, Hawaii, and Washington D.C.
### fbi$state does not have District of Columbia.
setdiff(fbi$state, states$region)
#[1] "alaska" "hawaii" "washington d. c."
setdiff(states$region, fbi$state)
#[1] "district of columbia"
### Select data for 2012 and choose two columns (i.e., state and Robbery)
fbi2 <- fbi %>%
filter(Year == 2012) %>%
select(state, Robbery)
Now I created two data frames with merge and inner_join.
### Create two data frames with merge and inner_join
ana <- merge(fbi2, states, by.x = "state", by.y = "region")
bob <- inner_join(fbi2, states, by = c("state" ="region"))
ana %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -119.8685 38.90956 4 676 <NA>
#2 california 56521 -119.5706 38.69757 4 677 <NA>
#3 california 56521 -119.3299 38.53141 4 678 <NA>
#4 california 56521 -120.0060 42.00927 4 667 <NA>
#5 california 56521 -120.0060 41.20139 4 668 <NA>
bob %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -120.0060 42.00927 4 667 <NA>
#2 california 56521 -120.0060 41.20139 4 668 <NA>
#3 california 56521 -120.0060 39.70024 4 669 <NA>
#4 california 56521 -119.9946 39.44241 4 670 <NA>
#5 california 56521 -120.0060 39.31636 4 671 <NA>
ggplot(data = bob, aes(x = long, y = lat, fill = Robbery, group = group)) +
geom_polygon()
The problem is in the order of arguments to merge
fbimap <- merge(fbi, states, by="region")
has the thematic data first and the geo data second. Switching the order with
fbimap <- merge(states, fbi, by="region")
the polygons should all close up.

R-How to obtain relationships between cutree groups?

Hopefully title is not too badly worded. I have a tree that I used cutree to obtain groups from, but it is clear that the groups are not numbered left-to-right or right-to-left (I know the orientation within a branch doesn't matter so much, was hoping the grouping would be the same as the ordering in the hclust object). Is it possible to extract groups from a tree (using the height option of cutree) and know which of those groups are more related to one another? I walk through an example using USArrests below.
hc <- hclust(dist(USArrests), "ave")
plot(hc)
cutree(hc,h=60)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 3 1 4 2
Hawaii Idaho Illinois Indiana Iowa
3 3 1 3 3
Kansas Kentucky Louisiana Maine Maryland
3 3 1 3 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 3 1 2
Montana Nebraska Nevada New Hampshire New Jersey
3 3 1 3 2
New Mexico New York North Carolina North Dakota Ohio
1 1 4 3 3
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 3 2 1
South Dakota Tennessee Texas Utah Vermont
3 2 2 3 3
Virginia Washington West Virginia Wisconsin Wyoming
2 2 3 3 2
If you plot the tree it is clear that groups 1 and 4 are more related then groups 2 and 3 are more related. However when I just print the contents of each group there is no way to know what that relationship is. Is there a function or standard process I am missing? The real data I'm working with I split 36k values into 10 groups, so it would be tough to visually validate the relationships as I do with the example data, and want to code it as a script for future analyses. Thanks ahead of time.
I think you want to use
hc <- hclust(dist(USArrests), "ave")
cuthc <- cut(as.dendrogram(hc), h=60)
This will return a list with an $upper showing the tree above the cut, and a $lower element which is a list of each of the subtrees made from the cut. We can plot them with
layout(matrix(1:4, ncol=2))
sapply(1:4, function(i) plot(cuthc$lower[[i]]))
Then, if you want to extract the names and groups in the order they appear in the dendrograms, you can do
stack(setNames(Map(labels, cuthc$lower),seq_along(cuthc$lower)))
Here I use stack() and setNames() just to assign a unique ID to each element in the $lower list. stack() doesn't like it when the list isn't named

Clustering list for hclust function

Using plot(hclust(dist(x))) method, I was able to draw a cluster tree map. It works. Yet I would like to get a list of all clusters, not a tree diagram, because I have huge amount of data (like 150K nodes) and the plot gets messy.
In other words, lets say if a b c is a cluster and if d e f g is a cluster then I would like to get something like this:
1 a,b,c
2 d,e,f,g
Please note that this is not exactly what I want to get as an "output". It is just an example. I just would like to be able to get a list of clusters instead of a tree plot It could be vector, matrix or just simple numbers that show which groups elements belong to.
How is this possible?
I will use the dataset available in R to demonstrate how to cut a tree into desired number of pieces. Result is a table.
Construct a hclust object.
hc <- hclust(dist(USArrests), "ave")
#plot(hc)
You can now cut the tree into as many branches as you want. For my next trick, I will split the tree into two groups. You set the number of cuts with the k parameter. See ?cutree and the use of paramter h which may be more useful to you (see cutree(hc, k = 2) == cutree(hc, h = 110)).
cutree(hc, k = 2)
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 2 1 1 2
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 1 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 1 2
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 1 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 1
South Dakota Tennessee Texas Utah Vermont
2 2 2 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2
lets say,
y<-dist(x)
clust<-hclust(y)
groups<-cutree(clust, k=3)
x<-cbind(x,groups)
now you will get for each record, the cluster group.
You can subset the dataset as well:
x1<- subset(x, groups==1)
x2<- subset(x, groups==2)
x3<- subset(x, groups==3)

Resources