Assigning colors to table values using Choroplethr - choroplethr

I'm trying to use choroplethr to make a map at the county level. Currently, I have 3 categorical integers (1, 2, 3) in my csv under the column value which vary depending on each county. The region column contains county fips.
I want to display the following values as the respective label , color (value = label = color):
0 = "None" = "white", 1 = "MD" = "#64acbe", 2 = "DO" = "#c85a5a", 3 =
"Both" = "#574249",
I've tried several combinations of scale_fill_brewer without the results I'm looking for. Any assistance would be great. Here's code that simulates the data I'm using:
library(choroplethr)
library(ggplot2)
library(choroplethrMaps)
Res <- data.frame(
region = c(45001, 22001, 51001, 16001, 19001, 21001, 29001, 40001, 8001, 19003, 16003, 17001, 18001, 28001, 38001, 31001, 39001, 42001, 53001, 55001, 50001, 72001, 72003, 72005, 72007, 72009, 45003, 27001),
value = c(0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3),
stringsAsFactors = FALSE)
county_choropleth(Res,
title = "All United States Medical Residencies",
legend = "Types of Medical Residencies"
)

Thank you for using Choroplethr.
I think that there are a few issues here. The first one I'd like to address is that your value column contains numeric data. This by itself is not a problem. But because you are actually using it to code categorical data (i.e. "MD", "OD", etc.) this is a problem. Therefore my first task will be to change the data type from numeric to character data:
> class(Res$value)
[1] "numeric"
> Res$value = as.character(Res$value)
> class(Res$value)
[1] "character"
Now I will replace the "numbers" with the category names that you want:
> Res[Res=="0"] = "None"
> Res[Res=="1"] = "MD"
> Res[Res=="2"] = "DO"
> Res[Res=="3"] = "Both"
> head(Res)
region value
1 45001 None
2 22001 MD
3 51001 DO
4 16001 Both
5 19001 None
6 21001 MD
Now for the second issue. You said that you were trying to use scale_fill_brewer. That function is for using the Brewer scales. But you don't want those here. You say that you have your own scale. So you want to use scale_fill_manual.
county_choropleth(Res) +
scale_fill_manual(values=c("None" = "#ffffffff",
"MD" = "#64acbe",
"DO" = "#c85a5a",
"Both" = "#574249"),
name="Types of Medical Residencies")
Note: What choroplethr calls the "legend" (which is actually the name of the legend) is actually a property of the ggplot2 scale. In particular, it is the name of the scale. So if you are using your own scale, you cannot use choroplethr's legend parameter any more.
Of course, now we have a new problem: Alaska and Hawaii are all black. I actually forgot about this issue (it's been a while since I worked on Choroplethr). The reason this happens is very technical, and perhaps more detailed than you care for, but I will mention it here for completeness: choroplethr uses ggplot2 annotations to render AK and HI in the proper place. the choropelthr + ggplot_scale paradigm does not work here for AK and HI because ggplot does not propogate additional layers / scales to annotations. To get around this we must use the object-oriented features of choroplethr:
c = CountyChoropleth$new(Res)
c$title = "All United States Medical Residencies"
c$ggplot_scale = scale_fill_manual(values=c("None" = "#ffffffff", "MD" = "#64acbe", "DO" = "#c85a5a", "Both" = "#574249"),
name="Types of Medical Residencies")
c$render()

Related

Need to create bivalent chloropleth map from latitude/longitude and two variables

My manager asked me to create a bivalent chloropleth map in R from a csv file that contains latitude/longitude data and two variables. I’ve tried to use this tutorial and this stack overflow post but have been unsuccessful – the plot comes up completely empty.
This is an example of what my manager is looking for: https://jech.bmj.com/content/jech/75/6/496/F1.large.jpg
I’ve tried to use this tutorial and this stack overflow post but have been unsuccessful – the plot comes up completely empty.
Below is a mini reproducible version of the data.
df <- data.frame(Region = c(1001, 1003, 1005, 1007),
ID = c(5, 6, 7, 8),
latitude = c(32.53953, 30.72775, 31.86826, 32.99642),
longitude = c(-86.64408, -87.72207, -85.38713, -87.12511),
variable_1 = c(0.3465, 0.3028, 0.4168, 0.3866),
variable_2 = c(0.44334, 0.45972, 0.46996, 0.44406))
I am not well-versed in mapping (or in R, frankly) so I would be deeply appreciative of any help this community could provide. Even understanding what additional data I need to create a bivalent plot would be really helpful.
Thank you and please let me know of any additional info I could provide!
Here is how you can achieve such choropleth map. First you need to load/install the necessary packages:
library(biscale)
library(ggplot2)
library(cowplot)
library(sf)
library(dplyr)
Then you need to compute the bi_class between the two variables that will be used to map each group (low,medium,high) for each combination.
df = bi_class(df, x= variable_1, y=variable_2, style="quantile", dim = 3)
As per documention on the package you can change the dim argument to create a 2x2 or 4x4 matrix
Then for what I saw within your data you are looking into counties in Alabama. For this you can look into the tigris package. (not limited to counties)
Al_county <- tigris::counties(state = "Alabama", cb = TRUE) %>% st_as_sf()
Finally, you can merge your data frame into the imported data frame with GEOID and Region. Make sure to add a 0 in from of your 'Region' if it's missing:
GEOID (in imported data)
Region (in your df)
01001
1001
01003
1003
01005
1005
01007
1007
df$Region = paste0("0", df$Region) # Add 0 in front of Region values
Al_county = Al_county %>% left_join(df, by= c("GEOID"= "Region")) # Join the 2 data frames
Now the data is ready to be plotted and you can follow the documentation from here
map = ggplot() +
geom_sf(data = Al_county, aes(fill=bi_class))+
bi_scale_fill(pal = "GrPink", dim = 3)+
labs(subtitle = "Var 1 and Var 2 in Alabama") +
bi_theme()+
theme(legend.position = "none")
legend <- bi_legend(pal = "GrPink", dim = 3, xlab = "Higher Var 1 ", ylab = "Higher Var 2 ", size = 8)
finalPlot <- ggdraw() + draw_plot(map, 0, 0, 1, 1) + draw_plot(legend, 0.6, 0.7, 0.4, 0.15)
finalPlot

How can I merge two datasets on a common variable? None of the _join`is working

I have a dataset with route_ID and 100 other variables and I need to add the flight length to each observation. So I have created another dataset with two variables, the route_ID, and the flight distance. The original dataset has 125k observations, whereas the distances one 13k as I have eliminated duplicate values. Moreover, the original dataset has 4k different routes, whereas the distance one has 11k different routes. This is to be sure that out of those 11k, there will be all the 4k ones that need to be matched.
This is what dput(head(airplanes)) produces
structure(list(ap_id = c("15304 12478", "12478 15304", "15304 12953",
"13303 12478", "13303 12953", "14986 12478"), ORIGIN_AIRPORT_ID = c(15304L,
12478L, 15304L, 13303L, 13303L, 14986L), DEST_AIRPORT_ID = c(12478L,
15304L, 12953L, 12478L, 12953L, 12478L), distance = c(1005L,
1005L, 1010L, 1089L, 1096L, 1041L)), row.names = c(NA, 6L), class = "data.frame")
This is what dput(head(distance)) produces
structure(list(apc_id = structure(c("10135 10397 DL", "10135 10397 DL",
"10135 10397 FL", "10135 10397 FL", "10135 11057 US", "10135 11057 US"
), label = "Route-carrier unique identifier", format.stata = "%16s"),
rcid = structure(c(1, 1, 2, 2, 6, 6), label = "Route-carrier unique identifier", format.stata = "%14.0g", labels = c(`10135 10397 DL` = 1,
`10135 10397 FL` = 2, `10135 10721 FL` = 3, `10135 10821 TW` = 4,
`10135 11042 RU` = 5, `10135 11057 US` = 6, `10135 11193 DL` = 7,
This continues for A LOT of rows, giving each route_code and the number of times it appears. Not sure why the two codes are giving such a different output
I was sure that left_join would have been the way to go, as it is supposed to keep the rows of the OG one and add the values from the second one only for the ones that match. This was my code.
left_join(airplanes, distance, by = "route_ID")
However, this generates a new dataset with more than 600k observations. The weird thing also is that many variables, such as ticket_price, have zero NAs in the new dataset and the percentile is almost exactly the same as the one in the OG dataset.
If helpful, this is the dropbox link to the OG dataset
https://www.dropbox.com/s/t9ptw6a9tuuh4tg/DB1B-T100-MSA-F41%20Final%20V2.dta?dl=0
And this is the one to the distances dataset
https://www.dropbox.com/s/m0k2i1vfqfnb7o7/distance_2.csv?dl=0
Anyone knows what is happening here and how could I match them as I wish?

Filter one column and matching to another (expanded)

Hi I have a similar question to this (Filter one column by matching to another column)
For background I'm trying to match up a code for a book name and a place where it is being used. I figured out how to use filter and grepyl to narrow down the book name but now I need to filter out if the location names match. I can't give up the data since it's private. It's a similar example to the one above except I'm filtering with what the animal starts with first so imagine this.
df <- data.frame(pair = c(1, 1, 2, 2, 3, 3,4,4,4),
animal = rep(c("Elephant", "Giraffe", "Antelope"), 6),
value = seq(1, 12, 2),
drop = c("savannah", "savannah", "jungle", "jungle", "zoo", "unknown", "unknown", "zoo", "my house"))
zoo_animals <- filter(df, grepl("Gir|Ele", animal))
what I'm not sure how to do is to build on that to see if the location column matches between each entry. Is it just & location = location?
What I want is to have it find is there an elephant and a giraffe from the zoo? what about the savanna? From the data I made it appears the only match is savanna so it would print those data points so a df that is
pair
animal
value
drop
1
Elephant
7
savannah
1
Giraffe
3
savannah
1
Giraffe
3
savannah

Efficient way to conditionally edit value labels

I'm working with survey data containing value labels. The haven package allows one to import data with value label attributes. Sometimes these value labels need to be edited in routine ways.
The example I'm giving here is very simple, but I'm looking for a solution that can be applied to similar problems across large data.frames.
d <- dput(structure(list(var1 = structure(c(1, 2, NA, NA, 3, NA, 1, 1), labels = structure(c(1,
2, 3, 8, 9), .Names = c("Protection of environment should be given priority",
"Economic growth should be given priority", "[DON'T READ] Both equally",
"[DON'T READ] Don't Know", "[DON'T READ] Refused")), class = "labelled")), .Names = "var1", row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame")))
d$var1
<Labelled double>
[1] 1 2 NA NA 3 NA 1 1
Labels:
value label
1 Protection of environment should be given priority
2 Economic growth should be given priority
3 [DON'T READ] Both equally
8 [DON'T READ] Don't Know
9 [DON'T READ] Refused
If a value label begins with "[DON'T READ]" I want to remove "[DON'T READ]" from the beginning of the label and add "(VOL)" at the end. So, "[DON'T READ] Both equally" would now read "Both equally (VOL)."
Of course, it's straightforward to edit this individual variable with a function from haven's associated labelled package. But I want to apply this solution across all the variables in a data.frame.
library(labelled)
val_labels(d$var1) <- c("Protection of environment should be given priority" = 1,
"Economic growth should be given priority" = 2,
"Both equally (VOL)" = 3,
"Don't Know (VOL)" = 8,
"Refused (VOL)" = 9)
How can I achieve the result of the function directly above in a way that can be applied to every variable in a data.frame?
The solution must work regardless of the specific value. (In this instance it is values 3,8, & 9 that need alteration, but this is not necessarily the case).
There are a few ways to do this. You could use lapply() or (if you want a one(ish)-liner) you could use any of the scoped variants of mutate():
1). Using lapply()
This method loops over all columns with gsub() to remove the part you do not want and adds the " (VOL)" to the end of the string. Of course you could use this with a subset as well!
d[] <- lapply(d, function(x) {
labels <- attributes(x)$labels
names(labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(labels))
attributes(x)$labels <- labels
x
})
d$var1
[1] 1 2 NA NA 3 NA 1 1
attr(,"labels")
Protection of environment should be given priority Economic growth should be given priority
1 2
Both equally (VOL) Don't Know (VOL)
3 8
Refused (VOL)
9
attr(,"class")
[1] "labelled"
2) Using mutate_all()
Using the same logic (with the same result) you could change the name of the labels in a tidier way:
d %>%
mutate_all(~{names(attributes(.)$labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(attributes(.)$labels));.}) %>%
map(attributes) # just to check on the result

Count by vector of multiple columns in sparklyr

In a related question I had some good help to generate possible combinations of a set or variables.
Assume the output of that process is
combo_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
combo_id = c("combo1", "combo2", "combo3"),
selection_1 = c("Alice", "Alice", "Bob"),
selection_2 = c("Bob", "Cat", "Cat")
),
name = "combo_table")
This is a tbl reference to a spark data frame object with two columns, each representing a selection of 2 values from a list of 3 (Alice, Bob, Cat), that could be imagined as 3 members of a household.
Now there is also a spark data frame with a binary encoding indicating a 1 if the member of the house was in the house, and 0 where they were not.
obs_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
obs_day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
Alice = c(1, 1, 0, 1, 0, 1, 0),
Bob = c(1, 1, 1, 0, 0, 0, 0),
Cat = c(0, 1, 1, 1, 1, 0, 0)
),
name = "obs_table")
I can relatively simply check if a specific pair were present in the house with this code:
obs_tbl %>%
group_by(Alice, Bob) %>%
summarise(n())
However, there are 2 flaws with this approach.
Each pair is being put in manually, when every combination I need to check is already in combo_tbl.
The output automatically outputs the intersection of every combination. i.e. I get the count of values where both Alice and Bob == 1, but also where Alice ==1 and Bob == 0, Alice == 0 and Bob ==1, etc.
The ideal end result would be an output like so:
Alice | Bob | 2
Alice | Cat | 2
Bob | Cat | 2
i.e. The count of co-habitation days per pair.
A perfect solution would allow simple modification to change the number of selection within the combination to increase. i.e. each combo_id may have 3 or greater selections, from a larger list than the one given.
So, is it possible on sparklyr to pass a vector of pairs that are iterated through?
How do I only check for where both of my selections are present? Instead of a vectorised group_by should I use a vectorised filter?
I've read about quosures and standard evaluation in the tidyverse. Is that the solution to this if running locally? And if so is this supported by spark?
For reference, I have a relatively similar solution using data.table that can be run on a single-machine, non-spark context. Some pseudo code:
combo_dt[, obs_dt[get(tolower(selection_1)) == "1" &
get(tolower(selection_2)) == "1"
, .N], by = combo_id]
This nested process effectively splits each combination into it's own sub-table: by = combo_id, and then for that sub-table filters where selection_1 and selection_2 are 1, and then applies .N to count the rows in that sub-table, and then aggregates the output.

Resources