Combining Rows in R with Pivot or Spread? - r

Here, I am manipulating election data, and the current data is in the following format. Both a visual and coded example are included (while visual is a bit condensed). Moreover, values have been edited from their originals.
# Representative Example
library(tidyverse)
test.df <- tibble(yr=rep(1956),mn=rep(11),
sub=rep("Alabama"),
unit_type=rep("County"),
unit_name=c("Autauga","Baldwin","Barbour"),
TotalVotes=c(1000,2000,3000),
RepVotes=c(500,1000,1500),
RepCandidate=rep("Eisenhower"),
DemVotes=c(500,1000,1500),
DemCandidate=rep("Stevenson"),
ThirdVotes=c(0,0,0),
ThirdCandidate=rep("Uncommitted"),
RepVotesTotalPerc=rep(50.00),
DemVotesTotalPerc=rep(50.00),
ThirdVotesTotalPerc=rep(0.00)
)
----------------------------------------------------------------------------------------------------
yr | mn | sub | unit_type | unit_name | TotalVotes | RepVotes | RepCan | DemVotes | DemCan
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga 1000 500 EisenHower 500 Stevenson
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Baldwin 2000 1000 EisenHower 1000 Stevenson
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Barbour 3000 2000 EisenHower 2000 Stevenson
----------------------------------------------------------------------------------------------------
I am trying to get a table that looks like the following:
----------------------------------------------------------------------------------------------------
yr | mn | sub | unit_type | unit_name | pty_n | can | TotalVotes | CanVotes
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Republican Eisenhower 1000 500
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Democrat Stevenson 1000 500
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Independent Uncommitted 1000 0
----------------------------------------------------------------------------------------------------
# and etc. for other counties in example (Baldwin, Barbour, etc)
As you can see, I pretty much want three observations per county, where candidates are all in one column, as well as their respective votes in another (CanVotes, or the like).
I have tried using things like pivot_longer() or spread(), but I am having a hard time visualizing these in code. Any help here would be greatly appreciated in sort of reorienting my data to get a candidate column, but also moving the rest of the data with it!

Here is a solution that first uses pivot_longer to bring the Votes into a long format. Then I use mutate with case_when to substitute the former column names with the actual candidate names and delete the single candidate columns:
long_table <- pivot_longer(test.df,
cols = c(RepVotes, DemVotes, ThirdVotes),
names_to = "pty_n",
values_to = "CanVotes") %>%
mutate(can = case_when(
pty_n == "RepVotes" ~ RepCandidate,
pty_n == "DemVotes" ~ DemCandidate,
pty_n == "ThirdVotes" ~ ThirdCandidate
),
pty_n = case_when(
pty_n == "RepVotes" ~ "Republican",
pty_n == "DemVotes" ~ "Democrat",
pty_n == "ThirdVotes" ~ "Independent"
)) %>%
select(-c(RepCandidate, DemCandidate, ThirdCandidate))
# A tibble: 9 x 12
yr mn sub unit_type unit_name TotalVotes RepVotesTotalPerc DemVotesTotalPerc ThirdVotesTotalPe~ pty_n CanVotes can
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1956 11 Alabama County Autauga 1000 50 50 0 Republican 500 Eisenhower
2 1956 11 Alabama County Autauga 1000 50 50 0 Democrat 500 Stevenson
3 1956 11 Alabama County Autauga 1000 50 50 0 Independe~ 0 Uncommitt~
4 1956 11 Alabama County Baldwin 2000 50 50 0 Republican 1000 Eisenhower
5 1956 11 Alabama County Baldwin 2000 50 50 0 Democrat 1000 Stevenson
6 1956 11 Alabama County Baldwin 2000 50 50 0 Independe~ 0 Uncommitt~
7 1956 11 Alabama County Barbour 3000 50 50 0 Republican 1500 Eisenhower
8 1956 11 Alabama County Barbour 3000 50 50 0 Democrat 1500 Stevenson
9 1956 11 Alabama County Barbour 3000 50 50 0 Independe~ 0 Uncommitt~
I tried to build a custom spec, but it seems that the names have to be derived from the column names and can't be directly conditional on other columns.

Here is a data.table go at things
library( data.table )
#convert data to the data.table-format
setDT( test.df )
#get the different paries to update the variable balter in
parties <- gsub( "Candidate", "", grep( "^.*Candidate$", names( test.df ), value = TRUE ) )
#melt to each candidate and his/her votes
DT.melt <- melt(test.df,
id.vars = c("yr", "mn", "sub", "unit_type", "unit_name"),
measure.vars = patterns( can = "^.*Candidate$",
canVotes = "^(Rep|Dem|Third)Votes$" ),
variable.name = "pty_n"
)
#get the totals from the original date (by unit_name) through joining
DT.melt[ test.df, TotalVotes := i.TotalVotes, on = .(unit_name)]
#and pass the correct party name to the pty_n column
DT.melt[, pty_n := parties[ pty_n ] ][]
# yr mn sub unit_type unit_name pty_n can canVotes TotalVotes
# 1: 1956 11 Alabama County Autauga Rep Eisenhower 500 1000
# 2: 1956 11 Alabama County Baldwin Rep Eisenhower 1000 2000
# 3: 1956 11 Alabama County Barbour Rep Eisenhower 1500 3000
# 4: 1956 11 Alabama County Autauga Dem Stevenson 500 1000
# 5: 1956 11 Alabama County Baldwin Dem Stevenson 1000 2000
# 6: 1956 11 Alabama County Barbour Dem Stevenson 1500 3000
# 7: 1956 11 Alabama County Autauga Third Uncommitted 0 1000
# 8: 1956 11 Alabama County Baldwin Third Uncommitted 0 2000
# 9: 1956 11 Alabama County Barbour Third Uncommitted 0 3000

Related

Obtaining the household median income from ACS data

The code below perfectly returns what I need: the household median income for each puma using 2019 ACS (1-year). However, what is missing is the States name. I tried the option of state="all" but it did not work. How can I obtain my data of interest by states and puma?
Thanks,
NM
PUMA_level <- get_acs(geography = "puma",
variable = "B19013_001",
survey = "acs1",
# state="all",
year = 2019)
Using the usmap::fips_info function you could get a list of state codes, names and abbreviations which you could then merge to your census data like so:
library(tidycensus)
library(usmap)
PUMA_level <- get_acs(geography = "puma",
variable = "B19013_001",
survey = "acs1",
year = 2019,
keep_geo_vars = TRUE)
#> Getting data from the 2019 1-year ACS
#> The 1-year ACS provides data for geographies with populations of 65,000 and greater.
PUMA_level$fips <- substr(PUMA_level$GEOID, 1, 2)
states <- usmap::fips_info(unique(PUMA_level$fips))
#> Warning in get_fips_info(fips_, sortAndRemoveDuplicates): FIPS code(s) 72 not
#> found
PUMA_level <- merge(PUMA_level, states, by = "fips")
head(PUMA_level)
#> fips GEOID
#> 1 01 0100100
#> 2 01 0100200
#> 3 01 0100302
#> 4 01 0100400
#> 5 01 0100500
#> 6 01 0100301
#> NAME
#> 1 Lauderdale, Colbert, Franklin & Marion (Northeast) Counties PUMA; Alabama
#> 2 Limestone & Madison (Outer) Counties--Huntsville City (Far West & Southwest) PUMA, Alabama
#> 3 Huntsville City (Central & South) PUMA, Alabama
#> 4 DeKalb & Jackson Counties PUMA, Alabama
#> 5 Marshall & Madison (Southeast) Counties--Huntsville City (Far Southeast) PUMA, Alabama
#> 6 Huntsville (North) & Madison (East) Cities PUMA, Alabama
#> variable estimate moe abbr full
#> 1 B19013_001 46449 3081 AL Alabama
#> 2 B19013_001 74518 6371 AL Alabama
#> 3 B19013_001 51884 5513 AL Alabama
#> 4 B19013_001 43406 3557 AL Alabama
#> 5 B19013_001 56276 3216 AL Alabama
#> 6 B19013_001 63997 5816 AL Alabama

How to filter a dataframe so that it finds the maximum value for 10 unique occurrences of another variable

I have this dataframe here which I filter down to only include counties in the state of Washington and only include columns that are relevant for the answer I am looking for. What I want to do is filter down the dataframe so that I have 10 rows only, which have the highest Black Prison Population out of all of the counties in Washington State regardless of year. The part that I am struggling with is that there can't be repeated counties, so each row should include the highest Black Prison Populations for the top 10 unique county names in the state of Washington. Some of the counties have Null data for the populations for the black prison populations as well. for You should be able to reproduce this to get the updated dataframe.
library(dplyr)
incarceration <- read.csv("https://raw.githubusercontent.com/vera-institute/incarceration-trends/master/incarceration_trends.csv")
blackPrisPop <- incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA")
Sample of what the updated dataframe looks like (should include 1911 rows):
fips county_name state year black_pop_15to64 black_prison_pop
130 53005 Benton County WA 2001 1008 25
131 53005 Benton County WA 2002 1143 20
132 53005 Benton County WA 2003 1208 21
133 53005 Benton County WA 2004 1236 27
134 53005 Benton County WA 2005 1310 32
135 53005 Benton County WA 2006 1333 35
You can group_by the county county_name, and then use slice_max taking the row with maximum value for black_prison_pop. If you set n = 1 option you will get one row for each county. If you set with_ties to FALSE, you also will get one row even in case of ties.
You can arrange in descending order the black_prison_pop value to get the overall top 10 values across all counties.
library(dplyr)
incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA") %>%
group_by(county_name) %>%
slice_max(black_prison_pop, n = 1, with_ties = FALSE) %>%
arrange(desc(black_prison_pop)) %>%
head(10)
Output
black_prison_pop black_pop_15to64 year fips county_name state
<dbl> <dbl> <int> <int> <chr> <chr>
1 1845 73480 2002 53033 King County WA
2 975 47309 2013 53053 Pierce County WA
3 224 5890 2005 53063 Spokane County WA
4 172 19630 2015 53061 Snohomish County WA
5 137 8129 2016 53011 Clark County WA
6 129 5146 2003 53035 Kitsap County WA
7 102 5663 2009 53067 Thurston County WA
8 58 706 1991 53021 Franklin County WA
9 50 1091 1991 53077 Yakima County WA
10 46 1748 2008 53073 Whatcom County WA

R: conditional aggregate based on factor level and year

I have a dataset in R which I am trying to aggregate by column level and year which looks like this:
City State Year Status Year_repealed PolicyNo
Pitt PA 2001 InForce 6
Phil. PA 2001 Repealed 2004 9
Pitt PA 2002 InForce 7
Pitt PA 2005 InForce 2
What I would like to create is where for each Year, I aggregate the PolicyNo across states taking into account the date the policy was repealed. The results I would then get is:
Year State PolicyNo
2001 PA 15
2002 PA 22
2003 PA 22
2004 PA 12
2005 PA 14
I am not sure how to go about splitting and aggregating the data conditional on the repeal data and was wondering if there is a way to achieve this is R easily.
It may help you to break this up into two distinct problems.
Get a table that shows the change in PolicyNo in every city-state-year.
Summarize that table to show the PolicyNo in each state-year.
To accomplish (1) we add the missing years with NA PolicyNo, and add repeals as negative PolicyNo observations.
library(dplyr)
df = structure(list(City = c("Pitt", "Phil.", "Pitt", "Pitt"), State = c("PA", "PA", "PA", "PA"), Year = c(2001L, 2001L, 2002L, 2005L), Status = c("InForce", "Repealed", "InForce", "InForce"), Year_repealed = c(NA, 2004L, NA, NA), PolicyNo = c(6L, 9L, 7L, 2L)), .Names = c("City", "State", "Year", "Status", "Year_repealed", "PolicyNo"), class = "data.frame", row.names = c(NA, -4L))
repeals = df %>%
filter(!is.na(Year_repealed)) %>%
mutate(Year = Year_repealed, PolicyNo = -1 * PolicyNo)
repeals
# City State Year Status Year_repealed PolicyNo
# 1 Phil. PA 2004 Repealed 2004 -9
all_years = expand.grid(City = unique(df$City), State = unique(df$State),
Year = 2001:2005)
df = bind_rows(df, repeals, all_years)
# City State Year Status Year_repealed PolicyNo
# 1 Pitt PA 2001 InForce NA 6
# 2 Phil. PA 2001 Repealed 2004 9
# 3 Pitt PA 2002 InForce NA 7
# 4 Pitt PA 2005 InForce NA 2
# 5 Phil. PA 2004 Repealed 2004 -9
# 6 Pitt PA 2001 <NA> NA NA
# 7 Phil. PA 2001 <NA> NA NA
# 8 Pitt PA 2002 <NA> NA NA
# 9 Phil. PA 2002 <NA> NA NA
# 10 Pitt PA 2003 <NA> NA NA
# 11 Phil. PA 2003 <NA> NA NA
# 12 Pitt PA 2004 <NA> NA NA
# 13 Phil. PA 2004 <NA> NA NA
# 14 Pitt PA 2005 <NA> NA NA
# 15 Phil. PA 2005 <NA> NA NA
Now the table shows every city-state-year and incorporates repeals. This is a table we can summarize.
df = df %>%
group_by(Year, State) %>%
summarize(annual_change = sum(PolicyNo, na.rm = TRUE))
df
# Source: local data frame [5 x 3]
# Groups: Year [?]
#
# Year State annual_change
# <int> <chr> <dbl>
# 1 2001 PA 15
# 2 2002 PA 7
# 3 2003 PA 0
# 4 2004 PA -9
# 5 2005 PA 2
That gets us PolicyNo change in each state-year. A cumulative sum over the changes gets us levels.
df = df %>%
ungroup() %>%
mutate(PolicyNo = cumsum(annual_change))
df
# # A tibble: 5 × 4
# Year State annual_change PolicyNo
# <int> <chr> <dbl> <dbl>
# 1 2001 PA 15 15
# 2 2002 PA 7 22
# 3 2003 PA 0 22
# 4 2004 PA -9 13
# 5 2005 PA 2 15
With the data.table package you could do it as follows:
melt(setDT(dat),
measure.vars = c(3,5),
value.name = 'Year',
value.factor = FALSE)[!is.na(Year)
][variable == 'Year_repealed', PolicyNo := -1*PolicyNo
][CJ(Year = min(Year):max(Year), State = State, unique = TRUE), on = .(Year, State)
][is.na(PolicyNo), PolicyNo := 0
][, .(PolicyNo = sum(PolicyNo)), by = .(Year, State)
][, .(Year, State, PolicyNo = cumsum(PolicyNo))]
The result of the above code:
Year State PolicyNo
1: 2001 PA 15
2: 2002 PA 22
3: 2003 PA 22
4: 2004 PA 13
5: 2005 PA 15
As you can see, there are several steps needed to come to the desired endresult:
First you convert to a data.table (setDT(dat)) and reshape this into long format and remove the rows with no Year
Then you make the value for the rows that have 'Year_repealed' to negative.
With a cross-join (CJ) you make sure that alle the years for each state are present and convert the NA-values in the PolicyNo column to zero.
Finally, you summarise by year and do a cumulative sum on the result.

Create count per item by year/decade

I have data in a data.table that is as follows:
> x<-df[sample(nrow(df), 10),]
> x
> Importer Exporter Date
1: Ecuador United Kingdom 2004-01-13
2: Mexico United States 2013-11-19
3: Australia United States 2006-08-11
4: United States United States 2009-05-04
5: India United States 2007-07-16
6: Guatemala Guatemala 2014-07-02
7: Israel Israel 2000-02-22
8: India United States 2014-02-11
9: Peru Peru 2007-03-26
10: Poland France 2014-09-15
I am trying to create summaries so that given a time period (say a decade), I can find the number of time each country appears as Importer and Exporter. So, in the above example the desired output when dividing up by decade should be something like:
Decade Country.Name Importer.Count Exporter.Count
2000 Ecuador 1 0
2000 Mexico 1 1
2000 Australia 1 0
2000 United States 1 3
.
.
.
2010 United States 0 2
.
.
.
So far, I have tried with aggregate and data.table methods as suggested by the post here, but both of them seem to just give me counts of the number Importers/Exporters per year (or decade as I am more interested in that).
> x$Decade<-year(x$Date)-year(x$Date)%%10
> importer_per_yr<-aggregate(Importer ~ Decade, FUN=length, data=x)
> importer_per_yr
Decade Importer
2 2000 6
3 2010 4
Considering that aggregate uses the formula interface, I tried adding another criteria, but got the following error:
> importer_per_yr<-aggregate(Importer~ Decade + unique(Importer), FUN=length, data=x)
Error in model.frame.default(formula = Importer ~ Decade + :
variable lengths differ (found for 'unique(Importer)')
Is there a way to create the summary according to the decade and the importer/ exporter? It does not matter if the summary for importer and exporter are in different tables.
We can do this using data.table methods, Create the 'Decade' column by assignment :=, then melt the data from 'wide' to 'long' format by specifying the measure columns, reshape it back to 'wide' using dcast and we use the fun.aggregate as length.
x[, Decade:= year(Date) - year(Date) %%10]
dcast(melt(x, measure = c("Importer", "Exporter"), value.name = "Country"),
Decade + Country~variable, length)
# Decade Country Importer Exporter
# 1: 2000 Australia 1 0
# 2: 2000 Ecuador 1 0
# 3: 2000 India 1 0
# 4: 2000 Israel 1 1
# 5: 2000 Peru 1 1
# 6: 2000 United Kingdom 0 1
# 7: 2000 United States 1 3
# 8: 2010 France 0 1
# 9: 2010 Guatemala 1 1
#10: 2010 India 1 0
#11: 2010 Mexico 1 0
#12: 2010 Poland 1 0
#13: 2010 United States 0 2
I think with will work with aggregate in base R:
my.data <- read.csv(text = '
Importer, Exporter, Date
Ecuador, United Kingdom, 2004-01-13
Mexico, United States, 2013-11-19
Australia, United States, 2006-08-11
United States, United States, 2009-05-04
India, United States, 2007-07-16
Guatemala, Guatemala, 2014-07-02
Israel, Israel, 2000-02-22
India, United States, 2014-02-11
Peru, Peru, 2007-03-26
Poland, France, 2014-09-15
', header = TRUE, stringsAsFactors = TRUE, strip.white = TRUE)
my.data$my.Date <- as.Date(my.data$Date, format = "%Y-%m-%d")
my.data <- data.frame(my.data,
year = as.numeric(format(my.data$my.Date, format = "%Y")),
month = as.numeric(format(my.data$my.Date, format = "%m")),
day = as.numeric(format(my.data$my.Date, format = "%d")))
my.data$my.decade <- my.data$year - (my.data$year %% 10)
importer.count <- with(my.data, aggregate(cbind(count = Importer) ~ my.decade + Importer, FUN = function(x) { NROW(x) }))
exporter.count <- with(my.data, aggregate(cbind(count = Exporter) ~ my.decade + Exporter, FUN = function(x) { NROW(x) }))
colnames(importer.count) <- c('my.decade', 'country', 'importer.count')
colnames(exporter.count) <- c('my.decade', 'country', 'exporter.count')
my.counts <- merge(importer.count, exporter.count, by = c('my.decade', 'country'), all = TRUE)
my.counts$importer.count[is.na(my.counts$importer.count)] <- 0
my.counts$exporter.count[is.na(my.counts$exporter.count)] <- 0
my.counts
# my.decade country importer.count exporter.count
# 1 2000 Australia 1 0
# 2 2000 Ecuador 1 0
# 3 2000 India 1 0
# 4 2000 Israel 1 1
# 5 2000 Peru 1 1
# 6 2000 United States 1 3
# 7 2000 United Kingdom 0 1
# 8 2010 Guatemala 1 1
# 9 2010 India 1 0
# 10 2010 Mexico 1 0
# 11 2010 Poland 1 0
# 12 2010 United States 0 2
# 13 2010 France 0 1

Maps, ggplot2, fill by state is missing certain areas on the map

I am working with maps and ggplot2 to visualize the number of certain crimes in each state for different years. The data set that I am working with was produced by the FBI and can be downloaded from their site or from here (if you don't want to download the dataset I don't blame you, but it is way too big to copy and paste into this question, and including a fraction of the data set wouldn't help, as there wouldn't be enough information to recreate the graph).
The problem is easier seen than described.
As you can see California is missing a large chunk as well as a few other states. Here is the code that produced this plot:
# load libraries
library(maps)
library(ggplot2)
# load data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
states <- map_data("state")
# merge data sets by region
fbi$region <- tolower(fbi$state)
fbimap <- merge(fbi, states, by="region")
# plot robbery numbers by state for year 2012
fbimap12 <- subset(fbimap, Year == 2012)
qplot(long, lat, geom="polygon", data=fbimap12,
facets=~Year, fill=Robbery, group=group)
This is what the states data looks like:
long lat group order region subregion
1 -87.46201 30.38968 1 1 alabama <NA>
2 -87.48493 30.37249 1 2 alabama <NA>
3 -87.52503 30.37249 1 3 alabama <NA>
4 -87.53076 30.33239 1 4 alabama <NA>
5 -87.57087 30.32665 1 5 alabama <NA>
6 -87.58806 30.32665 1 6 alabama <NA>
And this is what the fbi data looks like:
Year Population Violent Property Murder Forcible.Rape Robbery
1 1960 3266740 6097 33823 406 281 898
2 1961 3302000 5564 32541 427 252 630
3 1962 3358000 5283 35829 316 218 754
4 1963 3347000 6115 38521 340 192 828
5 1964 3407000 7260 46290 316 397 992
6 1965 3462000 6916 48215 395 367 992
Aggravated.Assault Burglary Larceny.Theft Vehicle.Theft abbr state region
1 4512 11626 19344 2853 AL Alabama alabama
2 4255 11205 18801 2535 AL Alabama alabama
3 3995 11722 21306 2801 AL Alabama alabama
4 4755 12614 22874 3033 AL Alabama alabama
5 5555 15898 26713 3679 AL Alabama alabama
6 5162 16398 28115 3702 AL Alabama alabama
I then merged the two sets along region. The subset I am trying to plot is
region Year Robbery long lat group
8283 alabama 2012 5020 -87.46201 30.38968 1
8284 alabama 2012 5020 -87.48493 30.37249 1
8285 alabama 2012 5020 -87.95475 30.24644 1
8286 alabama 2012 5020 -88.00632 30.24071 1
8287 alabama 2012 5020 -88.01778 30.25217 1
8288 alabama 2012 5020 -87.52503 30.37249 1
... ... ... ...
Any ideas on how I can create this plot without those ugly missing spots?
I played with your code. One thing I can tell is that when you used merge something happened. I drew states map using geom_path and confirmed that there were a couple of weird lines which do not exist in the original map data. I, then, further investigated this case by playing with merge and inner_join. merge and inner_join are doing the same job here. However, I found a difference. When I used merge, order changed; the numbers were not in the right sequence. This was not the case with inner_join. You will see a bit of data with California below. Your approach was right. But merge somehow did not work in your favour. I am not sure why the function changed order, though.
library(dplyr)
### Call US map polygon
states <- map_data("state")
### Get crime data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
fbi$state <- tolower(fbi$state)
### Check if both files have identical state names: The answer is NO
### states$region does not have Alaska, Hawaii, and Washington D.C.
### fbi$state does not have District of Columbia.
setdiff(fbi$state, states$region)
#[1] "alaska" "hawaii" "washington d. c."
setdiff(states$region, fbi$state)
#[1] "district of columbia"
### Select data for 2012 and choose two columns (i.e., state and Robbery)
fbi2 <- fbi %>%
filter(Year == 2012) %>%
select(state, Robbery)
Now I created two data frames with merge and inner_join.
### Create two data frames with merge and inner_join
ana <- merge(fbi2, states, by.x = "state", by.y = "region")
bob <- inner_join(fbi2, states, by = c("state" ="region"))
ana %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -119.8685 38.90956 4 676 <NA>
#2 california 56521 -119.5706 38.69757 4 677 <NA>
#3 california 56521 -119.3299 38.53141 4 678 <NA>
#4 california 56521 -120.0060 42.00927 4 667 <NA>
#5 california 56521 -120.0060 41.20139 4 668 <NA>
bob %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -120.0060 42.00927 4 667 <NA>
#2 california 56521 -120.0060 41.20139 4 668 <NA>
#3 california 56521 -120.0060 39.70024 4 669 <NA>
#4 california 56521 -119.9946 39.44241 4 670 <NA>
#5 california 56521 -120.0060 39.31636 4 671 <NA>
ggplot(data = bob, aes(x = long, y = lat, fill = Robbery, group = group)) +
geom_polygon()
The problem is in the order of arguments to merge
fbimap <- merge(fbi, states, by="region")
has the thematic data first and the geo data second. Switching the order with
fbimap <- merge(states, fbi, by="region")
the polygons should all close up.

Resources