How to filter a dataframe so that it finds the maximum value for 10 unique occurrences of another variable - r

I have this dataframe here which I filter down to only include counties in the state of Washington and only include columns that are relevant for the answer I am looking for. What I want to do is filter down the dataframe so that I have 10 rows only, which have the highest Black Prison Population out of all of the counties in Washington State regardless of year. The part that I am struggling with is that there can't be repeated counties, so each row should include the highest Black Prison Populations for the top 10 unique county names in the state of Washington. Some of the counties have Null data for the populations for the black prison populations as well. for You should be able to reproduce this to get the updated dataframe.
library(dplyr)
incarceration <- read.csv("https://raw.githubusercontent.com/vera-institute/incarceration-trends/master/incarceration_trends.csv")
blackPrisPop <- incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA")
Sample of what the updated dataframe looks like (should include 1911 rows):
fips county_name state year black_pop_15to64 black_prison_pop
130 53005 Benton County WA 2001 1008 25
131 53005 Benton County WA 2002 1143 20
132 53005 Benton County WA 2003 1208 21
133 53005 Benton County WA 2004 1236 27
134 53005 Benton County WA 2005 1310 32
135 53005 Benton County WA 2006 1333 35

You can group_by the county county_name, and then use slice_max taking the row with maximum value for black_prison_pop. If you set n = 1 option you will get one row for each county. If you set with_ties to FALSE, you also will get one row even in case of ties.
You can arrange in descending order the black_prison_pop value to get the overall top 10 values across all counties.
library(dplyr)
incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA") %>%
group_by(county_name) %>%
slice_max(black_prison_pop, n = 1, with_ties = FALSE) %>%
arrange(desc(black_prison_pop)) %>%
head(10)
Output
black_prison_pop black_pop_15to64 year fips county_name state
<dbl> <dbl> <int> <int> <chr> <chr>
1 1845 73480 2002 53033 King County WA
2 975 47309 2013 53053 Pierce County WA
3 224 5890 2005 53063 Spokane County WA
4 172 19630 2015 53061 Snohomish County WA
5 137 8129 2016 53011 Clark County WA
6 129 5146 2003 53035 Kitsap County WA
7 102 5663 2009 53067 Thurston County WA
8 58 706 1991 53021 Franklin County WA
9 50 1091 1991 53077 Yakima County WA
10 46 1748 2008 53073 Whatcom County WA

Related

R: move everything after a word to a new column and then only keep the last four digits in the new column

My data frame has a column called "State" and contains the state name, HB/HF number, and the date the law went into effect. I want the state column to only contain the state name and the second column to contain just the year. How would I do this?
Mintz = read.csv('https://github.com/bandcar/mintz/raw/main/State%20Legislation%20on%20Biosimilars2.csv')
mintz = Mintz
# delete rows if col 2 has a blank value.
mintz = mintz[mintz$Substitution.Requirements != "", ]
# removes entire row if column 1 has the word State
mintz=mintz[mintz$State != "State", ]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
# delete PR
mintz = mintz[-34,]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
I'm almost certain I'll need to use strsplit(gsub()) but I'm not sure how to this since there's no specific pattern
EDIT
I still need help keeping only the state name in column 1.
As for moving the year to a new column, I found the below. It works, but I don't know why it works. From my understanding \d means that \d is the actual character it's searching for. the "." means to search for one character, and I have no idea what the \1 means. Another strange thing is that Minnesota (row 20) did not have a year, so it instead used characters. Isn't \d only supposed to be for digits? Someone care to explain?
mintz2 = mintz
mintz2$Year = sub('.*(\\d{4}).*', '\\1', mintz2$State)
One way could be:
For demonstration purposes select the State column.
Then we use str_extract to extract all numbers with 4 digits with that are at the end of the string \\d{4}-> this gives us the Year column.
Finally we make use of the inbuilt state.name function make a pattern of it an use it again with str_extract and remove NA rows.
library(dplyr)
library(stringr)
mintz %>%
select(State) %>%
mutate(Year = str_extract(State, '\\d{4}$'), .after=State,
State = str_extract(State, paste(state.name, collapse='|'))
) %>%
na.omit()
State Year
2 Arizona 2016
3 California 2016
7 Connecticut 2018
12 Florida 2013
13 Georgia 2015
16 Hawaii 2016
21 Illinois 2016
24 Indiana 2014
28 Iowa 2017
32 Kansas 2017
33 Kentucky 2016
34 Louisiana 2015
39 Maryland 2017
42 Michigan 2018
46 Missouri 2016
47 Montana 2017
50 Nebraska 2018
51 Nevada 2018
54 New Hampshire 2018
55 New Jersey 2016
59 New York 2017
62 North Carolina 2015
63 North Dakota 2013
66 Ohio 2017
67 Oregon 2016
70 Pennsylvania 2016
74 Rhode Island 2016
75 South Carolina 2017
78 South Dakota 2019
79 Tennessee 2015
82 Texas 2015
85 Utah 2015
88 Vermont 2018
89 Virginia 2013
92 Washington 2015
93 West Virginia 2018
96 Wisconsin 2019
97 Wyoming 2018

Adding conditional variables to dataframe

Say we have a Dataframe that look like this:
UNIT NUMBER Year City STATE
124 1996 Prague CZECH
121 2001 Sofie BULG
122 2003 Ostrava CZECH
147 1986 Kyjev UKRAINE
133 2005 Lvov UKRAINE
...
...
...
188 2001 Rome ITALY
And say I need to add anothet variable to dataframe called Capital city - that would be equal to 1 if the City is a capital city of STATE and 0 otherwise.
how would I add this variable?
Capital cities in above dataframe are: Prague, Sofie, Kyjev
PS: I know I can do it 'by hand' in above dataframe, but I need universal solution for mutch bigger dataframes...
If you have many cities names with some cities with same names:
library(dplyr)
df <- data.frame(
unit = c(124, 121, 122, 147, 133),
Year = c(1996,2001,2003,1986,2005),
City = c("Prague", "Sofie", "Ostrava", "Kyjev", "Lvov"),
State = c("CZECH", "BULG", "CZECH", "UKRAINE", "UKRAINE"))
capital <- data.frame(
City = c("Prague", "Sofie", "Kyjev"),
State = c("CZECH", "BULG", "UKRAINE"),
Capital = "YES"
)
left_join(df, capital, by = c("State" = "State", "City" = "City"))
Get:
> left_join(df, capital, by = c("State" = "State", "City" = "City"))
unit Year City State Capital
1 124 1996 Prague CZECH YES
2 121 2001 Sofie BULG YES
3 122 2003 Ostrava CZECH <NA>
4 147 1986 Kyjev UKRAINE YES
5 133 2005 Lvov UKRAINE <NA>
If all city names are unique, then
cap_list = c("Prague", "Sofie", "Kyjev")
df %>%
mutate (
yes = as.numeric(City %in% cap_list)
)
unit Year City State yes
1 124 1996 Prague CZECH 1
2 121 2001 Sofie BULG 1
3 122 2003 Ostrava CZECH 0
4 147 1986 Kyjev UKRAINE 1
5 133 2005 Lvov UKRAINE 0

Find the nth largest value based on criteria [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
This is the basically same problem I had in Excel a few days ago (Excel - find nth largest value based on criteria), but this time in R (the data set contains half a million entries and that is more than Excel seems to be able to handle).
I have a table that looks like this that I have imported from Excel:
Country Region Code Product name Year Value
Sweden Stockholm 123 Apple 1991 244
Sweden Kirruna 123 Apple 1987 100
Japan Kyoto 543 Pie 1987 544
Denmark Copenhagen 123 Apple 1998 787
Denmark Copenhagen 123 Apple 1987 100
Denmark Copenhagen 543 Pie 1991 320
Denmark Copenhagen 126 Candy 1999 200
Sweden Gothenburg 126 Candy 2013 300
Sweden Gothenburg 157 Tomato 1987 150
Sweden Stockholm 125 Juice 1987 250
Sweden Kirruna 187 Banana 1998 310
Japan Kyoto 198 Ham 1987 157
Japan Kyoto 125 Juice 1987 550
Japan Tokyo 125 Juice 1991 100
What I want to do is to make a code that can give me the sum of the nth largest value of products that have been sold in a specific country. For instance, the most sold product in Sweden is Apple so I want to code to find that apple is the most sold product (in total, which is what I am interested in) and then summaries all the values of the sold apples in the country Sweden, 344.
I also want to be able to find the nth largest value based on both country and year. That is, if I am looking for the most sold product in Sweden in the year 2013, it should return the product Candy and the value 300.
Solution for your first question (find most sold product per country, summarise value for this product) using dplyr:
library(tidyverse)
df %>%
group_by(Country, Product_name) %>%
summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Country) %>%
filter(sum_value == max(sum_value))
# A tibble: 3 x 3
# Groups: Country [3]
Country Product_name sum_value
<fctr> <fctr> <int>
1 Denmark Apple 887
2 Japan Juice 650
3 Sweden Apple 344
Solution for second question (show nth most sold products per country and year, sum value):
df %>%
group_by(Country, Product_name, Year) %>%
summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Country, Year) %>%
arrange(desc(sum_value), .by_group = TRUE) %>%
slice(., 1:2)
Had to change the data a bit to get a decent output, so here's the output with all years set to 1987 (change the 2 in the 1:2 within the last row for a different n):
# A tibble: 6 x 4
# Groups: Country, Year [3]
Country Product_name Year sum_value
<fctr> <fctr> <int> <int>
1 Denmark Apple 1987 887
2 Denmark Pie 1987 320
3 Japan Juice 1987 650
4 Japan Pie 1987 544
5 Sweden Apple 1987 344
6 Sweden Banana 1987 310

State FIPS, county FIPS AND FIPS to latitude longitude?

I have a dataset looking like this, with 600 columns:
COUNTY_NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS Year
Boone Illinois 17 007 17007 2010
Bureau Illinois 17 011 17011 2008
Champaign Illinois 17 019 17019 2010
Cook Illinois 17 031 17031 2006
I need to get the centroids of smallest possible unit/area (counties?) for further analysis.
Is it possible to get this information in latitude longitude in R?

Maps, ggplot2, fill by state is missing certain areas on the map

I am working with maps and ggplot2 to visualize the number of certain crimes in each state for different years. The data set that I am working with was produced by the FBI and can be downloaded from their site or from here (if you don't want to download the dataset I don't blame you, but it is way too big to copy and paste into this question, and including a fraction of the data set wouldn't help, as there wouldn't be enough information to recreate the graph).
The problem is easier seen than described.
As you can see California is missing a large chunk as well as a few other states. Here is the code that produced this plot:
# load libraries
library(maps)
library(ggplot2)
# load data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
states <- map_data("state")
# merge data sets by region
fbi$region <- tolower(fbi$state)
fbimap <- merge(fbi, states, by="region")
# plot robbery numbers by state for year 2012
fbimap12 <- subset(fbimap, Year == 2012)
qplot(long, lat, geom="polygon", data=fbimap12,
facets=~Year, fill=Robbery, group=group)
This is what the states data looks like:
long lat group order region subregion
1 -87.46201 30.38968 1 1 alabama <NA>
2 -87.48493 30.37249 1 2 alabama <NA>
3 -87.52503 30.37249 1 3 alabama <NA>
4 -87.53076 30.33239 1 4 alabama <NA>
5 -87.57087 30.32665 1 5 alabama <NA>
6 -87.58806 30.32665 1 6 alabama <NA>
And this is what the fbi data looks like:
Year Population Violent Property Murder Forcible.Rape Robbery
1 1960 3266740 6097 33823 406 281 898
2 1961 3302000 5564 32541 427 252 630
3 1962 3358000 5283 35829 316 218 754
4 1963 3347000 6115 38521 340 192 828
5 1964 3407000 7260 46290 316 397 992
6 1965 3462000 6916 48215 395 367 992
Aggravated.Assault Burglary Larceny.Theft Vehicle.Theft abbr state region
1 4512 11626 19344 2853 AL Alabama alabama
2 4255 11205 18801 2535 AL Alabama alabama
3 3995 11722 21306 2801 AL Alabama alabama
4 4755 12614 22874 3033 AL Alabama alabama
5 5555 15898 26713 3679 AL Alabama alabama
6 5162 16398 28115 3702 AL Alabama alabama
I then merged the two sets along region. The subset I am trying to plot is
region Year Robbery long lat group
8283 alabama 2012 5020 -87.46201 30.38968 1
8284 alabama 2012 5020 -87.48493 30.37249 1
8285 alabama 2012 5020 -87.95475 30.24644 1
8286 alabama 2012 5020 -88.00632 30.24071 1
8287 alabama 2012 5020 -88.01778 30.25217 1
8288 alabama 2012 5020 -87.52503 30.37249 1
... ... ... ...
Any ideas on how I can create this plot without those ugly missing spots?
I played with your code. One thing I can tell is that when you used merge something happened. I drew states map using geom_path and confirmed that there were a couple of weird lines which do not exist in the original map data. I, then, further investigated this case by playing with merge and inner_join. merge and inner_join are doing the same job here. However, I found a difference. When I used merge, order changed; the numbers were not in the right sequence. This was not the case with inner_join. You will see a bit of data with California below. Your approach was right. But merge somehow did not work in your favour. I am not sure why the function changed order, though.
library(dplyr)
### Call US map polygon
states <- map_data("state")
### Get crime data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
fbi$state <- tolower(fbi$state)
### Check if both files have identical state names: The answer is NO
### states$region does not have Alaska, Hawaii, and Washington D.C.
### fbi$state does not have District of Columbia.
setdiff(fbi$state, states$region)
#[1] "alaska" "hawaii" "washington d. c."
setdiff(states$region, fbi$state)
#[1] "district of columbia"
### Select data for 2012 and choose two columns (i.e., state and Robbery)
fbi2 <- fbi %>%
filter(Year == 2012) %>%
select(state, Robbery)
Now I created two data frames with merge and inner_join.
### Create two data frames with merge and inner_join
ana <- merge(fbi2, states, by.x = "state", by.y = "region")
bob <- inner_join(fbi2, states, by = c("state" ="region"))
ana %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -119.8685 38.90956 4 676 <NA>
#2 california 56521 -119.5706 38.69757 4 677 <NA>
#3 california 56521 -119.3299 38.53141 4 678 <NA>
#4 california 56521 -120.0060 42.00927 4 667 <NA>
#5 california 56521 -120.0060 41.20139 4 668 <NA>
bob %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -120.0060 42.00927 4 667 <NA>
#2 california 56521 -120.0060 41.20139 4 668 <NA>
#3 california 56521 -120.0060 39.70024 4 669 <NA>
#4 california 56521 -119.9946 39.44241 4 670 <NA>
#5 california 56521 -120.0060 39.31636 4 671 <NA>
ggplot(data = bob, aes(x = long, y = lat, fill = Robbery, group = group)) +
geom_polygon()
The problem is in the order of arguments to merge
fbimap <- merge(fbi, states, by="region")
has the thematic data first and the geo data second. Switching the order with
fbimap <- merge(states, fbi, by="region")
the polygons should all close up.

Resources