Split a data frame column based on a comma [duplicate] - r

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 3 years ago.
I have a data frame with the following structure, titled "final_proj_data"
ID County Population Year
<dbl> <chr> <dbl> <dbl>
1003 Baldwin County, Alabama 169162 2006
1015 Calhoun County, Alabama 112903 2006
1043 Cullman County, Alabama 80187 2006
1049 DeKalb County, Alabama 68014 2006
I am trying to split the column County into two different columns, County and State, and remove the comma.
I tried a number of permutations of the separate() function but I keep getting back this error:
Error: var must evaluate to a single number or a column name, not a
character vector
I've tried (among others)
final_proj_data %>%
separate(final_proj_data$County, c("State", "County"), sep = ",", remove = TRUE)
final_proj_data %>%
separate(data = final_proj_data, col = County,
into = c("State", "County"), sep = ",")
I'm not sure what I am doing wrong, or why the "col =" keeps throwing this error. Any help would be appreciated!

Using dplyr and base R:
library(dplyr)
final_proj_data %>%
mutate(State=unlist(lapply(strsplit(County,", "),function(x) x[2])),
County=gsub(",.*","",County))
ID County Population Year State
1 1003 Baldwin County 169162 2006 Alabama
2 1015 Calhoun County 112903 2006 Alabama
3 1043 Cullman County 80187 2006 Alabama
4 1049 DeKalb County 68014 2006 Alabama
Original:
With dplyr and tidyr(Just seen that #Ronak Shah commented the same above):
library(dplyr)
library(tidyr)
final_proj_data %>%
separate(County,c("County","State"),sep=",")
ID County State Population Year
1 1003 Baldwin County Alabama 169162 2006
2 1015 Calhoun County Alabama 112903 2006
3 1043 Cullman County Alabama 80187 2006
4 1049 DeKalb County Alabama 68014 2006

We can try using sub here for a base R option:
County <- sub(",.*$", "", final_proj_data$County)
State <- sub("^.*,\\s*", "", final_proj_data$County)
final_proj_data$County <- County
final_proj_data$State <- State

We can do this in base R using read.csv
final_proj_data[c("County", "State")] <- read.csv(text = final_proj_data$County,
header = FALSE, stringsAsFactors = FALSE, strip.white = TRUE)
final_proj_data
# ID County Population Year State
#1 1003 Baldwin County 169162 2006 Alabama
#2 1015 Calhoun County 112903 2006 Alabama
#3 1043 Cullman County 80187 2006 Alabama
#4 1049 DeKalb County 68014 2006 Alabama
data
final_proj_data <- structure(list(ID = c(1003L, 1015L, 1043L, 1049L),
County = c("Baldwin County, Alabama",
"Calhoun County, Alabama", "Cullman County, Alabama", "DeKalb County, Alabama"
), Population = c(169162L, 112903L, 80187L, 68014L), Year = c(2006L,
2006L, 2006L, 2006L)), class = "data.frame", row.names = c(NA,
-4L))

We can use strsplit in base R.
cbind(d, `colnames<-`(do.call(rbind, strsplit(d$County, ", ")), c("County", "State")))[-2]
# ID Population Year County State
# 1 1003 169162 2006 Baldwin County Alabama
# 2 1015 112903 2006 Calhoun County Alabama
# 3 1043 80187 2006 Cullman County Alabama
# 4 1049 68014 2006 DeKalb County Alabama
Note: Use strsplit(as.character(d$County), ", ") if County is a factor column.
Data
d <- structure(list(ID = c("1003", "1015", "1043", "1049"), County = c("Baldwin County, Alabama",
"Calhoun County, Alabama", "Cullman County, Alabama", "DeKalb County, Alabama"
), Population = c("169162", "112903", "80187", "68014"), Year = c("2006",
"2006", "2006", "2006")), row.names = c(NA, -4L), class = "data.frame")

Related

Obtaining the household median income from ACS data

The code below perfectly returns what I need: the household median income for each puma using 2019 ACS (1-year). However, what is missing is the States name. I tried the option of state="all" but it did not work. How can I obtain my data of interest by states and puma?
Thanks,
NM
PUMA_level <- get_acs(geography = "puma",
variable = "B19013_001",
survey = "acs1",
# state="all",
year = 2019)
Using the usmap::fips_info function you could get a list of state codes, names and abbreviations which you could then merge to your census data like so:
library(tidycensus)
library(usmap)
PUMA_level <- get_acs(geography = "puma",
variable = "B19013_001",
survey = "acs1",
year = 2019,
keep_geo_vars = TRUE)
#> Getting data from the 2019 1-year ACS
#> The 1-year ACS provides data for geographies with populations of 65,000 and greater.
PUMA_level$fips <- substr(PUMA_level$GEOID, 1, 2)
states <- usmap::fips_info(unique(PUMA_level$fips))
#> Warning in get_fips_info(fips_, sortAndRemoveDuplicates): FIPS code(s) 72 not
#> found
PUMA_level <- merge(PUMA_level, states, by = "fips")
head(PUMA_level)
#> fips GEOID
#> 1 01 0100100
#> 2 01 0100200
#> 3 01 0100302
#> 4 01 0100400
#> 5 01 0100500
#> 6 01 0100301
#> NAME
#> 1 Lauderdale, Colbert, Franklin & Marion (Northeast) Counties PUMA; Alabama
#> 2 Limestone & Madison (Outer) Counties--Huntsville City (Far West & Southwest) PUMA, Alabama
#> 3 Huntsville City (Central & South) PUMA, Alabama
#> 4 DeKalb & Jackson Counties PUMA, Alabama
#> 5 Marshall & Madison (Southeast) Counties--Huntsville City (Far Southeast) PUMA, Alabama
#> 6 Huntsville (North) & Madison (East) Cities PUMA, Alabama
#> variable estimate moe abbr full
#> 1 B19013_001 46449 3081 AL Alabama
#> 2 B19013_001 74518 6371 AL Alabama
#> 3 B19013_001 51884 5513 AL Alabama
#> 4 B19013_001 43406 3557 AL Alabama
#> 5 B19013_001 56276 3216 AL Alabama
#> 6 B19013_001 63997 5816 AL Alabama

R duplicate rows based on the elements in a string column [duplicate]

This question already has answers here:
str_extract_all: return all patterns found in string concatenated as vector
(2 answers)
Closed 2 years ago.
I have a more or less specific question that probably pertains to loops in R. I have a dataframe:
X location year
1 North Dakota, Minnesota, Michigan 2011
2 California, Tennessee 2012
3 Bastrop County (Texas) 2013
4 Dallas (Texas) 2014
5 Shasta (California) 2015
6 California, Oregon, Washington 2011
I have two problems with this data: 1) I need a column that consists of just the state names of each row. I guess this should be generally easy with gsub and using a list of all US state names.
list <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "etc")
pat <- paste0("\\b(", paste0(list, collapse="|"), ")\\b")
pat
data$state <- gsub(data$location, "", paragraph)
The bigger issue for me is 2) I need an individual (duplicate) row for each state that is in the dataset. So if row 6 has California, Oregon and Washington in 2011, I need to have a separate row of each one separately like this:
X location year
1 California 2011
2 Oregon 2011
3 Washington 2011
Thank you for your help!
You can use str_extract_all to extract all the states and unnest to duplicate rows such that each state is in a separate row. There is an inbuilt constant state.name which have the state names of US which can be used here to create pattern.
library(dplyr)
pat <- paste0("\\b", state.name, "\\b", collapse = "|")
df %>%
mutate(states = stringr::str_extract_all(location, pat)) %>%
tidyr::unnest(states)
# A tibble: 11 x 3
# location year states
# <chr> <int> <chr>
# 1 North Dakota, Minnesota, Michigan 2011 North Dakota
# 2 North Dakota, Minnesota, Michigan 2011 Minnesota
# 3 North Dakota, Minnesota, Michigan 2011 Michigan
# 4 California, Tennessee 2012 California
# 5 California, Tennessee 2012 Tennessee
# 6 Bastrop County (Texas) 2013 Texas
# 7 Dallas (Texas) 2014 Texas
# 8 Shasta (California) 2015 California
# 9 California, Oregon, Washington 2011 California
#10 California, Oregon, Washington 2011 Oregon
#11 California, Oregon, Washington 2011 Washington
data
df <- structure(list(location = c("North Dakota, Minnesota, Michigan",
"California, Tennessee", "Bastrop County (Texas)", "Dallas (Texas)",
"Shasta (California)", "California, Oregon, Washington"), year = c(2011L,
2012L, 2013L, 2014L, 2015L, 2011L)), class = "data.frame", row.names = c(NA, -6L))

Combining Rows in R with Pivot or Spread?

Here, I am manipulating election data, and the current data is in the following format. Both a visual and coded example are included (while visual is a bit condensed). Moreover, values have been edited from their originals.
# Representative Example
library(tidyverse)
test.df <- tibble(yr=rep(1956),mn=rep(11),
sub=rep("Alabama"),
unit_type=rep("County"),
unit_name=c("Autauga","Baldwin","Barbour"),
TotalVotes=c(1000,2000,3000),
RepVotes=c(500,1000,1500),
RepCandidate=rep("Eisenhower"),
DemVotes=c(500,1000,1500),
DemCandidate=rep("Stevenson"),
ThirdVotes=c(0,0,0),
ThirdCandidate=rep("Uncommitted"),
RepVotesTotalPerc=rep(50.00),
DemVotesTotalPerc=rep(50.00),
ThirdVotesTotalPerc=rep(0.00)
)
----------------------------------------------------------------------------------------------------
yr | mn | sub | unit_type | unit_name | TotalVotes | RepVotes | RepCan | DemVotes | DemCan
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga 1000 500 EisenHower 500 Stevenson
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Baldwin 2000 1000 EisenHower 1000 Stevenson
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Barbour 3000 2000 EisenHower 2000 Stevenson
----------------------------------------------------------------------------------------------------
I am trying to get a table that looks like the following:
----------------------------------------------------------------------------------------------------
yr | mn | sub | unit_type | unit_name | pty_n | can | TotalVotes | CanVotes
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Republican Eisenhower 1000 500
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Democrat Stevenson 1000 500
----------------------------------------------------------------------------------------------------
1956 11 Alabama County Autauga Independent Uncommitted 1000 0
----------------------------------------------------------------------------------------------------
# and etc. for other counties in example (Baldwin, Barbour, etc)
As you can see, I pretty much want three observations per county, where candidates are all in one column, as well as their respective votes in another (CanVotes, or the like).
I have tried using things like pivot_longer() or spread(), but I am having a hard time visualizing these in code. Any help here would be greatly appreciated in sort of reorienting my data to get a candidate column, but also moving the rest of the data with it!
Here is a solution that first uses pivot_longer to bring the Votes into a long format. Then I use mutate with case_when to substitute the former column names with the actual candidate names and delete the single candidate columns:
long_table <- pivot_longer(test.df,
cols = c(RepVotes, DemVotes, ThirdVotes),
names_to = "pty_n",
values_to = "CanVotes") %>%
mutate(can = case_when(
pty_n == "RepVotes" ~ RepCandidate,
pty_n == "DemVotes" ~ DemCandidate,
pty_n == "ThirdVotes" ~ ThirdCandidate
),
pty_n = case_when(
pty_n == "RepVotes" ~ "Republican",
pty_n == "DemVotes" ~ "Democrat",
pty_n == "ThirdVotes" ~ "Independent"
)) %>%
select(-c(RepCandidate, DemCandidate, ThirdCandidate))
# A tibble: 9 x 12
yr mn sub unit_type unit_name TotalVotes RepVotesTotalPerc DemVotesTotalPerc ThirdVotesTotalPe~ pty_n CanVotes can
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1956 11 Alabama County Autauga 1000 50 50 0 Republican 500 Eisenhower
2 1956 11 Alabama County Autauga 1000 50 50 0 Democrat 500 Stevenson
3 1956 11 Alabama County Autauga 1000 50 50 0 Independe~ 0 Uncommitt~
4 1956 11 Alabama County Baldwin 2000 50 50 0 Republican 1000 Eisenhower
5 1956 11 Alabama County Baldwin 2000 50 50 0 Democrat 1000 Stevenson
6 1956 11 Alabama County Baldwin 2000 50 50 0 Independe~ 0 Uncommitt~
7 1956 11 Alabama County Barbour 3000 50 50 0 Republican 1500 Eisenhower
8 1956 11 Alabama County Barbour 3000 50 50 0 Democrat 1500 Stevenson
9 1956 11 Alabama County Barbour 3000 50 50 0 Independe~ 0 Uncommitt~
I tried to build a custom spec, but it seems that the names have to be derived from the column names and can't be directly conditional on other columns.
Here is a data.table go at things
library( data.table )
#convert data to the data.table-format
setDT( test.df )
#get the different paries to update the variable balter in
parties <- gsub( "Candidate", "", grep( "^.*Candidate$", names( test.df ), value = TRUE ) )
#melt to each candidate and his/her votes
DT.melt <- melt(test.df,
id.vars = c("yr", "mn", "sub", "unit_type", "unit_name"),
measure.vars = patterns( can = "^.*Candidate$",
canVotes = "^(Rep|Dem|Third)Votes$" ),
variable.name = "pty_n"
)
#get the totals from the original date (by unit_name) through joining
DT.melt[ test.df, TotalVotes := i.TotalVotes, on = .(unit_name)]
#and pass the correct party name to the pty_n column
DT.melt[, pty_n := parties[ pty_n ] ][]
# yr mn sub unit_type unit_name pty_n can canVotes TotalVotes
# 1: 1956 11 Alabama County Autauga Rep Eisenhower 500 1000
# 2: 1956 11 Alabama County Baldwin Rep Eisenhower 1000 2000
# 3: 1956 11 Alabama County Barbour Rep Eisenhower 1500 3000
# 4: 1956 11 Alabama County Autauga Dem Stevenson 500 1000
# 5: 1956 11 Alabama County Baldwin Dem Stevenson 1000 2000
# 6: 1956 11 Alabama County Barbour Dem Stevenson 1500 3000
# 7: 1956 11 Alabama County Autauga Third Uncommitted 0 1000
# 8: 1956 11 Alabama County Baldwin Third Uncommitted 0 2000
# 9: 1956 11 Alabama County Barbour Third Uncommitted 0 3000

combining observations based on a criteria in R [duplicate]

This question already has answers here:
Collapsing a data frame over one variable
(3 answers)
Closed 7 years ago.
I have a data set that looks like this:
geoid zip dealers Year County
1001 36703 1 2001 Autauga County, AL
1001 36704 3 2001 Autauga County, AL
1003 36535 7 2000 Baldwin County, AL
1003 36536 3 2000 Baldwin County, AL
And I want to take all the rows that are the same except for 'dealers' and 'zip' and combine them into one row with the dealer variable summed from all the similar rows. (I'm not sure what the easiest thing is to do with zip, either list them all or leave it out? Doesn't really matter.) So this would become:
geoid dealers Year County
1001 4 2001 Autauga County, AL
1003 10 2000 Baldwin County, AL
Is there any way to create a new dataset like this? (Incidentally, I got here by merging three datasets, so if there's a better way to merge without creating these duplicates, that would be helpful as well.)
This gives you the desired result:
df <- read.table(header=TRUE, text=
'geoid zip dealers Year County
1001 36703 1 2001 "Autauga County, AL"
1001 36704 3 2001 "Autauga County, AL"
1003 36535 7 2000 "Baldwin County, AL"
1003 36536 3 2000 "Baldwin County, AL"')
aggregate(dealers ~ geoid+Year+County, data=df[-2], FUN=sum) # or
aggregate(dealers ~ geoid+Year+County, data=df, FUN=sum)

Maps, ggplot2, fill by state is missing certain areas on the map

I am working with maps and ggplot2 to visualize the number of certain crimes in each state for different years. The data set that I am working with was produced by the FBI and can be downloaded from their site or from here (if you don't want to download the dataset I don't blame you, but it is way too big to copy and paste into this question, and including a fraction of the data set wouldn't help, as there wouldn't be enough information to recreate the graph).
The problem is easier seen than described.
As you can see California is missing a large chunk as well as a few other states. Here is the code that produced this plot:
# load libraries
library(maps)
library(ggplot2)
# load data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
states <- map_data("state")
# merge data sets by region
fbi$region <- tolower(fbi$state)
fbimap <- merge(fbi, states, by="region")
# plot robbery numbers by state for year 2012
fbimap12 <- subset(fbimap, Year == 2012)
qplot(long, lat, geom="polygon", data=fbimap12,
facets=~Year, fill=Robbery, group=group)
This is what the states data looks like:
long lat group order region subregion
1 -87.46201 30.38968 1 1 alabama <NA>
2 -87.48493 30.37249 1 2 alabama <NA>
3 -87.52503 30.37249 1 3 alabama <NA>
4 -87.53076 30.33239 1 4 alabama <NA>
5 -87.57087 30.32665 1 5 alabama <NA>
6 -87.58806 30.32665 1 6 alabama <NA>
And this is what the fbi data looks like:
Year Population Violent Property Murder Forcible.Rape Robbery
1 1960 3266740 6097 33823 406 281 898
2 1961 3302000 5564 32541 427 252 630
3 1962 3358000 5283 35829 316 218 754
4 1963 3347000 6115 38521 340 192 828
5 1964 3407000 7260 46290 316 397 992
6 1965 3462000 6916 48215 395 367 992
Aggravated.Assault Burglary Larceny.Theft Vehicle.Theft abbr state region
1 4512 11626 19344 2853 AL Alabama alabama
2 4255 11205 18801 2535 AL Alabama alabama
3 3995 11722 21306 2801 AL Alabama alabama
4 4755 12614 22874 3033 AL Alabama alabama
5 5555 15898 26713 3679 AL Alabama alabama
6 5162 16398 28115 3702 AL Alabama alabama
I then merged the two sets along region. The subset I am trying to plot is
region Year Robbery long lat group
8283 alabama 2012 5020 -87.46201 30.38968 1
8284 alabama 2012 5020 -87.48493 30.37249 1
8285 alabama 2012 5020 -87.95475 30.24644 1
8286 alabama 2012 5020 -88.00632 30.24071 1
8287 alabama 2012 5020 -88.01778 30.25217 1
8288 alabama 2012 5020 -87.52503 30.37249 1
... ... ... ...
Any ideas on how I can create this plot without those ugly missing spots?
I played with your code. One thing I can tell is that when you used merge something happened. I drew states map using geom_path and confirmed that there were a couple of weird lines which do not exist in the original map data. I, then, further investigated this case by playing with merge and inner_join. merge and inner_join are doing the same job here. However, I found a difference. When I used merge, order changed; the numbers were not in the right sequence. This was not the case with inner_join. You will see a bit of data with California below. Your approach was right. But merge somehow did not work in your favour. I am not sure why the function changed order, though.
library(dplyr)
### Call US map polygon
states <- map_data("state")
### Get crime data
fbi <- read.csv("http://www.hofroe.net/stat579/crimes-2012.csv")
fbi <- subset(fbi, state != "United States")
fbi$state <- tolower(fbi$state)
### Check if both files have identical state names: The answer is NO
### states$region does not have Alaska, Hawaii, and Washington D.C.
### fbi$state does not have District of Columbia.
setdiff(fbi$state, states$region)
#[1] "alaska" "hawaii" "washington d. c."
setdiff(states$region, fbi$state)
#[1] "district of columbia"
### Select data for 2012 and choose two columns (i.e., state and Robbery)
fbi2 <- fbi %>%
filter(Year == 2012) %>%
select(state, Robbery)
Now I created two data frames with merge and inner_join.
### Create two data frames with merge and inner_join
ana <- merge(fbi2, states, by.x = "state", by.y = "region")
bob <- inner_join(fbi2, states, by = c("state" ="region"))
ana %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -119.8685 38.90956 4 676 <NA>
#2 california 56521 -119.5706 38.69757 4 677 <NA>
#3 california 56521 -119.3299 38.53141 4 678 <NA>
#4 california 56521 -120.0060 42.00927 4 667 <NA>
#5 california 56521 -120.0060 41.20139 4 668 <NA>
bob %>%
filter(state == "california") %>%
slice(1:5)
# state Robbery long lat group order subregion
#1 california 56521 -120.0060 42.00927 4 667 <NA>
#2 california 56521 -120.0060 41.20139 4 668 <NA>
#3 california 56521 -120.0060 39.70024 4 669 <NA>
#4 california 56521 -119.9946 39.44241 4 670 <NA>
#5 california 56521 -120.0060 39.31636 4 671 <NA>
ggplot(data = bob, aes(x = long, y = lat, fill = Robbery, group = group)) +
geom_polygon()
The problem is in the order of arguments to merge
fbimap <- merge(fbi, states, by="region")
has the thematic data first and the geo data second. Switching the order with
fbimap <- merge(states, fbi, by="region")
the polygons should all close up.

Resources