I have state, county and MSA names in a single string variable states_county_MSA, and I want to split them to create three distinct variables - states, county and MSAs.
tail(df$states_county_MSA,n=10)
[1] "Iowa Polk Des Moines"
[2] "Mississippi Hinds Jackson"
[3] "Georgia Richmond Augusta-Richmond"
[4] "Ohio Mahoning Youngstown-Warren-Boardman"
[5] "Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[6] "Pennsylvania Dauphin Harrisburg-Carlisle"
[7] "Florida Brevard Palm Bay-Melbourne-Titusville"
[8] "Utah Utah Provo-Orem"
[9] "Tennessee Hamilton Chattanooga"
[10] "North Carolina Durham Durham"
Modifying the solution by #jared_mamrot to a similar question (splitting state-county variable into state and county distinct variables posted below ; full problem here for reference - Extracting states and counties from state-county character variable ), I can split the states_county_MSA variable into two variables - states and county-MSA variable.
library(tidyverse)
states_county_names_df <- data.frame(states_county = c(
"California San Francisco",
"New York Bronx",
"Illinois Cook",
"Massachusetts Suffolk",
"District of Columbia District of Columbia"
)
)
data(state)
states_inc_Columbia <- c(state.name, "District of Columbia")
states_county_names_df %>%
mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))
However, in this scenario, I am not able to decompose states_county_MSA further as I cannot find a function for county or MSA names. Not able to get county.names function to work, and tried using tigiris, censusapi and maps package but was unable to generate a vector of county names in US for the string split/extract command).
> data(county.names)
Warning in data(county.names) : data set ‘county.names’ not found
I was thinking of using the word function but names of MSAs are not standard either (one or more words).
Would anyone know a way to split the county-MSA in an efficient manner ?
EDIT - Data with (space) delimiter {county, state, MSA, MSA population, month, year}.
[1] "Virginia Richmond Richmond 1,210,063 8 2014"
[2] "Louisiana Orleans New Orleans-Metairie-Kenner 1,195,794"
[3] "North Carolina Wake Raleigh-Cary 1,137,346 6 2014"
[4] "New York Erie Buffalo-Niagara Falls 1,135,342"
[5] "Alabama Jefferson Birmingham-Hoover 1,129,034"
[6] "Utah Salt Lake Salt Lake City 1,091,432 5 2014"
[7] "New York Monroe Rochester 1,080,082"
[8] "Michigan Kent Grand Rapids-Wyoming 989,205 7 2014"
[9] "Arizona Pima Tucson 981,935 10 2013"
[10] "Hawaii Honolulu Honolulu 956,336 8 2013"
I think this should work:
data <- tibble::tribble(~state_county_msa,
"Iowa Polk Des Moines" ,
"Mississippi Hinds Jackson" ,
"Georgia Richmond Augusta-Richmond" ,
"Ohio Mahoning Youngstown-Warren-Boardman" ,
"Pennsylvania Lackawanna Scranton--Wilkes-Barre",
"Pennsylvania Dauphin Harrisburg-Carlisle" ,
"Florida Brevard Palm Bay-Melbourne-Titusville" ,
"Utah Utah Provo-Orem" ,
"Tennessee Hamilton Chattanooga" ,
"North Carolina Durham Durham")
state_county <- ggplot2::map_data("county") %>%
select(state = region,
county = subregion) %>%
as_tibble() %>%
mutate(across(everything(),str_to_title)) %>%
unite(state_county, c("state","county"), sep = " ", remove = FALSE) %>%
distinct(state_county, .keep_all = TRUE)
state_county_string <- paste(state_county$state_county, collapse = "|")
data %>%
mutate(state_county = str_extract(state_county_msa, state_county_string),
msa = str_trim(str_remove(state_county_msa, state_county_string))) %>%
left_join(state_county, by = "state_county") %>%
select(state, county, msa)
Output:
# A tibble: 10 × 3
state county msa
<chr> <chr> <chr>
1 Iowa Polk Des Moines
2 Mississippi Hinds Jackson
3 Georgia Richmond Augusta-Richmond
4 Ohio Mahoning Youngstown-Warren-Boardman
5 Pennsylvania Lackawanna Scranton--Wilkes-Barre
6 Pennsylvania Dauphin Harrisburg-Carlisle
7 Florida Brevard Palm Bay-Melbourne-Titusville
8 Utah Utah Provo-Orem
9 Tennessee Hamilton Chattanooga
10 North Carolina Durham Durham
Related
I have a variable below that I believe is separated by space.
[95] "Florida Volusia Deltona-Daytona Beach-Ormond Beach"
[96] "Iowa Polk Des Moines"
[97] "Mississippi Hinds Jackson"
[98] "Georgia Richmond Augusta-Richmond"
[99] "Ohio Mahoning Youngstown-Warren-Boardman"
[100] " Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[101] " Pennsylvania Dauphin Harrisburg-Carlisle"
[102] " Florida Brevard Palm Bay-Melbourne-Titusville"
[103] " Utah Utah Provo-Orem"
[104] " Tennessee Hamilton Chattanooga"
[105] " North Carolina Durham Durham"
I want to create three variables out of this string - state, county, and MSA. But the usual string split commands are not working. I tried stringi command too but fail to split the variable. Not sure why this is happening, as the command works on simpler strings.
> strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df,"")
Error in strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df, :
non-character argument
> BK<-strsplit(as.character(BK_state_county_MSA$non_squished_states_county_MSA_names_df),"\\t")
> str(BK) #List of 0
list()
> stri_split(str=BK_state_county_MSA$non_squished_states_county_MSA_names_df, regex="\\t",n=3)
list()
> BK <-stri_split_lines(BK_state_county_MSA$non_squished_states_county_MSA_names_df)
list()
EDIT - The original data has 104 observations, but I am posting only 8 observations with dput command...
> dput(BK_state_county_MSA)
structure(list(non_squished_state_county_MSA = c(
"New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")
Here's an option using stri_trim_left() from stringi and separate() from tidyr:
stri_trim_left() removes strings with a leading whitespace, which occurs in your data starting at [100]. You can then separate() the strings into the three specified columns state, country and MSA, separated by at least 2 whitespaces (sep = " {2,}").
Data
BK_state_county_MSA<- structure(list(non_squished_state_county_MSA = c(
"New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")
Code
library(tidyr)
library(stringi)
BK_state_county_MSA %>% mutate(non_squished_state_county_MSA = stri_trim_left(non_squished_state_county_MSA)) %>%
separate(non_squished_state_county_MSA, into = c("state", "country", "MSA"), sep = " {2,}")
Output
state country MSA
1 New York Bronx New York-Wayne-White Plains
2 New York Kings New York-Wayne-White Plains
3 Pennsylvania Lackawanna Scranton--Wilkes-Barre
4 Pennsylvania Dauphin Harrisburg-Carlisle
5 Florida Brevard Palm Bay-Melbourne-Titusville
6 Utah Utah Provo-Orem
7 Tennessee Hamilton Chattanooga
8 North Carolina Durham Durham
There's probably something useful, in terms of further data analysis, about keeping North with Carolina and New with York & etc. And it's always nice to have a one liner, but sometimes a few lines get you where is best for moving forward. Consider playing around like this:
maddening_txt <- " North Carolina Durham Durham"
strsplit(maddening_txt, split = ' ') # n space = 4L
[[1]]
[1] " North Carolina" " Durham" "" ""
[5] "" " Durham"
nchar(strsplit(maddening_txt, split = ' ')[[1]])
[1] 15 7 0 0 0 7
# you could throw in a which test for >0 for indexing here
strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
[1] " North Carolina" " Durham" " Durham"
string_vec <- strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
> string_vec[1]
[1] " North Carolina"
string_vec[2:3]
[1] " Durham" " Durham"
Left as it is, when returned to in the future, something like split=' ' begs 'How many spaces was that?'
So, more explicitly, and using some of the above notation:
> strsplit(sc_msa$entries[8], split='\\s{5,}')
[[1]]
[1] " North Carolina" "Durham" "Durham"
This appears to also remove leading left white space without recourse to trim. And then the above mentioned
for
sc_msa_lst <- list()
> for(i in 1:length(sc_msa$entries)) {
+ sc_msa_lst[[i]] <- strsplit(sc_msa$entries[i], split='\\s{5,}')
+ }
> sc_msa_lst
[[1]]
[[1]][[1]]
[1] "New York" "Bronx"
[3] "New York-Wayne-White Plains"
[[2]]
[[2]][[1]]
[1] "New York" "Kings"
[3] "New York-Wayne-White Plains"
[[3]]
[[3]][[1]]
[1] " Pennsylvania" "Lackawanna" "Scranton--Wilkes-Barre"
[[4]]
[[4]][[1]]
[1] " Pennsylvania" "Dauphin" "Harrisburg-Carlisle"
[[5]]
[[5]][[1]]
[1] " Florida" "Brevard"
[3] "Palm Bay-Melbourne-Titusville"
[[6]]
[[6]][[1]]
[1] " Utah" "Utah" "Provo-Orem"
[[7]]
[[7]][[1]]
[1] " Tennessee" "Hamilton" "Chattanooga"
[[8]]
[[8]][[1]]
[1] " North Carolina" "Durham" "Durham"
So, our regex pattern, '\s{5,}' works for cases seen so far. Next, I have to remember how to make a data.frame...there's got to be a list of lists to data.frame question on SO somewhere that'll help me.
A completely unacceptable answer, but it will get you to the guts of what simplifications are offering and provide more control going forward.
Data, with renaming:
sc_msa <- structure(list(entries = c("New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem", " Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")
This question already has answers here:
How can I match fuzzy match strings from two datasets?
(7 answers)
Closed 4 years ago.
I have a list of university names input with spelling errors and inconsistencies. I need to match them against an official list of university names to link my data together.
I know fuzzy matching/join is my way to go, but I'm a bit lost on the correct method. Any help would be greatly appreciated.
d<-data.frame(name=c("University of New Yorkk", "The University of South
Carolina", "Syracuuse University", "University of South Texas",
"The University of No Carolina"), score = c(1,3,6,10,4))
y<-data.frame(name=c("University of South Texas", "The University of North
Carolina", "University of South Carolina", "Syracuse
University","University of New York"), distance = c(100, 400, 200, 20, 70))
And I desire an output that has them merged together as closely as possible
matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina",
"Syracuuse University","University of South Texas","The University of No Carolina"),
correctmatch = c("University of New York", "University of South Carolina",
"Syracuse University","University of South Texas", "The University of North Carolina"))
I use adist() for things like this and have little wrapper function called closest_match() to help compare a value against a set of "good/permitted" values.
library(magrittr) # for the %>%
closest_match <- function(bad_value, good_values) {
distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
as.numeric() %>%
setNames(good_values)
distances[distances == min(distances)] %>%
names()
}
sapply(d$name, function(x) closest_match(x, y$name)) %>%
setNames(d$name)
University of New Yorkk The University of South\n Carolina Syracuuse University
"University of New York" "University of South Carolina" "University of New York"
University of South Texas The University of No Carolina
"University of South Texas" "University of South Carolina"
adist() utilizes Levenshtein distance to compare similarity between two strings.
This question already has answers here:
R: Find and remove all one to two letter words
(2 answers)
Closed 5 years ago.
I've got a problem and I'm sure it's super simple to fix, but I've been searching for an answer for about an hour and can't seem to work it out.
I have a character vector with data that looks a bit like this:
[5] "Toronto, ON" "Manchester, UK"
[7] "New York City, NY" "Newark, NJ"
[9] "Melbourne" "Los Angeles, CA"
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, CO" "London, UK"
[15] "New York, NY"
and basically I'd like to get rid of all character elements that are 2 digits or shorter, so that the data can then look as follows:
[5] "Toronto, " "Manchester, "
[7] "New York City, " "Newark, "
[9] "Melbourne" "Los Angeles, "
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, " "London, "
[15] "New York, "
The commas I know how to get rid of. As I said, I'm sure this is super simple, any help would be greatly appreciated. Thanks!
You can use quantifier on a word character \\w with word boundaries, \\b\\w{1,2}\\b will match a word with one or two characters; use gsub to remove it in case you have multiple matched pattern:
gsub("\\b\\w{1,2}\\b", "", v)
# [1] "Toronto, " "Manchester, " "New York City, " "Newark, " "Melbourne" "Los Angeles, " "New York, USA"
# [8] "Liverpool, England" "Fort Collins, " "London, " "New York, "
Notice \\w matches both alpha letters and digits with underscore, if you only want to take alpha letters into account, you can use gsub("\\b[a-zA-Z]{1,2}\\b", "", v).
v <- c("Toronto, ON", "Manchester, UK", "New York City, NY", "Newark, NJ", "Melbourne", "Los Angeles, CA", "New York, USA", "Liverpool, England", "Fort Collins, CO", "London, UK", "New York, NY")
Doesn't use regex but it gets the job done:
d <- c(
"Toronto, ON", "Manchester, UK",
"New York City, NY", "Newark, NJ",
"Melbourne", "Los Angeles, CA" ,
"New York, USA", "Liverpool, England" ,
"Fort Collins, CO", "London, UK" ,
"New York, NY" )
toks <- strsplit(d, "\\s+")
lens <- sapply(toks, nchar)
mapply(function(a, b) {
paste(a[b > 2], collapse = " ")
}, toks, lens)
I have a column State as shown below
State
Arizona, Arizona, Arizona, Arizona,
Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona
Virginia, Virginia, Virginia
.
.
.
I want to remove all duplicate words of specific type retain one unique word in this case I want to remove only duplicate Arizona words and Virginia Words and the final dataset should look like this below
Result
Arizona
Arizona, California Carmel Beach, California LBC, California Napa
Virginia
.
.
.
# Create a test data vector
testin <- c(
"Arizona, Arizona, Arizona, Arizona, ",
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona",
"Virginia, Virginia, Virginia"
)
# The names to remove if duplicated
kickDuplicates <- c("Arizona", "Virginia")
# create a list of vectors of place names
broken <- strsplit(testin, ",\\s*")
# paste each broken vector of place names back together
# .......kicking out duplicated instances of the chosen names
testout <- sapply(broken, FUN = function(x) paste(x[!duplicated(x) | !x %in% kickDuplicates ], collapse = ", "))
# see what we did
testout
I think this is what you want.
trimmed <- gsub('^\\s*','',state)
trimmed <- gsub('\\s*$','',trimmed)
lapply(lapply(strsplit(trimmed,'\\s*,\\s*'),unique),paste,sep =', ')
You could try with a single gsub to get the unique values, but the order of elements will be different
df1$Result <- gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*), ', "",
df1$State, perl=TRUE)
Regex101
df1$Result
#[1] "Arizona"
#[2] "California Carmel Beach, California LBC, California Napa, Arizona"
#[3] "Virginia"
data
df1 <- structure(list(State = c("Arizona, Arizona, Arizona, Arizona",
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona",
"Virginia, Virginia, Virginia")), .Names = "State", class = "data.frame",
row.names = c(NA, -3L))
I have a data frame with the columns city, state, and country. I want to create a string that concatenates: "City, State, Country". However, one of my cities doesn't have a State (has a NA instead). I want the string for that city to be "City, Country". Here is the code that creates the wrong string:
# define City, State, Country
city <- c("Austin", "Knoxville", "Salk Lake City", "Prague")
state <- c("Texas", "Tennessee", "Utah", NA)
country <- c("United States", "United States", "United States", "Czech Rep")
# create data frame
dff <- data.frame(city, state, country)
# create full string
dff["string"] <- paste(city, state, country, sep=", ")
When I display dff$string, I get the following. Note that the last string has a NA,, which is not needed:
> dff["string"]
string
1 Austin, Texas, United States
2 Knoxville, Tennessee, United States
3 Salk Lake City, Utah, United States
4 Prague, NA, Czech Rep
What do I do to skip that NA,, including the sep = ", ".
The alternative is to just fix it up afterwards:
gsub("NA, ","",dff$string)
#[1] "Austin, Texas, United States"
#[2] "Knoxville, Tennessee, United States"
#[3] "Salk Lake City, Utah, United States"
#[4] "Prague, Czech Rep"
Alternative #2, is to use apply once you have your data.frame called dff:
apply(dff, 1, function(x) paste(na.omit(x),collapse=", ") )
Late to the party, but unite provides a one-step approach:
dff %>% unite("string", c(city, state, country), sep=", ", remove = FALSE, na.rm = TRUE)
string city state country
1 Austin, Texas, United States Austin Texas United States
2 Knoxville, Tennessee, United States Knoxville Tennessee United States
3 Salk Lake City, Utah, United States Salk Lake City Utah United States
4 Prague, Czech Rep Prague <NA> Czech Rep