Splitting state-county-MSA string variable - r

I have a variable below that I believe is separated by space.
[95] "Florida Volusia Deltona-Daytona Beach-Ormond Beach"
[96] "Iowa Polk Des Moines"
[97] "Mississippi Hinds Jackson"
[98] "Georgia Richmond Augusta-Richmond"
[99] "Ohio Mahoning Youngstown-Warren-Boardman"
[100] " Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[101] " Pennsylvania Dauphin Harrisburg-Carlisle"
[102] " Florida Brevard Palm Bay-Melbourne-Titusville"
[103] " Utah Utah Provo-Orem"
[104] " Tennessee Hamilton Chattanooga"
[105] " North Carolina Durham Durham"
I want to create three variables out of this string - state, county, and MSA. But the usual string split commands are not working. I tried stringi command too but fail to split the variable. Not sure why this is happening, as the command works on simpler strings.
> strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df,"")
Error in strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df, :
non-character argument
> BK<-strsplit(as.character(BK_state_county_MSA$non_squished_states_county_MSA_names_df),"\\t")
> str(BK) #List of 0
list()
> stri_split(str=BK_state_county_MSA$non_squished_states_county_MSA_names_df, regex="\\t",n=3)
list()
> BK <-stri_split_lines(BK_state_county_MSA$non_squished_states_county_MSA_names_df)
list()
EDIT - The original data has 104 observations, but I am posting only 8 observations with dput command...
> dput(BK_state_county_MSA)
structure(list(non_squished_state_county_MSA = c(
"New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")

Here's an option using stri_trim_left() from stringi and separate() from tidyr:
stri_trim_left() removes strings with a leading whitespace, which occurs in your data starting at [100]. You can then separate() the strings into the three specified columns state, country and MSA, separated by at least 2 whitespaces (sep = " {2,}").
Data
BK_state_county_MSA<- structure(list(non_squished_state_county_MSA = c(
"New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")
Code
library(tidyr)
library(stringi)
BK_state_county_MSA %>% mutate(non_squished_state_county_MSA = stri_trim_left(non_squished_state_county_MSA)) %>%
separate(non_squished_state_county_MSA, into = c("state", "country", "MSA"), sep = " {2,}")
Output
state country MSA
1 New York Bronx New York-Wayne-White Plains
2 New York Kings New York-Wayne-White Plains
3 Pennsylvania Lackawanna Scranton--Wilkes-Barre
4 Pennsylvania Dauphin Harrisburg-Carlisle
5 Florida Brevard Palm Bay-Melbourne-Titusville
6 Utah Utah Provo-Orem
7 Tennessee Hamilton Chattanooga
8 North Carolina Durham Durham

There's probably something useful, in terms of further data analysis, about keeping North with Carolina and New with York & etc. And it's always nice to have a one liner, but sometimes a few lines get you where is best for moving forward. Consider playing around like this:
maddening_txt <- " North Carolina Durham Durham"
strsplit(maddening_txt, split = ' ') # n space = 4L
[[1]]
[1] " North Carolina" " Durham" "" ""
[5] "" " Durham"
nchar(strsplit(maddening_txt, split = ' ')[[1]])
[1] 15 7 0 0 0 7
# you could throw in a which test for >0 for indexing here
strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
[1] " North Carolina" " Durham" " Durham"
string_vec <- strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
> string_vec[1]
[1] " North Carolina"
string_vec[2:3]
[1] " Durham" " Durham"
Left as it is, when returned to in the future, something like split=' ' begs 'How many spaces was that?'
So, more explicitly, and using some of the above notation:
> strsplit(sc_msa$entries[8], split='\\s{5,}')
[[1]]
[1] " North Carolina" "Durham" "Durham"
This appears to also remove leading left white space without recourse to trim. And then the above mentioned
for
sc_msa_lst <- list()
> for(i in 1:length(sc_msa$entries)) {
+ sc_msa_lst[[i]] <- strsplit(sc_msa$entries[i], split='\\s{5,}')
+ }
> sc_msa_lst
[[1]]
[[1]][[1]]
[1] "New York" "Bronx"
[3] "New York-Wayne-White Plains"
[[2]]
[[2]][[1]]
[1] "New York" "Kings"
[3] "New York-Wayne-White Plains"
[[3]]
[[3]][[1]]
[1] " Pennsylvania" "Lackawanna" "Scranton--Wilkes-Barre"
[[4]]
[[4]][[1]]
[1] " Pennsylvania" "Dauphin" "Harrisburg-Carlisle"
[[5]]
[[5]][[1]]
[1] " Florida" "Brevard"
[3] "Palm Bay-Melbourne-Titusville"
[[6]]
[[6]][[1]]
[1] " Utah" "Utah" "Provo-Orem"
[[7]]
[[7]][[1]]
[1] " Tennessee" "Hamilton" "Chattanooga"
[[8]]
[[8]][[1]]
[1] " North Carolina" "Durham" "Durham"
So, our regex pattern, '\s{5,}' works for cases seen so far. Next, I have to remember how to make a data.frame...there's got to be a list of lists to data.frame question on SO somewhere that'll help me.
A completely unacceptable answer, but it will get you to the guts of what simplifications are offering and provide more control going forward.
Data, with renaming:
sc_msa <- structure(list(entries = c("New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem", " Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")

Related

Extracting counties and MSAs from state-county-MSA character variable

I have state, county and MSA names in a single string variable states_county_MSA, and I want to split them to create three distinct variables - states, county and MSAs.
tail(df$states_county_MSA,n=10)
[1] "Iowa Polk Des Moines"
[2] "Mississippi Hinds Jackson"
[3] "Georgia Richmond Augusta-Richmond"
[4] "Ohio Mahoning Youngstown-Warren-Boardman"
[5] "Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[6] "Pennsylvania Dauphin Harrisburg-Carlisle"
[7] "Florida Brevard Palm Bay-Melbourne-Titusville"
[8] "Utah Utah Provo-Orem"
[9] "Tennessee Hamilton Chattanooga"
[10] "North Carolina Durham Durham"
Modifying the solution by #jared_mamrot to a similar question (splitting state-county variable into state and county distinct variables posted below ; full problem here for reference - Extracting states and counties from state-county character variable ), I can split the states_county_MSA variable into two variables - states and county-MSA variable.
library(tidyverse)
states_county_names_df <- data.frame(states_county = c(
"California San Francisco",
"New York Bronx",
"Illinois Cook",
"Massachusetts Suffolk",
"District of Columbia District of Columbia"
)
)
data(state)
states_inc_Columbia <- c(state.name, "District of Columbia")
states_county_names_df %>%
mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))
However, in this scenario, I am not able to decompose states_county_MSA further as I cannot find a function for county or MSA names. Not able to get county.names function to work, and tried using tigiris, censusapi and maps package but was unable to generate a vector of county names in US for the string split/extract command).
> data(county.names)
Warning in data(county.names) : data set ‘county.names’ not found
I was thinking of using the word function but names of MSAs are not standard either (one or more words).
Would anyone know a way to split the county-MSA in an efficient manner ?
EDIT - Data with (space) delimiter {county, state, MSA, MSA population, month, year}.
[1] "Virginia Richmond Richmond 1,210,063 8 2014"
[2] "Louisiana Orleans New Orleans-Metairie-Kenner 1,195,794"
[3] "North Carolina Wake Raleigh-Cary 1,137,346 6 2014"
[4] "New York Erie Buffalo-Niagara Falls 1,135,342"
[5] "Alabama Jefferson Birmingham-Hoover 1,129,034"
[6] "Utah Salt Lake Salt Lake City 1,091,432 5 2014"
[7] "New York Monroe Rochester 1,080,082"
[8] "Michigan Kent Grand Rapids-Wyoming 989,205 7 2014"
[9] "Arizona Pima Tucson 981,935 10 2013"
[10] "Hawaii Honolulu Honolulu 956,336 8 2013"
I think this should work:
data <- tibble::tribble(~state_county_msa,
"Iowa Polk Des Moines" ,
"Mississippi Hinds Jackson" ,
"Georgia Richmond Augusta-Richmond" ,
"Ohio Mahoning Youngstown-Warren-Boardman" ,
"Pennsylvania Lackawanna Scranton--Wilkes-Barre",
"Pennsylvania Dauphin Harrisburg-Carlisle" ,
"Florida Brevard Palm Bay-Melbourne-Titusville" ,
"Utah Utah Provo-Orem" ,
"Tennessee Hamilton Chattanooga" ,
"North Carolina Durham Durham")
state_county <- ggplot2::map_data("county") %>%
select(state = region,
county = subregion) %>%
as_tibble() %>%
mutate(across(everything(),str_to_title)) %>%
unite(state_county, c("state","county"), sep = " ", remove = FALSE) %>%
distinct(state_county, .keep_all = TRUE)
state_county_string <- paste(state_county$state_county, collapse = "|")
data %>%
mutate(state_county = str_extract(state_county_msa, state_county_string),
msa = str_trim(str_remove(state_county_msa, state_county_string))) %>%
left_join(state_county, by = "state_county") %>%
select(state, county, msa)
Output:
# A tibble: 10 × 3
state county msa
<chr> <chr> <chr>
1 Iowa Polk Des Moines
2 Mississippi Hinds Jackson
3 Georgia Richmond Augusta-Richmond
4 Ohio Mahoning Youngstown-Warren-Boardman
5 Pennsylvania Lackawanna Scranton--Wilkes-Barre
6 Pennsylvania Dauphin Harrisburg-Carlisle
7 Florida Brevard Palm Bay-Melbourne-Titusville
8 Utah Utah Provo-Orem
9 Tennessee Hamilton Chattanooga
10 North Carolina Durham Durham

Collapse elements separated by ""

I have raw bibliographic data as follows:
bib =
c("Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte",
"Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in",
"Republican China*, Cambridge: Harvard University Press, 1976.",
"", "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing",
"History* and the Crisis of Traditional Chinese Historiography,\"",
"*Historiography East & West*2.2 (Sept. 2004): 173-204", "",
"Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:",
"Yale University Press, 1988.", "")
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte"
[2] "Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in"
[3] "Republican China*, Cambridge: Harvard University Press, 1976."
[4] ""
[5] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing"
[6] "History* and the Crisis of Traditional Chinese Historiography,\""
[7] "*Historiography East & West*2.2 (Sept. 2004): 173-204"
[8] ""
[9] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven:"
[10] "Yale University Press, 1988."
[11] ""
I would like to collapse elements between the ""s in one line so that:
clean_bib[1]=paste(bib[1], bib[2], bib[3])
clean_bib[2]=paste(bib[5], bib[6], bib[7])
clean_bib[3]=paste(bib[9], bib[10])
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] "Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] "Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
Is there a one-liner that does this automatically?
You can use tapply while grouping with all "" then paste together the groups
unname(tapply(bib,cumsum(bib==""),paste,collapse=" "))
[1] "Bernal, Martin, \\\"Liu Shi-p\\'ei and National Essence,\\\" in Charlotte Furth, ed., *The Limit of Change, Essays on Conservative Alternatives in Republican China*, Cambridge: Harvard University Press, 1976."
[2] " Chen,Hsi-yuan, \"*Last Chapter Unfinished*: The Making of the *Draft Qing History* and the Crisis of Traditional Chinese Historiography,\" *Historiography East & West*2.2 (Sept. 2004): 173-204"
[3] " Dennerline, Jerry, *Qian Mu and the World of Seven Mansions*, New Haven: Yale University Press, 1988."
[4] ""
you can also do:
unname(c(by(bib,cumsum(bib==""),paste,collapse=" ")))
or
unname(tapply(bib,cumsum(grepl("^$",bib)),paste,collapse=" "))
etc
Similar to the other answer. This uses split and sapply. The second line is just to remove any elements with only has "".
vec <- unname(sapply(split(bib, f = cumsum(bib %in% "")), paste0, collapse = " "))
vec[!vec %in% ""]

Splitting column by separator from right to left in R

I'm working on a dataset where one column (Place) consists of a location sentence.
librabry(tidyverse)
example <- tibble(Datum = c("October 1st 2017",
"October 2st 2017",
"October 3rd 2017"),
Place = c("Tabiyyah Jazeera village, 20km south east of Deir Ezzor, Deir Ezzor Governorate, Syria",
"Abu Kamal, Deir Ezzor Governorate, Syria",
"شارع القطار al Qitar [train] street, al-Tawassiya area, north of Raqqah city centre, Raqqah governorate, Syria"))
I would like to split the Place column by the comma separator so I prefer a solution with the tidyverse package. Because the values of Place have different lengths I would like to start from right to left. So that the country Syria is the value in the last column of this dataframe.
Oh, and for a bonus with which RegEx code do I delete the Arabic characters?
Thanks in advance.
Edit: Found my answer:
For removing Arabic characters (thanks to #g5w):
gsub("[\u0600-\u06FF]", "", airstrikes_okt_clean$Plek)
And splitting the column in a tidyr way:
airstrikes_okt_clean <- separate(example,
Place,
into = c("detail",
"detail2",
"City_or_village",
"District",
"Country"),
sep = ",",
fill = "left")
Just split the string on comma and the reverse it.
lapply(strsplit(Place, ","), rev)
[[1]]
[1] " Syria" " Deir Ezzor Governorate"
[3] " 20km south east of Deir Ezzor" "Tabiyyah Jazeera village"
[[2]]
[1] " Syria" " Deir Ezzor Governorate"
[3] "Abu Kamal"
[[3]]
[1] " Syria" " Raqqah governorate"
[3] " north of Raqqah city centre" " al-Tawassiya area"
[5] "شارع القطار al Qitar [train] street"
To get rid of the Arabic characters before splitting, try
gsub("[\u0600-\u06FF]", "", Place)
[1] "Tabiyyah Jazeera village, 20km south east of Deir Ezzor, Deir Ezzor Governorate, Syria"
[2] "Abu Kamal, Deir Ezzor Governorate, Syria"
[3] " al Qitar [train] street, al-Tawassiya area, north of Raqqah city centre, Raqqah governorate, Syria"
Here's a one-liner.
sapply(strsplit(example$Place, ","), function(x) trimws(x[length(x)]))
It will return the string after the last comma, be it Syria or any other.

R - using regex to delete all strings with 2 characters or less [duplicate]

This question already has answers here:
R: Find and remove all one to two letter words
(2 answers)
Closed 5 years ago.
I've got a problem and I'm sure it's super simple to fix, but I've been searching for an answer for about an hour and can't seem to work it out.
I have a character vector with data that looks a bit like this:
[5] "Toronto, ON" "Manchester, UK"
[7] "New York City, NY" "Newark, NJ"
[9] "Melbourne" "Los Angeles, CA"
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, CO" "London, UK"
[15] "New York, NY"
and basically I'd like to get rid of all character elements that are 2 digits or shorter, so that the data can then look as follows:
[5] "Toronto, " "Manchester, "
[7] "New York City, " "Newark, "
[9] "Melbourne" "Los Angeles, "
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, " "London, "
[15] "New York, "
The commas I know how to get rid of. As I said, I'm sure this is super simple, any help would be greatly appreciated. Thanks!
You can use quantifier on a word character \\w with word boundaries, \\b\\w{1,2}\\b will match a word with one or two characters; use gsub to remove it in case you have multiple matched pattern:
gsub("\\b\\w{1,2}\\b", "", v)
# [1] "Toronto, " "Manchester, " "New York City, " "Newark, " "Melbourne" "Los Angeles, " "New York, USA"
# [8] "Liverpool, England" "Fort Collins, " "London, " "New York, "
Notice \\w matches both alpha letters and digits with underscore, if you only want to take alpha letters into account, you can use gsub("\\b[a-zA-Z]{1,2}\\b", "", v).
v <- c("Toronto, ON", "Manchester, UK", "New York City, NY", "Newark, NJ", "Melbourne", "Los Angeles, CA", "New York, USA", "Liverpool, England", "Fort Collins, CO", "London, UK", "New York, NY")
Doesn't use regex but it gets the job done:
d <- c(
"Toronto, ON", "Manchester, UK",
"New York City, NY", "Newark, NJ",
"Melbourne", "Los Angeles, CA" ,
"New York, USA", "Liverpool, England" ,
"Fort Collins, CO", "London, UK" ,
"New York, NY" )
toks <- strsplit(d, "\\s+")
lens <- sapply(toks, nchar)
mapply(function(a, b) {
paste(a[b > 2], collapse = " ")
}, toks, lens)

String Pattern Manipulation in R

I am trying to find host and visitor names from a bunch of texts in R.
Sample Text -
dat = data.frame(Series = c('England in Australia ODI Match',
'Prudential Trophy (Australia in England)',
'Pakistan in New Zealand ODI Match',
'Prudential Trophy (New Zealand in England)',
'Prudential Trophy (West Indies in England)',
'Australia in New Zealand ODI Series',
'Texaco Trophy (Australia in England)'))
I want two new columns to be created.The desired output looks like below -
Visitor Host
England Australia
Australia England
Pakistan New Zealand
New Zealand England
West Indies England
Australia New Zealand
I am trying the following function but it's incomplete.
dat$Host = sub(" in.*", "", dat$Series)
Here is something that does what you want:
re = regexpr("((New |West )?\\w+) in ((New |West )?\\w+)", dat$Series)
rm = regmatches(dat$Series, re)
d = do.call(rbind,strsplit(rm, " in "))
colnames(d) = c("Visitor","Host")
Output:
Visitor Host
[1,] "England" "Australia"
[2,] "Australia" "England"
[3,] "Pakistan" "New Zealand"
[4,] "New Zealand" "England"
[5,] "West Indies" "England"
[6,] "Australia" "New Zealand"
[7,] "Australia" "England"

Resources