String Pattern Manipulation in R

String Pattern Manipulation in R - r

I am trying to find host and visitor names from a bunch of texts in R.
Sample Text -
dat = data.frame(Series = c('England in Australia ODI Match',
'Prudential Trophy (Australia in England)',
'Pakistan in New Zealand ODI Match',
'Prudential Trophy (New Zealand in England)',
'Prudential Trophy (West Indies in England)',
'Australia in New Zealand ODI Series',
'Texaco Trophy (Australia in England)'))
I want two new columns to be created.The desired output looks like below -
Visitor Host
England Australia
Australia England
Pakistan New Zealand
New Zealand England
West Indies England
Australia New Zealand
I am trying the following function but it's incomplete.
dat$Host = sub(" in.*", "", dat$Series)

Here is something that does what you want:
re = regexpr("((New |West )?\\w+) in ((New |West )?\\w+)", dat$Series)
rm = regmatches(dat$Series, re)
d = do.call(rbind,strsplit(rm, " in "))
colnames(d) = c("Visitor","Host")
Output:
Visitor Host
[1,] "England" "Australia"
[2,] "Australia" "England"
[3,] "Pakistan" "New Zealand"
[4,] "New Zealand" "England"
[5,] "West Indies" "England"
[6,] "Australia" "New Zealand"
[7,] "Australia" "England"

Related

r- Error when trying to use mutate with case_when

I am trying to add vector to a data frame holding the region of each US state. I have tried the following code and keep on getting an error message. I'm new to the tidyverse so any help you can offer would be appreciated. I'm guessing it's something small and embarrassing. :)
df <- df %>%
mutate(region = case_when((State=="Connecticut"|State=="Maine"|State=="Massachusetts"|State=="New Hampshire"|State=="Rhode Island"|State=="Vermont"~ "New England"),
case_when((State=="Delaware"| State=="District of Columbia" | State=="Maryland"| State=="New Jersey"| State=="New York"| State=="Pennsylvania"~ "Central Atlanic"),
case_when((State=="Florida"| State=="Georgia"| State=="North Carolina"|State=="South Carolina"| State=="Virginia"| State=="West Virginia"~ "Lower Atlantic"),
case_when((State=="Illinois"| State=="Indiana"| State=="Iowa"| State=="Kansas"| State=="Kentucky"| State=="Michigan"| State=="Minnesota"| State=="Missouri"| State=="Nebraska"| State=="North Dakota"| State=="Ohio"| State=="Oklahoma"| State=="South Dakota"| State=="Tennessee" |State=="Wisconsin"~ "Midwest"),
case_when((State=="Alabama" | State=="Arkansas" | State=="Louisiana"| State=="Mississippi"| State=="New Mexico"| State=="Texas"~ "Gulf Coast"),
case_when((State=="Colorado"| State=="Idaho" | State=="Montana"| State=="Utah"| State=="Wyoming"~ "Rocky Mountain"),
case_when((State=="Alaska" | State=="Arizona" | State=="California"| State=="Hawaii" | State=="Nevada"| State=="Oregon"| State=="Washington"~ "West Coast"), TRUE~"NA"))))))))
Error in mutate():
! Problem while computing region = case_when(...).
Caused by error in case_when():
! Case 2 ((State == "Colorado" | State == "Idaho" | State == "Montana" | State == "Utah" | State == "Wyoming" ~ "Rocky Mountain")) must be a two-sided formula, not a character vector.

As docs show, there is no need to nest case_when. Simply, separate the mutually exclusive conditions by commas. Also, consider %in% and avoid the many OR calls.
mutate(region = case_when(
State %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont") ~ "New England"),
State %in% c("Delaware", "District of Columbia", "Maryland", "New Jersey", "New York", "Pennsylvania") ~ "Central Atlantic"),
...,
TRUE ~ NA
))
In fact, consider simply merging and avoid any conditional logic:
txt = 'State Region
Connecticut "New England"
Maine "New England"
Massachusetts "New England"
"New Hampshire" "New England"
"Rhode Island" "New England"
Vermont "New England"
Delaware "Central Atlantic"
"District of Columbia" "Central Atlantic"
Maryland "Central Atlantic"
"New Jersey" "Central Atlantic"
"New York" "Central Atlantic"
Pennsylvania "Central Atlantic"
...'
region_df <- read.table(text = txt, header = TRUE)
region_df
# State Region
# 1 Connecticut New England
# 2 Maine New England
# 3 Massachusetts New England
# 4 New Hampshire New England
# 5 Rhode Island New England
# 6 Vermont New England
# 7 Delaware Central Atlantic
# 8 District of Columbia Central Atlantic
# 9 Maryland Central Atlantic
# 10 New Jersey Central Atlantic
# 11 New York Central Atlantic
# 12 Pennsylvania Central Atlantic
# ...
main_df <- merge(main_df, region_df, by = "State")

Extracting counties and MSAs from state-county-MSA character variable

I have state, county and MSA names in a single string variable states_county_MSA, and I want to split them to create three distinct variables - states, county and MSAs.
tail(df$states_county_MSA,n=10)
[1] "Iowa Polk Des Moines"
[2] "Mississippi Hinds Jackson"
[3] "Georgia Richmond Augusta-Richmond"
[4] "Ohio Mahoning Youngstown-Warren-Boardman"
[5] "Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[6] "Pennsylvania Dauphin Harrisburg-Carlisle"
[7] "Florida Brevard Palm Bay-Melbourne-Titusville"
[8] "Utah Utah Provo-Orem"
[9] "Tennessee Hamilton Chattanooga"
[10] "North Carolina Durham Durham"
Modifying the solution by #jared_mamrot to a similar question (splitting state-county variable into state and county distinct variables posted below ; full problem here for reference - Extracting states and counties from state-county character variable ), I can split the states_county_MSA variable into two variables - states and county-MSA variable.
library(tidyverse)
states_county_names_df <- data.frame(states_county = c(
"California San Francisco",
"New York Bronx",
"Illinois Cook",
"Massachusetts Suffolk",
"District of Columbia District of Columbia"
)
)
data(state)
states_inc_Columbia <- c(state.name, "District of Columbia")
states_county_names_df %>%
mutate(state = str_extract(states_county, paste(states_inc_Columbia, collapse = "|")),
county = str_remove(states_county, paste(states_inc_Columbia, collapse = "|")))
However, in this scenario, I am not able to decompose states_county_MSA further as I cannot find a function for county or MSA names. Not able to get county.names function to work, and tried using tigiris, censusapi and maps package but was unable to generate a vector of county names in US for the string split/extract command).
> data(county.names)
Warning in data(county.names) : data set ‘county.names’ not found
I was thinking of using the word function but names of MSAs are not standard either (one or more words).
Would anyone know a way to split the county-MSA in an efficient manner ?
EDIT - Data with (space) delimiter {county, state, MSA, MSA population, month, year}.
[1] "Virginia Richmond Richmond 1,210,063 8 2014"
[2] "Louisiana Orleans New Orleans-Metairie-Kenner 1,195,794"
[3] "North Carolina Wake Raleigh-Cary 1,137,346 6 2014"
[4] "New York Erie Buffalo-Niagara Falls 1,135,342"
[5] "Alabama Jefferson Birmingham-Hoover 1,129,034"
[6] "Utah Salt Lake Salt Lake City 1,091,432 5 2014"
[7] "New York Monroe Rochester 1,080,082"
[8] "Michigan Kent Grand Rapids-Wyoming 989,205 7 2014"
[9] "Arizona Pima Tucson 981,935 10 2013"
[10] "Hawaii Honolulu Honolulu 956,336 8 2013"

I think this should work:
data <- tibble::tribble(~state_county_msa,
"Iowa Polk Des Moines" ,
"Mississippi Hinds Jackson" ,
"Georgia Richmond Augusta-Richmond" ,
"Ohio Mahoning Youngstown-Warren-Boardman" ,
"Pennsylvania Lackawanna Scranton--Wilkes-Barre",
"Pennsylvania Dauphin Harrisburg-Carlisle" ,
"Florida Brevard Palm Bay-Melbourne-Titusville" ,
"Utah Utah Provo-Orem" ,
"Tennessee Hamilton Chattanooga" ,
"North Carolina Durham Durham")
state_county <- ggplot2::map_data("county") %>%
select(state = region,
county = subregion) %>%
as_tibble() %>%
mutate(across(everything(),str_to_title)) %>%
unite(state_county, c("state","county"), sep = " ", remove = FALSE) %>%
distinct(state_county, .keep_all = TRUE)
state_county_string <- paste(state_county$state_county, collapse = "|")
data %>%
mutate(state_county = str_extract(state_county_msa, state_county_string),
msa = str_trim(str_remove(state_county_msa, state_county_string))) %>%
left_join(state_county, by = "state_county") %>%
select(state, county, msa)
Output:
# A tibble: 10 × 3
state county msa
<chr> <chr> <chr>
1 Iowa Polk Des Moines
2 Mississippi Hinds Jackson
3 Georgia Richmond Augusta-Richmond
4 Ohio Mahoning Youngstown-Warren-Boardman
5 Pennsylvania Lackawanna Scranton--Wilkes-Barre
6 Pennsylvania Dauphin Harrisburg-Carlisle
7 Florida Brevard Palm Bay-Melbourne-Titusville
8 Utah Utah Provo-Orem
9 Tennessee Hamilton Chattanooga
10 North Carolina Durham Durham

Splitting state-county-MSA string variable

I have a variable below that I believe is separated by space.
[95] "Florida Volusia Deltona-Daytona Beach-Ormond Beach"
[96] "Iowa Polk Des Moines"
[97] "Mississippi Hinds Jackson"
[98] "Georgia Richmond Augusta-Richmond"
[99] "Ohio Mahoning Youngstown-Warren-Boardman"
[100] " Pennsylvania Lackawanna Scranton--Wilkes-Barre"
[101] " Pennsylvania Dauphin Harrisburg-Carlisle"
[102] " Florida Brevard Palm Bay-Melbourne-Titusville"
[103] " Utah Utah Provo-Orem"
[104] " Tennessee Hamilton Chattanooga"
[105] " North Carolina Durham Durham"
I want to create three variables out of this string - state, county, and MSA. But the usual string split commands are not working. I tried stringi command too but fail to split the variable. Not sure why this is happening, as the command works on simpler strings.
> strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df,"")
Error in strsplit(BK_state_county_MSA$non_squished_states_county_MSA_names_df, :
non-character argument
> BK<-strsplit(as.character(BK_state_county_MSA$non_squished_states_county_MSA_names_df),"\\t")
> str(BK) #List of 0
list()
> stri_split(str=BK_state_county_MSA$non_squished_states_county_MSA_names_df, regex="\\t",n=3)
list()
> BK <-stri_split_lines(BK_state_county_MSA$non_squished_states_county_MSA_names_df)
list()
EDIT - The original data has 104 observations, but I am posting only 8 observations with dput command...
> dput(BK_state_county_MSA)
structure(list(non_squished_state_county_MSA = c(
"New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")

Here's an option using stri_trim_left() from stringi and separate() from tidyr:
stri_trim_left() removes strings with a leading whitespace, which occurs in your data starting at [100]. You can then separate() the strings into the three specified columns state, country and MSA, separated by at least 2 whitespaces (sep = " {2,}").
Data
BK_state_county_MSA<- structure(list(non_squished_state_county_MSA = c(
"New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem",
" Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")
Code
library(tidyr)
library(stringi)
BK_state_county_MSA %>% mutate(non_squished_state_county_MSA = stri_trim_left(non_squished_state_county_MSA)) %>%
separate(non_squished_state_county_MSA, into = c("state", "country", "MSA"), sep = " {2,}")
Output
state country MSA
1 New York Bronx New York-Wayne-White Plains
2 New York Kings New York-Wayne-White Plains
3 Pennsylvania Lackawanna Scranton--Wilkes-Barre
4 Pennsylvania Dauphin Harrisburg-Carlisle
5 Florida Brevard Palm Bay-Melbourne-Titusville
6 Utah Utah Provo-Orem
7 Tennessee Hamilton Chattanooga
8 North Carolina Durham Durham

There's probably something useful, in terms of further data analysis, about keeping North with Carolina and New with York & etc. And it's always nice to have a one liner, but sometimes a few lines get you where is best for moving forward. Consider playing around like this:
maddening_txt <- " North Carolina Durham Durham"
strsplit(maddening_txt, split = ' ') # n space = 4L
[[1]]
[1] " North Carolina" " Durham" "" ""
[5] "" " Durham"
nchar(strsplit(maddening_txt, split = ' ')[[1]])
[1] 15 7 0 0 0 7
# you could throw in a which test for >0 for indexing here
strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
[1] " North Carolina" " Durham" " Durham"
string_vec <- strsplit(maddening_txt, split = ' ')[[1]][c(1,2,6)]
> string_vec[1]
[1] " North Carolina"
string_vec[2:3]
[1] " Durham" " Durham"
Left as it is, when returned to in the future, something like split=' ' begs 'How many spaces was that?'
So, more explicitly, and using some of the above notation:
> strsplit(sc_msa$entries[8], split='\\s{5,}')
[[1]]
[1] " North Carolina" "Durham" "Durham"
This appears to also remove leading left white space without recourse to trim. And then the above mentioned
for
sc_msa_lst <- list()
> for(i in 1:length(sc_msa$entries)) {
+ sc_msa_lst[[i]] <- strsplit(sc_msa$entries[i], split='\\s{5,}')
+ }
> sc_msa_lst
[[1]]
[[1]][[1]]
[1] "New York" "Bronx"
[3] "New York-Wayne-White Plains"
[[2]]
[[2]][[1]]
[1] "New York" "Kings"
[3] "New York-Wayne-White Plains"
[[3]]
[[3]][[1]]
[1] " Pennsylvania" "Lackawanna" "Scranton--Wilkes-Barre"
[[4]]
[[4]][[1]]
[1] " Pennsylvania" "Dauphin" "Harrisburg-Carlisle"
[[5]]
[[5]][[1]]
[1] " Florida" "Brevard"
[3] "Palm Bay-Melbourne-Titusville"
[[6]]
[[6]][[1]]
[1] " Utah" "Utah" "Provo-Orem"
[[7]]
[[7]][[1]]
[1] " Tennessee" "Hamilton" "Chattanooga"
[[8]]
[[8]][[1]]
[1] " North Carolina" "Durham" "Durham"
So, our regex pattern, '\s{5,}' works for cases seen so far. Next, I have to remember how to make a data.frame...there's got to be a list of lists to data.frame question on SO somewhere that'll help me.
A completely unacceptable answer, but it will get you to the guts of what simplifications are offering and provide more control going forward.
Data, with renaming:
sc_msa <- structure(list(entries = c("New York Bronx New York-Wayne-White Plains",
"New York Kings New York-Wayne-White Plains",
" Pennsylvania Lackawanna Scranton--Wilkes-Barre",
" Pennsylvania Dauphin Harrisburg-Carlisle",
" Florida Brevard Palm Bay-Melbourne-Titusville",
" Utah Utah Provo-Orem", " Tennessee Hamilton Chattanooga",
" North Carolina Durham Durham")), row.names = c(NA,
-8L), class = "data.frame")

R - using regex to delete all strings with 2 characters or less [duplicate]

This question already has answers here:
R: Find and remove all one to two letter words
(2 answers)
Closed 5 years ago.
I've got a problem and I'm sure it's super simple to fix, but I've been searching for an answer for about an hour and can't seem to work it out.
I have a character vector with data that looks a bit like this:
[5] "Toronto, ON" "Manchester, UK"
[7] "New York City, NY" "Newark, NJ"
[9] "Melbourne" "Los Angeles, CA"
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, CO" "London, UK"
[15] "New York, NY"
and basically I'd like to get rid of all character elements that are 2 digits or shorter, so that the data can then look as follows:
[5] "Toronto, " "Manchester, "
[7] "New York City, " "Newark, "
[9] "Melbourne" "Los Angeles, "
[11] "New York, USA" "Liverpool, England"
[13] "Fort Collins, " "London, "
[15] "New York, "
The commas I know how to get rid of. As I said, I'm sure this is super simple, any help would be greatly appreciated. Thanks!

You can use quantifier on a word character \\w with word boundaries, \\b\\w{1,2}\\b will match a word with one or two characters; use gsub to remove it in case you have multiple matched pattern:
gsub("\\b\\w{1,2}\\b", "", v)
# [1] "Toronto, " "Manchester, " "New York City, " "Newark, " "Melbourne" "Los Angeles, " "New York, USA"
# [8] "Liverpool, England" "Fort Collins, " "London, " "New York, "
Notice \\w matches both alpha letters and digits with underscore, if you only want to take alpha letters into account, you can use gsub("\\b[a-zA-Z]{1,2}\\b", "", v).
v <- c("Toronto, ON", "Manchester, UK", "New York City, NY", "Newark, NJ", "Melbourne", "Los Angeles, CA", "New York, USA", "Liverpool, England", "Fort Collins, CO", "London, UK", "New York, NY")

Doesn't use regex but it gets the job done:
d <- c(
"Toronto, ON", "Manchester, UK",
"New York City, NY", "Newark, NJ",
"Melbourne", "Los Angeles, CA" ,
"New York, USA", "Liverpool, England" ,
"Fort Collins, CO", "London, UK" ,
"New York, NY" )
toks <- strsplit(d, "\\s+")
lens <- sapply(toks, nchar)
mapply(function(a, b) {
paste(a[b > 2], collapse = " ")
}, toks, lens)

How to skip a paste() argument when its value is NA in R

I have a data frame with the columns city, state, and country. I want to create a string that concatenates: "City, State, Country". However, one of my cities doesn't have a State (has a NA instead). I want the string for that city to be "City, Country". Here is the code that creates the wrong string:
# define City, State, Country
city <- c("Austin", "Knoxville", "Salk Lake City", "Prague")
state <- c("Texas", "Tennessee", "Utah", NA)
country <- c("United States", "United States", "United States", "Czech Rep")
# create data frame
dff <- data.frame(city, state, country)
# create full string
dff["string"] <- paste(city, state, country, sep=", ")
When I display dff$string, I get the following. Note that the last string has a NA,, which is not needed:
> dff["string"]
string
1 Austin, Texas, United States
2 Knoxville, Tennessee, United States
3 Salk Lake City, Utah, United States
4 Prague, NA, Czech Rep
What do I do to skip that NA,, including the sep = ", ".

The alternative is to just fix it up afterwards:
gsub("NA, ","",dff$string)
#[1] "Austin, Texas, United States"
#[2] "Knoxville, Tennessee, United States"
#[3] "Salk Lake City, Utah, United States"
#[4] "Prague, Czech Rep"
Alternative #2, is to use apply once you have your data.frame called dff:
apply(dff, 1, function(x) paste(na.omit(x),collapse=", ") )

Late to the party, but unite provides a one-step approach:
dff %>% unite("string", c(city, state, country), sep=", ", remove = FALSE, na.rm = TRUE)
string city state country
1 Austin, Texas, United States Austin Texas United States
2 Knoxville, Tennessee, United States Knoxville Tennessee United States
3 Salk Lake City, Utah, United States Salk Lake City Utah United States
4 Prague, Czech Rep Prague <NA> Czech Rep

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

String Pattern Manipulation in R - r

Related

r- Error when trying to use mutate with case_when

Extracting counties and MSAs from state-county-MSA character variable

Splitting state-county-MSA string variable

R - using regex to delete all strings with 2 characters or less [duplicate]

How to skip a paste() argument when its value is NA in R

Categories

Resources