R: Separate string based on set of regular expressions

R: Separate string based on set of regular expressions - r

I have a data frame foo.df that contains one variable that is just a very long string consisting of several substrings. Additionally, I have vectors of characters that match parts of the string. Example for the variable in the data frame:
foo.df$var[1]
[1] "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%
Now an example for the vectors of characters:
head(candidates)
[1] "Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"
[5] "Charles Butt" "Mitch Esterhazy"
I want to create a variable foo.df$candidate1 that contains the name of the first candidate appearing in the string (i.e. food.df$candidate1[1] would be Peter Paul Smith). I was trying to approach this with grepl but it doesn't work as grepl only uses the first the first entry from candidates. Any idea how this could be done efficiently?

You can use the regex OR character, |, with paste and regmatches/regexpr.
candidates <- scan(what = character(), text = '
"Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"')
var1 <- "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%"
foo.df <- data.frame(var1)
pat <- paste(candidates, collapse = "|")
regmatches(foo.df$var1, regexpr(pat, foo.df$var1))
#[1] "Peter Paul Smith"
foo.df$candidate1 <- regmatches(foo.df$var1, regexpr(pat, foo.df$var1))

Related

How to remove specific characters from string in a column in R?

I've got the following data.
df <- data.frame(Name = c("TOMTom Catch",
"BIBill Ronald",
"JEFJeffrey Wilson",
"GEOGeorge Sic",
"DADavid Irris"))
How do I clean the data in names column?
I've tried nchar and substring however some names need the first two characters removed where as other need the first three?

We can use regex lookaround patterns.
gsub("^[A-Z]+(?=[A-Z])", "", df$Name, perl = T)
#> [1] "Tom Catch" "Bill Ronald" "Jeffrey Wilson" "George Sic"
#> [5] "David Irris"

Extracting words between word/space patterns

I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.

You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')

You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.

Parsing a CSV with irregular quoting rules using readr

I have a weird CSV that I can't parse with readr. Let's call it data.csv. It looks something like this:
name,info,amount_spent
John Doe,Is a good guy,5412030
Jane Doe,"Jan Doe" is cool,3159
Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451
If all of the rows were like first one below the columns row – two character columns followed by an integer column – this would be easy to parse with read_csv:
df <- read_csv("data.csv")
However, some rows are formatted like the second one, in that the second column ("info") contains a string, part of which is enclosed by double quotes and part of which is not. This makes it so read_csv doesn't read the comma after the word cool as a delimiter, and the entire following row gets appended to the offending cell.
A solution for this kind of problem is to pass FALSE to the escape_double argument in read_delim(), like so:
df <- read_delim("data.csv", delim = ",", escape_double = FALSE)
This works for the second row, but gets killed by the third, where the second column contains a string enclosed by double quotes which itself contains nested double quotes and a comma.
I have read the readr documentation but have as yet found no solution that would parse both types of rows.

Here is what worked for me with the example specified.
Used read.csv rather than read_csv.
This means I am using a dataframe rather than a tibble.
#Read the csv, just turned the table you had as an example to a csv.
#That resulted as a csv with one column
a <- read.csv(file = "Book1.csv", header=T)
#Replace the comma in the third(!) line with just space
a[,1] <- str_replace_all(as.vector(a[,1]), ", ", " ")
#Use seperate from the tidyer package to split the column to three columns
#and convert to a tibble
a <- a %>% separate(name.info.amount_spent, c("name", "info", "spent"), ",")%>%
as_tibble(a)
glimpse(a)
$name <chr> "John Doe", "Jane Doe", "Senator Sally Doe"
$info <chr> "Is a good guy", "\"Jan Doe\" is cool", "\"Sally \"Sal\" Doe is from New York NY\""
$spent <chr> "5412030", "3159", "4451"

You could use a regular expression which splits on the comma in question (using (*SKIP)(*FAIL)):
input <- c('John Doe,Is a good guy,5412030', 'Jane Doe,"Jan Doe" is cool,3159',
'Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451')
lst <- strsplit(input, '"[^"]*"(*SKIP)(*FAIL)|,', perl = T)
(df <- setNames(as.data.frame(do.call(rbind, lst)), c("name","info","amount_spent")))
This yields
name info amount_spent
1 John Doe Is a good guy 5412030
2 Jane Doe "Jan Doe" is cool 3159
3 Senator Sally Doe "Sally "Sal" Doe is from New York, NY" 4451
See a demo for the expression on regex101.com.

R gsub numbers and space from variables

With gsub I am able to remove the # from these person variables, however the way I am trying to remove the random number is not correct. I also would like to remove the space after the persons name as well but keep the space in the middle of the name.
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
c(1:99) -> numbers
person <- gsub("#", "", person, fixed=TRUE)
# MY ISSUE
person <- gsub(numbers, "", person, fixed=TRUE)
df <- data.frame(PERSON = person)
Current Results:
PERSON
mike smith 99
John johnson 2
jeff johnson 50
Expected Results:
PERSON
mike smith
John johnson
jeff johnson

c('mike smith #99','John johnson #2','jeff johnson #50') -> person
sub("\\s+#.*", "", person)
[1] "mike smith" "John johnson" "jeff johnson"

Here's another pattern as an alternative:
> gsub("(\\.*)\\s+#.*", "\\1", person)
[1] "mike smith" "John johnson" "jeff johnson"
In the above regex, (\\.*) will match a subgroup of any characters before a space (\\s+) following by # symbol and following by anything. Then \\1 indicates that gsub should replace all the original string with that subgroup (\\.*)
An easier way to get your desired output is :
> gsub("\\s+#.*$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
The above regex \\s+#.*$ indicates that everything consisting of space (\\s+), a # symbol and everyting else until the end of string (\.$) should be removed.
Using str_extract_all from stringr package
> library(stringr)
> str_extract_all(person, "[[a-z]]+", simplify = TRUE)
[,1] [,2]
[1,] "mike" "smith"
[2,] "ohn" "johnson"
[3,] "jeff" "johnson"
Also you can use:
library(stringi)
stri_extract_all(person, regex="[[a-z]]+", simplify=TRUE)

This could alternately be done with read.table.
read.table(text = person, sep = "#", strip.white = TRUE,
as.is = TRUE, col.names = "PERSON")
giving:
PERSON
1 mike smith
2 John johnson
3 jeff johnson

We can create the pattern with paste
pat <- paste0("\\s*#(", paste(numbers, collapse = "|"), ")")
gsub(pat, "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Note that the above solution was based on creating pattern with 'numbers'. If it is only to remove the numbers after the # including it
sub("\\s*#\\d+$", "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Or another option is
unlist(strsplit(person, "\\s*#\\d+"))
NOTE: All the above are base R methods
library(tidyverse)
data_frame(person) %>%
separate(person, into = c("person", "notneeded"), "\\s+#") %>%
select(person)

An alternative that deletes any sequence of non (lowercase) alphabetic characters at the end of the string.
gsub("[^a-z]+$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
If you want to allow for words that are all upper case or end with an uppercase character.
gsub("[^a-zA-Z]+$", "", person)
Some names might end with .:
gsub("[^a-zA-Z.]+$", "", person)

Subset rows only contain letters in R

My vector have around 3000 observations like:
clients <- c("Greg Smith", "John Coolman", "Mr. Brown", "John Nightsmith (father)", "2 Nicolas Cage")
How I can subset rows that contain only names with letters. For example, only Greg Smith, John Coolman (without symbols like 0-9,.?:[} etc.).

We can use grep to match only upper or lower case alphabets along with space from start (^) to end ($) of the string.
grep('^[A-Za-z ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"
Or just use the [[:alpha:] ]+
grep('^[[:alpha:] ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Separate string based on set of regular expressions - r

Related

How to remove specific characters from string in a column in R?

Extracting words between word/space patterns

Parsing a CSV with irregular quoting rules using readr

R gsub numbers and space from variables

Subset rows only contain letters in R

Categories

Resources