Note I have already read Split string at first occurrence of an integer in a string however my request is different because I would like to use R.
Suppose I have the following example data frame:
> df = data.frame(name_and_address =
c("Mr. Smith12 Some street",
"Mr. Jones345 Another street",
"Mr. Anderson6 A different street"))
> df
name_and_address
1 Mr. Smith12 Some street
2 Mr. Jones345 Another street
3 Mr. Anderson6 A different street
I would like to split the string at the first occurrence of an integer. Notice that the integers are of varying length.
The desired output can be like the following:
[[1]]
[1] "Mr. Smith"
[2] "12 Some street",
[[2]]
[1] "Mr. Jones"
[2] "345 Another street",
[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"
I have tried the following but I can not get the regular expression correct:
# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\\d+)', perl=TRUE, type.convert=TRUE)
# Attempt 2 (Does not work)
library(stringr)
str_split(df, "\\d+")
I would use sub here:
df$name <- sub("(\\D+).*", "\\1", df$name_and_address)
df$address <- sub(".*?(\\d+.*)", "\\1", df$name_and_address)
You can use tidyr::extract:
library(tidyr)
df <- df %>%
extract("name_and_address", c("name", "address"), "(\\D*)(\\d.*)")
## => df
## name address
## 1 Mr. Smith 12 Some street
## 2 Mr. Jones 345 Another street
## 3 Mr. Anderson 6 A different street
The (\D*)(\d.*) regex matches the following:
(\D*) - Group 1: any zero or more non-digit chars
(\d.*) - Group 2: a digit and then any zero or more chars as many as possible.
Another solution with stringr::str_split is also possible:
str_split(df$name_and_address, "(?=\\d)", n=2)
## => [[1]]
## [1] "Mr. Smith" "12 Some street"
## [[2]]
## [1] "Mr. Jones" "345 Another street"
## [[3]]
## [1] "Mr. Anderson" "6 A different street"
The (?=\d) positive lookahead finds a location before a digit, and n=2 tells stringr::str_split to only split into 2 chunks max.
Base R approach that does not return anything if there is no digit in the string:
df = data.frame(name_and_address = c("Mr. Smith12 Some street", "Mr. Jones345 Another street", "Mr. Anderson6 A different street", "1 digit is at the start", "No digits, sorry."))
df$name <- sub("^(?:(\\D*)\\d.*|.+)", "\\1", df$name_and_address)
df$address <- sub("^\\D*(\\d.*)?", "\\1", df$name_and_address)
df$name
# => [1] "Mr. Smith" "Mr. Jones" "Mr. Anderson" "" ""
df$address
# => [1] "12 Some street" "345 Another street"
# [3] "6 A different street" "1 digit is at the start" ""
See an online R demo. This also supports cases when the first digit is the first char in the string.
Related
I have a list ("listanswer") that looks something like this:
> str(listanswer)
List of 100
$ : chr [1:3] "" "" "\t\t"
$ : chr [1:5] "" "Dr. Smith" "123 Fake Street" "New York, ZIPCODE 1" ...
$ : chr [1:5] "" "Dr. Jones" "124 Fake Street" "New York, ZIPCODE 2" ...
> listanswer
[[1]]
[1] "" "" "\t\t"
[[2]]
[1] "" "Dr. Smith" "123 Fake Street" "New York"
[5] "ZIPCODE 1"
[[3]]
[1] "" "Dr. Jones" "124 Fake Street," "New York"
[5] "ZIPCODE2"
For each element in this list, I noticed the following pattern within the sub-elements:
# first sub-element is always empty
> listanswer[[2]][[1]]
[1] ""
# second sub-element is the name
> listanswer[[2]][[2]]
[1] "Dr. Smith"
# third sub-element is always the address
> listanswer[[2]][[3]]
[1] "123 Fake Street"
# fourth sub-element is always the city
> listanswer[[2]][[4]]
[1] "New York"
# fifth sub-element is always the ZIP
> listanswer[[2]][[5]]
[1] "ZIPCODE 1"
I want to create a data frame that contains the information from this list in row format. For example:
id name address city ZIP
1 2 Dr. Smith 123 Fake Street New York ZIPCODE 1
2 3 Dr. Jones 124 Fake Street New York ZIPCODE 2
I thought of the following way to do this:
name = sapply(listanswer,function(x) x[2])
address = sapply(listanswer,function(x) x[3])
city = sapply(listanswer,function(x) x[4])
zip = sapply(listanswer,function(x) x[5])
final_data = data.frame(name, address, city, zip)
id = 1:nrow(final_data)
My Question: I just wanted to confirm - Is this the correct way to reference sub-elements in lists?
If it works, it's the correct way, although there might be a more efficient or more readable way to do the same thing.
Another way to do this is to create a data frame with your columns, and add rows to it. i. e.
#create an empty data frame
df <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(df) <- c("name", "address", "city", "zip")
#add rows
lapply(listanswer, \(x){df[nrow(df) + 1,] <- x[2:5]})
This is simply another way to solve the same problem. Readability is a personal preference, and there's nothing wrong with your solution either.
If this is based on your elephant question, for businesses in Vancouver, then this mostly works.
library(rvest)
url<-"Website/british-columbia/"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("div.one_third")
listanswer <- b %>% html_text() %>% strsplit("\\n")
#listanswer2 <- b %>% html_text2() %>% strsplit("\\n")
listanswer[[1]]<-NULL #remove first blank record
rows<-lapply(listanswer, function(element){
vect<-element[-1] #remove first blank field
cityindex<-as.integer(grep("Vancouver", vect)) #find city field
#add some error checking and corrections
if(length(cityindex)==0) {
cityindex <- length(vect)-1 }
else if(length(cityindex)>1) {
cityindex <- cityindex[2] }
#get the fields of interest
address <- vect[cityindex-1]
city<-vect[cityindex]
phone <- vect[cityindex+1]
if( cityindex < 3) {
cityindex <- 3
} #error check
#first groups combine into 1 name
name <- toString(vect[1:(cityindex-2)])
data.frame(name, address, city, phone)
})
answer<-bind_rows(rows)
#clean up
answer$phone <- sub("Website", "", answer$phone)
answer
This still needs some clean up to handle the inconsistences but should be 80-90% complete
Hello I am trying to split a dataframe column test$Name that is in this format.
[1]"Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A"
[2] "Victoria Centre<U+FF0E>Block 3<U+FF0E>20/F<U+FF0E>Flat B"
[3] "Lei King Wan<U+FF0E>Sites B<U+FF0E>Block 6 Yat Hong Mansion<U+FF0E>3/F<U+FF0E>Flat H"
[4] "Island Place<U+FF0E>Block 3 (Three Island Place)<U+FF0E>9/F<U+FF0E>Flat G"
[5] "7A Comfort Terrace<U+FF0E>5/F<U+FF0E>Flat B"
[6] "Broadview Court<U+FF0E>Block 4<U+FF0E>38/F<U+FF0E>Flat E"
[7] "Chi Fu Fa Yuen<U+FF0E>Fu Ho Yuen (Block H-5)<U+FF0E>16/F<U+FF0E>Flat G"
[8] "City Garden<U+FF0E>Phase 2<U+FF0E>Block 10<U+FF0E>9/F<U+FF0E>Flat B"
[9] "Euston Court<U+FF0E>Tower 1<U+FF0E>12/F<U+FF0E>Flat H"
[10] "Garley Building<U+FF0E>10/F<U+FF0E>Flat C"
The structure of each entry is BuildingName<U+FF0E>FloorNumber<U+FF0E>Unit. I would like to extract the building name like the following example.
Name
Fung Yat Building
Victoria Centre
Lei King Wan
...
I have tested that <U+FF0E> is actually '.' by doing this.
grepl('.',"Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A")
[1] TRUE
Hence, I have tried the followings but none of them worked...
test %>% separate(Name, c('Name'), sep = '.') %>% head
gsub(".", " ", test$Name[1], fixed=TRUE)
sub("^\\s*<U\\+\\w+>\\s*", " ", test$Name[1])
Any suggestions please? Thanks!
easies way is to use < as a split pattern.
library(stringr)
word("Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A", 1, sep = "\\<")
# word("Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A", 1, sep = "\\<U\\+FF0E\\>") ## building is '1', FloorNumber is '2', Unit os '3'
out:
[1] "Fung Yat Building"
I have a group of names worded in a bizarre fashion. Here is a sample:
Sammy WatkinsS. Watkins
Buffalo BillsBUF
New England PatriotsNE
Tre'Quan SmithT. Smith
JuJu Smith-SchusterJ. Smith-Schuster
My goal is to clean it so either first and last name show for names or just team names is returned for teams. Here is what have tried:
df$name <- sub("^(.*[a-z])[A-Z]", "\\1", "\\1", df$name)
This is what I'm getting returned
Sammy WatkinsS. Watkins
Buffalo BillsBUF
New England PatriotsNE
Tre'Quan SmithT. Smith
JuJu Smith-SchusterJ. Smith-Schuster
To be clear, goal would be to have this:
Sammy Watkins
Buffalo Bills
New England Patriots
Tre'Quan Smith
JuJu Smith-Schuster
data
df <- data.frame(name = c(
"Sammy WatkinsS. Watkins",
"Buffalo BillsBUF",
"New England PatriotsNE",
"Tre'Quan SmithT. Smith",
"JuJu Smith-SchusterJ. Smith-Schuster"),
stringsAsFactors = FALSE)
I suggest
df$name <- sub("\\B[A-Z]+(?:\\.\\s+\\S+)*$", "", df$name)
See the regex demo
Pattern details
\B - a non-word boundary (there must be a letter, digit or _ right before)
[A-Z]+ - 1+ ASCII uppercase letters (use \p{Lu} to match any Unicode uppercase letters)
(?:\.\s+\S+)* - 0 or more sequences of:
\. - a dot
\s+ - 1+ whitespaces
\S+ - 1+ non-whitespaces
$ - end of string.
What about:
(?<=[a-z])[A-Z](?=[.\sA-Z]).*
Check here. Without experience in R I'm unsure if this would be accepted. Also, there may be neater patterns as I'm rather new to RegEx.
I've also included a (possibly unlikely) sample: Sammy J. WatkinsJ.S. Watkins
Two laps:
df$name <- gsub(".\\. .*", "", df$name)
df$name <- gsub("[A-Z]*$", "", df$name)
The first line removes all cases of the form "x. surname" and the second removes all capital letters at the end of the string.
Another way :
sub("(.*?\\s.*?[a-z](?=[A-Z])).*", "\\1", df$name, perl = TRUE)
#> [1] "Sammy Watkins" "Buffalo Bills" "New England Patriots"
#> [4] "Tre'Quan Smith" "JuJu Smith-Schuster"
sub(".*?\\s.*?[a-z](?=[A-Z])", "", df$name, perl = TRUE)
#> [1] "S. Watkins" "BUF" "NE"
#> [4] "T. Smith" "J. Smith-Schuster"
We're splitting between a lower case character and an upper case character, but not before we see a space.
You could also use unglue :
library(unglue)
unglue_unnest(df, name, "{name1=.*?\\s.*?[a-z]}{name2=[A-Z].*?}")
#> name1 name2
#> 1 Sammy Watkins S. Watkins
#> 2 Buffalo Bills BUF
#> 3 New England Patriots NE
#> 4 Tre'Quan Smith T. Smith
#> 5 JuJu Smith-Schuster J. Smith-Schuster
I have a list of names that I need to convert from "Firstname Lastname" to "Lastname, Firstname".
Barack Obama
Donald J. Trump
J. Edgar Hoover
Beyonce Knowles-Carter
Sting
I used G. Grothendieck's answer to "last name, first name" -> "first name last name" in serialized strings to get to gsub("([^ ]*) ([^ ]*)", "\\2, \\1", str) which gives me -
Obama, Barack
J., DonaldTrump,
Edgar, J.Hoover,
Knowles-Carter, Beyonce
Sting
What I would like to get -
Obama, Barack
Trump, Donald J.
Hoover, J. Edgar
Knowles-Carter, Beyonce
Sting
I would like a regex answer.
There is an esoteric function called person designed for holding names, a conversion function as.person which does this parsing for you and a format method to make use of it afterwards (with a creative use of the braces argument). It even works with complex surnames (eg van Nistelrooy) but the single name result is unsatisfactory. It can fixed with a quick ending sub though.
x <- c("Barack Obama","Donald J. Trump","J. Edgar Hoover","Beyonce Knowles-Carter","Sting", "Ruud van Nistelrooy", "John von Neumann")
y <- as.person(x)
format(y, include=c("family","given"), braces=list(family=c("",",")))
[1] "Obama, Barack" "Trump, Donald J."
[3] "Hoover, J. Edgar" "Knowles-Carter, Beyonce"
[5] "Sting," "van Nistelrooy, Ruud"
[7] "von Neumann, John"
## fix for single names - curse you Sting!
sub(",$", "", format(y, include=c("family","given"), braces=list(family=c("",","))))
[1] "Obama, Barack" "Trump, Donald J."
[3] "Hoover, J. Edgar" "Knowles-Carter, Beyonce"
[5] "Sting" "van Nistelrooy, Ruud"
[7] "von Neumann, John"
Use
gsub("(.*[^van])\\s(.*)", "\\2, \\1", people)
The regex:
(.*[^van]) \\s (.*)
Any ammount of characters exluding "van"... the last white space... The last name containing any character.
Data:
people <- c("Barack Obama",
"Donald J. Trump",
"J. Edgar Hoover",
"Beyonce Knowles-Carter",
"Sting",
"Ruud van Nistelrooy",
"Xi Jinping",
"Hans Zimvanmer")
Result:
[1] "Obama, Barack" "Trump, Donald J." "Hoover, J. Edgar"
[4] "Knowles-Carter, Beyonce" "Sting" "van Nistelrooy, Ruud"
[7] "Jinping, Xi" "Zimvanmer, Hans"
I am running a regex query using R
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
I am attempting to sort these addresses into two columns
I expected:
strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)
would split up the numbers on the left and the rest of the address on the right.
The result I am getting is:
[1] "" " Fake Street"
[1] "" " Fake Street"
[1] "" " M" " Ln"
[1] "" " Fake Street"
[1] "" " Fake Street"
The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?
Thanks
You are almost there, just append \\K\\s* to your regex and prepend with the ^, start of string anchor:
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)
The \K is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with - enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.
See the R demo outputting:
[[1]]
[1] "955 - 959" "Fake Street"
[[2]]
[1] "95-99" "Fake Street"
[[3]]
[1] "4-9" "M4 Ln"
[[4]]
[1] "95 - 99" "Fake Street"
[[5]]
[1] "99" "Fake Street"
You could use lookbehinds and lookaheads to split at the space between a number and the character:
strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959" "Fake Street"
#
# [[2]]
# [1] "95-99" "Fake Street"
#
# [[3]]
# [1] "4-9" "M4" "Ln"
#
# [[4]]
# [1] "95 - 99" "Fake Street"
#
# [[5]]
# [1] "99" "Fake Street"
This, however also splits at the space between "M4" and "Ln". If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as #d.b suggested):
splitDf <- data.frame(
numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))
splitDf
# numberPart rest
# 1 955 - 959 Fake Street
# 2 95-99 Fake Street
# 3 4-9 M4 Ln
# 4 95 - 99 Fake Street
# 5 99 Fake Street