extracting names and numbers using regex - r

I think I might have some issues with understanding the regular expressions in R.
I need to extract phone numbers and names from a sample vector and create a data-frame with corresponding columns for names and numbers using stringr package functionality.
The following is my sample vector.
phones <- c("Ann 077-789663", "Johnathan 99656565",
"Maria2 099-65-6569 office")
The code that I came up with to extract those is as follows
numbers <- str_remove_all(phones, pattern = "[^0-9]")
numbers <- str_remove_all(numbers, pattern = "[a-zA-Z]")
numbers <- trimws(numbers)
names <- str_remove_all(phones, pattern = "[A-Za-z]+", simplify = T)
phones_data <- data.frame("Name" = names, "Phone" = numbers)
It doesn't work, as it takes the digit in the name and joins with the phone number. (not optimal code as well)
I would appreciate some help in explaining the simplest way for accomplishing this task.

Not a regex expert, however with stringr package we can extract a number pattern with optional "-" in it and replace the "-" with empty string to extract numbers without any "-". For names, we extract the first word at the beginning of the string.
library(stringr)
data.frame(Name = str_extract(phones, "^[A-Za-z]+"),
Number = gsub("-","",str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+")))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
If you want to stick completely with stringr we can use str_replace_all instead of gsub
data.frame(Name = str_extract(phones, "[A-Za-z]+"),
Number=str_replace_all(str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+"), "-",""))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569

I think Ronak's answer is good for the name part, I don't really have a good alternative to offer there.
For numbers, I would go with "numbers and hyphens, with a word boundary at either end", i.e.
numbers = str_extract(phones, "\\b[-0-9]+\\b") %>%
str_remove_all("-")
# Can also specify that you need at least 5 numbers/hyphens
# in a row to match
numbers2 = str_extract(phones, "\\b[-0-9]{5,}\\b") %>%
str_remove_all("-")
That way, you're not locked into a fixed format for the number of hyphens that appear in the number (my suggested regex allows for any number).

If you (like me) prefer to use base-R and want to keep the regex as simple as possible you could do something like this:
phone_split <- lapply(
strsplit(phones, " "),
function(x) {
name_part <- grepl("[^-0-9]", x)
c(
name = paste(x[name_part], collapse = " "),
phone = x[!name_part]
)
}
)
phone_split
[[1]]
name phone
"Ann" "077-789663"
[[2]]
name phone
"Johnathan" "99656565"
[[3]]
name phone
"Maria2 office" "099-65-6569"
do.call(rbind, phone_split)
name phone
[1,] "Ann" "077-789663"
[2,] "Johnathan" "99656565"
[3,] "Maria2 office" "099-65-6569"

Related

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

Apply a regex only to the first word of a phrase (defined with spaces)

I have this regex to separate letters from numbers (and symbols) of a word: (?<=[a-zA-Z])(?=([[0-9]|[:punct:]])). My test string is: "CALLE15 CRA22".
I want to apply this regex only to the first word of that sentence (the word is defined with spaces). Namely, I want apply that only to "CALLE15".
One solution is split the string (sentence) into words and then apply the regex to the first word, but I want to do all in one regex. Other solution is to use r stringr::str_replace() (or sub()) that replace only the first match, but I need stringr::str_replace_all (or gsub()) for other reasons.
What I need is to insert a space between the two that I do with the replacement function. The outcome I want is "CALLE 15 CRA22" and with the posibility of "CALLE15 CRA 22". I try a lot of positions for the space and nothing, neither the ^ at the beginning.
https://rubular.com/r/7dxsHdOA3avTdX
Thanks for your help!!!!
I am unsure about your problem statement (see my comment above), but the following reproduces your expected output and uses str_replace_all
ss <- "CALLE15 CRA22"
library(stringr)
str_replace_all(ss, "^([A-Za-z]+)(\\d+)(\\s.+)$", "\\1 \\2\\3")
#[1] "CALLE 15 CRA22"
Update
To reproduce the output of the sample string from the comment above
ss <- "CLL.6 N 5-74NORTE"
pat <- c(
"(?<=[A-Za-z])(?![A-Za-z])",
"(?<![A-Za-z])(?=[A-Za-z])",
"(?<=[0-9])(?![0-9])",
"(?<![0-9])(?=[0-9])")
library(stringr)
str_split(ss, sprintf("(%s)", paste(pat, collapse = "|"))) %>%
unlist() %>%
.[nchar(trimws(.)) > 0] %>%
paste(collapse = " ")
#[1] "CLL . 6 N 5 - 74 NORTE"

Limiting word count in a character column in R and saving extra words in another variable [duplicate]

I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

Splitting unseparated string and numerical variables in R

I have transformed a Pdf to text file and I have a data set which is constructed as follow:
data=c("Paris21London3Tokyo51San Francisco38")
And I would like to obtain the following structure:
matrix(c("Paris","London","Tokyo","San Francisco",21,3,51,38),4,2)
Does anyone have a method to do it ? Thanks
You could try strsplit with regex lookahead and lookbehind
v1 <- strsplit(data, '(?<=[^0-9])(?=[0-9])|(?<=[0-9])(?=[^0-9])',
perl=TRUE)[[1]]
indx <- c(TRUE, FALSE)
data.frame(Col1= v1[indx], Col2=v1[!indx])
Update
Including decimal numbers as well
data1=c("Paris21.53London3Tokyo51San Francisco38.2")
v2 <- strsplit(data1, '(?<=[^0-9.])(?=[0-9])|(?<=[0-9])(?=[^0-9.])',
perl=TRUE)[[1]]
indx <- c(TRUE, FALSE)
data.frame(Col1= v2[indx], Col2=v2[!indx])
# Col1 Col2
#1 Paris 21.53
#2 London 3
#3 Tokyo 51
#4 San Francisco 38.2
Regular expressions are the right tool here, but unlike the other answer shows, strsplit is not well suited for the job.
Better use regular expression matches, and to have two separate expressions for words and numbers:
words = '[a-zA-Z ]+'
numbers = '[+-]?\\d+(\\.\\d+)?'
word_matches = gregexpr(words, data)
number_matches = gregexpr(numbers, data)
result = cbind(regmatches(data, word_matches)[[1]],
regmatches(data, number_matches)[[1]])
This recognises any number with an optional decimal point, and an optional sign. It does not recognise numbers in scientific (exponential) notation. This can be trivially added, if necessary.

How to get the first 10 words in a string in R?

I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

Resources