Extract names and covert to email addresses in R - r

I have the following string: " John Andrew Thomas"(4 empty spaces before John) and I need to split and concat it so my output is "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com", also I need to remove all whitespaces.
My best guess is:
test = unlist(lapply(names, strsplit, split = " ", fixed = FALSE))
paste(test, collapse = "#gmail.com")
but I get this as an output:
"#gmail.com#gmail.com#gmail.com#gmail.comJohn#gmail.comAndrew#gmail.comThomas"

names <- " John Andrew Thomas"
test <- unlist(lapply(names, strsplit, split = " ", fixed = FALSE))
paste(test[test != ""],"#gmail.com",sep = "",collapse = ";")
A small tweak to your paste line will remove the extra spaces and separate the email addresses with a semicolon.
Output is the following:
[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"

With stringr, so we can use its str_trim function to deal with your leading whitespace, and assuming your string is x:
library(stringr)
paste(sapply(str_split(str_trim(x), " "), function(i) sprintf("%s#gmail.com", i)), collapse = ";")
And here's a piped version, so it's easier to follow:
library(dplyr)
library(stringr)
x %>%
# get rid of leading and trailing whitespace
str_trim() %>%
# make a list with the elements of the string, split at " "
str_split(" ") %>%
# get an array of strings where those list elements are added to a fixed chunk via sprintf
sapply(., function(i) sprintf("%s#gmail.com", i)) %>%
# concatenate the resulting array into a single string with semicolons
paste(., collapse = ";")

Another approach using trimws function of base R
paste0(unlist(strsplit(trimws(names)," ")),"#gmail.com",collapse = ";")
#[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"
Data
names <- " John Andrew Thomas"

Another idea using stringi:
v <- " John Andrew Thomas"
paste0(stringi::stri_extract_all_words(v, simplify = TRUE), "#gmail.com", collapse = ";")
Which gives:
#[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"

You can use gsub(), and a little creativity.
x <- " John Andrew Thomas"
paste0(gsub(" ", "#gmail.com;", trimws(x)), "#gmail.com")
# [1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"
No packages, no loops, and no string splitting.

Related

How to remove n number of identical characters from string in R

I have a string in R where the words are interspaced with a random number of character \n:
mystring = c("hello\n\ni\n\n\n\n\nam\na\n\n\n\n\n\n\ndog")
I want to replace n number of repeating \n elements so that there is only a space character between words. I can currently do this as follows, but I want a tidier solution:
mystring %>%
gsub("\n\n", "\n", .) %>%
gsub("\n\n", "\n", .) %>%
gsub("\n\n", "\n", .) %>%
gsub("\n", " ", .)
[1] "hello i am a dog"
What is the best way to achieve this?
We can use + to signify one or more repetitions
gsub("\n+", " ", mystring)
[1] "hello i am a dog"
We could use same logic as akrun with str_replace_all:
library(stringr)
str_replace_all(mystring, '\n+', ' ')
[1] "hello i am a dog"
In this case, you might find str_squish() convenient. This is intended to solve this exact problem, while the other solutions show good ways to solve the more general case.
library(stringr)
mystring = c("hello\n\ni\n\n\n\n\nam\na\n\n\n\n\n\n\ndog")
str_squish(mystring)
# [1] "hello i am a dog"
If you look at the code of str_squish(), it is basically wrapper around str_replace_all().
str_squish
function (string)
{
stri_trim_both(str_replace_all(string, "\\s+", " "))
}
Another possible solution, based on stringr::str_squish:
library(stringr)
str_squish(mystring)
#> [1] "hello i am a dog"

Formatting UK Postcodes in R

I am trying to format UK postcodes that come in as a vector of different input in R.
For example, I have the following postcodes:
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
How do I write a generic code that would convert entries of the above vector into:
c("IV41 8PW","IV40 8BU","KY11 4HJ","KY1 1UU","KY4 9RW","G32 7EJ")
That is the first part of the postcode is separated from the second part of the postcode by one space and all letters are capitals.
EDIT: the second part of the postcode is always the 3 last characters (combination of a number followed by letters)
I couldn't come up with a smart regex solution so here is a split-apply-combine approach.
sapply(strsplit(sub('^(.*?)(...)$', '\\1:\\2', postcodes), ':', fixed = TRUE), function(x) {
paste0(toupper(trimws(x, whitespace = '[.\\s-]')), collapse = ' ')
})
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
The logic here is that we insert a : (or any character that is not in the data) in the string between the 1st and 2nd part. Split the string on :, remove unnecessary characters, get it in upper case and combine it in one string.
One approach:
Convert to uppercase
extract the alphanumeric characters
Paste back together with a space before the last three characters
The code would then be:
library(stringr)
postcodes<-c("IV41 8PW","IV408BU","kY11..4hJ","KY1.1UU","KY4 9RW","G32-7EJ")
postcodes <- str_to_upper(postcodes)
sapply(str_extract_all(postcodes, "[:alnum:]"), function(x)paste(paste0(head(x,-3), collapse = ""), paste0(tail(x,3), collapse = "")))
# [1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
You can remove everything what is not a word caracter \\W (or [^[:alnum:]_]) and then insert a space before the last 3 characters with (.{3})$ and \\1.
sub("(.{3})$", " \\1", toupper(gsub("\\W+", "", postcodes)))
#sub("(...)$", " \\1", toupper(gsub("\\W+", "", postcodes))) #Alternative
#sub("(?=.{3}$)", " ", toupper(gsub("\\W+", "", postcodes)), perl=TRUE) #Alternative
#[1] "IV41 8PW" "IV40 8BU" "KY11 4HJ" "KY1 1UU" "KY4 9RW" "G32 7EJ"
# Option 1 using regex:
res1 <- gsub(
"(\\w+)(\\d[[:upper:]]\\w+$)",
"\\1 \\2",
gsub(
"\\W+",
" ",
postcodes
)
)
# Option 2 using substrings:
res2 <- vapply(
trimws(
gsub(
"\\W+",
" ",
postcodes
)
),
function(ir){
paste(
trimws(
substr(
ir,
1,
nchar(ir) -3
)
),
substr(
ir,
nchar(ir) -2,
nchar(ir)
)
)
},
character(1),
USE.NAMES = FALSE
)

Find out single word names

I have a Name column and names are like this:
Preety ..
Sudalai Rajkumar S.
Parvathy M. S.
Navaraj Ranjan Arthur
I want to get which of these are single-word names, like in this case Preety.
I have tried eliminating the "." and " " and counting the length and using the difference of this length and the original string length.
But it's not giving me the desired output. Please help.
NBData3$namewodot <- gsub(" .","",NBData3$Client.Name)
NBData3$namewoblank <- gsub(" ","",NBData3$namewodot)
wordlength <- NBData3$namelengthchar-nchar(as.character(NBData3$namewoblank))
You could use str_count from stringr inside an ifelse() statement to check one worded names; first removing dots from names with gsub.
library(stringr)
NBData3$namewodot <- gsub("\\.", "", NBData3$Client.Name)
NBData3$oneword <- ifelse(str_count(NBData3$namewodot , '\\w+') == 1, TRUE, FALSE)
# Client.Name namewodot oneword
# 1 Preety .. Preety TRUE
# 2 Sudalai Rajkumar S. Sudalai Rajkumar S FALSE
# 3 Parvathy M. S. Parvathy M S FALSE
# 4 Navaraj Ranjan Arthur Navaraj Ranjan Arthur FALSE
This seems to work for your example
names = c("Preety ..",
"Sudalai Rajkumar S." ,
"Parvathy M. S.",
"Navaraj Ranjan Arthur")
names[sapply(strsplit(gsub(".","",names,fixed=T)," ",fixed=T),function(x) length(x) == 1)]
[1] "Preety .."
This may be a bit round about, but here would be a text mining approach. There are definitely more streamlined ways, but I thought there might be concepts in here that are also useful.
# define the data frame
df <- data.frame(Name = c("Preety ..",
"Sudalai Rajkumar S.",
"Parvathy M. S.",
"Navaraj Ranjan Arthur"),
stringsAsFactors = FALSE)
library(tidyverse)
library(tidytext)
# break each name out by words. remove all the periods
df_token <- df %>%
rowid_to_column(var = "name_id") %>%
mutate(Name = str_remove_all(Name, pattern = "\\.")) %>%
unnest_tokens(name_split, Name, to_lower = FALSE)
# find the lines with only one word
df_token %>%
group_by(name_id) %>%
summarize(count = n()) %>%
filter(count == 1) %>%
left_join(df_token) %>%
pull(name_split)
[1] "Preety"
in base R you could use grep:
grep("^\\S+$", gsub("\\W+$", "", names), value=T)
[1] "Preety"
If you need the names as originally given, then you will just use [:
names[grep("^\\S+$", gsub("\\W+$", "", names))]
[1] "Preety .."

How to find if a string contain certain characters without considering sequence?

I'm trying to match a name using elements from another vector with R. But I don't know how to escape sequence when using grep() in R.
name <- "Cry River"
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
grep(name, string, value = TRUE)
I expect the output to be "Cry Me A River", but I don't know how to do it.
Use .* in the pattern
grep("Cry.*River", string, value = TRUE)
#[1] "Cry Me A River"
Or if you are getting names as it is and can't change it, you can split on whitespace and insert the .* between the words like
grep(paste(strsplit(name, "\\s+")[[1]], collapse = ".*"), string, value = TRUE)
where the regex is constructed in the below fashion
strsplit(name, "\\s+")[[1]]
#[1] "Cry" "River"
paste(strsplit(name, "\\s+")[[1]], collapse = ".*")
#[1] "Cry.*River"
Here is a base R option, using grepl:
name <- "Cry River"
parts <- paste0("\\b", strsplit(name, "\\s+")[[1]], "\\b")
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
result <- sapply(parts, function(x) { grepl(x, string) })
string[rowSums(result) == length(parts)]
[1] "Cry Me A River"
The strategy here is to first split the string containing the various search terms, and generating individual regex patterns for each term. In this case, we generate:
\bCry\b and \bRiver\b
Then, we iterate over each term, and using grepl we check that the term appears in each of the strings. Finally, we retain only those matches which contained all terms.
We can do the grepl on splitted string and Reduce the list of logical vectors to a single logicalvector` and extract the matching element in 'string'
string[Reduce(`&`, lapply(strsplit(name, " ")[[1]], grepl, string))]
#[1] "Cry Me A River"
Also, instead of strsplit, we can insert the .* with sub
grep(sub(" ", ".*", name), string, value = TRUE)
#[1] "Cry Me A River"
Here's an approach using stringr. Is order important? Is case important? Is it important to match whole words. If you would just like to match 'Cry' and 'River' in any order and don't care about case.
name <- "Cry River"
string <- c("Yesterday Once More",
"Are You happy",
"Cry Me A River",
"Take me to the River or I'll Cry",
"The Cryogenic River Rag",
"Crying on the Riverside")
string[str_detect(string, pattern = regex('\\bcry\\b', ignore_case = TRUE)) &
str_detect(string, regex('\\bRiver\\b', ignore_case = TRUE))]

Replace space between two words with an underscore in a vector

I have words, Genus species, and I want an underscore to replace the space between the two strings in R
Input:
>data$species
Genus species
Desired output:
>data$species
Genus_species
We can use sub from base R
data$species <- sub(" ", "_", data$species)
Or with chartr from base R
data$species <- chartr(" ", "_", data$species)
Or using tidyverse
library(tidyverse)
data %>%
mutate(species = str_replace(species, " ", "_"))
You should use gsub:
data$species <- gsub(" ", "_", data$species)
The pattern [^[:alnum:]] can be used to replace all non alphanumeric characters by underscores. Combine it with tolower() to convert all characters to snake case following the tidyverse syntax recommendations.
column_names <- c("Bla Bla", "Hips / Hops", "weight (1000 kg)")
gsub("[^[:alnum:]]", "_", tolower(column_names))
[1] "bla_bla" "hips___hops" "weight__1000_kg_"
The pattern [^[:alnum:]]+ replaces one or more occurrences by an underscore.
gsub("[^[:alnum:]]+","_",tolower(column_names))
[1] "bla_bla" "hips_hops" "weight_1000_kg_"
See help(regexp) for more information on regular expressions.

Resources