Find out single word names - r

I have a Name column and names are like this:
Preety ..
Sudalai Rajkumar S.
Parvathy M. S.
Navaraj Ranjan Arthur
I want to get which of these are single-word names, like in this case Preety.
I have tried eliminating the "." and " " and counting the length and using the difference of this length and the original string length.
But it's not giving me the desired output. Please help.
NBData3$namewodot <- gsub(" .","",NBData3$Client.Name)
NBData3$namewoblank <- gsub(" ","",NBData3$namewodot)
wordlength <- NBData3$namelengthchar-nchar(as.character(NBData3$namewoblank))

You could use str_count from stringr inside an ifelse() statement to check one worded names; first removing dots from names with gsub.
library(stringr)
NBData3$namewodot <- gsub("\\.", "", NBData3$Client.Name)
NBData3$oneword <- ifelse(str_count(NBData3$namewodot , '\\w+') == 1, TRUE, FALSE)
# Client.Name namewodot oneword
# 1 Preety .. Preety TRUE
# 2 Sudalai Rajkumar S. Sudalai Rajkumar S FALSE
# 3 Parvathy M. S. Parvathy M S FALSE
# 4 Navaraj Ranjan Arthur Navaraj Ranjan Arthur FALSE

This seems to work for your example
names = c("Preety ..",
"Sudalai Rajkumar S." ,
"Parvathy M. S.",
"Navaraj Ranjan Arthur")
names[sapply(strsplit(gsub(".","",names,fixed=T)," ",fixed=T),function(x) length(x) == 1)]
[1] "Preety .."

This may be a bit round about, but here would be a text mining approach. There are definitely more streamlined ways, but I thought there might be concepts in here that are also useful.
# define the data frame
df <- data.frame(Name = c("Preety ..",
"Sudalai Rajkumar S.",
"Parvathy M. S.",
"Navaraj Ranjan Arthur"),
stringsAsFactors = FALSE)
library(tidyverse)
library(tidytext)
# break each name out by words. remove all the periods
df_token <- df %>%
rowid_to_column(var = "name_id") %>%
mutate(Name = str_remove_all(Name, pattern = "\\.")) %>%
unnest_tokens(name_split, Name, to_lower = FALSE)
# find the lines with only one word
df_token %>%
group_by(name_id) %>%
summarize(count = n()) %>%
filter(count == 1) %>%
left_join(df_token) %>%
pull(name_split)
[1] "Preety"

in base R you could use grep:
grep("^\\S+$", gsub("\\W+$", "", names), value=T)
[1] "Preety"
If you need the names as originally given, then you will just use [:
names[grep("^\\S+$", gsub("\\W+$", "", names))]
[1] "Preety .."

Related

Find year in random data in R

I have 71 columns in a dataframe, 10 of which include data that may include a year between 1990 and 2019 in the format YYYY (e.g. 2019). For example:
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
I am trying to find a way to pull the years from relevant cells and insert them in a new column.
So far, I am only aware of how to filter the data in a very time-consuming way. I have produced the following code, which starts like this:
dated_data <- select(undated_data, 1:71) %>%
filter(grepl("1990", id_1) | filter(grepl("1990", id_2) | filter(grepl("1991", id_1) | filter(grepl("1991", id_2)
However, it take a really long time to write that for all ten columns and all 30 years. I am sure there is a quicker way. I also have no idea how to then pull the dates from each of the matching cells into a new cell.
The output I want looks like this:
dated_data$year <- c("2013", "2014", "2016", "1990")
Does anyone know how I do this? Thank you in advance for your help!
There are many ways. This is one of them:
Step 1: define a pattern you want to match with regex:
pattern <- "(1|2)\\d{3}"
Step 2: define a function to extract raw matches:
extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))
Step 3: apply the function to your data, e.g., id_1:
extract(id_1)
[1] "2013" "2014" "2016" "1990"
Here's another way, actually simpler ;)
It uses the str_extract function from the stringr package. So you install the package and activate it:
install.packages("stringr")
library(stringr)
and use str_extract to pull your matches:
years <- str_extract(id_1,"(1|2)\\d{3}")
years
[1] "2013" "2014" "2016" "1990"
EDIT:
If not every string contains a match and you want to preserve the length of the vectors/columns, you can use ifelse to test whether the regex finds a match and, where it doesn't, to put NA.
For example, if your data is like this (note the two added strings which do not contain years):
id_3 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759", "gbgbgbgb", "hnhna25")
you can set up the ifelse test like this:
years <- ifelse(grepl("(1|2)\\d{3}", id_3), str_extract(id_3,"(1|2)\\d{3}"), NA)
years
[1] "2013" "2014" "2016" "1990" NA NA
Based on the example in your question, you are trying to filter out any rows without years and then extract the year from the string. It looks like every row only contains 1 year. Here is some code so that you do not have to write long filter statements for 10 columns and 30 years. Keep in mind that I don't have your data so I couldn't test it.
library(tidyverse)
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate(year = str_extract(id_1, pattern = paste0(1990:2019, collapse = "|")))
EDIT: based on your comment it looks like maybe some columns have a year and others do not. What we do instead is pull the year out of any column with id_* and then we coalesce the columns together. Again, without your data its tough to test this.
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate_at(vars(starts_with("id_")), list(year = ~str_extract(., pattern = paste0(1990:2019, collapse = "|")))) %>%
mutate(year = coalesce(ends_with("_year"))) %>%
select(-ends_with("_year"))
Using tidyverse methods:
undated_data %>%
mutate_at(vars(1:71),
funs(str_extract(., "(1|2)[0-9]{3}")))
(Note that the regex pattern will match numbers that may not be years, such as 2999; if your data has many "false positives" like that, you may be better off writing a custom function.)
Here is a similar solution to the one provided, but using dplyr and stringr on a data.frame.
library(stringr)
library(dplyr)
df<-data.frame("X1" = id_1,"X2" = id_2)
#Set in cols the column names from which years are going to be extracted
df %>%
pivot_longer(cols = c("X1","X2"), names_to = "id") %>%
arrange(id) %>%
mutate(new = unlist(str_extract_all(value, pattern = "(1|2)\\d{3}")))
Base R solution:
# Sample data: id_1; id_2 => character vectors
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
# Thanks #Chris Ruehlemann: store the date pattern: date_pattern => character scalar
date_pattern <- "(1|2)\\d{3}"
# Convert to data.frame: df => data.frame
df <- data.frame(id_1, id_2, stringsAsFactors = FALSE)
# Subset the data to only contain date information vectors: dates_subset => data.frame
dates_subset <- df[,sapply(df, function(x){any(grepl(date_pattern, x))}), drop = FALSE]
# Initialse the year vector: year => character vector:
df$years <- NA_character_
# Remove punctuation and letters, return valid dates, combine into a, comma-separated string:
# Store the dates found in the string: years => character vector
df$years[which(rowSums(Vectorize(grepl)(date_pattern, dates_subset)) > 0)] <-
apply(sapply(dates_subset, function(x){
grep(date_pattern, unlist(strsplit(x, "[[:punct:]]|[a-zA-Z]")), value = TRUE)}),
1, paste, collapse = ", ")
Here may be another solution.
We just use gsub() function and set pattern as ".(199[0-9]|20[01][0-9]).".
The pattern captures a year text between 1990 to 2019 as a
group result , especially only one group , so we replace original text with first one group string:)
library(magrittr)
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_1)
# [1] "2013" "2014" "2016" "1990"
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_2)
#[1] "2013" "2014" "2016" "1990"

Select words that are in a vector multiple times

I stored some items that didn't fulfill a criteria in a vector.
non.fulfilled <- c('positive', 'beta.1', 'beta.2', 'negative', 'alpha.1', 'alpha.2', 'alpha.3')
Now, I would like to find which words are in my vector multiple times and afterwards add them to this vector. So in this case:
non.fulfilled2 <- cbind(non.fulfilled, 'beta', 'alpha')
How do I find these words?
If we assume that a "word" here is defined as the first run of \w ("word characters"), we can do as follows to get the desired output:
non.fulfilled <- c('positive', 'beta.1', 'beta.2', 'negative', 'alpha.1', 'alpha.2', 'alpha.3')
library(stringr)
words <- str_extract(non.fulfilled, "\\w+")
unique(words[duplicated(words)])
#> [1] "beta" "alpha"
EDIT: After clarification in the comments, we can get duplicates like so:
words <- str_replace(non.fulfilled, "\\..*", "")
unique(words[duplicated(words)])
#> [1] "beta" "alpha"
Created on 2019-12-23 by the reprex package (v0.3.0)
We can use sub to keep string before dot, count their occurrence using table and select values which occur more than once.
vals <- table(sub('\\..*', '', non.fulfilled))
names(vals[vals > 1])
#[1] "alpha" "beta"
Append them to original vector
c(non.fulfilled, names(vals[vals > 1]))
We can also use tidyverse approaches
library(dplyr)
library(stringr)
tibble(non.fulfilled) %>%
mutate(non.fulfilled = str_remove(non.fulfilled, "\\.\\d+$")) %>%
count(non.fulfilled) %>%
filter(n > 1) %>%
pull(non.fulfilled)
#[1] "alpha" "beta"

splitting surnames from fullnames

I've used this:
String <- unlist(str_split(Invname,"[ ]",n=2))
To split the names that I have into Surnames and First Names, since the surnames come first. But I cannot figure out how to reassign the split Invname into two lists, so that I can use only the surnames for the rest of my project. Right now I have this:
" [471] "KRUEGER" "MARCUS" "
And I would like to have the left side only assigned to a new variable, so that I can work further with mining the surnames for information.
Using the data in nate.edwinton's answer, there is no need to unlist.
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
String <- stringr::str_split(Invnames, "[ ]", n = 2)
Surnames <- sapply(String, '[', 1)
Firstnames <- sapply(String, '[', 2)
data.frame(Surnames, Firstnames)
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
As mentioned in the comments, it would be easier to help if you provided some data. Anyway, here might be a solution:
Assuming that Invnames is a vector of where for every first name there is (exactly) one last name, you could do the following
# data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
# extraction
String <- unlist(stringr::str_split(Invnames,"[ ]",n=2))
# saving first and last names
lastNames <- String[seq(1,length(String),2)]
firstNames <- String[seq(2,length(String),2)]
# yields
> cbind(lastNames,firstNames)
lastNames firstNames
[1,] "Krueger" "Markus"
[2,] "Doe" "John"
[3,] "Tatum" "Jayson"
Here is some sample data and a suggested solution. Data modified from #Rui Barradas' answer:
Invnames <- c("Krueger.$Markus","Doe.John","Tatum.Jayson")
sapply(strsplit(Invnames,"\\W"),"[")
Again using data from an earlier answer with dplyr this time
library(tidyverse)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
Invnames <- data.frame(Invnames)
Invnames %>%
separate(Invnames, c('Surname', 'FirstName'), sep=" ")
Surname FirstName
1 Krueger Markus
2 Doe John
3 Tatum Jayson
With base R, we can make use of read.table/read.csv to separate the string into columns
read.table(text = Invnames, header = FALSE, col.names = c("Surnames", "Firstnames"))
# Surnames Firstnames
#1 Krueger Markus
#2 Doe John
#3 Tatum Jayson
data
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson")
If only names were so straightforward! If there were few complications between strings then yes the answers below are good options. In my experience with name lists we get hyphenated names (both in "first" and "last"), "Middle" names, Titles and shortened name versions (Dr., Mr, Md), and many other variants. I first try to clean the strings before any splitting.
Here is just one idea using dplyr (explicit code provided for clarity)
Invnames <- c("Krueger Markus","Doe John","Tatum Jayson", "Taylor - Cline Jeff", "Davis - Freud Melvin- John")
df <- as.data.frame(Invnames, Invnames = Invnames) %>%
mutate(Invnames2 = gsub("- ","-",Invnames)) %>%
mutate(Invnames2 = gsub(" -","-",Invnames2)) %>%
mutate(surname = gsub(" .*", "", Invnames2))

R extract unit of measure and attached number from string given list of units of measures

I'm using R and I've got two character vectors:
measures <- c('cm', 'mm', 'ml')
strings <- c('hgtrdhg cm12 mhjgf','asdfsf 12mm jhgjhg','adadf 45ml','ml89 jygjgh', 'cm 59 gfhgfd')
I have to extract for each string the unit of measure and the associated number like:
cm12, 12mm, 45ml, ml89, cm59
(there was originally a space between cm and 59 in the last string)
the number can be either in front or after the unit of measure.
We can loop over the 'measures' and extract the elements
library(dplyr)
library(stringr)
library(purrr)
measures %>%
map(~ str_extract(strings, paste0("\\d*", .x, "\\s*\\d*"))) %>%
do.call(coalesce, .) %>%
str_replace_all(" ", "")
#[1] "cm12" "12mm" "45ml" "ml89" "cm59"
Or if we want to use all the 'measures' at once, then paste it by collapse ing with |
pat <- paste0("(", paste("\\d*", measures, "\\s*\\d*", sep="", collapse="|"), ")")
str_replace_all(str_extract(strings, pat), " ", "")
#[1] "cm12" "12mm" "45ml" "ml89" "cm59"
using base r:
m=paste0(".*?(\\d+\\s*(",m<-paste0(measures,collapse = "|"),")|(",m,")\\s*\\d+).*")
> sub(m,"\\1",strings)
[1] "cm12" "12mm" "45ml" "ml89" "cm 59"
sub(".*?(\\d+\\s*(cm|mm|ml)|(cm|mm|ml)\\s*\\d+).*","\\1",strings)
[1] "cm12" "12mm" "45ml" "ml89" "cm 59"

Extract names and covert to email addresses in R

I have the following string: " John Andrew Thomas"(4 empty spaces before John) and I need to split and concat it so my output is "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com", also I need to remove all whitespaces.
My best guess is:
test = unlist(lapply(names, strsplit, split = " ", fixed = FALSE))
paste(test, collapse = "#gmail.com")
but I get this as an output:
"#gmail.com#gmail.com#gmail.com#gmail.comJohn#gmail.comAndrew#gmail.comThomas"
names <- " John Andrew Thomas"
test <- unlist(lapply(names, strsplit, split = " ", fixed = FALSE))
paste(test[test != ""],"#gmail.com",sep = "",collapse = ";")
A small tweak to your paste line will remove the extra spaces and separate the email addresses with a semicolon.
Output is the following:
[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"
With stringr, so we can use its str_trim function to deal with your leading whitespace, and assuming your string is x:
library(stringr)
paste(sapply(str_split(str_trim(x), " "), function(i) sprintf("%s#gmail.com", i)), collapse = ";")
And here's a piped version, so it's easier to follow:
library(dplyr)
library(stringr)
x %>%
# get rid of leading and trailing whitespace
str_trim() %>%
# make a list with the elements of the string, split at " "
str_split(" ") %>%
# get an array of strings where those list elements are added to a fixed chunk via sprintf
sapply(., function(i) sprintf("%s#gmail.com", i)) %>%
# concatenate the resulting array into a single string with semicolons
paste(., collapse = ";")
Another approach using trimws function of base R
paste0(unlist(strsplit(trimws(names)," ")),"#gmail.com",collapse = ";")
#[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"
Data
names <- " John Andrew Thomas"
Another idea using stringi:
v <- " John Andrew Thomas"
paste0(stringi::stri_extract_all_words(v, simplify = TRUE), "#gmail.com", collapse = ";")
Which gives:
#[1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"
You can use gsub(), and a little creativity.
x <- " John Andrew Thomas"
paste0(gsub(" ", "#gmail.com;", trimws(x)), "#gmail.com")
# [1] "John#gmail.com;Andrew#gmail.com;Thomas#gmail.com"
No packages, no loops, and no string splitting.

Resources