I have a text field that has email addresses for which I made below pattern
library(stringr)
str_extract_all(Data, "([a-zA-Z0-9.-])+#([a-zA-Z0-9.-])")
It works perfect and detects everything. However, I need to exclude emails from certain domains like gmail.com. For instance, I don't want emails that are with #gmail.com.
Using not symbol (^), I should be able to achieve my need, however, I have no idea why I get stock after several trial of adding ^gmail.com to my pattern.
Here are a couple obvious ways to go, starting from...
x = c("Espanta#gmail.com","Frank#notgmail.com","Jaap#gmail.com.com")
baddoms = c("gmail.com","yahoo.com")
filter first...
str_split_fixed(x[grep(paste0("#(",paste(baddoms,collapse="|"),")$"), x, invert=TRUE)], "#", 2)
# [,1] [,2]
# [1,] "Frank" "notgmail.com"
# [2,] "Jaap" "gmail.com.com"
... or filter afterwards ...
y = str_split_fixed(x, "#", 2)
y[!(y[,2] %in% baddoms),]
# [,1] [,2]
# [1,] "Frank" "notgmail.com"
# [2,] "Jaap" "gmail.com.com"
As far as code complexity and computational time goes, the second approach is much better. One could argue that the first means saving RAM, but I really doubt that would be a problem in practice.
The OP's idea of using ^gmail.com does not work because ^ has two uses in regex:
identifying the start of the string; and
negating characters inside a character class [^...].
To dodge entire strings, negative lookaheads and lookbehinds are handy, but I know of no way to (1) extract parts from a string and (2) filter results in a single step.
Related
Suppose I want to extract all letters between the letter a and c. I've been so far using the stringr package which gives a clear idea of the full matches and the groups. The package for example would give the following.
library(stringr)
str_match_all("abc", "a([a-z])c")
# [[1]]
# [,1] [,2]
# [1,] "abc" "b"
Suppose I only want to replace the group, and not the full match---in this case the letter b. The following would, however, replace the full match.
str_replace_all("abc", "a([a-z])c", "z")
[1] "z"
# Desired result: "azc"
Would there be any good ways to replace only the capture group? suppose I wanted to do multiple matches.
str_match_all("abcdef", "a([a-z])c|d([a-z])f")
# [[1]]
# [,1] [,2] [,3]
# [1,] "abc" "b" NA
# [2,] "def" NA "e"
str_replace_all("abcdef", "a([a-z])c|d([a-z])f", "z")
# [1] "zz"
# Desired result: "azcdzf"
Matching groups was easy enough, but I haven't found a solution when a replacement is desired.
It is not the way regex was designed. Capturing is a mechanism to get the parts of strings you need and when replacing, it is used to keep parts of matches, not to discard.
Thus, a natural solution is to wrap what you need to keep with capturing groups.
In this case here, use
str_replace_all("abc", "(a)[a-z](c)", "\\1z\\2")
Or with lookarounds (if the lookbehind is a fixed/known width pattern):
str_replace_all("abc", "(?<=a)[a-z](?=c)", "z")
Usually when I want to replace certain pattern of characters in a text\string I use the grep family functions, that is what we call working with regular expressions.
You can use sub function of the grep family functions to make replacements in strings.
Exemple:
sub("b","z","abc")
[1] "azc"
You may face more challenges working with replacement, for that, grep family functions offers many functionality:
replacing all characters by your preference except a and c:
sub("[^ac]+","z","abBbbbc")
[1] "azc"
replacing the second match
sub("b{2}","z","abBbbbc")
[1] "abBzbc"
replacing all characters after the pattern:
sub("b.*","z","abc")
[1] "az"
the same above except c:
sub("b.*[^c]","z","abc")
[1] "abc"
So on...
You can look for "regular expressions in R using grep" into internet and find many ways to work with regular expressions.
I'm writing a little script in R to check if an email address has a valid domain extension. I've read in and vectorized a file of the current valid extensions, e.g. com, uk, biz, etc. Let's say for example:
valid_domain_extensions <- c('com', 'biz', 'de', 'uk')
Then I've got a list of matrices of 100,000 captured emails, which were intentionally written in an obfuscatory way, e.g. name [at] domain /dot/ biz. The matrices are from a str_match_all regex pattern, with subgroups as columns.
(edited to add here:)
So the input would be a list of matrices that look like this:
[,1] [,2] [,3] [,4] [,5] [,6]
[1] name at stackoverflow dot com name at stackoverflow dot com
What I want to do is check all 100,000 of the subgrouped columns (ie, all the [,6]s from my input list that captured the domain extensions to see if they are equal to, or at least have one of the strings from the domain extension vector, for validation. And then spit out a canonicalized address.
Is there a more R-otic way to do it than my attempt here? It works, but it seems kind of bulky and gross.
validationFunction <- function(x){
y <- x[,6]
z <- any((sapply(y, grepl, valid_domain_extensions))) # valid_domain_extensions is a long vector
if (z){
return(paste(x[,2],'#', x[,4], '.', x[,6], sep = "", collapse = NULL))
} else {
return("Invalid Email Address")
}
}
final_list_of_emails <- lapply(tokenized_rough_emails, validationFunction)
print(final_list_of_emails)
Thanks.
I'm sure I used to know this, and I'm sure this is covered somewhere but since I can't find any Google/SO hits for this title search there probably should be one..
I want to split a string without using regex, e.g.
str = "abcx*defx*ghi"
Of course we can use stringr::str_split or strsplit with argument 'x[*]', but how can we just suppress regex entirely?
The argument fixed=TRUE can be useful in this instance
strsplit(str, "x*", fixed=TRUE)[[1]]
#[1] "abc" "def" "ghi"
Since the question also mentions a stringr::str_split, a stringr way might be of help, too.
You may use str_split with fixed(<YOUR_DELIMITER_STRING_HERE>, ignore_case = FALSE) or coll(pattern, ignore_case = FALSE, locale = "en", ...). See the stringr docs:
fixed: Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets.
coll Compare strings respecting standard collation rules
See the following R demo:
> str_split(str, fixed("x*"))
[[1]]
[1] "abc" "def" "ghi"
Collations are better illustrated with a letter that can have two representations:
> x <- c("Str1\u00e1Str2", "Str3a\u0301Str4")
> str_split(x, fixed("\u00e1"), simplify=TRUE)
[,1] [,2]
[1,] "Str1" "Str2"
[2,] "Str3áStr4" ""
> str_split(x, coll("\u00e1"), simplify=TRUE)
[,1] [,2]
[1,] "Str1" "Str2"
[2,] "Str3" "Str4"
A note about fixed():
fixed(x) only matches the exact sequence of bytes specified by x. This is a very limited “pattern”, but the restriction can make matching much faster. Beware using fixed() with non-English data. It is problematic because there are often multiple ways of representing the same character. For example, there are two ways to define “á”: either as a single character or as an “a” plus an accent.
...
coll(x) looks for a match to x using human-language collation rules, and is particularly important if you want to do case insensitive matching. Collation rules differ around the world, so you’ll also need to supply a locale parameter.
Simply wrap the regex inside fixed() to stop it being treated as a regex inside stringr::str_split()
Example
Normally, stringr::str_split() will treat the pattern as a regular expression, meaning certain characters have special meanings, which can cause errors if those regular expressions are not valid, e.g.:
library(stringr)
str_split("abcdefg[[[klmnop", "[[[")
Error in stri_split_regex(string, pattern, n = n, simplify = simplify, :
Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET)
But if we simply wrap the pattern we are splitting by inside fixed(), it treat's it as a string literal, rather than a regular expression:
str_split("abcdefg[[[klmnop", fixed("[[["))
[[1]]
[1] "abcdefg" "klmnop"
I have a data.frame with a single column "Terms". This could contain a string of multiple words. Each term contains at least two words or more, no upper limit.
From this column "Terms", I would like to extract the last word and store it in a new column "Last".
# load library
library(dplyr)
library(stringi)
# read csv
df <- read("filename.txt",stringsAsFactors=F)
# show df
head(df)
# Term
# 1 this is for the
# 2 thank you for
# 3 the following
# 4 the fact that
# 5 the first
I have prepared a function LastWord which works well when a single string is given.
However, when a vector of string is given, it still works with the first string in the vector. This has forced me to use mapply when used with mutate, to add a column as seen below.
LastWord <- function(InputWord) {
stri_sub(InputWord,stri_locate_last(str=InputWord, fixed=" ")[1,1]+1, stri_length(InputWord))
}
df <- mutate(df, Last=mapply(LastWord, df$Term))
Using mapply makes the process very slow. I generally need to process around 10 to 15 million lines or terms at a time. It takes hours.
Could anyone suggest a way to create the LastWord function that works with vector rather than a string?
You can try:
df$LastWord <- gsub(".* ([^ ]+)$", "\\1", df$Term)
df
# Term LastWord
# 1 this is for the the
# 2 thank you for for
# 3 the following following
# 4 the fact that that
# 5 the first first
In the gsub call, the expression between the brackets matches anything that is not a space at least one time (instead of [^ ]+, [a-zA-Z]+ could work too) at the end of the string ($). The fact that it is in between brackets permit to capture the expression with \\1. So gsub only keeps what is in between brackets as replacement.
EDIT:
As #akrun mentionned in the comments, in this case, sub can also be used instead of gsub.
To extract the last word only, you can use a vectorized function from stringi directly which should be very fast
library(stringi)
df$LastWord <- stri_extract_last_words(df$Term)
Now if you want two new columns, one containing all words but the last and another one containing the last words, you can use some regular expression like
stri_match(df$Term, regex= "([\\w*\\s]*)\\s(\\w*)")
# [,1] [,2] [,3]
# [1,] "this is for the" "this is for" "the"
# [2,] "thank you for" "thank you" "for"
# [3,] "the following" "the" "following"
# [4,] "the fact that" "the fact" "that"
# [5,] "the first" "the" "first"
So what you want is
df[c("ExceptLast", "LastWord")] <-
stri_match(df$Term, regex= "([\\w*\\s]*)\\s(\\w*)")[, 2:3]
(Note that this won't work if df$Term contains only one word. In that case you will need to modify the regular expression, depending on which column you want it to be included in.)
I wish to extract all unique substrings of text from a text file using R, that adhere to the form "matrixname[rowname,column number]". I have achieved only limited success with grep and extract_string_all (stringr) in the sense that it will only return the entire line and not the substring only. Trying to replace the unwanted text using gsub has been unsuccessful. Here is an example of the code that I have been using.
#Read in file
txt<-read.table("Project_R_code.R")
#create new object to create lines that contain this pattern
txt2<-grep("param\\[.*1\\]",txt$V1, value=TRUE)
#remove all text that does not match the above pattern
gsub("[^param\\[.*1\\]]","", txt2,perl=TRUE)
The second line works (but again doesn't give me a substring of that pattern only). However the gsub code for removing non-matching patterns keeps the lines and turns them into something like this:
[200] "[p.p]param[ama1]param[ama11]*[r1]param[ama1]...
and I have no idea why. I realise this method of paring down the line into something more manageable is more tedious but it's the only way I know how to get the patterns.
Preferably I would prefer R to spit out a list of all the (unique) substrings it finds in the text file, that match my pattern, but I don't know the command. Any help on this is much appreciated.
If you'd like to extract individual components, try str_match:
test <- c("aaa[name1,1]", "bbb[name2,3]", "ccc[name3,3]")
stringr::str_match(test, "([a-zA-Z0-9_]+)[[]([a-zA-Z0-9_]+),.*?(\\d+)\\]")
## [,1] [,2] [,3] [,4]
## [1,] "aaa[name1,1]" "aaa" "name1" "1"
## [2,] "bbb[name2,3]" "bbb" "name2" "3"
## [3,] "ccc[name3,3]" "ccc" "name3" "3"
Otherwise, use str_extract.
Note that to match [ in ERE/TRE we use a set containing a single [ character, i.e. [[].
Moreover, if you have many matches in a single string, use str_match_all or str_extract_all.