String identification in text files using regex in R

String identification in text files using regex in R - r

This is my first post in stack overflow and I'll try and explain my problem as succintly as possible.
The problem is pretty simple. I'm trying to identify strings containing alphanumeric characters and alphanumeric characters with symbols and remove them. I looked at previous questions in Stack overflow and found a solution that looks good.
https://stackoverflow.com/a/21456918/7467476
I tried the provided regex (slightly modified) in notepad++ on some sample data just to see if its working (and yes, it works). Then, I proceeded to use the same regex in R and use gsub to replace the string with "" (code given below).
replace_alnumsym <- function(x) {
return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[_-])[A-Za-z0-9_-]{8,}", "", x, perl = T))
}
replace_alnum <- function(x) {
return(gsub("(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{8,}", "", x, perl = T))
}
sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")
output1 <- sapply(sample, replace_alnum)
output2 <- sapply(sample, replace_alnumsym)
The code runs fine but the output still contains the strings. It hasn't been removed. I'm not getting any errors when I run the code (output below). The output format is also strange. Each element is printed twice (once without and once within quotes).
> output1
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh WQWEQtWe_232
"abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
> output2
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh WQWEQtWe_232
"abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
The desired result would be:
> output1
abc def ghi abcd efgh WQWEQtWe_232
> output2
abc def ghi WQE34324Wweasfsdfs23234 abcd efgh
I think I'm probably overlooking something very obvious.
Appreciate any assistance that you can provide.
Thanks

Your outputs are not printing twice, they're being output as named vectors. The unquoted line is the element names, the quoted line in the output itself. You can see this by checking the length of an output:
length( sapply( sample, replace_alnum ) )
# [1] 2
So you know there are only 2 elements there.
If you want them without the names, you can unname the vector on output:
unname( sapply( sample, replace_alnum ) )
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
Alternatively, you can rename them something more to your liking:
output <- sapply( sample, replace_alnum )
names( output ) <- c( "name1", "name2" )
output
# name1 name2
# "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh WQWEQtWe_232"
As far as the regex itself, it sounds like what you want is to apply it to each string separately. If so, and if you want them back to where they were at the end, you need to split them by space, then recombine them at the end.
# split by space (leaving results in separate list items for recombining later)
input <- sapply( sample, strsplit, split = " " )
# apply your function on each list item separately
output <- sapply( input, replace_alnumsym )
# recombine each list item as they looked at the start
output <- sapply( output, paste, collapse = " " )
output <- unname( output )
output
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh "
And if you want to clean up the trailing white space:
output <- trimws( output )
output
# [1] "abc def ghi WQE34324Wweasfsdfs23234" "abcd efgh"

No idea if this regex-based approach is really fine, but it is possible if we assume that:
alnumsym "words" are non-whitespace chunks delimited with whitespace and start/end of string
alnum words are chunks of letters/digits separated with non-letter/digits or start/end of string.
Then, you may use
sample <- c("abc def ghi WQE34324Wweasfsdfs23234", "abcd efgh WQWEQtWe_232")
gsub("\\b(?=\\w*[a-z])(?=\\w*[A-Z])(?=\\w*\\d)\\w{8,}", "", sample, perl=TRUE) ## replace_alnum
gsub("(?<!\\S)(?=\\S*[a-z])(?=\\S*[A-Z])(?=\\S*[0-9])(?=\\S*[_-])[A-Za-z0-9_-]{8,}", "", sample, perl=TRUE) ## replace_alnumsym
See the R demo online.
Pattern 1 details:
\\b - a leading word boundary (we need to match a word)
(?=\\w*[a-z]) - (a positive lookahead) after 0+ word chars (\w*) there must be a lowercase ASCII letter
(?=\\w*[A-Z]) - an uppercase ASCII letter must be inside this word
(?=\\w*\\d) - and a digit, too
\\w{8,} - if all the conditions above matched, match 8+ word chars
Note that to avoid matching _ (\w matches _) you need to replace \w with [^\W_].
Pattern 2 details:
(?<!\\S) - (a negative lookbehind) no non-whitespace can appear immediately to the left of the current location (a whitespace or start of string should be in front)
(?=\\S*[a-z]) - after 0+ non-whitespace chars, there must be a lowercase ASCII letter
(?=\\S*[A-Z]) - the non-whitespace chunk must contain an uppercase ASCII letter
(?=\\S*[0-9]) - and a digit
(?=\\S*[_-]) - and either _ or -
[A-Za-z0-9_-]{8,} - if all the conditions above matched, match 8+ ASCII letters, digits or _ or -.

Related

Replace matched patterns in a string based on condition

I have a text string containing digits, letters and spaces. Some of its substrings are month abbreviations. I want to perform a condition-based pattern replacement, namely to enclose a month abbreviation in whitespaces if and only if a given condition is fulfilled. As an example, let the condition be as follows: "preceeded by a digit and succeeded by a letter".
I tried stringr package but I fail to combine the functions str_replace_all() and str_locate_all():
# Input:
txt = "START1SEP2 1DECX JANEND"
# Desired output:
# "START1SEP2 1 DEC X JANEND"
# (A) What I could do without checking the condition:
library(stringr)
patt_month = paste("(", paste(toupper(month.abb), collapse = "|"), ")", sep='')
str_replace_all(string = txt, pattern = patt_month, replacement = " \\1 ")
# "START1 SEP 2 1 DEC X JAN END"
# (B) But I actually only need replacements inside the condition-based bounds:
str_locate_all(string = txt, pattern = paste("[0-9]", patt_month, "[A-Z]", sep=''))[[1]]
# start end
# [1,] 12 16
# To combine (A) and (B), I'm currently using an ugly for() loop not shown here and want to get rid of it

You are looking for lookarounds:
(?<=\d)DEC(?=[A-Z])
See a demo on regex101.com.
Lookarounds make sure a certain position is matched without consuming any characters. They are available in front of sth. (called lookbehind) or to make sure anything that follows is of a certain type (called lookahead). You have positive and negative ones on both sides, thus you have four types (pos./neg. lookbehind/-ahead).
A short memo:
(?=...) is a pos. lookahead
(?!...) is a neg. lookahead
(?<=...) is a pos. lookbehind
(?<!...) is a neg. lookbehind

A Base R version
patt_month <- capture.output(cat(toupper(month.abb),"|"))#concatenate all month.abb with OR
pat <- paste0("(\\s\\d)(", patt_month, ")([A-Z]\\s)")#make it a three group thing
gsub(pattern = pat, replacement = "\\1 \\2 \\3", txt, perl =TRUE)#same result as above
Also works for txt2 <- "START1SEP2 1JANY JANEND" out of the box.
[1] "START1SEP2 1 JAN Y JANEND"

How to pass multiple necessary patterns to str_subset?

I am trying to find elements in a character vector that match two words in no particular order, not just any single one of them, using the stringr::str_subset function. In other words, I'm looking for the intersection, not the union of the two words.
I tried using the "or" (|) operator but this only gives me either one of the two words and returns too many results. I also tried just passing a character vector with the two words as the pattern argument. This just returns the error that "longer object length is not a multiple of shorter object length" and only returns the values that match the second one of the two words.
character_vector <- c("abc ghi jkl mno def", "pqr abc def", "abc jkl pqr")
pattern <- c("def", "pqr")
str_subset(character_vector, pattern)
I'm looking for the pattern that will return only the second element of the character vector, i.e. "pqr abc def".

An option is str_detect. Loop over the 'pattern', check if both the 'pattern' elements match with the 'character_vector' (&), use the logical vector to extract the element from the 'character_vector'
library(tidyverse)
map(pattern, str_detect, string = character_vector) %>%
reduce(`&`) %>%
magrittr::extract(character_vector, .)
#[1] "pqr abc def"
Or using str_subset
map(pattern, str_subset, string = character_vector) %>%
reduce(intersect)
#[1] "pqr abc def"

You can use a pure R code with out a loop using regular expression. The code is like this:
character_vector[grepl(paste0("(?=.*",pattern,")",collapse = ""), character_vector, perl = TRUE)]
the grepl would find the position of the character that full fills the regex and condition inside the paste0.

As you are looking for the intersect, you can use the function intersect() and explicit the 2 patterns you are looking for
pattern_1 <- 'pqr'
pattern_2 <- 'def'
intersect(
str_subset(character_vector, pattern_1),
str_subset(character_vector, pattern_2)
)

Will this work?
character_vector %>% purrr::reduce(pattern, str_subset, .init = . )
[1] "pqr abc def"

Get characters after and before a pattern match in R

Lets say I have a string in R:
str <- "abc abc cde cde"
and I use regmatches and gregexpr to find how many "b"´s there is in my string
regmatches(str, gregexpr("b",str))
but I want an output of everything that cointains the letter b.
So an output like: "abc", "abc".
Thank you!

tmp <- "abc abc cde cde"
Split the string up into separate elements, grep for "b", return elements:
grep("b", unlist(strsplit(tmp, split = " ")), value = TRUE)

Look for non-space before and after, something like:
regmatches(str, gregexpr("\\S*b\\S*", s))
# [[1]]
# [1] "abc" "abc"
Special regex characters are documented in ?regex. For this case, \\s matches "any space-like character", and \\S is its negation, so any non-space-like character. You could be more specific, such as \\w ('word' character, same as [[:alnum:]_]). The * means zero-or-more, and + means one-or-more (forcing something).

I suppose you mean you want to find words that contain b. One regex that does this is
\w*b\w*
\w* matches 0 or more word characters, which is a-z, A-Z, 0-9, and the underscore character.
Demo

Here is a base R option using strsplit and grepl:
str <- "abc abc cde cde"
words <- strsplit(str, "\\s+")[[1]]
idx <- sapply(words, function(x) { grepl("b", x)})
matches <- words[idx]
matches
[1] "abc" "abc"

Substitute word with same word without initial # in R

I am trying to do a dataframe string substitution in R. I need to find all the words preceded by '#' (without space, e.g. #word) and change the '#' for '!' (e.g. from #word to !word). At the same time, it leaves intact the other instances of '#' (e.g. # or ## or #[#]). For example, this is my original dataframe (to change: #def, #jkl, #stu):
> df = data.frame(number = 1:4, text = c('abc #def ghi', '#jkl # mno', '#[#] pqr #stu', 'vwx ### yz'))
> df
number text
1 1 abc #def ghi
2 2 #jkl # mno
3 3 #[#] pqr #stu
4 4 vwx ### yz
And this is what I need it to look like:
> df_result = data.frame(number = 1:4, text = c('abc !def ghi', '!jkl # mno', '#[#] pqr !stu', 'vwx ### yz'))
> df_result
number text
1 1 abc !def ghi
2 2 !jkl # mno
3 3 #[#] pqr !stu
4 4 vwx ### yz
I have tried with
> gsub('#.+[a-z] ', '!', df$text)
[1] "abc !ghi" "!# mno" "!#stu" "vwx ### yz"
But the result is not the desired one. Any help is much appreciated.
Thank you.

How about
gsub("(^| )#(\\w)", "\\1!\\2", df$text)
# [1] "abc !def ghi" "!jkl # mno" "#[#] pqr !stu" "vwx ### yz"
This matches an # symbol at beginning of a string, or after a space. Then, we capture the word character after the # symbol, and replace # with !.
Explanation courtesy of regex101.com:
(^| ) is the 1st Capturing Group; ^ asserts position at start of the string; | denotes "or"; blank space matches the space character literally
# matches the character # literally (case sensitive)
(\\w) is the 2nd Capturing Group, it denotes a word character
The replacement string \\1!\\2 replaces the regular expression match with the first capturing group (\\1), followed by !, followed by the second capturing group (\\2).

You can use a positive lookahead (?=...)
gsub("#(?=[A-Za-z])", "!", df$text, perl = TRUE)
[1] "abc !def ghi" "!jkl # mno" "#[#] pqr !stu" "vwx ### yz"
From the "Regular Expressions as used in R" documentation page:
Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed.

R retrieving strings with sub: Why this does not work?

I would like to extract parts of strings. The string is:
> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""
Now, I would like first to extract the first part:
> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"
But I don't succeed in extracting the second part:
> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""
What is wrong with this code?
Thanks for help.

Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$ matches and captures ", 1+ letters and spaces, " into Group 1 and \1 in the replacement puts it back.
You may actually get the last part inside quotes by removing all chars other than " at the start of the string:
x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)
See the R demo
The sub here will find and replace just once, and it will match the string start (with ^) followed with 1+ chars other than " with [^"]+ negated character class.

To get this to work with sub, you have to match the whole string. The help file says
For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"
sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

String identification in text files using regex in R - r

Related

Replace matched patterns in a string based on condition

How to pass multiple necessary patterns to str_subset?

Get characters after and before a pattern match in R

Substitute word with same word without initial # in R

R retrieving strings with sub: Why this does not work?

Categories

Resources