Remove pattern that occurs outside of words - r

I am trying to remove pattern 'SO' from the end of a character vector. The issue I run into with the below code is that it will remove any sequence of 'SO' case insensitive/just removes the whole string (vs. last pattern detected). One solution I had was to do some manual cleaning and force all to lower with the exception of final 'SO' and leaving it case sensitive.
x <- data.frame(y = c("Solutions are welcomed, please SO # 12345")
x <- x %>% mutate(y = stri_replace_last_regex(x$y,"SO.*","",case_insensitive = TRUE)) # This will remove the string entirely - I'm not really sure why.
The desired output is:
'Solutions are welcomed, please'
I have used an iteration of regex that looks like \\b\\SO{2}\\b and \\b\\D{2}*\\b|[[:punct:]] - I believe the answer could lie here by setting word boundaries but I am not sure. The second one gets rid of the SO but I feel if there are so letters in sequence elsewhere separate from wording that would get removed as well. I just need the last occurrence of SO and everything after to be removed including punctuation in the whole string.
Any guidance on this would come much appreciated to me.

You can use gsub to remove the pattern you don't want.
gsub("\\sSO.+$", "", x$y)
[1] "Solutions are welcomed, please"
Use [[:upper:]]{2} if you want to generalise to any two consecutive upper case letters.
gsub("\\s[[:upper:]]{2}.+$", "", x$y)
[1] "Solutions are welcomed, please"
UPDATE: the above code might not be accurate if you have more than one "SO" in the string
To demonstrate, I have created another string with multiple "SO". Here, we are capturing any characters from the start of the string (^), until before the last occurrence of "SO" (SO.+$). These strings are stored in the first capture group (it's the regex (.*)). Then we can use gsub to replace the entire string with the first capture group (\\1), thus getting rid of everything that is after the last occurrence of "SO".
x <- data.frame(y = "Solutions are SO welcomed, SO please SO # 12345")
gsub('^(.*)SO.+$', '\\1', x$y)
[1] "Solutions are SO welcomed, SO please "

library(dplyr)
library(stringr)
x %>%
mutate(y = str_replace_all(y, 'SO.*', ''))
or
library(dplyr)
library(stringr)
x %>%
mutate(y = str_replace_all(y, 'SO\\s\\#\\s\\d*', ''))
output:
y
1 Solutions are welcomed, please

Related

Extract a part of a changeabel string

I have a simple but yet complicated question (at least for me)!
I would like to extract a part of a string like in this example:
From this string:
name <- "C:/Users/admin/Desktop/test/plots/"
To this:
name <- "test/plots/"
The plot twist for my problem that the names are changing. So its not always "test/plots/", it could be "abc/ccc/" or "m.project/plots/" and so on.
In my imagination I would use something to find the last two "/" in the string and cut out the text parts. But I have no idea how to do it!
Thank you for your help and time!
Without regex
Use str_split to split your path by /. Then extract the first three elements after reversing the string, and paste back the / using the collapse argument.
library(stringr)
name <- "C:/Users/admin/Desktop/m.project/plots/"
paste0(rev(rev(str_split(name, "\\/", simplify = T))[1:3]), collapse = "/")
[1] "m.project/plots/"
With regex
Since your path could contain character/numbers/symbols, [^/]+/[^/]+/$ might be better, which matches anything that is not /.
library(stringr)
str_extract(name, "[^/]+/[^/]+/$")
[1] "m.project/plots/"
With {stringr}, assuming the path comprises folders with lower case letters only. You could adjust the alternatives in the square brackets as required for example if directory names include a mix of upper and lower case letters use [.A-z]
Check a regex reference for options:
name <- c("C:/Users/admin/Desktop/m.project/plots/",
"C:/Users/admin/Desktop/test/plots/")
library(stringr)
str_extract(name, "[.a-z]+/[.a-z]+/$")
#> [1] "m.project/plots/" "test/plots/"
Created on 2022-03-22 by the reprex package (v2.0.1)

Regular Expression to find a string containing specific substring between two delimeters in R

I need to extract a string with multiple variations in between two comma delimiters.
The known similarity of these string is that it contains "LED" in the line.
Possible variations include "W-LED", "OLED", "Edge LED (Local Dimming)", "Direct LED" but are not only limited to those.
I want to extract all the substrings in between the delimiter with the comma removed. The strings are in a column inside a data frame. Two example:
ori_col <- c(
"Display: 27 in, VA, Viewing angles (H/V): 170 / 160, W-LED, 1920 x 1080 pixels",
"Display: 21.5 in, VA, Edge LED (Local Dimming), 1920 x 1080 pixels"
)
df <- as.data.frame(ori_col)
What I want to extract
"W-LED"
"Edge LED (Local Dimming)"
So I plan to mutate a new column to extract the values from the original column using regex.
df %>% mutate(new_column = str_extract(ori_col, "regex"))
I figure it must use something like lookaheads and lookbehinds but have no idea how to write the in between regex.
df %>% mutate(new_column = str_extract(ori_col, "(?<=\\,)(what should I write here)(?=\\,)"))
This question is derived from my previously overcomplicated question parsing to multiple columns if you want to understand more.
If a single value without comma's on the left and right should also be valid, you can match LED between matching any char except a comma on the left and right side using a negated character class [^,]*
[^,]*LED[^,]*
See a regex demo.
df %>% mutate(new_column = trimws(str_extract(ori_col, "[^,]*LED[^,]*")))
If the comma's should be present, you can use the lookarounds (note that you don't have to escape the comma's in the pattern):
df %>% mutate(new_column = trimws(str_extract(ori_col, "(?<=,)[^,]*LED[^,]*(?=,)")))
Output
new_column
1 W-LED
2 Edge LED (Local Dimming)
A potential, readable solution is using:
library("tidyverse")
mutate(df, extracted_str = str_match(string = ori_col,
pattern = "(.*\\,)(.*LED.*)(\\,.*)")[,3])
Notes
In this context you will be always looking to extract third group that contains the LED word. In order to bring to the results as a single column we subset the results from str_match to [,3].
As shown on regex101, the regex identifies the third group when encounters the LED word after a comma.
To get an understanding of the components that are landed in each group remove the subset [,3] from the str_match results:
mutate(df, extracted_str = str_match(string = ori_col,
pattern = "(.*\\,)(.*LED.*)(\\,.*)"))
You can run trimws / str_trim to get remove white spaces from the results
mutate(df, extracted_str = str_trim(str_match(string = ori_col,
pattern = "(.*\\,)(.*LED.*)(\\,.*)")[,3]))
Use scan to split the strings then select based on regex logical values:
> inp <- scan(text=txt, what="", sep=",")
Read 9 items
> inp[ sapply( inp, function(x){grepl("LED",x)}) ]
[1] " W-LED" " Edge LED (Local Dimming)"
Building from #rawr comment, this one works for me
df %>% mutate(new_column = gsub(', *([^,]*LED[^,]*),|.', '\\1', ori_col))
Will appreciate if someone can explain how the regex works.

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

Extract a string or value based on specific word before and a % sign after in R

I have a Text column with thousands of rows of paragraphs, and I want to extract the values of "Capacity > x%". The operation sign can be >,<,=, ~... I basically need the operation sign and integer value (e.g. <40%) and place it in a column next to the it, same row. I have tried, removing before/after text, gsub, grep, grepl, string_extract, etc. None with good results. I am not sure if the percentage sign is throwing it or I am just not getting the code structure. Appreciate your assistance please.
Here are some codes I have tried (aa is the df, TEXT is col name):
str_extract(string =aa$TEXT, pattern = perl("(?<=LVEF).*(?=%)"))
gsub(".*[Capacity]([^.]+)[%].*", "\\1", aa$TEXT)
genXtract(aa$TEXT, "Capacity", "%")
gsub("%.*$", "%", aa$TEXT)
grep("^Capacity.*%$",aa$TEXT)
Since you did not provide a reproducible example, I created one myself and used it here.
We can use sub to extract everything after "Capacity" until a number and % sign.
sub(".*Capacity(.*\\d+%).*", "\\1", aa$TEXT)
#[1] " > 10%" " < 40%" " ~ 230%"
Or with str_extract
stringr::str_extract(aa$TEXT, "(?<=Capacity).*\\d+%")
data
aa <- data.frame(TEXT = c("This is a temp text, Capacity > 10%",
"This is a temp text, Capacity < 40%",
"Capacity ~ 230% more text ahead"), stringsAsFactors = FALSE)
gsub solution
I think your gsub solution was pretty close, but didn't bring along the percentage sign as it's outside the brackets. So something like this should work (the result is assigned to the capacity column):
aa$capacity <- gsub(".*[Capacity]([^.]+%).*", "\\1", aa$TEXT)
Alternative method
The gsub approach will match the whole string when there is no operator match. To avoid this, we can use the stringr package with a more specific regular expression:
library(magrittr)
library(dplyr)
library(stringr)
aa %>%
mutate(capacity = str_extract(TEXT, "(?<=Capacity\\s)\\W\\s?\\d+\\s?%")) %>%
mutate(Capacity = str_squish(Capacity)) # Remove excess white space
This code will give NA when there is no match, which I believe is your desired behaviour.

Removing duplicate words in a string in R

Just to help someone who's just voluntarily removed their question, following a request for code he tried and other comments. Let's assume they tried something like this:
str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')
and wanted to learn a better way. So what is the best way to remove a duplicate word from the string?
If you are still interested in alternate solutions you can use unique which slightly simplifies your code.
paste(unique(d), collapse = ' ')
As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.
d <- gsub("[[:punct:]]", "", d)
There are no need additional package
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
Atomic function:
rem_dup.one <- function(x){
paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")
Vectorize
rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)
REsult
"how do i best try and find a way to improve this code" "and here's a second one not third"
To remove duplicate words except for any special characters. use this function
rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse =
" ")
}
Input data:
duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg
(Silver)"
rem_dup_word(duptest)
output: samsung wa80e5lec top loading with diamond drum 6 kg (silver)
It will treat "Samsung" and "SAMSUNG" as duplicate
I'm not sure if string case is a concern. This solution uses qdap with the add-on qdapRegex package to make sure that punctuation and beginning string case doesn't interfere with the removal but is maintained:
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
library(qdap)
library(dplyr) # so that pipe function (%>% can work)
str %>%
tolower() %>%
word_split() %>%
sapply(., function(x) unbag(unique(x))) %>%
rm_white_endmark() %>%
rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
unname()
## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."

Resources