replacing repeated strings using regex in R - r

I have a string as follows:
text <- "http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I want to eliminate all duplicated addresses, so my expected result is:
expected <- "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"
I tried (^[\w|.|:|\/]*),\1+ in regex101.com and it works removing the first repetition of the string (fails at the second). However, if I port it to R's gsub it doesn't work as expected:
gsub("(^[\\w|.|:|\\/]*),\\1+", "\\1", text)
I've tried with perl = FALSE and TRUE to no avail.
What am I doing wrong?

If they are sequential, you just need to modify your regex slightly.
Take out your BOS anchor ^.
Add a cluster group around the comma and backreference, then quantify it (?:,\1)+.
And, lose the pipe symbol | as in a class it's just a literal.
([\w.:/]+)(?:,\1)+
https://regex101.com/r/FDzop9/1
( [\w.:/]+ ) # (1), The adress
(?: # Cluster
, \1 # Comma followed by what found in group 1
)+ # Cluster end, 1 to many times
Note - if you use split and unique then combine, you will lose the ordering of
the items.

An alternative approach is to split the string on the comma, then unique the results, then re-combine for your single text
paste0(unique(strsplit(text, ",")[[1]]), collapse = ",")
# [1] "http://x.co/imag/xyz.png,http://x.co/imag/jpg.png"

text <- c("http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/xyz.png,http://x.co/imag/jpg.png",
"http://q.co/imag/qrs.png,http://q.co/imag/qrs.png")
df <- data.frame(no = 1:2, text)
You can use functions from tidyverse if your strings are in a dataframe:
library(tidyverse)
separate_rows(df, text, sep = ",") %>%
distinct %>%
group_by(no) %>%
mutate(text = paste(text, collapse = ",")) %>%
slice(1)
The output is:
# no text
# <int> <chr>
# 1 1 http://x.co/imag/xyz.png,http://x.co/imag/jpg.png
# 2 2 http://q.co/imag/qrs.png

Related

Extract emails in brackets

I work with gmailR and I need to extract emails from brackets <> (sometimes few in one row) but in case when there are no brackets (e.g. name#mail.com) I need to keep those elements.
This is an example
x2 <- c("John Smith <jsmith#company.ch> <abrown#company.ch>","no-reply#cdon.com" ,
"<rikke.hc#hotmail.com>")
I need output like:
[1] "jsmith#company.ch" "abrown#company.ch"
[2] "no-reply#cdon.com"
[3] "rikke.hc#hotmail.com"
I tried this in purpose to merge that 2 results
library("qdapRegex")
y1 <- ex_between(x2, "<", ">", extract = FALSE)
y2 <- rm_between(x2, "<", ">", extract = TRUE )
My data code sample:
from <- sapply(msgs_meta, gm_from)
from[sapply(from, is.null)] <- NA
from1 <- rm_bracket(from)
from2 <- ex_bracket(from)
gmail_DK <- gmail_DK %>%
mutate(from = unlist(y1)) %>%
mutate(from = unlist(y2))
but when I use this function to my data (only one day emails) and unlist I get
Error in mutate():
! Problem while computing cc = unlist(cc2).
x cc must be size 103 or 1, not 104.
Run rlang::last_error() to see where the error occurred.
I suppose that in data from more days difference should be bigger, so I prefer to not go this way.
Preferred answer in R but if you know how to make it in for example PowerQuery should be great too.
We may also use base R - split the strings at the space that follows the > (strsplit) and then capture the substring between the < and > in sub (in the replacement, we specify the backreference (\\1) of the captured group) - [^>]+ - implies one or more characters that are not a >
sub(".*<([^>]+)>", "\\1", unlist(strsplit(x2,
"(?<=>)\\s+", perl = TRUE)))
[1] "jsmith#company.ch" "abrown#company.ch"
[3] "no-reply#cdon.com" "rikke.hc#hotmail.com"
Clunky but OK?
(x2
## split into single words/tokens
%>% strsplit(" ")
%>% unlist()
## find e-mail-like strings, with or without brackets
%>% stringr::str_extract("<?[\\w-.]+#[\\w-.]+>?")
## drop elements with no e-mail component
%>% na.omit()
## strip brackets
%>% stringr::str_remove_all("[<>]")
)

Using R Separate_Rows doesn't work with a "|"

Have a CSV file which has a column which has a variable list of items separated by a |.
I use the code below:
violations <- inspections %>% head(100) %>%
select(`Inspection ID`,Violations) %>%
separate_rows(Violations,sep = "|")
but this only creates a new row for each character in the field (including spaces)
What am I missing here on how to separate this column?
It's hard to help without a better description of your data and an example of what the correct output would look like. That said, I think part of your confusion is due to the documentation in separate_rows. A similar function, separate, documents its sep argument as:
If character, sep is interpreted as a regular expression. The default value is a regular expression that matches any sequence of non-alphanumeric values.
but the documentation for the sep argument in separate_rows doesn't say the same thing though I think it has the same behavior. In regular expressions, | has special meaning so it must be escaped as \\|.
df <- tibble(
Inspection_ID = c(1, 2, 3),
Violations = c("A", "A|B", "A|B|C"))
separate_rows(df, Violations, sep = "\\|")
Yields
# A tibble: 6 x 2
Inspection_ID Violations
<dbl> <chr>
1 1 A
2 2 A
3 2 B
4 3 A
5 3 B
6 3 C
Not sure what your data looks like, but you may want to replace sep = "|" with sep = "\\|". Good luck!
Using sep=‘\|’ with the separate_rows function allowed me to separate pipe delimited values

Replace words in R

I have words against their synonyms. In the different data frame, I have sentences. I want to search synonyms from the other dataframe. If found, replace it with word for which synomym found.
dt = read.table(header = TRUE,
text ="Word Synonyms
Use 'employ, utilize, exhaust, spend, expend, consume, exercise'
Come 'advance, approach, arrive, near, reach'
Go 'depart, disappear, fade, move, proceed, recede, travel'
Run 'dash, escape, elope, flee, hasten, hurry, race, rush, speed, sprint'
Hurry 'rush, run, speed, race, hasten, urge, accelerate, bustle'
Hide 'conceal, cover, mask, cloak, camouflage, screen, shroud, veil'
", stringsAsFactors= F)
mydf = read.table(header = TRUE, , stringsAsFactors= F,
text ="sentence
'I can utilize this file'
'I can cover these things'
")
The desired output looks like -
I can Use this file
I can Hide these things
Above is just a sample. In my real dataset, I have more than 10000 sentences.
One can replace , in dt$Synonyms with | so that it can be used as pattern argument of gsub. Now, use dt$Synonyms as pattern and replace occurrence of any word (separated by |) with dt$word. One can use sapply and gsub as:
Edited: Added word-boundary check (as part of pattern in gsub) as suggested by OP.
# First replace `, ` with `|` in dt$Synonyms. Now dt$Synonyms can be
# used 'pattern' argument of `gsub`.
dt$Synonyms <- paste("\\b",gsub(", ","\\\\b|\\\\b",dt$Synonyms),"\\b", sep = "")
# Loop through each row of 'dt' to replace Synonyms with word using sapply
mydf$sentence <- sapply(mydf$sentence, function(x){
for(row in 1:nrow(dt)){
x = gsub(dt$Synonyms[row],dt$Word[row], x)
}
x
})
mydf
# sentence
# 1 I can Use this file
# 2 I can Hide these things
Here is a tidyverse solution...
library(stringr)
library(dplyr)
dt2 <- dt %>%
mutate(Synonyms=str_split(Synonyms, ",\\s*")) %>% #split into words
unnest(Synonyms) #this results in a long dataframe of words and synonyms
mydf2 <- mydf %>%
mutate(Synonyms=str_split(sentence, "\\s+")) %>% #split into words
unnest(Synonyms) %>% #expand to long form, one word per row
left_join(dt2) %>% #match synonyms
mutate(Word=ifelse(is.na(Word), Synonyms, Word)) %>% #keep unmatched words the same
group_by(sentence) %>%
summarise(sentence2=paste(Word, collapse=" ")) #reconstruct sentences
mydf2
sentence sentence2
<chr> <chr>
1 I can cover these things I can Hide these things
2 I can utilize this file I can Use this file

Replacing words with spaces within a tibble in R without anti-join

I have a tibble of sentences like so:
A tibble: 1,782 x 1
Chat
<chr>
1 Hi i would like to find out more about the trials
2 Hello I had a guest
3 Hello my friend overseas right now
...
What I'm trying to do is to remove stopwords like "I", "hello". I already have a list of them and I want to replace these stopwords with a space. I tried using mutate and gsub but it only takes in a regex. Anti join won't work here as I am trying to do bigram/trigram I don't have a single word column to anti-join the stopwords.
Is there a way to replace all these words in each sentences in R?
We could unnest the tokens, replace the 'word' that is found in the 'stop_words' 'word' column with space (" "), and paste the 'word' after grouping by 'lines'
library(tidytext)
library(tidyverse)
rowid_to_column(df1, 'lines') %>%
unnest_tokens(word, Chat) %>%
mutate(word = replace(word, word %in% stop_words$word, " ")) %>%
group_by(lines) %>%
summarise(Chat = paste(word, collapse=' ')) %>%
ungroup %>%
select(-lines)
NOTE: This replaces the stop words found in 'stop_words' dataset to " " If we need only a custom subset of stop words to be replaced, then create a vector of those elements and do the change in the mutate step
v1 <- c("I", "hello", "Hi")
rowid_to_column(df1, 'lines') %>%
...
...
mutate(word = replace(word %in% v1, " ")) %>%
...
...
We can construct a pattern with "\\b stop word \\b" and then use gsub to replace them with "". Here is an example. Notice that I set ignore.case = TRUE to include both lower and upper case, but you may want to adjust that for your needs.
dat <- read.table(text = "Chat
1 'Hi i would like to find out more about the trials'
2 'Hello I had a guest'
3 'Hello my friend overseas right now'",
header = TRUE, stringsAsFactors = FALSE)
dat
# Chat
# 1 Hi i would like to find out more about the trials
# 2 Hello I had a guest
# 3 Hello my friend overseas right now
# A list of stop word
stopword <- c("I", "Hello", "Hi")
# Create the pattern
stopword2 <- paste0("\\b", stopword, "\\b")
stopword3 <- paste(stopword2, collapse = "|")
# View the pattern
stopword3
# [1] "\\bI\\b|\\bHello\\b|\\bHi\\b"
dat$Chat <- gsub(pattern = stopword3, replacement = " ", x = dat$Chat, ignore.case = TRUE)
dat
# Chat
# 1 would like to find out more about the trials
# 2 had a guest
# 3 my friend overseas right now

Filtering text from numbers and stopwords in R(not for tdm)

I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system
Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"
There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "

Resources