Find and replace numbers in an R txt file - r

I am attempting to find all sentences in a text file in r that have numbers of any format in them and replace it with hashtags around them.
for example take the input below:
ex <- c("I have $5.78 in my account","Hello my name is blank","do you want 1,785 puppies?",
"I love stack overflow!","My favorite numbers are 3, 14,568, and 78")
as the output of the function, I'm looking for:
> "I have #$5.78# in my account"
> "do you want #1,785# puppies?"
> "My favorite numbers are #3#, #14,568#, and #78#"

Surrounding numbers is straight-forward, assuming that anything with a number, period, comma, and dollar-sign are all included.
gsub("\\b([-$0-9.,]+)\\b", "#\\1#", ex)
# [1] "I have $#5.78# in my account"
# [2] "Hello my name is blank"
# [3] "do you want #1,785# puppies?"
# [4] "I love stack overflow!"
# [5] "My favorite numbers are #3#, #14,568#, and #78#"
To filter out just the numbered entries:
grep("\\d", gsub("\\b([-$0-9.,]+)\\b", "#\\1#", ex), value = TRUE)
# [1] "I have $#5.78# in my account"
# [2] "do you want #1,785# puppies?"
# [3] "My favorite numbers are #3#, #14,568#, and #78#"

We can use gsub
gsub("(?<=\\s)(?=[$0-9])|(?<=[0-9])(?=,?[ ]|$)", "#", ex, perl = TRUE)
#[1] "I have #$5.78# in my account" "Hello my name is blank"
#[3] "do you want #1,785# puppies?" "I love stack overflow!"
#[5] "My favorite numbers are #3#, #14,568#, and #78#"

Another step-by-step approach is to use grep to identify the elements of text file containing the pattern "[0-9]", subset text elements with numeric entries using ex[....], and use the pipe operator %>% from library(dplyr) to pass the subset to gsub then use #r2evans' logic to place hashtags around numeric entries as shown below:
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\\b([-$0-9.,]+)\\b", "#\\1#",.)
The do.call(grep,list("[0-9]",ex)) portion of the code returns the indices for the text elements in ex with numeric entries.
Output
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\\b([-$0-9.,]+)\\b", "#\\1#",.)
[1] "I have $#5.78# in my account" "do you want #1,785# puppies?"
[3] "My favorite numbers are #3#, #14,568#, and #78#"

Related

Extract string between exact word and pattern using stringr

I have been wondering how to extract string in R using stringr or another package between the exact word "to the" (which is always lowercase) and the very second comma in a sentence.
For instance:
String: "This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want"
Desired output: "THIS IS WHAT I WANT, DO YOU SEE IT?"
I have this vector:
x<-c("This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want",
"HYU_IO TO TO to the I WANT, THIS, this i dont, want", "uiui uiu to the xxxx,,this is not, what I want")
and I am trying to use this code
str_extract(string = x, pattern = "(?<=to the ).*(?=\\,)")
but I cant seem to get it to work to properly give me this:
"THIS IS WHAT I WANT, DO YOU SEE IT?"
"I WANT, THIS"
"xxxx,"
Thank you guys so much for your time and help
You were close!
str_extract(string = x, pattern = "(?<=to the )[^,]*,[^,]*")
# [1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
# [2] "I WANT, THIS"
# [3] "xxxx,"
The look-behind stays the same, [^,]* matches anything but a comma, then , matches exactly one comma, then [^,]* again for anything but a comma.
Alternative approach, by far not comparable with Gregor Thomas approach, but somehow an alternative:
vector to tibble
separate twice by first to the then by ,
paste together
pull for vector output.
library(tidyverse)
as_tibble(x) %>%
separate(value, c("a", "b"), sep = 'to the ') %>%
separate(b, c("a", "c"), sep =",") %>%
mutate(x = paste0(a, ",", c), .keep="unused") %>%
pull(x)
[1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
[2] "I WANT, THIS"
[3] "xxxx,"

Looping through and replacing text in a data frame

I have a dataframe that consists of a variable with multiple words, such as:
variable
"hello my name is this"
"greetings friend"
And another dataframe that consists of two columns, one of which is words, the other of which is replacements for those words, such as:
word
"hello"
"greetings"
replacement:
replacement
"hi"
"hi"
I'm trying to find an easy way to replace the words in "variable" with the replacement words, looping over both all the observations, and all the words in each observation. The desired result is:
variable
"hi my name is this"
"hi friend"
I've looked into some methods that use cSplit, but it's not feasible for my application (there are too many words in any given observation of "variable", so this creates too many columns). I'm not sure how I would use strsplit for this, but am guessing that is the correct option?
EDIT: From my understanding of this question, my question my be a repeat of a previously unanswered question: Replace strings in text based on dictionary
stringr's str_replace_all would be handy in this case:
df = data.frame(variable = c('hello my name is this','greetings friend'))
replacement <- data.frame(word = c('hello','greetings'), replacment = c('hi','hi'), stringsAsFactors = F)
stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
Output:
> stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
[1] "hi my name is this" "hi friend"
This is similar to #amrrs's solution, but I am using a named vector instead of supplying two separate vectors. This also addresses the issue mentioned by OP in the comments:
library(dplyr)
library(stringr)
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .) %>%
str_replace_all(df1$variable, .)
# [1] "hi my name is this" "hi friend" "hi, hellomy is not a word"
# [4] "hi! my friend"
This is the named vector with regex as the names and string to replace with as elements:
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .)
# \\bhello\\b \\bgreetings\\b
# "hi" "hi"
Data:
df1 = data.frame(variable = c('hello my name is this',
'greetings friend',
'hello, hellomy is not a word',
'greetings! my friend'))
df2 = data.frame(word = c('hello','greetings'),
replacement = c('hi','hi'),
stringsAsFactors = F)
Note:
In order to address the issue of root words also being converted, I wrapped the regex with word boundaries (\\b). This makes sure that I am not converting words that live inside another, like "helloguys".

Regex in R to match a group of words and split by space

Split the regex by space if the group of words is not matched.
If group of words is matched then keep them as it is.
text <- c('considerate and helpful','not bad at all','this is helpful')
pattern <- c('considerate and helpful','not bad')
Output :
considerate and helpful, not bad, at, all, this, is, helpful
Thank you for the help!
Of course, just put the words in front of \w+:
library("stringr")
text <- c('considerate and helpful','not bad at all','this is helpful')
parts <- str_extract_all(text, "considerate and helpful|not bad|\\w+")
parts
Which yields
[[1]]
[1] "considerate and helpful"
[[2]]
[1] "not bad" "at" "all"
[[3]]
[1] "this" "is" "helpful"
It does not split on whitespaces but rather extracts the "words".

separating last sentence from a string in R

I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.
You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})
Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"
This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."

Problems in a regular expression to extract names using stringr

I cannot fully understand why my regular expression does not work to extract the info I want. I have an unlisted vector that looks like this:
text <- c("Senator, 1.4balbal", "rule 46.1, declares",
"Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23)
I would like to create a regular expression to extract only the name of the "Town", even if the town has a long name as the one written in the vector ("A Town with a Long Name"). I have tried this to extract the name of the town:
reg.town <- "[[:alpha:]](.+?)+,(.+?)\\d{2}"
towns<- unlist(str_extract_all(example, reg.prov))
but I extract everything around the ",".
Thanks in advance,
It looks like a town name starts with a capital letter ([[:upper:]]), ends with a comma (or continues to the end of text if there is no comma) ([^,]+) and should be at the start of the input text (^). The corresponding regex in this case would be:
^[[:upper:]][^,]+
Demo: https://regex101.com/r/QXYtyv/1
I have solve the problem thanks to #Dmitry Egorov 's demo post in the comment. the regular expression is this one ([[:upper:]].+?, [[:digit:]])
Thanks for your quick replies!!
You may use the following regex:
> library(stringr)
> text <- c("Senator, 1.4balbal", "rule 46.1, declares", "Town, 24", "A Town with a Long Name, 23", "THIS IS A DOCUMENT,23")
> towns <- unlist(str_extract_all(text, "\\b\\p{Lu}[^,]++(?=, \\d)"))
> towns
[1] "Senator" "Town"
[3] "A Town with a Long Name"
The regex matches:
\\b - a leading word boundary
\\p{Lu} - an uppercase letter
[^,]++ - 1+ chars other than a , (possessively, due to ++ quantifier, with no backtracking into this pattern for a more efficient matching)
(?=, \\d) - a positive lookahead that requires a ,, then a space and then any digit to appear immediately after the last non-, symbol matched with [^,]++.
Note you may get the same results with base R using the same regex with a PCRE option enabled:
> towns_baseR <- unlist(regmatches(text, gregexpr("\\b\\p{Lu}[^,]++(?=, \\d)", text, perl=TRUE)))
> towns_baseR
[1] "Senator" "Town"
[3] "A Town with a Long Name"
>

Resources