Remove first 4 words after a certain string pattern in R? - r

I am working with really long strings. How can I remove the first 4 words after a certain string pattern occurs? For example:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
#remove the fist 4 words after and including "stackoverflow"
result
"hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."

Solution with base R
A one line solution:
pattern <- "stackoverflow"
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."
How it works
Create the pattern you want with a regex:
"stackoverflow" followed by 4 words.
Definitely, check out ?regex for more info about it.
Words are identified by \\w+ and separators are identified by \\W+ (capital w, it includes spaces and special characters like the apostrophe that you have in the sentence)
(...){0,4} means that the combination of word and separator may repeat up to 4 times.
\\W* needs to identify a possible final separator, so that the remaining two pieces of the sentence won't have two separators dividing them. Try it without, you'll see what I mean.
gsub locates the pattern you want and replace it with "" (thus deliting it).
Handle Exceptions
Note that it works even for particular cases:
# end of a sentence with fewer than 4 words after
string <- "hello I am a user of stackoverflow and I am"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "hello I am a user of "
# beginning of a sentence
string <- "stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] "happy with all the help the community usually offers when I'm in need of some coding expertise."
# pattern == string
string <- "stackoverflow and I am really"
gsub(paste0(pattern, "(\\W+\\w+){0,4}\\W*"), "", string)
#> [1] ""
A tidyverse solution
library(stringr)
# locate start and end position of pattern
tmp <- str_locate(string, paste0(pattern,"(\\W+\\w+){0,4}\\W*"))
# get positions: start_sentence-start_pattern and end_pattern-end_sentence
tmp <- invert_match(tmp)
# get the substrings
tmp <- str_sub(string, tmp[,1], tmp[,2])
# collapse substrings together
str_c(tmp, collapse = "")
#> [1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."

Search for your pattern with additional spaces and words after it. Find the positions of the first last match, split the string and paste it back together. At the end gsub any double (or more) spaces.
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
pat="stackoverflow"
library(stringr)
tmp=str_locate(
string,
paste0(
pat,
paste0(
rep("\\s?[a-zA-Z]+",4),
collapse=""
)
)
)
gsub("\\s{2,}"," ",
paste0(
substring(string,1,tmp[1]-1),
substring(string,tmp[2]+1)
)
)
[1] "hello I am a user of happy with all the help the community usually offers when I'm in need of some coding expertise."

Quick answer, I am sure you can have better code thant that:
string <- "hello I am a user of stackoverflow and I am really happy with all the help the community usually offers when I'm in need of some coding expertise."
t<-read.table(textConnection(string))
string2<-''
i<-0
j<-0
for(i in 1:length(t)){
if(t[i]=="stackoverflow"){
j=i
}else if(j>0){
if(i-j>4){
string2=paste0(string2, " " , t[i])
}
}else if(j==0){
if(i>1){
string2=paste0(string2, " " , t[i])
}else{
string2=t[i]
}
}
}
print(string2)

Related

Remove all special characters from string except Turkish ones

there are tons of similar questions but i couldn't find the exact answer.
I have a text like this;
str <- "NG*-#+ÜÇ12 NET GROUPنت ياترم "
I want to remove all special and non-Turkish characters and keep the others. Desired output is;
"NGÜÇ12 NET GROUP"
I really appreciate your help.
Please try
library(stringr)
str <- "NG*-#+ÜÇ12 NET GROUPنت ياترم "
str_replace_all(str, '[^[\\da-zA-Z ÜüİıÇ窺Ğğ]]', '')
Using base gsub:
gsub("[^0-9a-zA-Z ÜüİıÇ窺Ğğ]", "", str)
# [1] "NGÜÇ12 NET GROUP "

How to count the number of segments in a string in r?

I have a string printed out like this:
"\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
(The "\" wasn't there. R just automatically prints it out.)
I would like to calculate how many non-empty segments there are in this string. In this case the answer should be 11.
I tried to convert it to a vector, but R ignores the quotation marks so I still ended up with a vector with length 1.
I don't know whether I need to extract those segments first and then count, or there're easier ways to do that.
If it's the former case, which regular expression function best suits my need?
Thank you very much.
You can use scan to convert your large string into a vector of individual ones, then use nchar to count the lengths. Assuming your large string is x:
y <- scan(text=x, what="character", sep=",", strip.white=TRUE)
Read 12 items
sum(nchar(y)>0)
[1] 11
I assume a segment is defined as anything between . or ,. An option using strsplit can be found as:
length(grep("\\w+", trimws(strsplit(str, split=",|\\.")[[1]])))
#[1] 11
Note: trimws is not mandatory in above statement. I have included so that one can get the value of each segment by just adding value = TRUE argument in grep.
Data:
str <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
strsplit might be one possibility?
txt <- "Jenna and Alex were making cupcakes., Jenna asked Alex whether all were ready to be frosted.,
Alex said that, some of them , were., He added, that, the rest, would be, ready, soon.,"
a <- strsplit(txt, split=",")
length(a[[1]])
[1] 11
If the backslashes are part of the text it doesnt really change a lot, except for the last element which would have "\"" in it. By filtering that out, the result is the same:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all
were ready to be frosted.\", \"Alex said that\", \" some of them \",
\"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
a <- strsplit(txt, split=", \"")
length(a[[1]][a[[1]] != "\""])
[1] 11
This is an absurd idea, but it does work:
txt <- "\"Jenna and Alex were making cupcakes.\", \"Jenna asked Alex whether all were ready to be frosted.\", \"Alex said that\", \" some of them \", \"were.\", \"He added\", \"that\", \"the rest\", \"would be\", \"ready\", \"soon.\", \"\""
Txt <-
read.csv(text = txt,
header = FALSE,
colClasses = "character",
na.strings = c("", " "))
sum(!vapply(Txt, is.na, logical(1)))

gsubfn function not giving desired output when ignore.case = TRUE

I am trying to substitute multiple patterns within a character vector with their corresponding replacement strings. After doing some research I found the package gsubfn which I think is able to do what I want it to, however when I run the code below I don't get my expected output (see end of question for results versus what I expected to see).
library(gsubfn)
# Our test data that we want to search through (while ignoring case)
test.data<- c("1700 Happy Pl","155 Sad BLVD","82 Lolly ln", "4132 Avent aVe")
# A list data frame which contains the patterns we want to search for
# (again ignoring case) and the associated replacement strings we want to
# exchange any matches we come across with.
frame<- data.frame(pattern= c(" Pl"," blvd"," LN"," ave"), replace= c(" Place", " Boulevard", " Lane", " Avenue"),stringsAsFactors = F)
# NOTE: I added spaces in front of each of our replacement terms to make
# sure we only grab matches that are their own word (for instance if an
# address was 45 Splash Way we would not want to replace "pl" inside of
# "Splash" with "Place
# The following set of paste lines are supposed to eliminate the substitute function from
# grabbing instances like first instance of " Ave" found directly after "4132"
# inside "4132 Avent Ave" which we don't want converted to " Avenue".
pat <- paste(paste(frame$pattern,collapse = "($|[^a-zA-Z])|"),"($|[^a-zA-Z])", sep = "")
# Here is the gsubfn function I am calling
gsubfn(x = test.data, pattern = pat, replacement = setNames(as.list(frame$replace),frame$pattern), ignore.case = T)
Output being received:
[1] "1700 Happy" "155 Sad" "82 Lolly" "4132 Avent"
Output expected:
[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
My working theory on why this isn't working is that the matches don't match the names associated with the list I am passing into the gsubfn's replacement argument because of some case discrepancies (eg: the match being found on "155 Sad BLVD" doesn't == " blvd" even though it was able to be seen as a match due to the ignore.case argument). Can someone confirm that this is the issue/point me to what else might be going wrong, and perhaps a way of fixing this that doesn't require me expanding my pattern vector to include all case permutations if possible?
Seems like stringr has a simple solution for you:
library(stringr)
str_replace_all(test.data,
regex(paste0('\\b',frame$pattern,'$'),ignore_case = T),
frame$replace)
#[1] "1700 Happy Place" "155 Sad Boulevard" "82 Lolly Lane" "4132 Avent Avenue"
Note that I had to alter the regex to look for only words at the end of the string because of the tricky 'Avent aVe'. But of course there's other ways to handle that too.

separating last sentence from a string in R

I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.
You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})
Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"
This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."

How to get words that end with certain characters within each string r

I have a vector of strings that looks like:
str <- c("bills slashed for poor families today", "your calls are charged", "complaints dept awaiting refund")
I want to get all the words that end with the letter s and remove the s. I have tried:
gsub("s$","",str)
but it doesn't work because it tries to match with the strings that end with s instead of words. I'm trying to get an output that looks like:
[1] bill slashed for poor familie today
[2] your call are charged
[3] complaint dept awaiting refund
Any pointers as to how I can do this? Thanks
$ checks for the end of the string, not the end of a word.
To check for the word boundaries you should use \b
So:
gsub("s\\b", "", str)
Here's a non base R solution:
library(rebus)
library(stringr)
plurals <- "s" %R% BOUNDARY
str_replace_all(str, pattern = plurals, replacement = "")
You could also use a positive lookahead assertion:
gsub(pattern = "s{1}(?>\\s)", " ", x = str, perl = T)
I am no expert on regex, but I believe this expression looks for an "s" if it is followed by a space. Finding a match, it replaces that "s" with a space. So, final "s's" are removed.

Resources