Extract all text before \n in R regex - r

I want to extract all text in a string before a "\n" appears.
Test string:
string <- "Stack Overflow\nIs a great website for asking programming questions\nOther Info"
Solution extracts "Stack Overflow"
Bonus point if it grabs the first word of the string and the last word before the "\n"
Example:
string2 <- "Stack Overflow Dot Com\nIS a great website for asking programming questions\nOther Info"
Solution extracts "Stack Com"

Seems you want to have a solution with regexp, to answer your first question
/(.*)/
will match the whole string before your first end of line (\n)
regexp101 Test
To have the first and the last word matched on a one liner you can try
/([^ ]+).* (.*)$/
Probably someone can improve my answer to filter out this solution to match the first and last word before the first occurrence of newline.
regexp101 Test

Here is trick of using double gsub
> s
[1] "Stack Overflow\nIs a great website for asking programming questions\nOther Info"
[2] "Stack Overflow Dot Com\nIS a great website for asking programming questions\nOther Info"
> gsub("\\s.*\\s", " ", gsub("\n.*", "", s))
[1] "Stack Overflow" "Stack Com"

Regexp for first example click
^(?:([\w\s]+)\\n)
Regexp for second example click
^(?:(\w*\w\b)[\w\s]+\s(\w*\w\b)\\n)

Related

How do I remove all questions from a text with regex and stringr?

edit to demonstrate I made an effort to solve this myself
I have the following string:
"This is a question? This is an answer This is another question? This is another answer."
What I want to do is create a regex that will match all the questions so I can remove them. In this case the preceding answer or sentence doesn't always end with a '.' (full stop). So what I am looking for is to match sentences that end with a question mark and start with a capital letter.
What I want to match:
"This is a question?" and "This is another question?"
I work in R so I prefer an answer with stringr, but I'm mostly interested in the regex that I should apply.
I tried the following regex ^[A-Z].+\? but unfortunately it matches the whole string.
This should do the trick in regex: ([A-Z][^A-Z?]*\?)

R - Get substring between first occurrence and last occurrence [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 2 years ago.
I'm working with long strings in R such as:
string <- "end of section. 3. LESSONS. The previous LESSONS are very important as seen in Figure 1. This text is also important. Figure 1: Blah blah blah".
I would like to extract the substring between the first occurrence of 'LESSONS' and the last occurrence of 'Figure 1', as follows:
"The previous LESSONS are very important as seen in Figure 1. This text is also important."
I tried the following but it returns the substring after the last occurence of 'LESSONS', not the first:
gsub(".*LESSONS (.*) Figure 1.*", "\\1", string)
#[1] "are very important as seen in Figure 1. This text is also important."
Also tried the following but it cuts the string after the first occurrence of 'Figure 1', not the last:
library(qdapRegex)
ex_between(string, "LESSONS", "Figure 1")
#[[1]]
#[1] ". The previous LESSONS are very important as seen in"
I'd appreciate any help!
You were very close. Make the regex non-greedy at the before "LESSONS" so that it matches the first one.
Also, here you can use only sub instead of gsub.
sub(".*?LESSONS\\.\\s*(.*) Figure 1.*", "\\1", string)
#[1] "The previous LESSONS are very important as seen in Figure 1. This text is also important."
You can use str_extract from the package stringr as well as positive lookbehind in (?<=...)and positive lookahead in (?=...) to define those parts of the string that delimit the part you want to extract:
str_extract(string, "(?<=LESSONS\\.\\s).*(?=\\sFigure 1)")
[1] "The previous LESSONS are very important as seen in Figure 1. This text is also important."

Mid sentence carriage return with regex

I have text as follows.
mytext<-c("There is a\nlot of stuff","There is a\nlot of stuff\n","There is a\n lot of stuff","Stuff is everywhere\n\n\n\n around here. Clean it\n up")
I'd like to get rid of the \n in the middle of the sentence with the output being:
There is a lot of stuff
There is a lot of stuff\n
There is a lot of stuff
Stuff is everywhere around here. Clean it up
I have tried:
gsub("([a-z]\\s*)\n+(\\s*[a-z])", "\\1 \\2", mytext)
but it gives the output:
[1] "There is a lot of stuff" "There is a lot of stuff"
[3] "There is a lot of stuff" "Stuff is everywhere\n\n\n around here. Clean it up"
I don't seem to be able to get rid of the mid sentence \n when there are multiples of them. Using the greedy operator with \n gives me odd results.
You may use
gsub("(?:\\h*\\R)++(?!\\z)\\h*", " ", mytext, perl=TRUE)
See the regex demo and the R demo online.
Details
(?:\\h*\\R)++ - 1 or more occurrences (matched possessively thanks to ++ quantifier, so that no backtracking could occur into the non-capturing group pattern) of:
\\h* - 0 or more horizontal whitespaces.
\\R - any line break sequence
(?!\\z) - not at the very end of string.
\\h* - 0 or more horizontal whitespaces.
Since it is a PCRE pattern, perl=TRUE is required.
I think we can use negative lookahead regex.
gsub('\n(?!$)', ' ', mytext, perl = TRUE)
#[1]"There is a lot of stuff" "There is a lot of stuff\n"
#[3]"There is a lot of stuff" "Stuff is everywhere around here. Clean it up"
This will replace all the \n except for the ones which are at the end of the string.

How do you remove an isolated number from a string in R?

This is a silly question, but I can't seem to find a solution in R online. I am trying to remove an isolated number from a long string. For example, I would like to remove the number 27198 from the sentence below.
x <- "hello3 my name 27198 is 5joey"
I tried the following:
gsub("[0-9]","",x)
Which results in:
"hello my name is joey"
But I want:
"hello3 my name is 5joey"
This seems really simple, but I am not well versed with regular expressions. Thanks for your help!
We can specify word boundary (\\b) at the end of one or more digits ([0-9]+)
gsub("\\b[0-9]+\\b", "", x)
#[1] "hello3 my name is 5joey"

How to delete only (anyword).com in Regex?

I'd like to match the following
My best email gmail.com
email com
email.com
to become
My best email
email com
*nothing*
Specifically, I'm using Regex for R, so I know there are different rules for escaping certain characters. I'm very new to Regex, but so far I have
\ .*(com)
which makes the same input
My
But this code does not work for instances where there are no spaces like the third example, and removes everything past the first space of a line if the line has a ".com"
Use the following solution:
x <- c("My best email gmail.com","email com", "email.com", "smail.com text here")
trimws(gsub("\\S+\\.com\\b", "", x))
## => [1] "My best email" "email com" "" "text here"
See the R demo.
The \\S+\\.com\\b pattern matches 1+ non-whitespace chars followed by a literal .com followed by the word boundary.
The trimws function will trim all the resulting strings (as, e.g. with "smail.com text here", when a space will remain after smail.com removal).
Note that TRE regex engine does not support shorthand character classes inside bracket expressions.

Resources