Remove spaces between line breaks only - r

I have the following example string with line breaks "\n" and spaces " ":
a <- "\n \n \n \nTEST TEST\n"
I would like to remove spaces (" ") directly following after line breaks ("\n"), but not the spaces after other strings (like "TEST" in my toy example). My desired output is therefore:
"\n\n\n\nTEST TEST\n"
I tried stringr's str_remove_all and str_replace_all but didn't succeed as those seem to have problems in this case with the adjacent occurences of the line breaks. This is the closest I got:
str_replace_all(a, "\n[ ]*\n", "\n\n")
I spent hours on this (probably ridiculously easy) problem, any help is thus highly appreciated!

gsub("\n *", "\n", a)
or
str_replace_all(a, "\n *", "\n") # with stringr package
will get you the desired output "\n\n\n\nTEST TEST\n"
EDIT: For space(s) only between blank lines
Note that the above will also remove spaces that appear at the start of non-blank lines—e.g., if the string was "\n TEST TEST \n"
#bobble bubble's suggestion of including (?=\n) into the search pattern (i.e., "\n *(?=\n)") works for between blank lines. (Thank you, bobble bubble)
gsub("\n *(?=\n)", "\n", a, perl=TRUE)
or
str_replace_all(a, "\n *(?=\n)", "\n") # with stringr package
(?=(regex)) is a positive lookahead assertion. As "\n *(?=\n)", it means that the asserted regex \n needs to appear directly after \n * (new line with blank space(s)), but it will not be captured in the string pattern. Because the asserted regex is not captured in the pattern, it does not get replaced when using gsub or stringr::str_replace_all.
To illustrate this more clearly, only the "b" that appears before "bu" is replaced in the following example:
str_replace_all("bobblebbubble", "b(?=bu)", "_")
#[1] "bobble_bubble"

I believe you can remove any line that consists of horizontal whitespace. With stringr, you can use
library(stringr)
a <- "\n \n \n \nTEST TEST\n"
stringr::str_replace_all(a, "(?m)^\\h+$", "")
See the R demo and the regex demo. Details:
(?m) - a multiline modifier making ^ match start of any line and $ match any end of line positions
^ - line start
\h+ - one or more horizontal whitespace chars
$ - line end.

Related

Find match in multiple line breaks

I need KPRMILL from the text below. Pattern is finding first : and a single space, then followed by desired text (till first line break \n)
x <- "\n \n NSE: KPRMILL\n \n \n | \n BSE: 532889\n \n \n | INDUSTRY : TEXTILES\n | SECTOR : TEXTILES, APPARELS & ACCESSORIES\n "
I am able to solve this via combination of str_extract( ) and str_replace( ), looking for efficient solution.
x %>% str_extract("[.*?:]\\s+(.*?\\n)") %>% str_replace("(:\\s+)(.*)\\n","\\2")
You can use regex lookaround to find text before and/or after your pattern without including them in the returned text. (?<=abc)qu+x means "find and return qu+x when it is preceded by abc"; similarly, qu+x(?=abc) means *"find and return qu+x when it is followed by abc.
str_extract(x, "(?<=: )(.*)(?=\n)")
# [1] "KPRMILL"
I'm inferring that you only want the first of the patterns in your x, since there are four. If you want the others, use str_extract_all:
str_extract_all(x, "(?<=: )(.*)(?=\n)")
# [[1]]
# [1] "KPRMILL" "532889"
# [3] "TEXTILES" "TEXTILES, APPARELS & ACCESSORIES"

Remove whitespace after a symbol (hyphen) in R

I'm trying to remove the hyphen that divides a word from a string. For example, the word example: "for exam- ple this".
a <- "for exam- ple this"
How could I join them?
I have tried to remove the script using this command:
str_replace_all(a, "-", "")
But I got this back:
"for exam ple this"
It does not return the word united. I have also tried this:
str_replace_all(a, "- ", "") but I get nothing.
Therefore I have thought of first removing the white spaces after a hyphen to get the following
"for exm-ple this"
and then eliminating the hyphen.
Can you help me?
Here is an option with sub where we match the - followed by zero or more spaces (\\s*) and replace with -
sub("-\\s*", "-", a)
#[1] "for exam-ple this"
If it is to remove all spaces instead of a single one, then replace with gsub
gsub("-\\s*", "-", a)
str_replace_all(a, "- ", "-")
If you are just trying to remove the whitespace after a symbol then Ricardo's answer is sufficient. If you want to remove an unknown amount of whitespace after a hyphen consider
str_replace_all(a, "- +", "-")
#[1] "for exam-ple this"
b <- "for exam- ple this"
str_replace_all(b, "- +", "-")
#[1] "for exam-ple this"
EDIT --- Explaination
The "+" is something that tells r how to match a string and is part of the regular expressions. "+" specifically means to match the preceding character (or group/set) 1 or more times. You can find out more about regular expressions here.

Remove all words in string containing punctuation (R)

How (in R) would I remove any word in a string containing punctuation, keeping words without?
test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"
desired <- "I a see works not"
Here is an approach using sub which seems to work:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)
[1] "I a see works not"
This approach is to use the following regex pattern:
[A-Za-z]* match a leading letter zero or more times
[^A-Za-z ] then match a symbol once (not a space character or a letter)
\\S* followed by any other non whitespace character
\\s* followed by any amount of whitespace
Then, we just replace with empty string, to remove the words having one or more symbols in them.
You can use this regex
(?<=\\s|^)[a-z0-9]+(?=\\s|$)
(?<=\\s|^) - positive lookbehind, match should be preceded by space or start of string.
[a-z0-9]+ - Match alphabets and digits one or more time,
(?=\\s|$) - Match must be followed by space or end of string
Demo
Tim's edit:
This answer uses a whitelist approach, namely identify all words which the OP does want to retain in his output. We can try matching using the regex pattern given above, and then connect the vector of matches using paste:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\\r not"
result <- regmatches(test.string,gregexpr("(?<=\\s|^)[A-Za-z0-9]+(?=\\s|$)",test.string, perl=TRUE))[[1]]
paste(result, collapse=" ")
[1] "I a see works not"
Here are couple of more approaches
First approach:
str_split(test.string, " ", n=Inf) %>% # spliting the line into words
unlist %>%
.[!str_detect(., "\\W|\r")] %>% # detect words without punctuation or \r
paste(.,collapse=" ") # collapse the words to get the line
Second approach:
str_extract_all(test.string, "^\\w+|\\s\\w+\\s|\\w+$") %>%
unlist %>%
trimws() %>%
paste(., collapse=" ")
^\\w+ - words having only [a-zA-Z0-9_] and also is start of the string
\\s\\w+\\s - words with [a-zA-Z0-9_] and having space before and after the word
\\w+$ - words having [a-zA-Z0-9_] and also is the end of the string

separating last sentence from a string in R

I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.
You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})
Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"
This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."

remove all line breaks (enter symbols) from the string using R

How to remove all line breaks (enter symbols) from the string?
my_string <- "foo\nbar\rbaz\r\nquux"
I've tried gsub("\n", "", my_string), but it doesn't work, because new line and line break aren't equal.
You need to strip \r and \n to remove carriage returns and new lines.
x <- "foo\nbar\rbaz\r\nquux"
gsub("[\r\n]", "", x)
## [1] "foobarbazquux"
Or
library(stringr)
str_replace_all(x, "[\r\n]" , "")
## [1] "foobarbazquux"
I just wanted to note here that if you want to insert spaces where you found newlines the best option is to use the following:
gsub("\r?\n|\r", " ", x)
which will insert only one space regardless whether the text contains \r\n, \n or \r.
Have had success with:
gsub("\\\n", "", x)
With stringr::str_remove_all
library(stringr)
str_remove_all(my_string, "[\r\n]")
# [1] "foobarbazquux"

Resources