I'm trying to remove the hyphen that divides a word from a string. For example, the word example: "for exam- ple this".
a <- "for exam- ple this"
How could I join them?
I have tried to remove the script using this command:
str_replace_all(a, "-", "")
But I got this back:
"for exam ple this"
It does not return the word united. I have also tried this:
str_replace_all(a, "- ", "") but I get nothing.
Therefore I have thought of first removing the white spaces after a hyphen to get the following
"for exm-ple this"
and then eliminating the hyphen.
Can you help me?
Here is an option with sub where we match the - followed by zero or more spaces (\\s*) and replace with -
sub("-\\s*", "-", a)
#[1] "for exam-ple this"
If it is to remove all spaces instead of a single one, then replace with gsub
gsub("-\\s*", "-", a)
str_replace_all(a, "- ", "-")
If you are just trying to remove the whitespace after a symbol then Ricardo's answer is sufficient. If you want to remove an unknown amount of whitespace after a hyphen consider
str_replace_all(a, "- +", "-")
#[1] "for exam-ple this"
b <- "for exam- ple this"
str_replace_all(b, "- +", "-")
#[1] "for exam-ple this"
EDIT --- Explaination
The "+" is something that tells r how to match a string and is part of the regular expressions. "+" specifically means to match the preceding character (or group/set) 1 or more times. You can find out more about regular expressions here.
Related
I am trying to clean some garbage out of some text. While doing this, I am assuming that any word that has a letter (any letter) repeated three or more times is garbage - and I want to remove it.
I've come up with this:
gsub(pattern = "[a-zA-Z]\\1\\1", replacement = "", string)
in which string is the character vector, but this doesn't work. Everything else I've tried might find the pattern, but it just removes the pattern, leaving a mess. I'm trying to remove the whole word with the pattern in it.
Any ideas?
You need
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
stringr::str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")
See an R demo:
string <- "This is a baaaad unnnnecessary short word"
gsub("\\s*[[:alpha:]]*([[:alpha:]])\\1{2}[[:alpha:]]*", "", string)
gsub("\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "", string, perl=TRUE)
library(stringr)
str_replace_all(string, "\\s*\\p{L}*(\\p{L})\\1{2}\\p{L}*", "")
All yielding [1] "This is a short word".
See the regex demo. Regex details:
\s* - zero or more whitespaces
\p{L}* / [[:alpha:]]* - zero or more letters
(\p{L}) - Capturing group 1: any single letter
\1{2} - two occurrences of the same value as in Group 1
\p{L}* / [[:alpha:]]* - zero or more letters.
You need to assign a "capture group" to the [.] class by wrapping it in parens, since the \\1 needs something to reference:
gsub("([a-zA-Z])\\1\\1", "", "aabbbccdddee")
# [1] "aaccee"
Updated on OP comment:
Try this:
gsub("([A-Z]&|[a-z])\\1{2, }", "", "AAA")
[1] "AAA"
gsub("([A-Z]&|[a-z])\\1{2, }", "", "aabbbccdddee")
[1] "aaccee"
r2evans example with different regex:
gsub("(\\w)\\1{2, }", "", "aabbbccdddee")
[1] "aaccee"
I have a column with the following kind of strings:
Author
Achebe, Chinua. Ach
Akbar, M.j. Akb
Alanahally, Srikrishna. Ala
These are names of authors with their shortened abbreviation at the end. This is only at the end, because if I just look for three letter words, author names like Jon and Sam will be deleted. This usually occurs after two spaces. I want to eliminate this. I wrote the following regex to detect and delete these:
data$Author <- gsub("\\s([A-Z]+[A-Za-z]{2})\\s", "", data$Author)
What do I change in this so that I can delete these three letter abbreviations?
Your \\s at the end of the pattern is forcing a space after the three-letters, and none of the samples have that here. Options:
You cannot remove it or replace it with \\s*, as those will be too permissive (and break things):
gsub("\\s([A-Z]+[A-Za-z]{2})", "", authors)
# [1] "Achebe,nua. " "Akbar, M.j. " "Alanahally,krishna. "
add a word-boundary \\b
gsub("\\s([A-Z]+[A-Za-z]{2})\\b", "", authors)
# [1] "Achebe, Chinua. " "Akbar, M.j. " "Alanahally, Srikrishna. "
change to end-of-string
gsub("\\s([A-Z]+[A-Za-z]{2})$", "", authors)
# [1] "Achebe, Chinua. " "Akbar, M.j. " "Alanahally, Srikrishna. "
(though I think this might be over-constraining).
Data
authors <- c("Achebe, Chinua. Ach", "Akbar, M.j. Akb", "Alanahally, Srikrishna. Ala")
Try this with (find & replace) syntax ,
Find: \s?\s\w+$
Replace: leave it empty
How (in R) would I remove any word in a string containing punctuation, keeping words without?
test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"
desired <- "I a see works not"
Here is an approach using sub which seems to work:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)
[1] "I a see works not"
This approach is to use the following regex pattern:
[A-Za-z]* match a leading letter zero or more times
[^A-Za-z ] then match a symbol once (not a space character or a letter)
\\S* followed by any other non whitespace character
\\s* followed by any amount of whitespace
Then, we just replace with empty string, to remove the words having one or more symbols in them.
You can use this regex
(?<=\\s|^)[a-z0-9]+(?=\\s|$)
(?<=\\s|^) - positive lookbehind, match should be preceded by space or start of string.
[a-z0-9]+ - Match alphabets and digits one or more time,
(?=\\s|$) - Match must be followed by space or end of string
Demo
Tim's edit:
This answer uses a whitelist approach, namely identify all words which the OP does want to retain in his output. We can try matching using the regex pattern given above, and then connect the vector of matches using paste:
test.string <- "I am:% a test$ to& see if* your# fun/ction works o\\r not"
result <- regmatches(test.string,gregexpr("(?<=\\s|^)[A-Za-z0-9]+(?=\\s|$)",test.string, perl=TRUE))[[1]]
paste(result, collapse=" ")
[1] "I a see works not"
Here are couple of more approaches
First approach:
str_split(test.string, " ", n=Inf) %>% # spliting the line into words
unlist %>%
.[!str_detect(., "\\W|\r")] %>% # detect words without punctuation or \r
paste(.,collapse=" ") # collapse the words to get the line
Second approach:
str_extract_all(test.string, "^\\w+|\\s\\w+\\s|\\w+$") %>%
unlist %>%
trimws() %>%
paste(., collapse=" ")
^\\w+ - words having only [a-zA-Z0-9_] and also is start of the string
\\s\\w+\\s - words with [a-zA-Z0-9_] and having space before and after the word
\\w+$ - words having [a-zA-Z0-9_] and also is the end of the string
I want to write a regex in R to remove all words of a string containing numbers.
For example:
first_text = "a2c if3 clean 001mn10 string asw21"
second_text = "clean string
Try with gsub
trimws(gsub("\\w*[0-9]+\\w*\\s*", "", first_text))
#[1] "clean string"
It is easier to select words with no numbers than to select and delete words with numbers:
> library(stringr)
> str1 <- "a2c if3 clean 001mn10 string asw21"
> paste(unlist(str_extract_all(str1, "(\\b[^\\s\\d]+\\b)")), collapse = " ")
[1] "clean string"
Note:
Backslashes have to be escaped in R to work properly, hence double backslashes
\b is word boundary
\s is white space
\d is digit character
a caret (^) inside square brackets is a negater: find characters that do not match ...
"+" after the character group inside [] means "1 or more" occurrences of those (non white space and non digit) characters
Just another alternative using gsub
trimws(gsub("[^\\s]*[0-9][^\\s]*", "", first_text, perl=T))
#[1] "clean string"
A bit longer than some of the answers but very tractable is to first convert the string to a vector of words, then check word by word if there are any numbers and use standard R subsetting.
first_text_vec <- strsplit(first_text, " ")[[1]]
first_text_vec
[1] "a2c" "if3" "clean" "001mn10" "string" "asw21"
paste(first_text_vec[!grepl("[0-9]", first_text_vec)], collapse = " ")
[1] "clean string"
I have a vector of strings and i want to separate the last sentence from each string in R.
Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R.
You can use strsplit to get the last sentence from each string as shown:-
## paragraph <- "Your vector here"
result <- strsplit(paragraph, "\\.|\\!|\\?")
last.sentences <- sapply(result, function(x) {
trimws((x[length(x)]))
})
Provided that your input is clean enough (in particular, that there are spaces between the sentences), you can use:
sub(".*(\\.|\\?|\\!) ", "", trimws(yourvector))
It finds the longest substring ending with a punctuation mark and a space and removes it.
I added trimws just in case there are trailing spaces in some of your strings.
Example:
u <- c("This is a sentence. And another sentence!",
"By default R regexes are greedy. So only the last sentence is kept. You see ? ",
"Single sentences are not a problem.",
"What if there are no spaces between sentences?It won't work.",
"You know what? Multiple marks don't break my solution!!",
"But if they are separated by spaces, they do ! ! !")
sub(".*(\\.|\\?|\\!) ", "", trimws(u))
# [1] "And another sentence!"
# [2] "You see ?"
# [3] "Single sentences are not a problem."
# [4] "What if there are no spaces between sentences?It won't work."
# [5] "Multiple marks don't break my solution!!"
# [6] "!"
This regex anchors to the end of the string with $, allows an optional '.' or '!' at the end. At the front it finds the closest ". " or "! " as the end of the prior sentence. The negative lookback ?<= ensures the "." or '!' are not matched. Also provides for a single sentence by using ^ for the beginning.
s <- "Sentences may end with full stops(.) or even exclamatory marks(!). Hence i am confused as to how to separate the last sentence from a string in R."
library (stringr)
str_extract(s, "(?<=(\\.\\s|\\!\\s|^)).+(\\.|\\!)?$")
yields
# [1] "Hence i am confused as to how to separate the last sentence from a string in R."