I'm cleaning some text data and I've come across a problem associated with removing newline text. For this data, there are not merely \n strings in the text, but \n\n strings, as well as numbered newlines such as: \n2 and \n\n2. The latter are my problem. How does one remove this using regex?
I'm working in R. Here is some sample text and what I've used, so far:
#string
string <- "There is a square in the apartment. \n\n4Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten.\n2"
#code attempt
gsub("[\r\\n0-9]", '', string)
The problem with this regex code is that it removes numbers and matches with the letter n.
I would like to have the following output:
"There is a square in the apartment. Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten."
I'm using regexr for reference.
Writing the pattern like this [\r\\n0-9] matches either a carriage return, one of the chars \ or n or a digit 0-9
You could write the pattern matching 1 or more carriage returns or newlines, followed by optional digits:
[\r\n]+[0-9]*
Example:
string <- "There is a square in the apartment. \n\n4Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten.\n2"
gsub("[\r\n]+[0-9]*", '', string)
Output
[1] "There is a square in the apartment. Great laughs, which I hear from the other room. 4 laughs. Several. 9 times ten."
See a R demo.
Related
I am trying to remove the decimal points in decimal numbers in R. Please note I want to keep the full stop of strings.
Example:
data= c("It's 6.00pm, and is late.")
I know that I have to use regex for this, but I am struggling. My desired output is:
"It's 6 00pm, and is late."
Thank you in advance.
Try this:
sub("(?<=\\d)\\.(?=\\d)", " ", data, perl = TRUE)
This solution uses lookbehind (?<=...) and lookahead (?=...)to assert that the period you wish to remove be enclosed by digits (thus avoiding matching the period at the sentence end). If you have several such cases within strings, then use gsubinstead of sub.
I suggest using a simple pattern to find the target text, then adding parenthesis to identify the parts of the matching text that you want to retain.
# Test data
data <- c("It's 6.00pm, and is late.")
The target pattern is a literal dot with a string of digits before and after it. \\d+ matches one or more digits and \\. matches a literal dot. Testing the pattern to see if it works:
grepl("\\d+\\.\\d+", data)
Result
TRUE
If we wanted too eliminate the whole thing we could do a simple replacement with an empty string. Testing if this targets the correct text:
sub("\\d+\\.\\d+", "", data)
Result
"It's pm, and is late."
Instead, to discard only a section of matched text we can identify the parts we want to keep, which is done by surrounding them with parenthesis. Once done we can refer to the captured text in the replacement. \\1 refers to the first chunk of text captured and \\2 refers to the second chunk of text, corresponding to the first and second sets of parenthesis
# pattern replacement
sub("(\\d+)\\.(\\d+)", "\\1\\2", data)
Result
[1] "It's 600pm, and is late."
This effectively removes the dot by omitting it from the replacement text.
I have a big dataframe with news articles. I have noticed that some of the articles have two words connected by a dot as the following examples shows The government.said it was important to quit.. I will conduct some topic modelling, so I need to separate every single word.
This is the code I have used to separate those words
#String example
test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")
#Code to separate the words
test <- do.call(paste, as.list(strsplit(test, "\\.")[[1]]))
#This is what I get
> test
[1] "i need to separate the words connected by dots however, I need to keep having the dots separating sentences"
As you can see, I deleted all the dots (periods) on the text. How could I get the following outcome:
"i need to separate the words connected by dots. however, I need to keep having the dots separating sentences"
Final note
My dataframe is composed of 17.000 articles; all the text is on lowercase. I just provided a small example of the issue I am having when trying to separate two words connected by a dot. Additionally, is there any way I can use strsplit on a list?
You may use
test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences. Look at http://google.com for s.0.m.e more details.")
# Replace each dot that is in between word characters
gsub("\\b\\.\\b", " ", test, perl=TRUE)
# Replace each dot that is in between letters
gsub("(?<=\\p{L})\\.(?=\\p{L})", " ", test, perl=TRUE)
# Replace each dot that is in between word characters, but no in URLs
gsub("(?:ht|f)tps?://\\S*(*SKIP)(*F)|\\b\\.\\b", " ", test, perl=TRUE)
See the R demo online.
Output:
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s 0 m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s.0.m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google.com for s 0 m e more details."
Details
\b\.\b - a dot that is enclosed with word boundaries (i.e. before and after . cannot be any non-word char, there cannot be any char other than a letter, digit or underscore
(?<=\p{L})\.(?=\p{L}) matches a dot that is not immediately preceded nor followed with a letter ((?<=\p{L}) is a negative lookbehind and the (?=\p{L}) is a negative lookahead)
(?:ht|f)tps?://\\S*(*SKIP)(*F)|\b\.\b matches http/ftp or https/ftps, then :// and then any 0 or more non-whitespace chars, and skips the match and goes on to search for matches from the position it was when it came across the SKIP PCRE verb.
What's the most elegant way to extract the keywords in a sentence of string?
I have a list of keywords from a CSV, and i want to predict exact match with keywords which is present in the string.
sapply(keywords, regexpr, String, ignore.case=FALSE)
I used the above code, but it gives approximate match too.
The problem is that a regular expressions don't know what 'words' are - they look for patterns. Consider the following keywords and string.
keywords = c("bad","good","sad","mad")
string = "Some good people live in the badlands which is maddeningly close to the sad harbor."
Here "bad" matches "badlands" because the pattern of "bad" is found in the first three characters. Same with "mad" and "maddeningly".
sapply(keywords, regexpr, string, ignore.case=FALSE)
#> bad good sad mad
#> 30 6 73 48
So, we need to modify the pattern to make it detect what we really want. The problem is knowing what we really want. If we want a distinct word, then we can add boundaries around the keywords. As Andre noted in the comments, the \b in regex is a word boundary.
sapply(paste("\\b",keywords,"\\b",sep=""), regexpr, string, ignore.case=FALSE)
#> \\bbad\\b \\bgood\\b \\bsad\\b \\bmad\\b
#> -1 6 73 -1
Note, what I did was use the paste function to stick an escaped \b before and after each keyword. This returns a no-match code for 'bad' and 'mad' but finds the whole word versions of 'good' and 'sad'.
If you wanted to find hyphenated characters, you'd need to modify the boundary matching portion of the expression.
I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.
Here's what I've tried:
gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
However, This removes the special characters (punctuations + non utf8) but the output has no spaces.
gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")
The result has spaces but there are still non utf8 characters present.
Any work around?
For the sample string above, output should be:
Sample string here
You could use the classes [:alnum:] and [:space:] for this:
sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Alternatively you can use PCRE codes to refer to specific character sets:
gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"
Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep A-Z, a-z and 0-9:
gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"
This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:
gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "
Or, if you want to find a string, even not bound by spaces, use the word boundary \\b instead:
gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"
What happens here:
.*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
everything between () will be stored and can be refered to in the replacement by \\1
\\b indicates a word boundary
This is followed at least once (+) by any character that's A-Z, a-z, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using A-z will include all special letters (which are UTF-8 btw!)
after that sequence,fit anything at least zero times to remove the rest of the string.
the backreference \\1 in combination with .* in the regex, will make sure only the required part remains in the output.
stringr may use a differrent regex engine that supports POSIX character classes. The :ascii: names the class, which must generally be enclosed in square brackets [:asciii:], whithin the outer square bracket. The [^ indicates negation of the match.
library(stringr)
str_replace_all("�+ Sample string here =�{�>E�BH�P<]�{�>", "[^[:ascii:]]", "")
result in
[1] "+ Sample string here ={>EBHP<]{>"
I need help with regex in R.
I have a bunch of strings each of which has a structure similar to this one:
mytext <- "\"Dimitri. It has absolutely no meaning,\": Allow me to him|\"realize that\": Poor Alice! It |\"HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes\": |\"same for the Dislikes. Thank you very much for completing this\": ME.' 'You!' sai"
Notice that this strings contains substrings within "" followed by a ":" and some text without quotation marks - until we encounter a "|" - then a new quotation mark appears etc.
Notice also that at the very end there is text after a ":" - but at the VERY end there is no "|"
My objective is to completely eliminate all text starting with any ":" (and INCLUDING ":") and until the next "|" (but "|" has to stay). I also need to eliminate all text that comes after the very last ":"
Finally (that's more of a bonus) - I want to get rid of all "\" characters and all quotation marks - because in the final solution I need to have "clean text": A bunch of strings separated only by "|" characters.
Is it possible?
Here is my awkward first attempt:
gsub('\\:.*?\\|', '', mytext)
This method uses 3 passes of g?sub.
sub("\\|$", "", gsub("[\\\\\"]", "", gsub(":.*?(\\||$)", "|", mytext)))
[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"
The first strips out the text in between ":" and "|" inclusive and replaces it with "|". The second pass removes "\" and """ and the third pass removes the "|" at the end.
With a single gsub you can match text after a : (including the :), so long as it doesn't contain a pipe: :[^|]*. This matches the case at the end of the string, too. You can also match double quotes by searching for another pattern after the alternation character (|): [\"]
gsub(":[^|]*|[\"]", "", mytext)
#[1] "Dimitri. It has absolutely no meaning,|realize that|HIGHLIGHT A LOT OF THINGS. Our team is small and if each person highlights only 1 or 2 things, the counts of Likes|same for the Dislikes. Thank you very much for completing this"