only remove punctuation for words not numbers - r

I am using the tm package in R to remove punctuation.
TextDoc <- tm_map(TextDoc, removePunctuation)
Is there a way I can only remove puncutation if it has to do with a letter/word instead of a number?
E.g.
I want performance. --> performance
But I want 3.14 --> 3.14
Example of how i want function to work:
wall, --> wall
expression. --> expression
ef. --> ef
A. --> A
name: --> name
:ok --> ok
91.8.10 --> 91.8.10
EDIT:
TextDoc is of the form:

You may also try this gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T) where text is your text vector. Explanation of regex
(?<!\\d) negative lookbehind for any digit character
[[:punct:]] searches for punctuation marks
(?=\\D) followed by positive lookahead for any non-digit character
? 0 or once
check this for regex demo
text <- c("wall, 88.1", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10")
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T)
#> [1] "wall 88.1" "expression" "ef" "ok" "A"
#> [6] "3.14" "91.8.10"
long_text <- "wall, 88.1 expression. ef. :ok A. 3.14 91.8.10"
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', long_text, perl = T)
#> [1] "wall 88.1 expression ef ok A 3.14 91.8.10"
Created on 2021-06-13 by the reprex package (v2.0.0)

I've completely revamped my answer based on your specification and Anil's answer below, which is much more widely applicable than what I originally had.
library(tm)
# Here we pretend that your texts are like this
text <- c("wall,", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10",
"w.a.ll, 6513.645+1646-5")
# and we create a corpus with them, like the one you show
corp <- Corpus(VectorSource(text))
# you create a function with any of the solutions that we've provided here
# I'm taking AnilGoyal's because it's better than my rushed purrr one.
my_remove_punct <- function(x) {
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', x, perl = T)
}
# pass the function to tm_map
new_corp <- tm_map(corp, my_remove_punct)
# Applying the function will give you a warning about dropping documents; but it's a bug of the TM package.
# We use this to confirm that the contents are indeed correct. The last line is a print-out of all the individual documents together.
sapply(new_corp, print)
#> [1] "wall"
#> [1] "expression"
#> [1] "ef"
#> [1] "ok"
#> [1] "A"
#> [1] "3.14"
#> [1] "91.8.10"
#> [1] "wall 6513.645+1646-5"
#> [1] "wall" "expression" "ef"
#> [4] "ok" "A" "3.14"
#> [7] "91.8.10" "wall 6513.645+1646-5"
The warning you receive about "dropping documents" is not real as you can see by printing. An explanation is in this other SO question.
In the future, note that you can quickly get better answers by providing raw data with the function dput to your object. Something like dput(TextDoc). If it is too much, you can subset it.

Tried to make it less ugly but here is my best shot:
library(data.table)
TextDoc <- data.table(text = c("wall",
"expression.",
"ef.",
"91.8.10",
"A.",
"name:",
":ok"))
TextDoc[grepl("[a-zA-Z]", text),
text := unlist(tm_map(Corpus(VectorSource(as.vector(text))), removePunctuation))[1:length(grepl("[a-zA-Z]", text))]]
Which gives us:
> TextDoc
text
1: wall
2: expression
3: ef
4: 91.8.10
5: A
6: name
7: ok

Related

Replace quanteda tokens through regex

I would like to explicitly replace specific tokens defined in objects of class tokens of the package quanteda. I fail to replicate a standard approach that works well with stringr.
The objective is to replace all tokens of the form "XXXof" in two tokens of the form c("XXX", "of").
Please, have a look at the minimal below:
suppressPackageStartupMessages(library(quanteda))
library(stringr)
text = "It was a beautiful day down to the coastof California."
# I would solve this with stringr as follows:
text_stringr = str_replace( text, "(^.*?)(of)", "\\1 \\2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."
# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )
# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\\1 \\2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "It" "was" "a" "beautiful" "day"
#> [6] "down" "to" "the" "\\1 \\2" "California"
#> [11] "."
Any workaround?
Created on 2021-03-16 by the reprex package (v1.0.0)
You can use a mixture to build a list of the words needing separating and their separated form, then use tokens_replace() to perform the replacement. This has the advantage of allowing you to curate the list before applying it, which means you can verify that you haven't caught replacements that you probably don't want to apply.
suppressPackageStartupMessages(library("quanteda"))
toks <- tokens("It was a beautiful day down to the coastof California.")
keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
vals <- stringr::str_replace(keys, "(^.*?)(of)", "\\1 \\2") %>%
strsplit(" ")
keys
## [1] "coastof"
vals
## [[1]]
## [1] "coast" "of"
tokens_replace(toks, keys, vals)
## Tokens consisting of 1 document.
## text1 :
## [1] "It" "was" "a" "beautiful" "day"
## [6] "down" "to" "the" "coast" "of"
## [11] "California" "."
Created on 2021-03-16 by the reprex package (v1.0.0)

R Extract a word from a character string using pattern matching

I need some help with pattern matching in R. I need to extract a whole word that starts with a common prefix, from a long character string. The word I want to extract always starts with the same prefix (AA), but the word is not the same length, and does not occur in the same location of the string.
mytext1 <- as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH") # Return AA12345
mytext2 <- as.character("ELEPHANT AA100 KOALA POLAR.BEAR") # Want to return AA100
mytext3 <- as.character("CROCODILE DRAGON.FLY ANTELOPE") # Want to return NA
As an extension of this, what if there were two different patterns to match and I wanted to return a character string with both?
mytext4 <- as.character("TULIP AA999 DAISY BB123")
# Pattern matching to AA and BB
# Want to return AA999 BB123
Any help with this would be greatly appreciated :)
Here is a stringr approach. The regular expression matches AA preceded by a space or the start of the string (?<=^| ), and then as few characters as possible .*? until the next space or the end of the string (?=$| ). Note that you can combine all the strings into a vector and a vector will be returned. If you want all matches for each string, then use str_extract_all instead of str_extract and you get a list with a vector for each string. If you want to specify multiple matches, use an option and a capturing group (AA|BB) as shown.
mytext <- c(
as.character("HORSE MONKEY LIZARD AA12345 SWORDFISH"), # Return AA12345
as.character("ELEPHANT AA100 KOALA POLAR.BEAR"), # Want to return AA100,
as.character("AA3273 ELEPHANT KOALA POLAR.BEAR"), # Want to return AA3273
as.character("ELEPHANT KOALA POLAR.BEAR AA5785"), # Want to return AA5785
as.character("ELEPHANT KOALA POLAR.BEAR"), # Want to return nothing
as.character("ELEPHANT AA12345 KOALA POLAR.BEAR AA5785") # Can return only AA12345 or both
)
library(stringr)
mytext %>% str_extract("(?<=^| )AA.*?(?=$| )")
#> [1] "AA12345" "AA100" "AA3273" "AA5785" NA "AA12345"
mytext %>% str_extract_all("(?<=^| )AA.*?(?=$| )")
#> [[1]]
#> [1] "AA12345"
#>
#> [[2]]
#> [1] "AA100"
#>
#> [[3]]
#> [1] "AA3273"
#>
#> [[4]]
#> [1] "AA5785"
#>
#> [[5]]
#> character(0)
#>
#> [[6]]
#> [1] "AA12345" "AA5785"
as.character("TULIP AA999 DAISY BB123") %>% str_extract_all("(?<=^| )(AA|BB).*?(?=$| )")
#> [[1]]
#> [1] "AA999" "BB123"
Created on 2018-04-29 by the reprex package (v0.2.0).
You can get a base R solution using sub
sub(".*\\b(AA\\w*).*", "\\1", mytext1)
[1] "AA12345"
> sub(".*\\b(AA\\w*).*", "\\1", mytext2)
[1] "AA100"
I like keeping things in base R whenever possible, and there is already a solution for this. What you really are looking for is the regmatches() function. See Here
Extract or replace matched substrings from match data obtained by regexpr, gregexpr or regexec.
To solve your specific problem
matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext1, perl=T)
regmatches(mytext1, matches)
> [1] "AA12345"
When there is no match:
matches = regexpr("(?<=^| )AA.*?(?=$| )", mytext3, perl=T)
regmatches(mytext3, matches)
> character(0)
If you want to avoid character(0) put your strings in a vector and run them all at once.
alltext = c(mytext1, mytext2, mytext3)
matches = regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T)
regmatches(alltext, matches)
> [1] "AA12345" "AA100"
And finally, if you want a one-liner
regmatches(alltext, regexpr("(?<=^| )AA.*?(?=$| )", alltext, perl=T))
> [1] "AA12345" "AA100"

Using str_view with a list of words in R

I want to use str_view from stringr in R to find all the words that start with "y" and all the words that end with "x." I have a list of words generated by Corpora, but whenever I launch the code, it returns a blank view.
Common_words<-corpora("words/common")
#start with y
start_with_y <- str_view(Common_words, "^[y]", match = TRUE)
start_with_y
#finish with x
str_view(Common_words, "$[x]", match = TRUE)
Also, I would like to find the words that are only 3 letters long, but no
ideas so far.
I'd say this is not about programming with stringr but learning some regex. Here are some sites I have found useful for learning:
http://www.regular-expressions.info/tutorial.html
http://www.rexegg.com/
https://www.debuggex.com/
Here the \\w or short hand class for word characters (i.e., [A-Za-z0-9_]) is useful with quantifiers (+ and {3} in these 2 cases). PS here I use stringi because stringr is using that in the backend anyway. Just skipping the middle man.
x <- c("I like yax because the rock to the max!",
"I yonx & yix to pick up stix.")
library(stringi)
stri_extract_all_regex(x, 'y\\w+x')
stri_extract_all_regex(x, '\\b\\w{3}\\b')
## > stri_extract_all_regex(x, 'y\\w+x')
## [[1]]
## [1] "yax"
##
## [[2]]
## [1] "yonx" "yix"
## > stri_extract_all_regex(x, '\\b\\w{3}\\b')
## [[1]]
## [1] "yax" "the" "the" "max"
##
## [[2]]
## [1] "yix"
EDIT Seems like these may be of use too:
## Just y starting words
stri_extract_all_regex(x, 'y\\w+\\b')
## Just x ending words
stri_extract_all_regex(x, 'y\\w+x')
## Words with n or more characters
stri_extract_all_regex(x, '\\b\\w{4,}\\b')

Multiline text extraction in R with stringr

I have a column in my dataframe which has free text in it
I would like to extract the text after INDICATIONS FOR EXAMINATION and before the next capitalized line. In the example below the result would be 'Anaemia'
INDICATIONS FOR EXAMINATION
Anaemia
PROCEDURE PERFORMED
Gastroscopy (OGD)
I am having some trouble as I'm using stringr and I can't seem to get multiline matches.
I have been using:
EoE$IndicationsFroExamination<-str_extract(EoE$Endo_ResultText, '(?<=INDICATIONS FOR EXAMINATION).*?[A-Z]+')
It requires a little digging. You can use the regex() modifier function.
Use the multiline argument to switch on multiline fitting:
str_extract_all("a\nb\nc", "^.")
# [[1]]
# [1] "a"
str_extract_all("a\nb\nc", regex("^.", multiline = TRUE))
# [[1]]
# [1] "a" "b" "c"
Please be aware of the dotall argument, that will switch on multiline behaviour of ".*":
str_extract_all("a\nb\nc", "a.")
# [[1]]
# character(0)
str_extract_all("a\nb\nc", regex("a.", dotall = TRUE))
# [[1]]
# [1] "a\n"
These are documented in stringi::stri_opts_regex(), which stringr::regex() passes arguments to.
I made the regular expression a bit more generic so it will match all occurrences and used the str_extract_all package from stringr:
matches <- str_extract_all(str, "(?<=[A-Z]\n)([^\n]*)")
Which, given the string you provided, should return:
[[1]]
[1] "Anaemia" "Gastroscopy (OGD)"

How to remove parentheses with words inside by tm packages ?

Let's say I have part of the texts in a document like this:
"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."
I want to remove the "(API)", and it needs to be done before
corpus <- tm_map(corpus, removePunctuation)
After removing "(API)",it should be look like this below :
"Other segment comprised of our active pharmaceutical ingredient business,which..."
I searched for a long time but all I can find was the answers about removing parentheses only,the word within I don't want appear in the corpus too.
I really need someone give me some hint plz.
You can use a smarter tokeniser, such as that in the quanteda package, where the removePunct = TRUE will remove the parentheses automatically.
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
Added:
If you want to tokenise the text first, then you need lapply a gsub until we add a regular expression valuetype to removeFeatures.tokenizedTexts() in quanteda. But this would work:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
If you simply want to remove the parenthetical expressions as in the question, then you don't need either tm or quanteda:
# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient. New sentence..."
The longer regular expression also catches cases in which the parenthetical expression ends a sentence or is followed by additional punctuation such as a comma.
If it's only single words, how about (untested):
removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)})
tm_map(corpus, removeBracketed)

Resources