Let's say I have part of the texts in a document like this:
"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."
I want to remove the "(API)", and it needs to be done before
corpus <- tm_map(corpus, removePunctuation)
After removing "(API)",it should be look like this below :
"Other segment comprised of our active pharmaceutical ingredient business,which..."
I searched for a long time but all I can find was the answers about removing parentheses only,the word within I don't want appear in the corpus too.
I really need someone give me some hint plz.
You can use a smarter tokeniser, such as that in the quanteda package, where the removePunct = TRUE will remove the parentheses automatically.
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
Added:
If you want to tokenise the text first, then you need lapply a gsub until we add a regular expression valuetype to removeFeatures.tokenizedTexts() in quanteda. But this would work:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
If you simply want to remove the parenthetical expressions as in the question, then you don't need either tm or quanteda:
# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient. New sentence..."
The longer regular expression also catches cases in which the parenthetical expression ends a sentence or is followed by additional punctuation such as a comma.
If it's only single words, how about (untested):
removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)})
tm_map(corpus, removeBracketed)
Related
I am using the tm package in R to remove punctuation.
TextDoc <- tm_map(TextDoc, removePunctuation)
Is there a way I can only remove puncutation if it has to do with a letter/word instead of a number?
E.g.
I want performance. --> performance
But I want 3.14 --> 3.14
Example of how i want function to work:
wall, --> wall
expression. --> expression
ef. --> ef
A. --> A
name: --> name
:ok --> ok
91.8.10 --> 91.8.10
EDIT:
TextDoc is of the form:
You may also try this gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T) where text is your text vector. Explanation of regex
(?<!\\d) negative lookbehind for any digit character
[[:punct:]] searches for punctuation marks
(?=\\D) followed by positive lookahead for any non-digit character
? 0 or once
check this for regex demo
text <- c("wall, 88.1", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10")
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T)
#> [1] "wall 88.1" "expression" "ef" "ok" "A"
#> [6] "3.14" "91.8.10"
long_text <- "wall, 88.1 expression. ef. :ok A. 3.14 91.8.10"
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', long_text, perl = T)
#> [1] "wall 88.1 expression ef ok A 3.14 91.8.10"
Created on 2021-06-13 by the reprex package (v2.0.0)
I've completely revamped my answer based on your specification and Anil's answer below, which is much more widely applicable than what I originally had.
library(tm)
# Here we pretend that your texts are like this
text <- c("wall,", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10",
"w.a.ll, 6513.645+1646-5")
# and we create a corpus with them, like the one you show
corp <- Corpus(VectorSource(text))
# you create a function with any of the solutions that we've provided here
# I'm taking AnilGoyal's because it's better than my rushed purrr one.
my_remove_punct <- function(x) {
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', x, perl = T)
}
# pass the function to tm_map
new_corp <- tm_map(corp, my_remove_punct)
# Applying the function will give you a warning about dropping documents; but it's a bug of the TM package.
# We use this to confirm that the contents are indeed correct. The last line is a print-out of all the individual documents together.
sapply(new_corp, print)
#> [1] "wall"
#> [1] "expression"
#> [1] "ef"
#> [1] "ok"
#> [1] "A"
#> [1] "3.14"
#> [1] "91.8.10"
#> [1] "wall 6513.645+1646-5"
#> [1] "wall" "expression" "ef"
#> [4] "ok" "A" "3.14"
#> [7] "91.8.10" "wall 6513.645+1646-5"
The warning you receive about "dropping documents" is not real as you can see by printing. An explanation is in this other SO question.
In the future, note that you can quickly get better answers by providing raw data with the function dput to your object. Something like dput(TextDoc). If it is too much, you can subset it.
Tried to make it less ugly but here is my best shot:
library(data.table)
TextDoc <- data.table(text = c("wall",
"expression.",
"ef.",
"91.8.10",
"A.",
"name:",
":ok"))
TextDoc[grepl("[a-zA-Z]", text),
text := unlist(tm_map(Corpus(VectorSource(as.vector(text))), removePunctuation))[1:length(grepl("[a-zA-Z]", text))]]
Which gives us:
> TextDoc
text
1: wall
2: expression
3: ef
4: 91.8.10
5: A
6: name
7: ok
I have a large corpus of text in a vector of strings (app. 700.000 strings). I'm trying to replace specific words/phrases within the corpus. That is, I have a vector of app 40.000 phrases and a corresponding vector of replacements.
I'm looking for an efficient way of solving the problem
I can do it in a for loop, looping through each pattern + replacement. But it scales badly (3 days or so !)
I'v also tried qdap::mgsub(), but it seems to scale badly as well
txt <- c("this is a random sentence containing bca sk",
"another senctence with bc a but also with zqx tt",
"this sentence contains non of the patterns",
"this sentence contains only bc a")
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c("#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt")
#either
txt2 <- qdap::mgsub(patterns, replacements, txt)
#or
for(i in 1:length(patterns)){
txt <- gsub(patterns[i], replacements[i], txt)
}
Both solutions scale badly for my data with app 40.000 patterns/replacements and 700.000 txt strings
I figure there must be a more efficient way of doing this?
If you can tokenize the texts first, then vectorized replacement is much faster. It's also faster if a) you can use a multi-threaded solution and b) you use fixed instead of regular expression matching.
Here's how to do all that in the quanteda package. The last line pastes the tokens back into a single "document" as a character vector, if that is what you want.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
quanteda_options(threads = 4)
txt <- c(
"this is a random sentence containing bca sk",
"another sentence with bc a but also with zqx tt",
"this sentence contains none of the patterns",
"this sentence contains only bc a"
)
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c(
"#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt"
)
This will tokenize the texts and then use fast replacement of the hashed types, using a fixed pattern match (but you could have used valuetype = "regex" for regular expression matching). By wrapping patterns inside the phrases() function, you are telling tokens_replace() to look for token sequences rather than individual matches, so this solves the multi-word issue.
toks <- tokens(txt) %>%
tokens_replace(phrase(patterns), replacements, valuetype = "fixed")
toks
## tokens from 4 documents.
## text1 :
## [1] "this" "is" "a" "random" "sentence"
## [6] "containing" "bca" "sk"
##
## text2 :
## [1] "another" "sentence"
## [3] "with" "#a-specfic-tag-#bc a"
## [5] "but" "also"
## [7] "with" "#a-specfic-tag-#zqx tt"
##
## text3 :
## [1] "this" "sentence" "contains" "none" "of" "the"
## [7] "patterns"
##
## text4 :
## [1] "this" "sentence" "contains"
## [4] "only" "#a-specfic-tag-#bc a"
Finally if you really want to put this back into character format, then convert to a list of character types and then paste them together.
sapply(as.list(toks), paste, collapse = " ")
## text1
## "this is a random sentence containing bca sk"
## text2
## "another sentence with #a-specfic-tag-#bc a but also with #a-specfic-tag-#zqx tt"
## text3
## "this sentence contains none of the patterns"
## text4
## "this sentence contains only #a-specfic-tag-#bc a"
You'll have to test this on your large corpus, but 700k strings does not sound like too large a task. Please try this and report how it did!
Create a vector of all words in each phrase
txt1 = strsplit(txt, " ")
words = unlist(txt1)
Use match() to find the index of words to replace, and replace them
idx <- match(words, patterns)
words[!is.na(idx)] = replacements[idx[!is.na(idx)]]
Re-form the phrases and paste together
phrases = relist(words, txt1)
updt = sapply(phrases, paste, collapse = " ")
I guess this won't work if patterns can have more than one word...
Create a map between the old and new values
map <- setNames(replacements, patterns)
Create a pattern that contains all patterns in a single regular expression
pattern = paste0("(", paste0(patterns, collapse="|"), ")")
Find all matches, and extract them
ridx <- gregexpr(pattern, txt)
m <- regmatches(txt, ridx)
Unlist, map, and relist the matches to their replacement values, and update the original vector
regmatches(txt, ridx) <- relist(map[unlist(m)], m)
I've used to great satisfaction quanteda's textstat_collocation() for extracting MWE. Now I'm trying to extract all matches that match a specific pattern, irrespective of their frequency.
My objective is to create a character vector by extracting featnames from a dfm() built with a regex pattern. I will then use this character vector in the "select" argument for building a dfm. I might also want to use this character vector to add to a dictionary I use as an ontology for building dfms at later stages of the pipeline.
The pattern is: "aged xx-xx" where x is a digit.
I used the regex pattern "aged\s([0-9]{2}-[0-9]{2})" here and got the desired matches. But when I try it in R (adding an additional "\" before "\s"), I don't get any matches.
When I do:
txt <- c("In India, male smokers aged 20-45 perceive brandX positively.",
"In Spain, female buyers aged 30-39 don't purchase brandY.")
ageGroups <- dfm(txt, select = "aged\\s([0-9]{2}-[0-9]{2})", valuetype = "regex")
featnames(ageGroups)
I get:
character(0)
However, when I try:
ageGroups <- dfm(txt, select = "([0-9]{2}-[0-9]{2})", valuetype = "regex")
featnames(ageGroups)
I get:
[1] "20-45" "30-39"
It seems I'm unable to capture the white space in the regex. I've gone through many similar questions in SO, with perhaps this being the most relevant, but still can't get to make my specific objective to work.
I also tried:
tokens <- tokens(txt, remove_punct = FALSE, remove_numbers = FALSE, remove_symbols = FALSE)
tokensCompunded <- tokens_compound(tokens, pattern = "aged\\s([0-9]{2}-[0-9]{2})", valuetype = "regex")
attr(tokensCompunded, "types")
But I get all tokens back:
[1] "In" " " "India" "," "male" "smokers" "aged" "20-45" "perceive"
[10] "brandX" "positively" "." "Spain" "female" "buyers" "30-39" "don't" "purchase"
[19] "brandY"
I think there might be several other more efficient approaches for extracting character vectors using regex (or glob) with quanteda, and I'm happy to learn new ways how to use this amazing R package.
Thanks for your help!
Edit to original question:
This other question in SO has a similar requirement, i.e. detecting multi-word phrases using kwic objects, and can be further expanded to achieve the objectives stated above with the following addition:
kwicObject <- kwic(corpus, pattern = phrase("aged ([0-9]{2}-[0-9]{2})"), valuetype = "regex")
unique(kwicObject$keyword)
The problem here is that the target text and the multi-word pattern (which contains white space) are not being tokenised the same way. In your example, you have applied a regex for multiple tokens (which includes the whitespace separator) but the target for search has already been split into individual tokens.
We devised a solution to this, a function called phrase(). From ?pattern:
Whitespace is not privileged, so that in a character vector, white
space is interpreted literally. If you wish to consider
whitespace-separated elements as sequences of tokens, wrap the
argument in phrase().
So in this case:
pat <- "aged [0-9]{2}-[0-9]{2}"
toks2 <- tokens_select(toks, pattern = phrase(pat), valuetype = "regex")
toks2
# tokens from 2 documents.
# text1 :
# [1] "aged" "20-45"
#
# text2 :
# [1] "aged" "30-39"
Here, we see that the selection worked, because the phrase() wrapper converted the pattern into a sequence of matches.
If you want to make these a single token, you can send the same pattern argument to tokens_compound():
toks3 <- tokens_compound(toks2, pattern = phrase(pat),
valuetype = "regex", concatenator = " ")
toks3
# tokens from 2 documents.
# text1 :
# [1] "aged 20-45"
#
# text2 :
# [1] "aged 30-39"
Finally, you can use that to construct a dfm, where each multi-word match is a feature. This cannot work unless you have first performed the concatenation at the tokens stage, since by definition a dfm has no order in its features.
dfm(toks3)
# Document-feature matrix of: 2 documents, 2 features (50% sparse).
# 2 x 2 sparse Matrix of class "dfm"
# features
# docs aged 20-45 aged 30-39
# text1 1 0
# text2 0 1
You can change the regex pattern:
select = "aged.*([0-9]{2}-[0-9]{2})"
I'm trying to remove apostrophes from a Corpus, but only when they are the first character in a paragraph. I have seen posts about finding the first word in a sentence, but not a paragraph.
The reason I'm trying this is because I'm analyzing text. I want to strip all the punctuation, but leave apostrophes and dashes only in the middle of words. To start this, I did:
library(tm)
library(qdap)
#docs is any corpus
docs.test=tm_map(docs, PlainTextDocument)
docs.test=tm_map(docs.test, content_transformer(strip), char.keep=c("'","-"))
for(j in seq(docs.test))
{
docs[[j]] <- gsub(" \'", " ", docs[[j]])
}
This successfully removed all of the apostrophes except those that start on new lines. To remove those on new lines, I have tried:
for(j in seq(docs.test))
{
docs[[j]] <- gsub("\r\'", " ", docs[[j]])
docs[[j]] <- gsub("\n\'", " ", docs[[j]])
docs[[j]] <- gsub("<p>\'", " ", docs[[j]])
docs[[j]] <- gsub("</p>\'", " ", docs[[j]])
}
In general, I think it would be useful to find a way to extract the first word of a paragraph. For my specific issue, I'm trying it just as a way to get at those apostrophes. I'm currently using the packages qdap and tm, but open to using more.
Any ideas?
Thank you!
You didn't supply a test example, but here is a function that keeps intra-word apostrophes and hyphens. It's in a different package, but as the example at the end shows, is easily coerced to a regular list if you need it to be:
require(quanteda)
txt <- c(d1 = "\"This\" is quoted.",
d2 = "Here are hypen-words.",
d3 = "Example: 'single' quotes.",
d4 = "Possessive plurals' usage.")
(toks <- tokens(txt, removePunct = TRUE, removeHyphens = FALSE))
## tokens from 4 documents.
## d1 :
## [1] "This" "is" "quoted"
##
## d2 :
## [1] "quanteda's" "hypen-words"
##
## d3 :
## [1] "Example" "single" "quotes"
##
## d4 :
## [1] "Possessive" "plurals" "usage"
You can get back to a list this way, and of course back to documents if you need to be by sapply()ing a paste(x, collapse = " "), etc.
as.list(toks)
## $d1
## [1] "This" "is" "quoted"
##
## $d2
## [1] "quanteda's" "hypen-words"
##
## $d3
## [1] "Example" "single" "quotes"
##
## $d4
## [1] "Possessive" "plurals" "usage"
I would like to use the removeWords (stopwords("english")) function via: corpus <- tm_map(corpus,removeWords, stopwords("english")) but some words like "not", and other negations I'd like to keep.
Is it possible to use the removeWords, stopwords("english") function BUT exclude certain words in that list if specified?
How could I prevent the removal of "not" for example?
(Secondary) is it possible to set this type of control list to all "negations"?
I'd rather not resort to creating my own custom list with only the words from that stoplist that I'm interested in.
You can create a custom list of stopwords by taking the difference between stopwords("en") and the list of words you want to exclude:
exceptions <- c("not")
my_stopwords <- setdiff(stopwords("en"), exceptions)
If you need to remove all the negations, you can grep them from the stopwords() list:
exceptions <- grep(pattern = "not|n't", x = stopwords(), value = TRUE)
# [1] "isn't" "aren't" "wasn't" "weren't" "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't"
# [11] "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't" "not"
my_stopwords <- setdiff(stopwords("en"), exceptions)