I'm trying to remove apostrophes from a Corpus, but only when they are the first character in a paragraph. I have seen posts about finding the first word in a sentence, but not a paragraph.
The reason I'm trying this is because I'm analyzing text. I want to strip all the punctuation, but leave apostrophes and dashes only in the middle of words. To start this, I did:
library(tm)
library(qdap)
#docs is any corpus
docs.test=tm_map(docs, PlainTextDocument)
docs.test=tm_map(docs.test, content_transformer(strip), char.keep=c("'","-"))
for(j in seq(docs.test))
{
docs[[j]] <- gsub(" \'", " ", docs[[j]])
}
This successfully removed all of the apostrophes except those that start on new lines. To remove those on new lines, I have tried:
for(j in seq(docs.test))
{
docs[[j]] <- gsub("\r\'", " ", docs[[j]])
docs[[j]] <- gsub("\n\'", " ", docs[[j]])
docs[[j]] <- gsub("<p>\'", " ", docs[[j]])
docs[[j]] <- gsub("</p>\'", " ", docs[[j]])
}
In general, I think it would be useful to find a way to extract the first word of a paragraph. For my specific issue, I'm trying it just as a way to get at those apostrophes. I'm currently using the packages qdap and tm, but open to using more.
Any ideas?
Thank you!
You didn't supply a test example, but here is a function that keeps intra-word apostrophes and hyphens. It's in a different package, but as the example at the end shows, is easily coerced to a regular list if you need it to be:
require(quanteda)
txt <- c(d1 = "\"This\" is quoted.",
d2 = "Here are hypen-words.",
d3 = "Example: 'single' quotes.",
d4 = "Possessive plurals' usage.")
(toks <- tokens(txt, removePunct = TRUE, removeHyphens = FALSE))
## tokens from 4 documents.
## d1 :
## [1] "This" "is" "quoted"
##
## d2 :
## [1] "quanteda's" "hypen-words"
##
## d3 :
## [1] "Example" "single" "quotes"
##
## d4 :
## [1] "Possessive" "plurals" "usage"
You can get back to a list this way, and of course back to documents if you need to be by sapply()ing a paste(x, collapse = " "), etc.
as.list(toks)
## $d1
## [1] "This" "is" "quoted"
##
## $d2
## [1] "quanteda's" "hypen-words"
##
## $d3
## [1] "Example" "single" "quotes"
##
## $d4
## [1] "Possessive" "plurals" "usage"
Related
I wonder if you can change the formation of sentences. Instead of punctuation to form the sentence, I would like a new row/ new line forming the sentence.
This is a very minimal question, so I'm going to have to guess here what you intend, but I'm guessing that you want to segment your documents into lines, rather than sentences. There are two ways to do this: to have a new corpus where each sentence is a document, or a new tokens object where each "token" is a line.
Getting both is a matter of using the *_segment() functions. Here's two ways, with some sample text I will create where each line is a "sentence".
library("quanteda")
## Package version: 2.0.0
txt <- c(
d1 = "Sentence one.\nSentence two is on this line.\nLine three",
d2 = "This is a single sentence."
)
cat(txt)
## Sentence one.
## Sentence two is on this line.
## Line three This is a single sentence.
To make this into tokens, we use char_segment() with a newline being the segmentation pattern, and then coerce this into a list and then into tokens:
# as tokens
char_segment(txt, pattern = "\n", remove_pattern = FALSE) %>%
as.list() %>%
as.tokens()
## Tokens consisting of 4 documents.
## d1.1 :
## [1] "Sentence one."
##
## d1.2 :
## [1] "Sentence two is on this line."
##
## d1.3 :
## [1] "Line three"
##
## d2.1 :
## [1] "This is a single sentence."
If you want to make each of the lines into a "document" that can be segmented further, then use corpus_segment() after constructing a corpus from the txt object:
# as documents
corpus(txt) %>%
corpus_segment(pattern = "\n", extract_pattern = FALSE)
## Corpus consisting of 4 documents.
## d1.1 :
## "Sentence one."
##
## d1.2 :
## "Sentence two is on this line."
##
## d1.3 :
## "Line three"
##
## d2.1 :
## "This is a single sentence."
I have a large corpus of text in a vector of strings (app. 700.000 strings). I'm trying to replace specific words/phrases within the corpus. That is, I have a vector of app 40.000 phrases and a corresponding vector of replacements.
I'm looking for an efficient way of solving the problem
I can do it in a for loop, looping through each pattern + replacement. But it scales badly (3 days or so !)
I'v also tried qdap::mgsub(), but it seems to scale badly as well
txt <- c("this is a random sentence containing bca sk",
"another senctence with bc a but also with zqx tt",
"this sentence contains non of the patterns",
"this sentence contains only bc a")
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c("#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt")
#either
txt2 <- qdap::mgsub(patterns, replacements, txt)
#or
for(i in 1:length(patterns)){
txt <- gsub(patterns[i], replacements[i], txt)
}
Both solutions scale badly for my data with app 40.000 patterns/replacements and 700.000 txt strings
I figure there must be a more efficient way of doing this?
If you can tokenize the texts first, then vectorized replacement is much faster. It's also faster if a) you can use a multi-threaded solution and b) you use fixed instead of regular expression matching.
Here's how to do all that in the quanteda package. The last line pastes the tokens back into a single "document" as a character vector, if that is what you want.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
quanteda_options(threads = 4)
txt <- c(
"this is a random sentence containing bca sk",
"another sentence with bc a but also with zqx tt",
"this sentence contains none of the patterns",
"this sentence contains only bc a"
)
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c(
"#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt"
)
This will tokenize the texts and then use fast replacement of the hashed types, using a fixed pattern match (but you could have used valuetype = "regex" for regular expression matching). By wrapping patterns inside the phrases() function, you are telling tokens_replace() to look for token sequences rather than individual matches, so this solves the multi-word issue.
toks <- tokens(txt) %>%
tokens_replace(phrase(patterns), replacements, valuetype = "fixed")
toks
## tokens from 4 documents.
## text1 :
## [1] "this" "is" "a" "random" "sentence"
## [6] "containing" "bca" "sk"
##
## text2 :
## [1] "another" "sentence"
## [3] "with" "#a-specfic-tag-#bc a"
## [5] "but" "also"
## [7] "with" "#a-specfic-tag-#zqx tt"
##
## text3 :
## [1] "this" "sentence" "contains" "none" "of" "the"
## [7] "patterns"
##
## text4 :
## [1] "this" "sentence" "contains"
## [4] "only" "#a-specfic-tag-#bc a"
Finally if you really want to put this back into character format, then convert to a list of character types and then paste them together.
sapply(as.list(toks), paste, collapse = " ")
## text1
## "this is a random sentence containing bca sk"
## text2
## "another sentence with #a-specfic-tag-#bc a but also with #a-specfic-tag-#zqx tt"
## text3
## "this sentence contains none of the patterns"
## text4
## "this sentence contains only #a-specfic-tag-#bc a"
You'll have to test this on your large corpus, but 700k strings does not sound like too large a task. Please try this and report how it did!
Create a vector of all words in each phrase
txt1 = strsplit(txt, " ")
words = unlist(txt1)
Use match() to find the index of words to replace, and replace them
idx <- match(words, patterns)
words[!is.na(idx)] = replacements[idx[!is.na(idx)]]
Re-form the phrases and paste together
phrases = relist(words, txt1)
updt = sapply(phrases, paste, collapse = " ")
I guess this won't work if patterns can have more than one word...
Create a map between the old and new values
map <- setNames(replacements, patterns)
Create a pattern that contains all patterns in a single regular expression
pattern = paste0("(", paste0(patterns, collapse="|"), ")")
Find all matches, and extract them
ridx <- gregexpr(pattern, txt)
m <- regmatches(txt, ridx)
Unlist, map, and relist the matches to their replacement values, and update the original vector
regmatches(txt, ridx) <- relist(map[unlist(m)], m)
I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
How can I search a string, a word that matches and replace it?
The expected result is the following example:
a1<- c(" the classroom is ful ")
a2<- c(" full")
In this case I would be replacing ful for full in a1
Take a look at the hunspell package. As the comments have already suggested, your problem is much more difficult than it seems, unless you already have a dictionary of misspelled words and their correct spelling.
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
So even in your example, would you want to replace ful with full, or many of the other options here?
The package does let you use your own dictionary. Let's say you're doing that, or at least you're happy with the first returned suggestion.
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
But, as the other comments and answers have pointed out, you do need to be careful with the word showing up within other words.
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
Update
Based on your comment, you already have a dictionary, structured as a vector of badwords and another vector of their replacements.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
Update 2
Addressing your comment, with your new example the issue is back to having words showing up in other words. The solutions is to use \\b. This represents a word boundary. Using pattern "thin" it will match to "thin", "think", "thinking", etc. But if you bracket with \\b it anchors the pattern to a word boundary. \\bthin\\b will only match "thin".
Your example:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
The solution is to modify badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
Then create the vect.corpus as I describe in the previous update, and use in str_replace_all.
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
I think the function you are looking for is gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
Create a list of the corrections then replace them using gsubfn which is a generalization of gsub that can also take list, function and proto object replacement objects. The regular expression matches a word boundary, one or more word characters and another word boundary. Each time it finds a match it looks up the match in the list names and if found replaces it with the corresponding list value.
library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
For a kind of ordered replacement, you can try this
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
For unordered replacement you can use an approximate string matching (see stringdist::amatch). Here is an example
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"
Let's say I have part of the texts in a document like this:
"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."
I want to remove the "(API)", and it needs to be done before
corpus <- tm_map(corpus, removePunctuation)
After removing "(API)",it should be look like this below :
"Other segment comprised of our active pharmaceutical ingredient business,which..."
I searched for a long time but all I can find was the answers about removing parentheses only,the word within I don't want appear in the corpus too.
I really need someone give me some hint plz.
You can use a smarter tokeniser, such as that in the quanteda package, where the removePunct = TRUE will remove the parentheses automatically.
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
Added:
If you want to tokenise the text first, then you need lapply a gsub until we add a regular expression valuetype to removeFeatures.tokenizedTexts() in quanteda. But this would work:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
If you simply want to remove the parenthetical expressions as in the question, then you don't need either tm or quanteda:
# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient. New sentence..."
The longer regular expression also catches cases in which the parenthetical expression ends a sentence or is followed by additional punctuation such as a comma.
If it's only single words, how about (untested):
removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)})
tm_map(corpus, removeBracketed)
I am using the tm and lda packages in R to topic model a corpus of news articles. However, I am getting a "non-character" problem represented as "" that is messing up my topics. Here is my workflow:
text <- Corpus(VectorSource(d$text))
newtext <- lapply(text, tolower)
sw <- c(stopwords("english"), "ahram", "online", "egypt", "egypts", "egyptian")
newtext <- lapply(newtext, function(x) removePunctuation(x))
newtext <- lapply(newtext, function(x) removeWords(x, sw))
newtext <- lapply(newtext, function(x) removeNumbers(x))
newtext <- lapply(newtext, function(x) stripWhitespace(x))
d$processed <- unlist(newtext)
corpus <- lexicalize(d$processed)
k <- 40
result <-lda.collapsed.gibbs.sampler(corpus$documents, k, corpus$vocab, 500, .02, .05,
compute.log.likelihood = TRUE, trace = 2L)
Unfortunately, when I train the lda model, everything looks great except the most frequently occurring word is "". I try to remedy this by removing it from the vocab as given below and reestimating the model just as above:
newtext <- lapply(newtext, function(x) removeWords(x, ""))
But, it's still there, as evidenced by:
str_split(newtext[[1]], " ")
[[1]]
[1] "" "body" "mohamed" "hassan"
[5] "cook" "found" "turkish" "search"
[9] "rescue" "teams" "rescued" "hospital"
[13] "rescue" "teams" "continued" "search"
[17] "missing" "body" "cook" "crew"
[21] "wereegyptians" "sudanese" "syrians" "hassan"
[25] "cook" "cargo" "ship" "sea"
[29] "bright" "crashed" "thursday" "port"
[33] "antalya" "southern" "turkey" "vessel"
[37] "collided" "rocks" "port" "thursday"
[41] "night" "result" "heavy" "winds"
[45] "waves" "crew" ""
Any suggestions on how to go about removing this? Adding "" to my list of stopwords doesn't help, either.
I deal with text a lot but not tm so this is 2 approaches to get rid of the "" you have. likely the extra "" characters are because of a double space bar between sentences. You can treat this condition before or after you turn the text into a bag of words. You could replace all " "x2 with " "x1 before the strsplit or you could do it afterward (you have to unlist after strsplit).
x <- "I like to ride my bicycle. Do you like to ride too?"
#TREAT BEFORE(OPTION):
a <- gsub(" +", " ", x)
strsplit(a, " ")
#TREAT AFTER OPTION:
y <- unlist(strsplit(x, " "))
y[!y%in%""]
You might also try:
newtext <- lapply(newtext, function(x) gsub(" +", " ", x))
Again I don't use tm so this may not be of help but this post hadn't seen any action so I figured I'd share possibilities.
If you already have the corpus set up, try using the document length as a filter by attaching it to meta() as a tag and then creating a new corpus.
dtm <- DocumentTermMatrix(corpus)
## terms per document
doc.length = rowSums(as.matrix(dtm))
## add length as description term
meta(corpus.clean.noTL,tag="Length") <- doc.length
## create new corpus
corpus.noEmptyDocs <- tm_filter(corpus, FUN = sFilter, "Length > 0")
## remove Length as meta tag
meta(corpus.clean.noTL,tag="Length") <- NULL
With the above method, you can computationally efficiently hijack the existing matrix manipulation support in tm with only 5 lines of code.