Replace quanteda tokens through regex - r

I would like to explicitly replace specific tokens defined in objects of class tokens of the package quanteda. I fail to replicate a standard approach that works well with stringr.
The objective is to replace all tokens of the form "XXXof" in two tokens of the form c("XXX", "of").
Please, have a look at the minimal below:
suppressPackageStartupMessages(library(quanteda))
library(stringr)
text = "It was a beautiful day down to the coastof California."
# I would solve this with stringr as follows:
text_stringr = str_replace( text, "(^.*?)(of)", "\\1 \\2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."
# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )
# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\\1 \\2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "It" "was" "a" "beautiful" "day"
#> [6] "down" "to" "the" "\\1 \\2" "California"
#> [11] "."
Any workaround?
Created on 2021-03-16 by the reprex package (v1.0.0)

You can use a mixture to build a list of the words needing separating and their separated form, then use tokens_replace() to perform the replacement. This has the advantage of allowing you to curate the list before applying it, which means you can verify that you haven't caught replacements that you probably don't want to apply.
suppressPackageStartupMessages(library("quanteda"))
toks <- tokens("It was a beautiful day down to the coastof California.")
keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
vals <- stringr::str_replace(keys, "(^.*?)(of)", "\\1 \\2") %>%
strsplit(" ")
keys
## [1] "coastof"
vals
## [[1]]
## [1] "coast" "of"
tokens_replace(toks, keys, vals)
## Tokens consisting of 1 document.
## text1 :
## [1] "It" "was" "a" "beautiful" "day"
## [6] "down" "to" "the" "coast" "of"
## [11] "California" "."
Created on 2021-03-16 by the reprex package (v1.0.0)

Related

How to transform a list of character vectors into a quanteda tokens object?

I have a list of character vectors that hold tokens for documents.
list(doc1 = c("I", "like", "apples"), doc2 = c("You", "like", "apples", "too"))
I would like to transform this vector into a quanteda tokens (or dfm) object in order to make use of some of quantedas functionalities.
What's the best ay to do this?
I realize I could do something like the following for each document:
tokens(paste0(c("I", "like", "apples"), collapse = " "), what = "fastestword")
Which gives:
Tokens consisting of 1 document.
text1 :
[1] "I" "like" "apples"
But this feels like a hack and is also unreliable as I have whitespaces in some of my tokens objects. Is there a way to transfer these data structures more smoothly?
You can construct a tokens object from:
a character vector, in which case the object is tokenised with each character element becoming a "document"
a corpus, which is a specially classed character vector, and is tokenised and converted into documents in the tokens object in the same way
a list of character elements, in which case each list element becomes a tokenised document, and each element of that list becomes a token (but is not tokenised further)
a tokens object, which is treated the same as the list of character elements.
It's also possible to convert a list of character elements to a tokens object using as.tokens(mylist). The difference is that with tokens(), you have access to all of the options such as remove_punct. With as.tokens(), the conversion is direct, without options, so would be a bit faster if you do not need the options.
lis <- list(
doc1 = c("I", "like", "apples"),
doc2 = c("One two", "99", "three", ".")
)
library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tokens(lis)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "99" "three" "."
tokens(lis, remove_punct = TRUE, remove_numbers = TRUE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "three"
The coercion alternative, without options:
as.tokens(lis)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "99" "three" "."
According to ?tokens, the x can be a list.
x - the input object to the tokens constructor, one of: a (uniquely) named list of characters; a tokens object; or a corpus or character object that will be tokenized
So we just need
library(quanteda)
tokens(lst1, what = 'fastestword')

only remove punctuation for words not numbers

I am using the tm package in R to remove punctuation.
TextDoc <- tm_map(TextDoc, removePunctuation)
Is there a way I can only remove puncutation if it has to do with a letter/word instead of a number?
E.g.
I want performance. --> performance
But I want 3.14 --> 3.14
Example of how i want function to work:
wall, --> wall
expression. --> expression
ef. --> ef
A. --> A
name: --> name
:ok --> ok
91.8.10 --> 91.8.10
EDIT:
TextDoc is of the form:
You may also try this gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T) where text is your text vector. Explanation of regex
(?<!\\d) negative lookbehind for any digit character
[[:punct:]] searches for punctuation marks
(?=\\D) followed by positive lookahead for any non-digit character
? 0 or once
check this for regex demo
text <- c("wall, 88.1", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10")
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T)
#> [1] "wall 88.1" "expression" "ef" "ok" "A"
#> [6] "3.14" "91.8.10"
long_text <- "wall, 88.1 expression. ef. :ok A. 3.14 91.8.10"
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', long_text, perl = T)
#> [1] "wall 88.1 expression ef ok A 3.14 91.8.10"
Created on 2021-06-13 by the reprex package (v2.0.0)
I've completely revamped my answer based on your specification and Anil's answer below, which is much more widely applicable than what I originally had.
library(tm)
# Here we pretend that your texts are like this
text <- c("wall,", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10",
"w.a.ll, 6513.645+1646-5")
# and we create a corpus with them, like the one you show
corp <- Corpus(VectorSource(text))
# you create a function with any of the solutions that we've provided here
# I'm taking AnilGoyal's because it's better than my rushed purrr one.
my_remove_punct <- function(x) {
gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', x, perl = T)
}
# pass the function to tm_map
new_corp <- tm_map(corp, my_remove_punct)
# Applying the function will give you a warning about dropping documents; but it's a bug of the TM package.
# We use this to confirm that the contents are indeed correct. The last line is a print-out of all the individual documents together.
sapply(new_corp, print)
#> [1] "wall"
#> [1] "expression"
#> [1] "ef"
#> [1] "ok"
#> [1] "A"
#> [1] "3.14"
#> [1] "91.8.10"
#> [1] "wall 6513.645+1646-5"
#> [1] "wall" "expression" "ef"
#> [4] "ok" "A" "3.14"
#> [7] "91.8.10" "wall 6513.645+1646-5"
The warning you receive about "dropping documents" is not real as you can see by printing. An explanation is in this other SO question.
In the future, note that you can quickly get better answers by providing raw data with the function dput to your object. Something like dput(TextDoc). If it is too much, you can subset it.
Tried to make it less ugly but here is my best shot:
library(data.table)
TextDoc <- data.table(text = c("wall",
"expression.",
"ef.",
"91.8.10",
"A.",
"name:",
":ok"))
TextDoc[grepl("[a-zA-Z]", text),
text := unlist(tm_map(Corpus(VectorSource(as.vector(text))), removePunctuation))[1:length(grepl("[a-zA-Z]", text))]]
Which gives us:
> TextDoc
text
1: wall
2: expression
3: ef
4: 91.8.10
5: A
6: name
7: ok

Extracting specific data from nested list

I am trying web scraping of movies of 2019 from IMDB. I am extracting the Director's name from a nested list.
Now, the issue is the name of the Directors are not given for all the movies but for selected few, hence I need to extract the Director's name where ever the term 'Director:\n' appears.
The nested list is as follows:
[[1]]
[1] "Henry Cavill,Freya Allan,Anya Chalotra,Mimi Ndiweni\n"
[[2]]
[1] "\n"
[2] "Director:\nJ.J. Abrams"
[3] "|"
[4] "Stars:\nCarrie Fisher,Mark Hamill,Adam Driver,Daisy Ridley\n"
[[3]]
[1] "Pedro Pascal,Carl Weathers,Rio Hackford,Gina Carano\n"
[[4]]
[1] "\n"
[2] "Director:\nTom Hooper"
[3] "|"
[4] "Stars:\nFrancesca Hayward,Taylor Swift,Laurie Davidson,Robbie Fairchild\n"
[[5]]
[1] "Guy Pearce,Andy Serkis,Stephen Graham,Joe Alwyn\n"
[[6]]
[1] "\n"
[2] "Director:\nMichael Bay"
[3] "|"
[4] "Stars:\nRyan Reynolds,Mélanie Laurent,Manuel Garcia-Rulfo,Ben Hardy\n"
Here as one can see, the Director's name appears in an alternate manner but this is just for example purpose. Thanks in advance.
Expected Output:
directors_data
NA,"J.J. Abrams",NA,"Michael Bay"
Here is a base R solution, where you can use use the method grep+gsub, or the method regmatches + gregexpr.
Assuming you data is a list lst, then you can try the following code to extract the director's name:
sapply(lst, function(x) ifelse(length(r <- grep("Director",x,value = T)),gsub("Director:\n","",r),NA))
or
sapply(lst, function(x) ifelse(length(r<-unlist(regmatches(x,gregexpr("(?<=Director:\n)(.*)",x,perl = T)))),r,NA))
You can use str_extract to extract string and map to loop over each element in the list
library(purrr)
library(stringr)
map_chr(list_df, ~{temp <- na.omit(str_extract(.x, "(?<=Director:\n)(.*)"));
if(length(temp) > 0) temp else NA})
#[1] NA "J.J. Abrams" NA "Tom Hooper"
data
Since you did not provide a reproducible example I created one myself.
list_df <- list("Henry Cavill,Freya Allan,Anya Chalotra,Mimi Ndiweni\n",
c("\n", "Director:\nJ.J. Abrams", "|", "Stars:\nCarrie Fisher,Mark Hamill,Adam Driver,Daisy Ridley\n"
), "Pedro Pascal,Carl Weathers,Rio Hackford,Gina Carano\n",
c("\n", "Director:\nTom Hooper", "|", "Stars:\nFrancesca Hayward,Taylor Swift,Laurie Davidson,Robbie Fairchild\n"
))
Base R solution:
directors_data <- gsub("Director:\n", "",
unlist(Map(function(x){x[2]}, list_df)), fixed = TRUE)
Base R solution not using unlist and using mapply not Map:
directors_data <- gsub(".*\\\n", "",
mapply(function(x){x[2]}, list_df, SIMPLIFY = TRUE))
Base R solution if pattern appears at different indices per list element:
directors_data <- gsub(".*\\\n", "",
mapply(function(x) {
ifelse(length(x[which(grepl("Director", x))]) > 0,
x[which(grepl("Director", x))],
NA)}, list_df, SIMPLIFY = TRUE))

How do I 'efficiently' replace a vector of strings with another (pairwise) in a large text corpus

I have a large corpus of text in a vector of strings (app. 700.000 strings). I'm trying to replace specific words/phrases within the corpus. That is, I have a vector of app 40.000 phrases and a corresponding vector of replacements.
I'm looking for an efficient way of solving the problem
I can do it in a for loop, looping through each pattern + replacement. But it scales badly (3 days or so !)
I'v also tried qdap::mgsub(), but it seems to scale badly as well
txt <- c("this is a random sentence containing bca sk",
"another senctence with bc a but also with zqx tt",
"this sentence contains non of the patterns",
"this sentence contains only bc a")
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c("#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt")
#either
txt2 <- qdap::mgsub(patterns, replacements, txt)
#or
for(i in 1:length(patterns)){
txt <- gsub(patterns[i], replacements[i], txt)
}
Both solutions scale badly for my data with app 40.000 patterns/replacements and 700.000 txt strings
I figure there must be a more efficient way of doing this?
If you can tokenize the texts first, then vectorized replacement is much faster. It's also faster if a) you can use a multi-threaded solution and b) you use fixed instead of regular expression matching.
Here's how to do all that in the quanteda package. The last line pastes the tokens back into a single "document" as a character vector, if that is what you want.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
quanteda_options(threads = 4)
txt <- c(
"this is a random sentence containing bca sk",
"another sentence with bc a but also with zqx tt",
"this sentence contains none of the patterns",
"this sentence contains only bc a"
)
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c(
"#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt"
)
This will tokenize the texts and then use fast replacement of the hashed types, using a fixed pattern match (but you could have used valuetype = "regex" for regular expression matching). By wrapping patterns inside the phrases() function, you are telling tokens_replace() to look for token sequences rather than individual matches, so this solves the multi-word issue.
toks <- tokens(txt) %>%
tokens_replace(phrase(patterns), replacements, valuetype = "fixed")
toks
## tokens from 4 documents.
## text1 :
## [1] "this" "is" "a" "random" "sentence"
## [6] "containing" "bca" "sk"
##
## text2 :
## [1] "another" "sentence"
## [3] "with" "#a-specfic-tag-#bc a"
## [5] "but" "also"
## [7] "with" "#a-specfic-tag-#zqx tt"
##
## text3 :
## [1] "this" "sentence" "contains" "none" "of" "the"
## [7] "patterns"
##
## text4 :
## [1] "this" "sentence" "contains"
## [4] "only" "#a-specfic-tag-#bc a"
Finally if you really want to put this back into character format, then convert to a list of character types and then paste them together.
sapply(as.list(toks), paste, collapse = " ")
## text1
## "this is a random sentence containing bca sk"
## text2
## "another sentence with #a-specfic-tag-#bc a but also with #a-specfic-tag-#zqx tt"
## text3
## "this sentence contains none of the patterns"
## text4
## "this sentence contains only #a-specfic-tag-#bc a"
You'll have to test this on your large corpus, but 700k strings does not sound like too large a task. Please try this and report how it did!
Create a vector of all words in each phrase
txt1 = strsplit(txt, " ")
words = unlist(txt1)
Use match() to find the index of words to replace, and replace them
idx <- match(words, patterns)
words[!is.na(idx)] = replacements[idx[!is.na(idx)]]
Re-form the phrases and paste together
phrases = relist(words, txt1)
updt = sapply(phrases, paste, collapse = " ")
I guess this won't work if patterns can have more than one word...
Create a map between the old and new values
map <- setNames(replacements, patterns)
Create a pattern that contains all patterns in a single regular expression
pattern = paste0("(", paste0(patterns, collapse="|"), ")")
Find all matches, and extract them
ridx <- gregexpr(pattern, txt)
m <- regmatches(txt, ridx)
Unlist, map, and relist the matches to their replacement values, and update the original vector
regmatches(txt, ridx) <- relist(map[unlist(m)], m)

How to remove parentheses with words inside by tm packages ?

Let's say I have part of the texts in a document like this:
"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."
I want to remove the "(API)", and it needs to be done before
corpus <- tm_map(corpus, removePunctuation)
After removing "(API)",it should be look like this below :
"Other segment comprised of our active pharmaceutical ingredient business,which..."
I searched for a long time but all I can find was the answers about removing parentheses only,the word within I don't want appear in the corpus too.
I really need someone give me some hint plz.
You can use a smarter tokeniser, such as that in the quanteda package, where the removePunct = TRUE will remove the parentheses automatically.
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
Added:
If you want to tokenise the text first, then you need lapply a gsub until we add a regular expression valuetype to removeFeatures.tokenizedTexts() in quanteda. But this would work:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
If you simply want to remove the parenthetical expressions as in the question, then you don't need either tm or quanteda:
# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient. New sentence..."
The longer regular expression also catches cases in which the parenthetical expression ends a sentence or is followed by additional punctuation such as a comma.
If it's only single words, how about (untested):
removeBracketed <- content_transformer(function(x, ...) {gsub("\\(\\w+\\)", "", x)})
tm_map(corpus, removeBracketed)

Resources