I wonder if you can change the formation of sentences. Instead of punctuation to form the sentence, I would like a new row/ new line forming the sentence.
This is a very minimal question, so I'm going to have to guess here what you intend, but I'm guessing that you want to segment your documents into lines, rather than sentences. There are two ways to do this: to have a new corpus where each sentence is a document, or a new tokens object where each "token" is a line.
Getting both is a matter of using the *_segment() functions. Here's two ways, with some sample text I will create where each line is a "sentence".
library("quanteda")
## Package version: 2.0.0
txt <- c(
d1 = "Sentence one.\nSentence two is on this line.\nLine three",
d2 = "This is a single sentence."
)
cat(txt)
## Sentence one.
## Sentence two is on this line.
## Line three This is a single sentence.
To make this into tokens, we use char_segment() with a newline being the segmentation pattern, and then coerce this into a list and then into tokens:
# as tokens
char_segment(txt, pattern = "\n", remove_pattern = FALSE) %>%
as.list() %>%
as.tokens()
## Tokens consisting of 4 documents.
## d1.1 :
## [1] "Sentence one."
##
## d1.2 :
## [1] "Sentence two is on this line."
##
## d1.3 :
## [1] "Line three"
##
## d2.1 :
## [1] "This is a single sentence."
If you want to make each of the lines into a "document" that can be segmented further, then use corpus_segment() after constructing a corpus from the txt object:
# as documents
corpus(txt) %>%
corpus_segment(pattern = "\n", extract_pattern = FALSE)
## Corpus consisting of 4 documents.
## d1.1 :
## "Sentence one."
##
## d1.2 :
## "Sentence two is on this line."
##
## d1.3 :
## "Line three"
##
## d2.1 :
## "This is a single sentence."
Related
I have a corpus object that I converted into a tokens object. I then filtered this object to remove words and unify their spelling.
For my further workflow, I again need a corpus object. How can I construct this from the tokens object?
You could paste the tokens together to return a new corpus. (Although this may not be the best approach if your goal is to get back to a corpus so that you can use corpus_reshape().)
library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
"This is an example.",
"This, a second example."
)
corp <- corpus(txt)
toks <- tokens(corp) %>%
tokens_remove(stopwords("en"))
toks
## Tokens consisting of 2 documents.
## text1 :
## [1] "example" "."
##
## text2 :
## [1] "," "second" "example" "."
vapply(toks, paste, FUN.VALUE = character(1), collapse = " ") %>%
corpus()
## Corpus consisting of 2 documents.
## text1 :
## "example ."
##
## text2 :
## ", second example ."
I have a list of character vectors that hold tokens for documents.
list(doc1 = c("I", "like", "apples"), doc2 = c("You", "like", "apples", "too"))
I would like to transform this vector into a quanteda tokens (or dfm) object in order to make use of some of quantedas functionalities.
What's the best ay to do this?
I realize I could do something like the following for each document:
tokens(paste0(c("I", "like", "apples"), collapse = " "), what = "fastestword")
Which gives:
Tokens consisting of 1 document.
text1 :
[1] "I" "like" "apples"
But this feels like a hack and is also unreliable as I have whitespaces in some of my tokens objects. Is there a way to transfer these data structures more smoothly?
You can construct a tokens object from:
a character vector, in which case the object is tokenised with each character element becoming a "document"
a corpus, which is a specially classed character vector, and is tokenised and converted into documents in the tokens object in the same way
a list of character elements, in which case each list element becomes a tokenised document, and each element of that list becomes a token (but is not tokenised further)
a tokens object, which is treated the same as the list of character elements.
It's also possible to convert a list of character elements to a tokens object using as.tokens(mylist). The difference is that with tokens(), you have access to all of the options such as remove_punct. With as.tokens(), the conversion is direct, without options, so would be a bit faster if you do not need the options.
lis <- list(
doc1 = c("I", "like", "apples"),
doc2 = c("One two", "99", "three", ".")
)
library("quanteda")
## Package version: 3.0.9000
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tokens(lis)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "99" "three" "."
tokens(lis, remove_punct = TRUE, remove_numbers = TRUE)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "three"
The coercion alternative, without options:
as.tokens(lis)
## Tokens consisting of 2 documents.
## doc1 :
## [1] "I" "like" "apples"
##
## doc2 :
## [1] "One two" "99" "three" "."
According to ?tokens, the x can be a list.
x - the input object to the tokens constructor, one of: a (uniquely) named list of characters; a tokens object; or a corpus or character object that will be tokenized
So we just need
library(quanteda)
tokens(lst1, what = 'fastestword')
I have a large corpus of text in a vector of strings (app. 700.000 strings). I'm trying to replace specific words/phrases within the corpus. That is, I have a vector of app 40.000 phrases and a corresponding vector of replacements.
I'm looking for an efficient way of solving the problem
I can do it in a for loop, looping through each pattern + replacement. But it scales badly (3 days or so !)
I'v also tried qdap::mgsub(), but it seems to scale badly as well
txt <- c("this is a random sentence containing bca sk",
"another senctence with bc a but also with zqx tt",
"this sentence contains non of the patterns",
"this sentence contains only bc a")
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c("#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt")
#either
txt2 <- qdap::mgsub(patterns, replacements, txt)
#or
for(i in 1:length(patterns)){
txt <- gsub(patterns[i], replacements[i], txt)
}
Both solutions scale badly for my data with app 40.000 patterns/replacements and 700.000 txt strings
I figure there must be a more efficient way of doing this?
If you can tokenize the texts first, then vectorized replacement is much faster. It's also faster if a) you can use a multi-threaded solution and b) you use fixed instead of regular expression matching.
Here's how to do all that in the quanteda package. The last line pastes the tokens back into a single "document" as a character vector, if that is what you want.
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
quanteda_options(threads = 4)
txt <- c(
"this is a random sentence containing bca sk",
"another sentence with bc a but also with zqx tt",
"this sentence contains none of the patterns",
"this sentence contains only bc a"
)
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c(
"#a-specfic-tag-#abc sk",
"#a-specfic-tag-#bc a",
"#a-specfic-tag-#zqx tt"
)
This will tokenize the texts and then use fast replacement of the hashed types, using a fixed pattern match (but you could have used valuetype = "regex" for regular expression matching). By wrapping patterns inside the phrases() function, you are telling tokens_replace() to look for token sequences rather than individual matches, so this solves the multi-word issue.
toks <- tokens(txt) %>%
tokens_replace(phrase(patterns), replacements, valuetype = "fixed")
toks
## tokens from 4 documents.
## text1 :
## [1] "this" "is" "a" "random" "sentence"
## [6] "containing" "bca" "sk"
##
## text2 :
## [1] "another" "sentence"
## [3] "with" "#a-specfic-tag-#bc a"
## [5] "but" "also"
## [7] "with" "#a-specfic-tag-#zqx tt"
##
## text3 :
## [1] "this" "sentence" "contains" "none" "of" "the"
## [7] "patterns"
##
## text4 :
## [1] "this" "sentence" "contains"
## [4] "only" "#a-specfic-tag-#bc a"
Finally if you really want to put this back into character format, then convert to a list of character types and then paste them together.
sapply(as.list(toks), paste, collapse = " ")
## text1
## "this is a random sentence containing bca sk"
## text2
## "another sentence with #a-specfic-tag-#bc a but also with #a-specfic-tag-#zqx tt"
## text3
## "this sentence contains none of the patterns"
## text4
## "this sentence contains only #a-specfic-tag-#bc a"
You'll have to test this on your large corpus, but 700k strings does not sound like too large a task. Please try this and report how it did!
Create a vector of all words in each phrase
txt1 = strsplit(txt, " ")
words = unlist(txt1)
Use match() to find the index of words to replace, and replace them
idx <- match(words, patterns)
words[!is.na(idx)] = replacements[idx[!is.na(idx)]]
Re-form the phrases and paste together
phrases = relist(words, txt1)
updt = sapply(phrases, paste, collapse = " ")
I guess this won't work if patterns can have more than one word...
Create a map between the old and new values
map <- setNames(replacements, patterns)
Create a pattern that contains all patterns in a single regular expression
pattern = paste0("(", paste0(patterns, collapse="|"), ")")
Find all matches, and extract them
ridx <- gregexpr(pattern, txt)
m <- regmatches(txt, ridx)
Unlist, map, and relist the matches to their replacement values, and update the original vector
regmatches(txt, ridx) <- relist(map[unlist(m)], m)
I'm trying to remove apostrophes from a Corpus, but only when they are the first character in a paragraph. I have seen posts about finding the first word in a sentence, but not a paragraph.
The reason I'm trying this is because I'm analyzing text. I want to strip all the punctuation, but leave apostrophes and dashes only in the middle of words. To start this, I did:
library(tm)
library(qdap)
#docs is any corpus
docs.test=tm_map(docs, PlainTextDocument)
docs.test=tm_map(docs.test, content_transformer(strip), char.keep=c("'","-"))
for(j in seq(docs.test))
{
docs[[j]] <- gsub(" \'", " ", docs[[j]])
}
This successfully removed all of the apostrophes except those that start on new lines. To remove those on new lines, I have tried:
for(j in seq(docs.test))
{
docs[[j]] <- gsub("\r\'", " ", docs[[j]])
docs[[j]] <- gsub("\n\'", " ", docs[[j]])
docs[[j]] <- gsub("<p>\'", " ", docs[[j]])
docs[[j]] <- gsub("</p>\'", " ", docs[[j]])
}
In general, I think it would be useful to find a way to extract the first word of a paragraph. For my specific issue, I'm trying it just as a way to get at those apostrophes. I'm currently using the packages qdap and tm, but open to using more.
Any ideas?
Thank you!
You didn't supply a test example, but here is a function that keeps intra-word apostrophes and hyphens. It's in a different package, but as the example at the end shows, is easily coerced to a regular list if you need it to be:
require(quanteda)
txt <- c(d1 = "\"This\" is quoted.",
d2 = "Here are hypen-words.",
d3 = "Example: 'single' quotes.",
d4 = "Possessive plurals' usage.")
(toks <- tokens(txt, removePunct = TRUE, removeHyphens = FALSE))
## tokens from 4 documents.
## d1 :
## [1] "This" "is" "quoted"
##
## d2 :
## [1] "quanteda's" "hypen-words"
##
## d3 :
## [1] "Example" "single" "quotes"
##
## d4 :
## [1] "Possessive" "plurals" "usage"
You can get back to a list this way, and of course back to documents if you need to be by sapply()ing a paste(x, collapse = " "), etc.
as.list(toks)
## $d1
## [1] "This" "is" "quoted"
##
## $d2
## [1] "quanteda's" "hypen-words"
##
## $d3
## [1] "Example" "single" "quotes"
##
## $d4
## [1] "Possessive" "plurals" "usage"
I have one input file which has one paragraph. I need to split paragraph by pattern into two sub-paragraphs.
paragraph.xml
<Text>
This is first line.
This is second line.
\delemiter\new\one
This is third line.
This is fourth line.
</Text>
R code:
doc<-xmlTreeParse("paragraph.xml")
top = xmlRoot(doc)
text<-top[[1]]
I need to split this paragraph into 2 paragraphs.
paragraph1
This is first line.
This is second line.
paragraph2
This is third line.
This is fourth line.
I found strsplit function is very useful but it never split multi line text.
Since you have xml files, it is better to use XML package facilities. I see you start using it here a continuity of what you have start.
library(XML)
doc <- xmlParse('paragraph.xml') ## equivalent xmlTreeParse (...,useInternalNodes =TRUE)
## extract the text of the node Text
mytext = xpathSApply(doc,'//Text/text()',xmlValue)
## convert it to a list of lines using scan
lines <- scan(text=mytext,sep='\n',what='character')
## get the delimiter index
delim <- which(lines == "\\delemiter\\new\\one")
## get the 2 paragraphes
p1 <- lines[seq(delim-1)]
p2 <- lines[seq(delim+1,length(lines))]
Then you can use paste or write to get the paragraph structure, for example, using write:
write(p1,"",sep='\n')
This is first line.
This is second line.
Here is a sort of roundabout possibility, using split, grepl, and cumsum.
Some sample data:
temp <- c("This is first line.", "This is second line.",
"\\delimiter\\new\\one", "This is third line.",
"This is fourth line.", "\\delimiter\\new\\one",
"This is fifth line")
# [1] "This is first line." "This is second line." "\\delimiter\\new\\one"
# [4] "This is third line." "This is fourth line." "\\delimiter\\new\\one"
# [7] "This is fifth line"
Use split after generating "groups" by using cumsum on grepl:
temp1 <- split(temp, cumsum(grepl("delimiter", temp)))
temp1
# $`0`
# [1] "This is first line." "This is second line."
#
# $`1`
# [1] "\\delimiter\\new\\one" "This is third line." "This is fourth line."
#
# $`2`
# [1] "\\delimiter\\new\\one" "This is fifth line"
If further cleanup is desired, here's one option:
lapply(temp1, function(x) {
x[grep("delimiter", x)] <- NA
x[complete.cases(x)]
})
# $`0`
# [1] "This is first line." "This is second line."
#
# $`1`
# [1] "This is third line." "This is fourth line."
#
# $`2`
# [1] "This is fifth line"