How to convert data frame to dfm in quanteda package in R? - r

Suppose I have a data frame vector which looks like:
tweets
#text
#text 2
#text 3
Using the quanteda package, I'm trying to count the number of hashtags in the data frame.
However, using the following code, I get an error:
tweet_dfm <- dfm(data, remove_punct = TRUE)
tag_dfm <- dfm_select(tweet_dfm, pattern = ('#*'))
toptag <- names(topfeatures(tag_dfm, 50))
head(toptag)
Error (on the first line of code):
Error in dfm.default(data, remove_punct = TRUE) : dfm() only works on character, corpus, dfm, tokens objects.
Any ideas how to fix?

You need to slice out the column of the data.frame called "tweets", using data$tweets. So:
library("quanteda")
## Package version: 2.1.2
data <- data.frame(tweets = c("#text", "#text 2", "#text 3"))
dfm(data$tweets, remove_punct = TRUE) %>%
dfm_select(pattern = ("#*")) %>%
sum()
## [1] 3
(since you wanted the total of all hashtags)
Note that the remove_punct = TRUE is unnecessary here, although it has no effect - since fortunately quanteda's built-in tokeniser recognises the difference between punctuation and the hashtag character that other tokenisers might consider to be a punctuation character.

Related

How to make a token(made by quanteda)convert to dataframe and have a doc_id per doc? i need a dataframe or tibble to calcuate tf-idf

At frist, I used readtext() and as_tibble() got a tibble just like the picture shows.
And actually I want a one-token-per-row to calcuate the tf-idf (by per doc_id).
There have two question I met.
1.I found the function tokens() which from quanteda can not be used in a a tibble(
2.I tried make it as a corpus()firstly ,and tokens()it then ,but I found the tokens format can not be coverted to be a dataframe or a tibble which i want made me so sad)
I want a tibble like that:
doc_id word n
xxx xx 3
xxx xx 40
xxx xx 80
suppressPackageStartupMessages({
library(quanteda)
library(quanteda.textstats)
library(jiebaR)
library(readtext)
library(purrr)
library(tidyverse)
library(RMeCab)
library(tidyverse)
library(tidytext)
library(RcppMeCab)})
##readtext read all file
docs1 <- readtext("/Users/oushiei/Downloads/NORUBY")
class(docs1)##"readtext" "data.frame"
##1.corpus
docs2 <- corpus(docs1)
docs3 <- docs2 %>% tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) ##"corpus"变为"token"
unnest(docs3)##error,no applicable method for 'unnest' applied to an object of class "tokens"
##2.tibble
doc_tibble <- docs1 %>% as_tibble()
doc_tibble

Tokenization of Compound Words not Working in Quanteda

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.
This is the subset of the dataset I'm using as a reproducible example:
test_cluster <- speeches_subset %>%
filter(grepl('Schwester Agnes',
speechContent,
ignore.case = TRUE))
test_corpus <- corpus(test_cluster,
docid_field = "id",
text_field = "speechContent")
Here, test_cluster contains six observations of 12 variables, that is, six rows in which the column speechContent contains the compound word "Schwester Agnes". test_corpus transforms the underlying data into a quanteda corpus object.
When I then run the following code, I would expect, first, the content of the speechContent variables to be tokenized, and due to tokens_compound, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with the keyword variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making with tokens_compound(), but I'm not sure... Any help would be greatly appreciated!
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("Schwester Agnes"))
test_kwic <- kwic(test_tokens,
pattern = "Schwester Agnes",
window = 5)
EDIT: I realize that the examples above are not easily reproducible, so please refer to the reprex below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = c("stack", "overflow"))
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)
You need to apply phrase("stack overflow") and set concatenator = " " in tokens_compound().
require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1
speech <- c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of",
"This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.",
"this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id = 1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("stack overflow"), concatenator = " ")
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)
test_kwic
#> Keyword-in-context with 2 matches.
#> [1, 29] for example is the word | stack overflow | However there are so many
#> [2, 24] but at the very end | stack overflow |
Created on 2022-05-06 by the reprex package (v2.0.1)

How to compare text from two data frames in a wordcloud using R's quanteda package?

Suppose I have two data frames (country_x and country_y which contain similar columns). E.g.
text_country_x
hello
bye
and
text_country_y
see ya
great
Using quanteda and quanteda.textplots packages, I have created a word cloud:
corpus_country_x <- corpus(country_x_df$text_country_x)
country_x_token <- tokens(corpus_country_x, remove_punct = TRUE, remove_numbers = TRUE)
country_x_token <- tokens_remove(country_x_token , stopwords("english"))
token_dfm_x <- dfm(country_x_token)
quanteda.textplots::textplot_wordcloud(token_dfm_x)
However, I want to create a wordcloud where half of it contains text from text_country_x and the other half contains text from text_country_y. Does anyone know how to do this?
I know there is the comparison = TRUE parameter but not sure how to make it work in practice: https://quanteda.io/reference/textplot_wordcloud.html#:~:text=To%20produce%20word%20cloud%20plots,each%20document%20in%20the%20dfm..
Do it this way:
Form each corpus separately
Set a docvar for each corpus to differentiate the country. (Below, I use the document variable set)
Combine the corpus objects using +
Tokenise and form a dfm, then group the dfm using your set variable (country, in your example)
Plot the comparison wordcloud.
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
corpus_country_x <- corpus_subset(data_corpus_inaugural, Party == "Democratic")
corpus_country_x$set <- "Dem"
corpus_country_y <- corpus_subset(data_corpus_inaugural, Party == "Republican")
corpus_country_y$set <- "Rep"
corp <- corpus_country_x + corpus_country_y
dfmat <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm() %>%
dfm_group(groups = set)
library("quanteda.textplots")
textplot_wordcloud(dfmat, max_words = 60, comparison = TRUE)
Created on 2022-04-27 by the reprex package (v2.0.1)

quanteda dfm() Error: groups must have length ndoc(x)

I'm trying to run a keyness analysis, everything worked and then, for an unknown reason, it started to give me an error.
I'm using data_corpus_inaugural which is the quanteda-package corpus object of US presidents' inaugural addresses.
My code:
> corpus_pres <- corpus_subset(data_corpus_inaugural,
+ President %in% c("Obama", "Trump"))
> dtm_pres <- dfm(corpus_pres, groups = "President",
+ remove = stopwords("english"), remove_punct = TRUE)
Error: groups must have length ndoc(x)
In addition: Warning messages:
1: 'dfm.corpus()' is deprecated. Use 'tokens()' first.
2: '...' should not be used for tokens() arguments; use 'tokens()' first.
3: 'groups' is deprecated; use dfm_group() instead
>
In quanteda v3 "dfm() constructs a document-feature matrix from a tokens object" - https://tutorials.quanteda.io/basic-operations/dfm/dfm/
Try this:
toks_pres <- tokens(pres_corpus, remove_punct = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_group(groups = President)
pres_dfm <- dfm(toks_pres)
I came across same problem when analyzing tweeter accounts and this code works for me. You can search terms across accounts
# to make a group in corpus
twcorpus <- corpus(users) %>%
corpus_group(groups= interaction(user_username))
# to visualize textplot_xray
textplot_xray(kwic(twcorpus, "helsin*"), scale="relative")

Filtering text from numbers and stopwords in R(not for tdm)

I have text corpus.
mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)
How can I filter this text? I must delete:
1) all numbers
2) pass through the stop words
3) remove the brackets
I will not work with dtm, I need just clean this textdata from numbers and stopwords
sample data:
112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715
Jura,the are stopwords.
In an output I expect
Tablet for cleaning hydraulic system
Since there is one character string available in the question at the moment, I decided to create a sample data by myself. I hope this is something close to your actual data. As Nate suggested, using the tidytext package is one way to go. Here, I first removed numbers, punctuations, contents in the brackets, and the brackets themselves. Then, I split words in each string using unnest_tokens(). Then, I removed stop words. Since you have your own stop words, you may want to create your own dictionary. I simply added jura in the filter() part. Grouping the data by id, I combined the words in order to create character strings in summarise(). Note that I used jura instead of Jura. This is because unnest_tokens() converts capital letters to small letters.
mydata <- data.frame(id = 1:2,
text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
"1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
stringsAsFactors = F)
library(dplyr)
library(tidytext)
data(stop_words)
mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))
# id text
# <int> <chr>
#1 1 tablet cleaning hydraulic system
#2 2 tablet cleaning mambojumbo system
Another way would be the following. In this case, I am not using unnest_tokens().
library(magrittr)
library(stringi)
library(tidytext)
data(stop_words)
gsub(x = mydata$text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "") %>%
stri_split_regex(str = ., pattern = " ", omit_empty = TRUE) %>%
lapply(function(x){
foo <- x[which(!x %in% c(stop_words$word, "Jura"))] %>%
paste(collapse = " ")
foo}) %>%
unlist
#[1] "Tablet cleaning hydraulic system" "Tablet cleaning mambojumbo system"
There are multiple ways of doing this. If you want to rely on base R only, you can transform #jazurro's answer a bit and use gsub() to find and replace the text patterns you want to delete.
I'll do this by using two regular expressions: the first one matches the content of the brackets and numeric values, whereas the second one will remove the stop words. The second regex will have to be constructed based on the stop words you want to remove. If we put it all in a function, you can easily apply it to all your strings using sapply:
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)
custom_filter <- function(string, stopwords=c()){
string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
# Create something like: "\\b( the|Jura)\\b"
new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
gsub(new_regex, "", string)
}
stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system "

Resources