R: having trouble using quanteda corpus with readtext

R: having trouble using quanteda corpus with readtext - r

After reading my corpus with the Quanteda package, I get the same error when using various subsequent statements:
Error in UseMethod("texts") : no applicable method for 'texts' applied to an object of class "c('corpus_frame', 'data.frame')").
For example, when using this simple statement: texts(mycorpus)[2]
My actual goal is to create a dfm (which give me the same error message as above).
I read the corpus with this code:
`mycorpus < corpus_frame(readtext("C:/Users/renswilderom/Documents/Stuff Im
working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom="filenames", dvsep="_", docvarnames=c("Date of Publication",
"Length LexisNexis"), encoding = "UTF-8-BOM"))`
My dataset consists of 50 newspaper articles, including some metadata such as the date of publication.
See screenshot.
Why am I getting this error every time? Thanks very much in advance for your help!
Response 1:
When using just readtext() I get one step further and texts(text.corpus)[1] does not yield an error.
However, when tokenizing, the same error occurs again, so:
token <- tokenize(text.corpus, removePunct=TRUE, removeNumbers=TRUE, ngrams
= 1:2)
tokens(text.corpus)
Yields:
Error in UseMethod("tokenize") :
no applicable method for 'tokenize' applied to an object of class "c('readtext', 'data.frame')"
Error in UseMethod("tokens") :
no applicable method for 'tokens' applied to an object of class "c('readtext', 'data.frame')"
Response 2:
Now I get these two error messages in return, which I initially also got, so I started using corpus_frame()
Error in UseMethod("tokens") : no applicable method for 'tokens'
applied to an object of class "c('corpus_frame', 'data.frame')"
In addition: Warning message: 'corpus' is deprecated.
Use 'corpus_frame' instead. See help("Deprecated")
Do I need to specify that 'tokenization' or any other step is only applied to the 'text' column and not to the entire dataset?
Response 3:
Thank you, Patrick, this does clarify and brought me somewhat further.
When running this:
# Quanteda - corpus way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM") %>%
corpus() %>%
tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)
I get this:
Error in tokens_internal(texts(x), ...) :
the ... list does not contain 3 elements
In addition: Warning message:
removePunctremoveNumbers is deprecated; use remove_punctremove_numbers instead
So I changed it accordingly (using remove_punct and remove_numbers) and now the code runs well.
Alternatively, I also tried this:
# Corpus - term_matrix way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM") %>%
term_matrix(drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2)
Which gives this error:
Error in term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) :
unrecognized text filter property: 'drop_numbers'
After removing drop_numbers = TRUE, the matrix is actually produced. Thanks very much for your help!

To clarify the situation:
Versions 0.9.1 of the corpus package had a function called corpus. quanteda also has a function called corpus. To avoid the name clash between the two packages, the corpus corpus function got deprecated and renamed to corpus_frame in version 0.9.2; it was removed in version 0.9.3.
To avoid the name clash with quanteda, either upgrade to corpus to the latest version on CRAN (0.9.3), or else do
library(corpus)
library(quanteda)
Instead of the other order.
Now, if you want to use quanteda to tokenize your texts, follow the advice given in Ken's answer:
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis"),
encoding = "UTF-8-BOM")) %>%
corpus() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1:2)
You may want to use the dfm function instead of the tokens function if your goal is to get a document-by-term count matrix.
If you want to use the corpus package, instead do
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis"),
encoding = "UTF-8-BOM")) %>%
term_matrix(drop_punct = TRUE, drop_number = TRUE, ngrams = 1:2)
Depending on what you're trying to do, you might want to use the term_stats function instead of the term_matrix function.

OK, you are getting this error because (as the error message states) there is no tokens() method for a readtext object class, which is a special version of a data.frame. (Note: tokenize() is older, deprecated syntax that will be removed in the next version - use tokens() instead.)
You want this:
library("quanteda")
library("readtext")
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis"),
encoding = "UTF-8-BOM")) %>%
corpus() %>%
tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)
It's the corpus() step you omitted. corpus_frame() is from a different package (my friend Patrick Perry's corpus).

Related

How to use a custom NRC-style lexicon on Syuzhet for R?

I am new to R and new to working with Syuzhet.
I am trying to make a custom NRC-style library to use with the Syuzhet package in order to categorize words. Unfortunately, although this functionality now exists within Syuzhet, it doesnt seem to recognize my custom lexicon. Please excuse my weird variable names and the extra libraries, I plan to use them for other stuff later on and I am just testing things.
library(sentimentr)
library(pdftools)
library(tm)
library(readxl)
library(syuzhet)
library(tidytext)
texto <- "I am so love hate beautiful ugly"
text_cust <- get_tokens(texto)
custom_lexicon <- data.frame(lang = c("eng","eng","eng","eng"), word = c("love", "hate", "beautiful", "ugly"), sentiment = c("positive","positive","positive","positive"), value = c("1","1","1","1"))
my_custom_values <- get_nrc_sentiment(text_cust, lexicon = custom_lexicon)
I get the following error:
my_custom_values <- get_nrc_sentiment(text_cust, lexicon = custom_lexicon)
New names:
• value -> value...4
• value -> value...5
Error in FUN(X[[i]], ...) :
custom lexicon must have a 'word', a 'sentiment' and a 'value' column
As far as I can tell, my data frame exactly matches that of the standard NRC library, containing columns labeled 'word', 'sentiment', and 'value'. So I'm not sure why I am getting this error.

The cran version of syuzhet's get_nrc_sentiment doesn't accept a lexicon. The get_sentiment does. But your custom_lexicon has an error. The values need to be integer values, not a character value. And to use your own lexicon, you need to set the method to "custom" otherwise the custom lexicon will be ignored. The code below works just with syuzhet.
library(syuzhet)
texto <- "I am so love hate beautiful ugly"
text_cust <- get_tokens(texto)
custom_lexicon <- data.frame(lang = c("eng","eng","eng","eng"),
word = c("love", "hate", "beautiful", "ugly"),
sentiment = c("positive","positive","positive","positive"),
value = c(1,1,1,1))
get_sentiment(text_cust, method = "custom", lexicon = custom_lexicon)
[1] 0 0 0 1 1 1 1

Error 'invalid regular expression '[:alpha:]+'' after migrating R and RStudio

I am having troubles running my code which was written under RStudio 1.3.959 after migrating to a new PC and installing RStudio 1.4.1717. The same error appears when running the code via base R (4.1.0). When using base R functions (grep, gregexpr, e.g. gregexpr("[:alpha:]+", "1234a")), there is no error message.
Code:
library(tidyverse)
data_files <- as.data.frame(list.files(data_folder))
data_files <- data_files %>%
mutate(temp = data_files[,1]) %>%
separate("temp",
c("temp", "Trash"),
sep = "\\.") %>%
select(-"Trash") %>%
separate("temp",
c("run", "Trash"),
sep = "[:alpha:]+",
remove = FALSE) %>%
select(-"Trash") %>%
separate("temp",
c("Trash", "letters"),
sep = "[:digit:]+") %>%
select(-"Trash") %>%
select("run", "letters")
My data_folder contains csv files with name pattern (date-increment-letter.csv, e.g. 21021202a.csv)
Error message:
Error in gregexpr(pattern, x, perl = TRUE) :
invalid regular expression '[:alpha:]+'
In addition: Warning message:
In gregexpr(pattern, x, perl = TRUE) : PCRE pattern compilation error
'POSIX named classes are supported only within a class'
at '[:alpha:]+'
Reproducible example using dput:
data_files <- as.data.frame(list.files(icpms_folder))
dput(head(data_files))
structure(list(list.files(icpms_folder) = c("21021202a.csv",
"21021202b.csv",
"21021202c.csv",
"21021203a.csv",
"21021203b.csv",
"21021203c.csv")),
row.names = c(NA, 6L), class = "data.frame")
Could you point me what is missing in my fresh installation, please?
Thank you in advance!

The answer to "why" is already in the error message: POSIX named classes are supported only within a class.
POSIX named classes are like [:digit:], [:alpha:], and so on.
By "class", the message author meant a character class, i.e. [...].
Put one inside of another:
sep = '[[:alpha:]]+'

quanteda dfm() Error: groups must have length ndoc(x)

I'm trying to run a keyness analysis, everything worked and then, for an unknown reason, it started to give me an error.
I'm using data_corpus_inaugural which is the quanteda-package corpus object of US presidents' inaugural addresses.
My code:
> corpus_pres <- corpus_subset(data_corpus_inaugural,
+ President %in% c("Obama", "Trump"))
> dtm_pres <- dfm(corpus_pres, groups = "President",
+ remove = stopwords("english"), remove_punct = TRUE)
Error: groups must have length ndoc(x)
In addition: Warning messages:
1: 'dfm.corpus()' is deprecated. Use 'tokens()' first.
2: '...' should not be used for tokens() arguments; use 'tokens()' first.
3: 'groups' is deprecated; use dfm_group() instead
>

In quanteda v3 "dfm() constructs a document-feature matrix from a tokens object" - https://tutorials.quanteda.io/basic-operations/dfm/dfm/
Try this:
toks_pres <- tokens(pres_corpus, remove_punct = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_group(groups = President)
pres_dfm <- dfm(toks_pres)

I came across same problem when analyzing tweeter accounts and this code works for me. You can search terms across accounts
# to make a group in corpus
twcorpus <- corpus(users) %>%
corpus_group(groups= interaction(user_username))
# to visualize textplot_xray
textplot_xray(kwic(twcorpus, "helsin*"), scale="relative")

How to convert data frame to dfm in quanteda package in R?

Suppose I have a data frame vector which looks like:
tweets
#text
#text 2
#text 3
Using the quanteda package, I'm trying to count the number of hashtags in the data frame.
However, using the following code, I get an error:
tweet_dfm <- dfm(data, remove_punct = TRUE)
tag_dfm <- dfm_select(tweet_dfm, pattern = ('#*'))
toptag <- names(topfeatures(tag_dfm, 50))
head(toptag)
Error (on the first line of code):
Error in dfm.default(data, remove_punct = TRUE) : dfm() only works on character, corpus, dfm, tokens objects.
Any ideas how to fix?

You need to slice out the column of the data.frame called "tweets", using data$tweets. So:
library("quanteda")
## Package version: 2.1.2
data <- data.frame(tweets = c("#text", "#text 2", "#text 3"))
dfm(data$tweets, remove_punct = TRUE) %>%
dfm_select(pattern = ("#*")) %>%
sum()
## [1] 3
(since you wanted the total of all hashtags)
Note that the remove_punct = TRUE is unnecessary here, although it has no effect - since fortunately quanteda's built-in tokeniser recognises the difference between punctuation and the hashtag character that other tokenisers might consider to be a punctuation character.

How to input text file to api in R

I want to use this api:
http(s)://lindat.mff.cuni.cz/services/morphodita/api/
with the method "tag". It will tag and lemmatize my text input. It has worked fine with a text string (see below), but I need to send an entire file to the API.
Just to show that string as input works fine:
method <- "tag"
lemmatized_text <- RCurl::getForm(paste#
("http://lindat.mff.cuni.cz/services/morphodita/api/", method, sep = ""),
.params = list(data = "Peter likes cakes. John likes lollypops.",#
output = "json", model = "english-morphium-wsj-140407-no_negation"), #
method = method)
This is the - correct - result:
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
\"acknowledgements\": [\n \"http://ufal.mff.cuni.cz
/morphodita#morphodita_acknowledgements\",\n \"http://ufal.mff.cuni.cz
/morphodita/users-manual#english-morphium-wsj_acknowledgements\"\n ],\n
\"result\": [[{\"token\":\"Peter\",\"lemma\":\"Peter\",\"tag\":\"NNP
\",\"space\":\" \"},{\"token\":\"likes\",\"lemma\":\"like\",\"tag\":\"VBZ
\",\"space\":\" \"},{\"token\":\"cakes\",\"lemma\":\"cake\",\"tag\":\"NNS
[truncated by me]
However, replacing a string with a vector with elements corresponding to lines of a text file does not work, since the API requires a string on input. Only one, by default the first, vector element would be processed:
method <- "tag"
mydata <- c("cakes.", "lollypops")
lemmatized_text <- RCurl::getForm(paste("http://lindat.mff.cuni.cz
/services/morphodita/api/", method, sep = ""),
.params = list(data = mydata, output = "json",
model = "english-morphium-wsj-140407-no_negation"))
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
[truncated by me]
\"result\": [[{\"token\":\"cakes\",\"lemma\":\"cake\",\"tag\":\"NNS
\"},{\"token\":\".\",\"lemma\":\".\",\"tag\":\".\"}]]\n}\n"
This issue can be alleviated with sapply and a function calling that API on each element of the vector at the same time, but each element of the resulting vector contains a separate json document. To parse it, I need the entire data to be one single json document, though.
Eventually I tried textConnection, but it returns an erroneous output:
mydata <- c("cakes.", "lollypops")
mycon <- textConnection(mydata, encoding = "UTF-8")
lemmatized_text <- RCurl::getForm(paste#
("http://lindat.mff.cuni.cz/services/morphodita/api/", method,#
sep = ""), .params = list(data = mycon, output = "json",#
model = "english-morphium-wsj-140407-no_negation"))
[1] "{\n \"model\": \"english-morphium-wsj-140407-no_negation\",\n
\"acknowledgements\": [\n \"http://ufal.mff.cuni.cz
/morphodita#morphodita_acknowledgements\",\n \"http://ufal.mff.cuni.cz
/morphodita/users-manual#english-morphium-wsj_acknowledgements\"\n ],\n
\"result\": [[{\"token\":\"5\",\"lemma\":\"5\",\"tag\":\"CD\"}]]\n}\n"
attr(,"Content-Type")
I should probably also say that I have already tried to paste and collapse the vector into one single element, but that is very fragile. It works with dummy data but not with larger files and never with Czech files (although UTF-8 encoded). The API strictly requires UTF-8-encoded data. I therefore suspect encoding issues. I have tried this file:
mydata <- RCurl::getURI("https://ia902606.us.archive.org/4/items/maidmarian00966gut/maidm10.txt", .opts = list(.encoding = "UTF-8"))
and it said
Error: Bad Request
but when I only used a few lines, it suddenly worked. I also made a local copy of the file where I changed the newlines from MacIntosh to Windows. Maybe this helped a bit, but it was definitely not sufficient.
Eventually I should add that I work on Windows 8 Professional, running R-3.2.4 64bit, with RStudio Version 0.99.879.

I should have used RCurl::postForm instead of RCurl::getForm, with all other arguments remaining the same. The postForm function can not only be used to write files on the server, as I had wrongly believed. It does not impose strict limits on the size of the data to be processed, since, with postForm the data do not become part of the URL, unlike with getForm.
This is my convenience function (requires RCurl, stringi, stringr, magrittr):
process_w_morphodita <- function(method, data, output = "json", model
= "czech-morfflex-pdt-161115", guesser = "yes",...){
# for formally optional but very important argument-value pairs see
MorphoDiTa REST API reference at
# http://lindat.mff.cuni.cz/services/morphodita/api-reference.php
pokus <- RCurl::postForm(paste("http://lindat.mff.cuni.cz/services
/morphodita/api/", method, sep = ""), .params = list(data =
stringi::stri_enc_toutf8(data), output = output, model = model,
guesser = guesser,...))
if (output == "vertical") {
pokus <- pokus %>% stringr::str_trim(side = "both") %>%
stringr::str_conv("UTF-8") %>% stringr::str_replace_all(pattern =
"\\\\t", replacement = "\t") %>% stringr::str_replace_all(pattern =
"\\\\n", replacement = "\n") # look for four backslashes, replace
with one backslash to get vertical format in text file
}
return(pokus)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: having trouble using quanteda corpus with readtext - r

Related

How to use a custom NRC-style lexicon on Syuzhet for R?

Error 'invalid regular expression '[:alpha:]+'' after migrating R and RStudio

quanteda dfm() Error: groups must have length ndoc(x)

How to convert data frame to dfm in quanteda package in R?

How to input text file to api in R

Categories

Resources