I'm reading all my textfiles into a df with the readtext package.
df <- readtext(directory, "*.txt")
The .txt files get stored in a df with doc_id (name of the document) and text (content).
Before I upgraded to the newest version of quanteda, the doc_id was stored in the corpus object when I created my corpus using:
corpus <- corpus(df)
But now this doesn't work anymore, the 'documents'-df of the corpus object only stores the 'texts'-values, but not the doc_id-values anymore.
How do I get back my doc_id into my corpus object?
That's because of a bug that we fixed prior to v1.2.0. When constructing a corpus from a data.frame, some field is required for a document id, and by default this is the readtext doc_id.
If you want it also as a document variable, you can do it this way. First, I read in some texts from the system files of the readtext package, for a reproducible example.
library("readtext")
library("quanteda")
packageVersion("readtext")
## [1] ‘0.50’
packageVersion("quanteda")
## [1] ‘1.2.0’
df <- readtext(paste0(DATA_DIR, "txt/EU_manifestos/*.txt"), encoding = "LATIN1")
df
## readtext object consisting of 17 documents and 0 docvars.
## # data.frame [17 × 2]
## doc_id text
## <chr> <chr>
## 1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..."
## 2 EU_euro_2004_de_V.txt "\"Gemeinsame\"..."
## 3 EU_euro_2004_en_PSE.txt "\"PES · PSE \"..."
## 4 EU_euro_2004_en_V.txt "\"Manifesto\n\"..."
## 5 EU_euro_2004_es_PSE.txt "\"PES · PSE \"..."
## 6 EU_euro_2004_es_V.txt "\"Manifesto\n\"..."
When we create the corpus from this, we see no document variables.
crp <- corpus(df)
crp
## data frame with 0 columns and 17 rows
But it's trivial to add them:
docvars(crp, "doc_id") <- df$doc_id
head(docvars(crp))
## doc_id
## EU_euro_2004_de_PSE.txt EU_euro_2004_de_PSE.txt
## EU_euro_2004_de_V.txt EU_euro_2004_de_V.txt
## EU_euro_2004_en_PSE.txt EU_euro_2004_en_PSE.txt
## EU_euro_2004_en_V.txt EU_euro_2004_en_V.txt
## EU_euro_2004_es_PSE.txt EU_euro_2004_es_PSE.txt
## EU_euro_2004_es_V.txt EU_euro_2004_es_V.txt
Note that you are strongly discouraged from accessing the internals of the corpus object through its data.frame element df$documents. Using the accessor docvars() and replacement docvars()<- will work in the future, but the internals of the corpus are likely to change.
Related
This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).
What's the Quanteda way of cleaning a corpus like shown in the example below using tm (lowercase, remove punct., remove numbers, stem words)? To be clear, I don't want to create a document-feature matrix with dfm(), I just want a clean corpus that I can use for a specific downstream task.
# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
PS I am aware that I could just do quanteda_corpus <- quanteda::corpus(crude)to get what I want, but I would much prefer being able to do everything in Quanteda.
I think what you want to do is deliberately impossible in quanteda.
You can, of course, do the cleaning quite easily without losing the order of words using the tokens* set of functions:
library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_wordstem()
print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effect"
#> [7] "today" "it" "had" "cut" "it" "contract"
#> [ ... and 78 more ]
#>
#> reut-00002.xml :
#> [1] "OPEC" "may" "be" "forc" "to" "meet" "befor"
#> [8] "a" "schedul" "June" "session" "to"
#> [ ... and 427 more ]
#>
#> reut-00004.xml :
#> [1] "Texaco" "Canada" "said" "it" "lower" "the"
#> [7] "contract" "price" "it" "will" "pay" "for"
#> [ ... and 40 more ]
#>
#> [ reached max_ndoc ... 17 more documents ]
But it is not possible to return this tokens object into a corpus. Now it would be possible to write a new function to do this:
corpus.tokens <- function(x, ...) {
quanteda:::build_corpus(
unlist(lapply(x, paste, collapse = " ")),
docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
)
}
corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#>
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#>
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#>
#> [ reached max_ndoc ... 17 more documents ]
But this object, while technically being a corpus class object, is not what a corpus is supposed to be. From ?corpus [emphasis added]:
Value
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level
metadata, and default settings for subsequent processing of the
corpus.
The object above does not meet this description as the original texts have been processed already. Yet the class of the object communicates otherwise. I don't see a reason to break this logic as all subsequent analyses steps should be possible using either tokens* or dfm* functions.
I have a list and the field inside each list element is of same name(only values are different) and I need to convert that into a data.frame with column name is same as that of field name. Following is my list,
Data input (data input in json format.json)
library(rjson)
data <- fromJSON(file = "data input in json format.json")
head(data,3)
[[1]]
[[1]]$floors
[1] 5
[[1]]$elevation
[1] 15
[[1]]$bmi
[1] 23.7483
[[2]]
[[2]]$floors
[1] 4
[[2]]$elevation
[1] 12
[[2]]$bmi
[1] 23.764
[[3]]
[[3]]$floors
[1] 3
[[3]]$elevation
[1] 9
[[3]]$bmi
[1] 23.7797
And my expected data.frame is,
floors elevation bmi
5 15 23.7483
4 12 23.7640
3 9 23.7797
Can you help me to figure out this ?.
Thanks in adavance.
You can use jsonlite.
library(jsonlite)
Then use fromJSON() and specify the path to your file (or alternatively a URL or the raw text) in the argument txt:
fromJSON(txt = 'path/to/json/file.json')
The result is:
floors elevation bmi
1 5 15 23.7483
2 4 12 23.7640
3 3 9 23.7797
If you prefer rjson, you could first read it as previously:
data <- rjson::fromJSON(file = 'path/to/json/file.json')
Then use do.call() and rbind.data.frame() to convert the list to a dataframe:
do.call("rbind.data.frame", data)
Alternatively to do.call(): use data.tables rbindlist() which is faster:
data.table::rbindlist(data)
I am trying to implement quanteda on my corpus in R, but I am getting:
Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE, :
duplicate row.names: character(0)
I don't have much experience with this. Here is a download of the dataset: https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0
Here is the code:
tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)
quanteda.corpus <- corpus(corpus)
The processing that you're doing with tm is preparing a object for tm and quanteda doesn't know what to do with it...quanteda does all of these steps itself, help("dfm"), as can be seen from the options.
If you try the following you can move ahead:
dfm(tweets$Tweet, verbose = TRUE, toLower= TRUE, removeNumbers = TRUE, removePunct = TRUE,removeTwitter = TRUE, language = "english", ignoredFeatures=stopwords("english"), stem=TRUE)
Creating a dfm from a character vector ...
... lowercasing
... tokenizing
... indexing documents: 6,943 documents
... indexing features: 15,164 feature types
... removed 161 features, from 174 supplied (glob) feature types
... stemming features (English), trimmed 2175 feature variants
... created a 6943 x 12828 sparse dfm
... complete.
Elapsed time: 0.756 seconds.
HTH
No need to start with the tm package, or even to use read.csv() at all - this is what the quanteda companion package readtext is for.
So to read in the data, you can send the object created by readtext::readtext() straight to the corpus constructor:
myCorpus <- corpus(readtext("~/Downloads/TwitterSelfDriveShrink.csv", text_field = "Tweet"))
summary(myCorpus, 5)
## Corpus consisting of 6943 documents, showing 5 documents.
##
## Text Types Tokens Sentences Sentiment Sentiment_Confidence
## text1 19 21 1 2 0.7579
## text2 18 20 2 2 0.8775
## text3 23 24 1 -1 0.6805
## text5 17 19 2 0 1.0000
## text4 18 19 1 -1 0.8820
##
## Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Apr 14 09:22:11 2016
## Notes:
From there, you can perform all of the pre-processing stems directly in the dfm() call, including the choice of ngrams:
# just unigrams
dfm1 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 15,577 feature types
## ... removed 161 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 2174 feature variants
## ... created a 6943 x 13242 sparse dfm
## ... complete.
## Elapsed time: 0.662 seconds.
# just bigrams
dfm2 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"), ngrams = 2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 52,433 feature types
## ... removed 24,002 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 572 feature variants
## ... created a 6943 x 27859 sparse dfm
## ... complete.
## Elapsed time: 1.419 seconds.
How would I load data from a csv file into R if the file contains different numbers of strings in every line? I need to unite it in one variable (E.g. a list of lists?). The data in the file looks like this (I don't know the maximum number of elements in one row):
Peter; Paul; Mary
Jeff;
Peter; Jeff
Julia; Vanessa; Paul
Use fill=TRUE:
read.table(text='
Peter; Paul; Mary
Jeff;
Peter; Jeff
Julia; Vanessa; Paul',sep=';',fill=TRUE)
V1 V2 V3
1 Peter Paul Mary
2 Jeff
3 Peter Jeff
4 Julia Vanessa Paul
r <- readLines("tmp3.csv")
getLine <- function(x) {
r <- scan(text=x,sep=";",what="character",quiet=TRUE)
r <- r[nchar(r)>0] ## drop empties
r <- gsub("(^ +| +$)","",r) ## strip whitespace
r
}
lapply(r,getLine)
## [[1]]
## [1] "Peter" "Paul" "Mary"
##
## [[2]]
## [1] "Jeff"
##
## [[3]]
## [1] "Peter" "Jeff"
##
## [[4]]
## [1] "Julia" "Vanessa" "Paul"
This is technically a list of vectors rather than a list of lists but it might be what you want ...