I'm trying to run a keyness analysis, everything worked and then, for an unknown reason, it started to give me an error.
I'm using data_corpus_inaugural which is the quanteda-package corpus object of US presidents' inaugural addresses.
My code:
> corpus_pres <- corpus_subset(data_corpus_inaugural,
+ President %in% c("Obama", "Trump"))
> dtm_pres <- dfm(corpus_pres, groups = "President",
+ remove = stopwords("english"), remove_punct = TRUE)
Error: groups must have length ndoc(x)
In addition: Warning messages:
1: 'dfm.corpus()' is deprecated. Use 'tokens()' first.
2: '...' should not be used for tokens() arguments; use 'tokens()' first.
3: 'groups' is deprecated; use dfm_group() instead
>
In quanteda v3 "dfm() constructs a document-feature matrix from a tokens object" - https://tutorials.quanteda.io/basic-operations/dfm/dfm/
Try this:
toks_pres <- tokens(pres_corpus, remove_punct = TRUE) %>%
tokens_remove(pattern = stopwords("en")) %>%
tokens_group(groups = President)
pres_dfm <- dfm(toks_pres)
I came across same problem when analyzing tweeter accounts and this code works for me. You can search terms across accounts
# to make a group in corpus
twcorpus <- corpus(users) %>%
corpus_group(groups= interaction(user_username))
# to visualize textplot_xray
textplot_xray(kwic(twcorpus, "helsin*"), scale="relative")
Related
I'm using R to create a descriptive actor graph based on Twitter data acquired with vosonSML. I'm trying to use the "edge_cleanup" command (using code that's been provided to me) to clean up self-loops (replies to self). When I do, I receive the following error message:
Error in data.frame(name = as.character(V(actorGraph)$name), label = as.character(V(actorGraph)$label)) :
arguments imply differing number of rows: 913, 0
Can anyone tell me why I'm getting this error message, and how to resolve it? The full excerpted code is below
actorGraph <- twitterData %>%
Create("actor") %>%
Graph
## clean up the graph data removing self-loop
edge_cleanup <- function(graph = actorGraph){
library(igraph)
df <- get.data.frame(actorGraph)
names_list <- data.frame('name' = as.character(V(actorGraph)$name),
'label' = as.character(V(actorGraph)$label))
df$from <- sapply(df$from, function(x) names_list$label[match(x,names_list$name)] %>% as.character())
df$to <- sapply(df$to, function(x) names_list$label[match(x,names_list$name)] %>% as.character())
nodes <- data.frame(sort(unique(c(df$from,df$to))))
links <- df[,c('from','to')]
net <- graph.data.frame(links, nodes, directed=T)
E(net)$weight <- 1
net <- igraph::simplify(net,edge.attr.comb="sum")
return(net)
}
It's after this last command (edge_cleanup <- function(graph = actorGraph) that I receive the above error message.
I am running a DESeq analysis through importing counts from FeatureCounts. I generate the counts matrix but when I get to running the DESeqMatrixFromDataSet, I get the following error. I have checked my counts several times and I don't see any negative results.
I will appreciate any help with this.
library(purrr)
library(tidyverse)
f_files<- list.files("C:/Users/cash/Desktop/DESeq analysis 2", pattern = "\\.txt$", full.names = T)
read_in_feature_counts<- function(file){
cnt<- read_tsv(file, col_names =T, comment = "#")
cnt<- cnt %>% dplyr::select(-Chr, -Start, -End, -Strand, -Length)
return(cnt)
}
raw_counts<- map(f_files, read_in_feature_counts)
raw_counts_df<- purrr::reduce(raw_counts, inner_join)
head(raw_counts_df)
# Assign condition (first four are controls, second four contain the expansion)
condition <- factor(c("Donor1-1_S1_R2_001", "Donor1-2_S2_R2_001", "Donor2-1_S10_R2_001", "Donor2-2_S11_R2_001","Donor3-1_S1_R2_001", "Donor3-2_S2_R2_001", "Donor4-1_S10_R2_001", "Donor4-2_S11_R2_001"))
library(DESeq2)
# Create a coldata frame and instantiate the DESeqDataSet. See ?DESeqDataSetFromMatrix
coldata <- data.frame(row.names=colnames(raw_counts_df))
dds <- DESeqDataSetFromMatrix(countData=raw_counts_df, colData=coldata, design=~condition)
**Error in DESeqDataSet(se, design = design, ignoreRank) :
some values in assay are negative**
Here is the new error:
> dds <- DESeqDataSetFromMatrix(countData=raw_counts_df, colData=coldata, design=~condition)
converting counts to integer mode
Error in DESeqDataSet(se, design = design, ignoreRank) :
all variables in design formula must be columns in colData
Suppose I have a data frame vector which looks like:
tweets
#text
#text 2
#text 3
Using the quanteda package, I'm trying to count the number of hashtags in the data frame.
However, using the following code, I get an error:
tweet_dfm <- dfm(data, remove_punct = TRUE)
tag_dfm <- dfm_select(tweet_dfm, pattern = ('#*'))
toptag <- names(topfeatures(tag_dfm, 50))
head(toptag)
Error (on the first line of code):
Error in dfm.default(data, remove_punct = TRUE) : dfm() only works on character, corpus, dfm, tokens objects.
Any ideas how to fix?
You need to slice out the column of the data.frame called "tweets", using data$tweets. So:
library("quanteda")
## Package version: 2.1.2
data <- data.frame(tweets = c("#text", "#text 2", "#text 3"))
dfm(data$tweets, remove_punct = TRUE) %>%
dfm_select(pattern = ("#*")) %>%
sum()
## [1] 3
(since you wanted the total of all hashtags)
Note that the remove_punct = TRUE is unnecessary here, although it has no effect - since fortunately quanteda's built-in tokeniser recognises the difference between punctuation and the hashtag character that other tokenisers might consider to be a punctuation character.
I am using this example to conduct sentiment analysis of a collection of txt documents in R. The code is:
library(tm)
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
library(dplyr)
library(wordcloud)
require(reshape2)
files <- list.files(inputdir,pattern="*.txt")
GetNrcSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>% # positive - negative
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
return(sentiment)
}
The .txt files are stored in inputdirand have names AB-City.0000, where AB is an abbreviation of a country, City is a city name and 0000 is year (ranges from 2000 to 2017).
The function works for a single file as expected, i.e. GetNrcSentiment(files[1]) gives me a tibble with proper counts per sentiment. However, when i try to run it for the whole set, i.e.
nrc_sentiments <- data_frame()
for(i in files){
nrc_sentiments <- rbind(nrc_sentiments, GetNrcSentiment(i))
}
I get the following error message:
Joining, by = "word"
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The exact same code works well with longer documents, but gives an error when dealing with shorter texts. It seems that not all sentiments are found in small documents and as a result the number of columns vary for each document, which might lead to this error, but I am not sure. I would appreciate any advice on how to fix the problem. If a sentiment is not found, I would want the entry to be equal to zero (if it is the cause of my problem).
As an aside, bing sentiment function runs through about two dozen of files and gives a different error, which seems to point to the same problem (negative sentiment not found?):
GetBingSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>%
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
# return our sentiment dataframe
return(sentiment)
}
Error in mutate_impl(.data, dots) :
Evaluation error: object 'negative' not found.
EDIT: Following the recommendation by David Klotz I edited the code to
for(i in files){ nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i)) }
As a result, instead of throwing an error the nrc generates NA if words from a certain sentiment are not found, however after 22 joinings i get a different error:
Error in mutate_impl(.data, dots) : Evaluation error: object 'negative' not found.
The same error shows up when run the bing function with dplyr. Both dataframes by the time the functions reaches 22nd document contain columns for all sentiments. What may cause the error and how to can diagnose it?
dplyr's bind_rows function is more flexible than rbind, at least when it comes to missing columns:
nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i))
The input might be missing the "negative" column that is used in the expression
After reading my corpus with the Quanteda package, I get the same error when using various subsequent statements:
Error in UseMethod("texts") : no applicable method for 'texts' applied to an object of class "c('corpus_frame', 'data.frame')").
For example, when using this simple statement: texts(mycorpus)[2]
My actual goal is to create a dfm (which give me the same error message as above).
I read the corpus with this code:
`mycorpus < corpus_frame(readtext("C:/Users/renswilderom/Documents/Stuff Im
working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom="filenames", dvsep="_", docvarnames=c("Date of Publication",
"Length LexisNexis"), encoding = "UTF-8-BOM"))`
My dataset consists of 50 newspaper articles, including some metadata such as the date of publication.
See screenshot.
Why am I getting this error every time? Thanks very much in advance for your help!
Response 1:
When using just readtext() I get one step further and texts(text.corpus)[1] does not yield an error.
However, when tokenizing, the same error occurs again, so:
token <- tokenize(text.corpus, removePunct=TRUE, removeNumbers=TRUE, ngrams
= 1:2)
tokens(text.corpus)
Yields:
Error in UseMethod("tokenize") :
no applicable method for 'tokenize' applied to an object of class "c('readtext', 'data.frame')"
Error in UseMethod("tokens") :
no applicable method for 'tokens' applied to an object of class "c('readtext', 'data.frame')"
Response 2:
Now I get these two error messages in return, which I initially also got, so I started using corpus_frame()
Error in UseMethod("tokens") : no applicable method for 'tokens'
applied to an object of class "c('corpus_frame', 'data.frame')"
In addition: Warning message: 'corpus' is deprecated.
Use 'corpus_frame' instead. See help("Deprecated")
Do I need to specify that 'tokenization' or any other step is only applied to the 'text' column and not to the entire dataset?
Response 3:
Thank you, Patrick, this does clarify and brought me somewhat further.
When running this:
# Quanteda - corpus way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM") %>%
corpus() %>%
tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)
I get this:
Error in tokens_internal(texts(x), ...) :
the ... list does not contain 3 elements
In addition: Warning message:
removePunctremoveNumbers is deprecated; use remove_punctremove_numbers instead
So I changed it accordingly (using remove_punct and remove_numbers) and now the code runs well.
Alternatively, I also tried this:
# Corpus - term_matrix way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM") %>%
term_matrix(drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2)
Which gives this error:
Error in term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) :
unrecognized text filter property: 'drop_numbers'
After removing drop_numbers = TRUE, the matrix is actually produced. Thanks very much for your help!
To clarify the situation:
Versions 0.9.1 of the corpus package had a function called corpus. quanteda also has a function called corpus. To avoid the name clash between the two packages, the corpus corpus function got deprecated and renamed to corpus_frame in version 0.9.2; it was removed in version 0.9.3.
To avoid the name clash with quanteda, either upgrade to corpus to the latest version on CRAN (0.9.3), or else do
library(corpus)
library(quanteda)
Instead of the other order.
Now, if you want to use quanteda to tokenize your texts, follow the advice given in Ken's answer:
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis"),
encoding = "UTF-8-BOM")) %>%
corpus() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1:2)
You may want to use the dfm function instead of the tokens function if your goal is to get a document-by-term count matrix.
If you want to use the corpus package, instead do
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis"),
encoding = "UTF-8-BOM")) %>%
term_matrix(drop_punct = TRUE, drop_number = TRUE, ngrams = 1:2)
Depending on what you're trying to do, you might want to use the term_stats function instead of the term_matrix function.
OK, you are getting this error because (as the error message states) there is no tokens() method for a readtext object class, which is a special version of a data.frame. (Note: tokenize() is older, deprecated syntax that will be removed in the next version - use tokens() instead.)
You want this:
library("quanteda")
library("readtext")
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis"),
encoding = "UTF-8-BOM")) %>%
corpus() %>%
tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)
It's the corpus() step you omitted. corpus_frame() is from a different package (my friend Patrick Perry's corpus).