Removing words from a DocumentTermMatrix - r

My friend and I are working on transforming some tweets we collected into a dtm in order to be able to run a sentiment analysis using machine learning in R. The task must be performed in R, because it is for an exam at our university, where R is required to be used as a tool.
Initially we have collected a smaller sample, in order to test if our code was working, before we would start coding a larger dataset. Our problem is that we can't seem to figure out how to remove custom words from the dtm. Our code so far looks something like this (we are primarily using the tm package):
file <- read.csv("Tmix.csv",
row.names = NULL, sep=";", header=TRUE) #just for loading the dataset
tweetsCorpus <- Corpus(VectorSource(file[,1]))
tweetsDTM <- DocumentTermMatrix(tweetsCorpus,
control = list(verbose = TRUE,
asPlain = TRUE,
stopwords = TRUE,
tolower = TRUE,
removeNumbers = TRUE,
stemWords = FALSE,
removePunctuation = TRUE,
removeSeparators = TRUE,
removeTwitter = TRUE,
stem = TRUE,
stripWhitespace = TRUE,
removeWords = c("customword1", "customword2", "customword3")))
We've also tried removing the words before converting into a dtm, using the removeWords command, together with all of the "removeXXX" commands in the tm package, and then converting it to a dtm, but it doesn't seem to work.
It is important that we don't simply remove all words with i.e. 5 or less observations. We need all observations, except the ones we want to remove like for instance https-adresses and stuff like that.
Does anyone know how we do this?
And a second question: Is there any easier way to remove all words that start with https instead of having to write all of the adresses individually into the code. Right now for instance we are writing "httpstcokozcejeg", "httpstcolskjnyjyn", "httpstcolwwsxuem" as single custom words to remove from the data.
NOTE: We know that RemoveWords is a terrible solution to our problem, but we can't figure out how else to do it.

You can use regular expressions, for example:
gsub("http[a-z]*","","httpstcolwwsxuem here")
[1] " here"
Assuming that you removed punctuation/digits in tweetsCorpus, you can use the following:
1- Direct gsub
tweetsCorpus <- gsub("http[a-z]*","",tweetsCorpus[[1]][[1]])
OR
2- tm::tm_map, content_transformer
library(tm)
RemoveURL <- function(x){
gsub("http[a-z]*","",x)
}
tweetsCorpus <- tm_map(tweetsCorpus, content_transformer(RemoveURL))

Related

R stopwords: getting rid of ALL the words starting with 'https'

I'm doing a project that includes Twitter scraping.
The problem: I don't seem to be able to remove ALL of the words that start with 'https'.
My code:
library(twitteR)
library(tm)
library(RColorBrewer)
library(e1017)
library(class)
library(wordcloud)
library(tidytext)
scraped_tweets <- searchTwitter('Silk Sonic - leave door open', n = 10000, lang='en')
# get text data from tweets
scraped_text <- sapply(scraped_tweets, function(x){x$getText()})
# removing emojis and characters
scraped_text <- iconv(scraped_text, 'UTF-8', 'ASCII')
scraped_corpus <- Corpus(VectorSource(scraped_text))
doc_matrix <- TermDocumentMatrix(scraped_corpus, control = list(removePunctuation=T,
stopwords = c('https','http', 'sonic',
'silk',stopwords('english')),
removeNumbers = T,tolower = T))
# convert object into a matrix
doc_matrix <- as.matrix(doc_matrix)
# get word counts
head(doc_matrix,1)
words <- sort(rowSums(doc_matrix), decreasing = T)
dm <- data.frame(word = names(words), freq = words)
# wordcloud
wordcloud(dm$word, dm$freq, random.order = F, colors = brewer.pal(8, 'Dark2'))
I added the tags 'https' and 'http', but it didn't help.
I can of course clean the output with gsub but it's not the same as I still get the rest of the link name as an output.
Are there any ideas how I could do this?
Thanks in advance.
Let's have a look at the documentation for the tm:
stopwords Either a Boolean value indicating stopword removal using default
language specific stopword lists shipped with this package, a character vec-
tor holding custom stopwords, or a custom function for stopword removal.
Defaults to FALSE.
The stopwords argument does not seem to make any partial or pattern matches on the provided stopwords. It does accept a custom function, though. This is one option, but I think it is easiest to do the url removal on the character vector before even turning it into a corpus:
scraped_text <- sapply(scraped_tweets, function(x){x$getText()})
scraped_text <- iconv(scraped_text, 'UTF-8', 'ASCII')
# Added line for regex string removal
scraped_text <- str_remove_all(scraped_text, r"(https?://[^)\]\s]+(?=[)\]\s]))")
scraped_corpus <- Corpus(VectorSource(scraped_text))
This is a rather simple regex for url recognition, but it works reasonably well. There are more complicated ones out there, which can easily be found with a google search.

kwic() function returns less rows than it should

I'm currently trying to perform a sentiment analysis on a kwic object, but I'm afraid that the kwic() function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform kwic() on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
It's something of a guess without having a reproducible example (your input full_data, namely) but here's my best guess. Your kwic() call is using the default "glob" pattern matching, and what you want is a regular expression match instead.
Fix it this way:
kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex",
window = 5

How do I upload a FASTA file to R if it contains multiplexed data?

I was hoping to use the seqinr method to upload FASTA files to R for analysis. However, they are multiplexed sequences.
library(seqinr)
dnafile <- system.file("sra_data-3.fasta", package = "seqinr")
read.fasta(file = dnafile, as.string = TRUE, forceDNAtolower = FALSE)
This code states there are no > to read, but what's it is reading is this, which clearly has an arrow but also other text which I'm assuming is what it doesn't like.
'>' SRR573784.1.2 G3G26M402FVJLA length=74
TGTGAGTAGTACGGGCGGTGTGTGCCGTACCGTCAATTCCTTTAAGTTTCTGAGCGGGCTGGCAAGGCGCATAG
(Added quotations to show the arrow is there and prevent indent programming)
Any suggestions to upload this? I don't necessarily need to barcodes or anything, just a row for each sequence to be able to differentiate the sequences. Thank you for any ideas, multiplexed data is new to me.
dnafile <- system.file("sra_data-3.fasta", package = "seqinr")
read.fasta(file = dnafile, as.string = TRUE, forceDNAtolower = FALSE)
Note only error occurs.
Update: dodging seqinr for initial input and simply using
dnafile <- "file_name"
DNAfile <- read.fasta(file = dnafile, as.string = TRUE, forceDNAtolower = FALSE)
length(SRAfile)
worked just fine.

Why does quanteda drop some documents when converting to topicmodels format?

I'm working with quanteda here, and finding that when I convert from a document-feature matrix to topic models I lose some documents. Does anyone know why this is or how I can prevent it? It is causing me some problems in a later section of the analysis. This code starts from the construction of the dfm to the conversion. when I run nrow(dfm_counts2), I get 199,560 rows. But after converting to dtm_lda, there are only 198,435?
dfm_counts <- corpus_raw %>%
dfm(tolower = TRUE, remove_punct = TRUE, remove_numbers=TRUE,
remove = stopwords_and_single,stem = FALSE,
remove_separators=TRUE,remove_url =TRUE, remove_symbols = TRUE)
docnames(dfm_counts)<-dfm_counts#docvars$index
## trimming tokens too common or too rare to improve efficiency of modeling
dfm_counts2<-dfm_trim(dfm_counts, max_docfreq = 0.95, min_docfreq=0.005,docfreq_type="prop")
dtm_lda <- convert(dfm_counts2, to = "topicmodels")
That's because after your trimming, some documents now consist of zero features. convert(x, to = "topicmodels") removes empty documents, since you cannot fit them in a topic model, and topicmodels::LDA()` produces an error if you try.
In the dfm_trim() call, 199560 - 198435 = 1125 documents must have consisted of features that fall outside your docfreq range.
I suspect that this will be true:
sum(ntoken(dfm_counts2) == 0) == 1125
By the way you can rename the document names by:
docnames(dfm_counts) <- dfm_counts$index
Better to use this operator than access object internals.

quanteda convert to topicmodels retaining docvars

I'm using the awesome quanteda package to convert my dfm to a topicmodels format. However, in the process I'm losing my docvars which I need for identifying which topics are most likely prevalent in my documents. This is especially a problem given that topicmodels package (as does STM) only selects non-zero counts. The number of documents in the original dfm and the model output hence differ. Is there any way for me to correctly identify the documents in casu?
I checked your outcome. Because of your select statement you have no features left in dfm_speeches. Convert that to the "dtm" format as used by the topicmodels and you indeed get a document term matrix that has no documents and no terms.
But if your selection with dfm_select results in a dfm with features and you then convert it into a dtm format you will see docvars appearing.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("Bruton", "Cowen"))
docvars(dfm_speeches)
dfmlda <- convert(dfm_speeches, to = "topicmodels")
This will then work further with topicmodels. I will admit that if you convert to a dtm for tm and you have no features you will see the documents appearing in the dtm. I'm not sure if there is a unintended side effect with the conversion to topicmodels if there are no features.
I don't think the problem is described clearly, but I believe I understand what it is.
Topic models' document feature matrix cannot contain empty documents, so they return named vector of topics without these. But you can still live with it if you match them to the document names:
# mx is a quanteda's dfm
# topic is a named vector for topics from LDA
docvars(mx, "topic") <- topic[match(docnames(mx), names(topic))]
Sorry, here's an example.
dfm_speeches <- dfm(data_corpus_irishbudget2010,
remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>%
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("corbyn", "hillary"))
library(topicmodels)
dfmlda <- convert(dfm_speeches, to = "topicmodels") %>%
dfmlda
As you can see, the dfmlda object is empty because the fact that I modified my dfm by removing specific words.

Resources