I am on a project with R and I am starting to get my hands dirty with it.
In the first part I try to clean the data of vector msg. But later when I build the termdocumentmatrix, these characters still appear.
I would like to remove words with less than 4 letters and remove punctuation
gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
gsub("[[:punct:]]", "", pclbyshares$msg)
corpus <- Corpus(VectorSource(pclbyshares$msg))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=120, highfreq=Inf)
You haven't stored your first two lines of code as variables to use later. So, in your third line, where you create your corpus variable, you are using the unmodified msg data. Give this a try:
msg_clean <- gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
msg_clean <- gsub("[[:punct:]]", "", msg_clean)
corpus <- Corpus(VectorSource(msg_clean))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq = 120, highfreq = Inf)
Related
I'm trying to run LDA topic modelling analysis with stm but I have problems with my meta data, it seems to work fine but I have a covariate (Age) that is not being read as shown in this example.
I have some tweets (docu column in excel file) with an Age covariate (Young,Old) values..
Here is my data
http://www.mediafire.com/file/5eb9qe6gbg22o9i/dada.xlsx/file
library(stm)
library(readxl)
library(quanteda)
library(stringr)
library(tm)
data <- read_xlsx("C:/dada.xlsx")
#Remove URL's
data$docu <- str_replace_all(data$docu, "https://t.co/[a-z,A-Z,0-9]*","")
data$docu <- gsub("#\\w+", " ", data$docu) # Remove user names (all proper names if you're wise!)
data$docu <- iconv(data$docu, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
data$docu <- gsub("#\\w+", " ", data$docu)
data$docu <- gsub("http.+ |http.+$", " ", data$docu) # Remove links
data$docu <- gsub("[[:punct:]]", " ", data$docu) # Remove punctuation)
data$docu<- gsub("[\r\n]", "", data$docu)
data$docu <- tolower(data$docu)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
data$docu <- tm::removeWords(x = data$docu, c(stopwords(kind = "SMART")))
data$docu <- gsub(" +", " ", data$docu) # General spaces (should just do all whitespaces no?)
myCorpus <- corpus(data$docu)
docvars(myCorpus, "Age") <- as.factor(data$Age)
processed <- textProcessor(data$docu, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)
out$documents
out$meta
levels(out$meta)
First_STM <- stm(documents = out$documents, vocab = out$vocab,
K = 4, prevalence =~ Age ,
max.em.its = 25, data = out$meta,
init.type = "LDA", verbose = FALSE)
As shown in the code I tried to define Age as factor, I think that is not needed because running textProcessor might be enough.. but nevertheless when I run
levels(out$meta) I get NULL value so when I then run stm to get the actual topics I get memory allocation error..
You set your metavariable of Age as factor in this line
docvars(myCorpus, "Age") <- as.factor(data$Age)
But you don't use myCorpus further. In the next steps you use your dataframe data for preprocessing. Try to define Age in the dataframe as factor:
data$Age <- factor(data$Age)
and then use it just before here
processed <- textProcessor(data$docu, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)
You can then look at the levels like this:
levels(out$meta$Age)
I could not reproduce your memory allocation error though. The stm works fine on my machine (Win 10 Pro, 8GB Ram).
Hi I'm working with the last example in this tutorial: Topics proportions over time.
https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html
I run it for my data with this code
library(readxl)
library(tm)
# Import text data
tweets <- read_xlsx("C:/R/data.xlsx")
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
library('stringr')
#Remove URL's
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")
textdata <- gsub("#\\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new <- datatm[rowTotals> 0, ]
library("ldatuning")
library("topicmodels")
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#####################################################
#topics by year
tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
top5termsPerTopic <- terms(topicModel, 7)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames
# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")
# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Here is the excel file to the input data
https://www.mediafire.com/file/4w2hkgzzzaaax88/data.xlsx/file
I got the error when I run the line with the aggregate function, I can't find out what is going on with the aggregate, I created the "decade" variable the same as in the tutoria, I show it and looks ok, the theta variable is also ok.. I changed several times the aggregate function according for example to this post
Error in aggregate.data.frame : arguments must have same length
But still have the same error.. please help
I am not sure what you want to achieve with the command
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
As far as I see you produce only one decade with
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
table(tweets$decade)
2010
3481
With all the preprocessing from tweets to textdata you're producing a few empty lines. This is where your problem starts.
Textdata with its new empty lines is the basis of your corpus and your dtm. You get rid of them with the lines:
ui = unique(dtm$i)
dtm.new = dtm[ui,]
At the same time you're basically deleting the empty columns in the dtm, thereby changing the length of your object. This new dtm without the empty cells is
then your new basis for the topic model. This is coming back to haunt you, when you try to use aggregate() with two objects of different lengths: tweets$decade, which is still the old length of 3418 with theta, that is produced by the topic model, which in turn is based on dtm.new -- remember, the one with fewer rows.
What I would suggest is to, first, get an ID-column in tweets. Later on you can use the IDs to find out what texts later on get deleted by your preprocessing and match the length of tweet$decade and theta.
I rewrote your code -- try this out:
library(readxl)
library(tm)
# Import text data
tweets <- read_xlsx("data.xlsx")
## Include ID for later
tweets$ID <- 1:nrow(tweets)
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
library('stringr')
#Remove URL's
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")
textdata <- gsub("#\\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See: https://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new <- datatm[rowTotals> 0, ]
library("ldatuning")
library("topicmodels")
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#####################################################
#topics by year
tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)
id <- data.frame(ID = dtm.new$dimnames$Docs)
colnames(id) <- "ID"
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
tweets_new <- merge(id, tweets, by.x="ID", by.y = "ID", all.x = T)
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets_new$decade), mean)
I am currently looking to perform some text mining on 25000 YouTube comments, which I gathered using the tuber package. I am very new to coding and with all these different information out there, this can be a bit overwhelming at times. So I already cleaned my corpus, that I created:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Add extra stopwords
myStopwords <- c(stopwords('english'),"im", "just", "one","youre",
"hes","shes","its","were","theyre","ive","youve","weve","theyve","id")
# Remove stopwords from corpus
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z""0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)}))
# Replace word elongations using the textclean package by Tyler Rinker.
corpus <- tm_map(corpus, replace_word_elongation)
# Creating data frame from corpus
corpus_asdataframe<-data.frame(text = sapply(corpus, as.character),stringsAsFactors = FALSE)
# Due to pre-processing some rows are empty. Therefore, the empty rows should be removed.
# Remove empty rows from data frame and "NA's"
corpus_asdataframe <-corpus_asdataframe[!apply(is.na(corpus_asdataframe) | corpus_asdataframe == "", 1, all),]
corpus_asdataframe<-as.data.frame(corpus_asdataframe)
# Create corpus of clean data frame
corpus <- Corpus(VectorSource(corpus_asdataframe$corpus_asdataframe))
So now the issue is that there are a lot of Spanish or German comments in my corpus, which I would like to exclude. I thought that maybe it is possible to download an English dictionary and maybe use an inner jointo detect english words and remove all other languages. However, I am very new to coding (I am studying Business Administration and never had to do anything with computer science) and so my skills are not sufficient for applying my idea to my corpus (or data frame). I really hope find a little help here. That would me very much appreciated! Thank you and best regards from Germany!
dftest <- data.frame(
id = 1:3,
text = c(
"Holla this is a spanish word",
"English online here",
"Bonjour, comment ça va?"
)
)
library("cld3")
subset(dftest, detect_language(dftest$text) == "en")
## id text
## 1 1 Holla this is a spanish word
## 2 2 English online here
CREDIT: Ken Benoit at: Find in a dfm non-english tokens and remove them
Has anyone managed to create a massive find/replace function/working code snippet that exchanges out known bigrams in a dataframe?
Here's an example. I'm able to don onesie-twosie replacements but I really want to leverage a known lexicon of about 800 terms I want to find-replace to turn them into word units prior to DTM generation. For example, I want to turn "Google Analytics" into "google-analytics".
I know it's theoretically possible; essentially, a custom stopwords list functionally does almost the same thing, except without the replacement. And it seems stupid to just have 800 gsubs.
Here's my current code. Any help/pointers/URLs/RTFMs would be greatly appreciated.
mystopwords <- read.csv(stopwords.file, header = FALSE)
mystopwords <- as.character(mystopwords$V1)
mystopwords <- c(mystopwords, stopwords())
# load the file
df <- readLines(file.name)
# transform to corpus
doc.vec <- VectorSource(df)
doc.corpus <- Corpus(doc.vec)
# summary(doc.corpus)
## Hit known phrases
docs <- tm_map(doc.corpus, content_transformer(gsub), pattern = "Google Analytics", replacement = "google-analytics")
## Clean up and fix text - note, no stemming
doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
doc.corpus <- tm_map(doc.corpus, removePunctuation,preserve_intra_word_dashes = TRUE)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removeWords, c(stopwords("english"),mystopwords))
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
The corpus library allows you to combine multi-word phrases into single tokens. When there are multiple matches, it chooses the longest one:
library(corpus)
text_tokens("I live in New York City, New York",
combine = c("new york city", "new york"))
# [[1]]
# [1] "i" "live" "in" "new_york_city"
# [5] "," "new_york"
By default, the connector is the underscore character (_), but you can specify an alternative connector using the connector argument`.
In your example, you could do the following to get a document-by-term matrix:
mycombine <- c("google analytics", "amazon echo") # etc.
term_matrix(doc.corpus, combine = mycombine,
drop_punct = TRUE, drop_number = TRUE,
drop = c(stopwords_en, mystopwords))
Note also that corpus keeps intra-word hyphens, so there's no need for a preserve_intra_word_dashes option.
It can be a hassle to specify the preprocessing options in every function call. If you'd like, you can convert your corpus to a corpus_frame (a data.frame with a special text column), then set the preprocessing options (the text_filter):
corpus <- as_corpus_frame(doc.corpus)
text_filter(corpus) <- text_filter(combine = mycombine,
drop_punct = TRUE,
drop_number = TRUE,
drop = c(stopwords_en, mystopwords))
After that, you can just call
term_matrix(corpus)
There's a lot more information about corpus, including an introductory vignette, at http://corpustext.com
I do stemming on my dataset for sentiment analysis and I got this error message
"Error in structure(if (length(n)) n else NA, names = x) :
'names' attribute [2] must be the same length as the vector [1]"
Please help!
myCorpus<-Corpus(VectorSource(Datasetlow_cost_airline$text))
# Convert to lower case
myCorpus<-tm_map(myCorpus,tolower)
# Remove puntuation
myCorpus<-tm_map(myCorpus,removePunctuation)
# Remove numbers
myCorpus<-tm_map(myCorpus,removeNumbers)
# Remove URLs ?regex = regular expression ?gsub = pattern matching
removeURL<-function(x)gsub("http[[:alnum:]]*","",x)
myCorpus<-tm_map(myCorpus,removeURL)
stopwords("english")
# Add two extra stop words: 'available' and 'via'
myStopwords<-c(stopwords("english"),"available","via","can")
# Remove stopwords from corpus
myCorpus<-tm_map(myCorpus,removeWords,myStopwords)
# Keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy<-myCorpus
# Stem word (change all the words to its root word)
myCorpus<-tm_map(myCorpus,stemDocument)
# Inspect documents (tweets) numbered 11 to 15
for(i in 11:15){
cat(paste("[[",i,"]]",sep=""))
writeLines(strwrap(myCorpus[[i]],width=73))
}
# Stem completion
myCorpus<-tm_map(myCorpus,stemCompletion,dictionary=myCorpusCopy)
There seems to be something odd about the stemCompletion function in tm version 0.6. There is a nice workaround here that I've used for this answer. In brief, replace your
# Stem completion
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # use spaces!
with
# Stem completion
stemCompletion_mod <- function(x,dict) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")), dictionary = dict, type = "shortest"), sep = "", collapse = " ")))
}
# apply workaround function
myCorpus <- lapply(corpus, stemCompletion_mod, myCorpusCopy)
If that doesn't help then you'll need to give more details and a sample of your actual data.