I need to train a model which would perform multilabel multiclass categorization on text data.
Currently, i'm using mlr package in R. But unluckily I didn't proceed further because of the error I got it before training a model.
More specifically I'm stuck in this place:
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
and, got this error
Error in makeMultilabelTask(id = "classif", data = termsDf, target = target) :
Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
I used this example: -
Multi-label text classification using mlr package in R
Here is a complete code snippet i'm using so far,
tm <- read.csv("translate_text_V02.csv", header = TRUE,
stringsAsFactors = FALSE, na.strings = c("", "NA"))
process <- tm[, c("label", "text")]
process <- na.omit(process)
docs <- Corpus(VectorSource(process$text))
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, mystopwords)
corpus <- tm_map(corpus, removeWords, stopwords("SMART"))
corpus <- tm_map(corpus, removeWords, stopwords("german"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument, language = "english")
return(corpus)
}
clean_corp <- clean_corpus(docs)
terms <-DocumentTermMatrix(clean_corp)
m <- as.matrix(terms)
m <- cbind(m,process$label)
termsDf <- as.data.frame(m)
target <- unique(termsDf[,2628]) %>% as.character() %>% sort()
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
I created the data frame after Document term matrix with the label class. but I'm stuck afterwords how can I proceed further with machine learning part?
Questions for kind answer: -
How can I proceed further with the creation of DocumentTermMatrix?
How to apply the random-forest algorithm on this particular dataset?
Related
Please, I need your help to resolve this issue.
I created a spam filtering model using random forest.
I am trying to predict with the random forest model I trained and saved as .rds file.
I keep getting the error
Error in eval(predvars, data, env) : object 'check' not found
Here is my code
--- code ---
library(randomForest)
library(caret)
library(tm)
library(caTools)
library(dplyr)
library(gmodels)
#library(Boruta)
library(mlbench)
#library(ROCR)
#Type <- c("ham")
Text <- c("If you handle newsletter for can multiple websites")
df <- data.frame(Type, Text)
corpus = df$Text
#remove non ASCII codes and emoticons
corpus = gsub("[^\x01-\x7F]", "", corpus)
#remove HTML tags
corpus = gsub("<.*/>","",corpus)
#Remove all the URLs
corpus = gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", corpus)
#remove numbers
corpus = removeNumbers(corpus)
#max no of characters in a mail
max(nchar(corpus))
corpus = Corpus(VectorSource(corpus))
corpus = tm_map(corpus , tolower)
corpus = tm_map(corpus , removePunctuation)
removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T))
corpus <- tm_map(corpus, removeURL)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "#")
corpus <- tm_map(corpus, toSpace, "\\|")
corpus = tm_map(corpus ,removeWords , stopwords("english"))
corpus = tm_map(corpus , stemDocument)
dtm = DocumentTermMatrix(corpus)
dtm
sptdm = removeSparseTerms(dtm , 0.98)
#sptdm = dtm
sptdm
emailsparse= as.data.frame(as.matrix(sptdm))
names(emailsparse) = make.names(names(emailsparse))
emailsparse$labels = as.factor(df$Type)
set.seed(123)
##loading model
RF_model <- readRDS("RF_model_2022.09.15.13.35.37.rds")
#RF_model
class(emailsparse)
summary(RF_model)
predict(RF_model, newdata =
emailsparse, type='prob')
--- end of code ---
Here is the link to the model .rds file
Please, what am I doing wrongly and how do I fix this?
I have multiple *.txt files that contain the title and texts that I want to process in R. A program below reads all the *.txt and displays the final file while skipping the first read texts.
My program is as here below. It uses for loop and I want to see all the texts
library(here)
library(glue)
library(tm)
library(SnowballC)
library(tidyverse)
library(tidytext)
all_texts <- list.files(setwd('.KCI/'), (startsWith = 'abstract'))
for(i in seq(1:length(all_texts)))
{
data <- read_tsv(all_texts[i], , show_col_types = FALSE)
corpus <- Corpus(VectorSource(data[i]))
corpus[i] <- tm_map(corpus[i], tolower)
corpus[i] <- tm_map(corpus[i], removePunctuation)
corpus[i] <- tm_map(corpus[i], removeNumbers)
corpus[i] <- tm_map(corpus[i], stripWhitespace)
corpus[i] <- tm_map(corpus[i], removeWords, c(stopwords("english"), mystopwords))
corpus[i] <- tm_map(corpus[i], stemDocument)
dtm <- DocumentTermMatrix(corpus[i])
}
This program just reads the final document but skips the previous ones. Therefore I want even other documents to be displayed before the last one.
<Title> <Year> <Text>
How is it? 1998 I am wondering if it could end like that. Therefore the deal is too good to be true
This would be a lot easier if you had provided some data.
library(tm)
library(SnowballC)
##
# two documents based on your example (t1 & t2 are identical here).
#
t1 <- read.delim(text='
Title\tYear\tText
How is it?\t1998\tI am wondering if it could end like that. Therefore the deal is too good to be true',
header=TRUE)
t2 <- read.delim(text='
Title\tYear\tText
How is it?\t1998\tI am wondering if it could end like that. Therefore the deal is too good to be true',
header=TRUE)
data <- list(t1,t2) # listof documents
dtm.list <- lapply(data, function(x) {
corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, stemDocument)
DocumentTermMatrix(corpus)
})
lapply(dtm.list, inspect)
Note I left out mystopwords because you did not provide any.
In your case you could put the read_tsv(...) back into the function and use lapply(...) in the list of file names. Something like:
dtm.list <- lapply(all.texts, function(x) {
data <- read_tsv(x)
corpus <- Corpus(VectorSource(data))
...
})
Where ... are the lines of code in my example above.
If your ultimate goal is to analyze word frequency, you might be better off using ?termFreq.
I have been using the code below to load text as a corpus and using the tm package to clean the text. As a next step I am loading a dictionary and cleaning it as well. Then I am matching the words from the text with the dictionary to calculate a score. However, the matching results in a higher number of matches than actual words in the text (e.g., the competence score is 1500 but the actual number of words in the text is only 1000).
I think it is related to the stemming of the text and the dictionary as the matches are lower when there is no stemming performed.
Do you have any ideas why this is happening?
Thank you very much.
R Code
Step 1 Storing data as corpus
file.path <- file.path(here("Generated Files", "Data Preparation")) corpus <- Corpus(DirSource(file.path))
Step 2 Cleaning data
#Removing special characters
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/")
corpus <- tm_map(corpus, toSpace, "#")
corpus <- tm_map(corpus, toSpace, "\\|")
#Convert the text to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
#Remove numbers
corpus <- tm_map(corpus, removeNumbers)
#Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
#Remove your own stop word
specify your stopwords as a character vector
corpus <- tm_map(corpus, removeWords, c("view", "pdf"))
#Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
#Eliminate extra white spaces
corpus <- tm_map(corpus, stripWhitespace)
#Text stemming
corpus <- tm_map(corpus, stemDocument)
#Unique words
corpus <- tm_map(corpus, unique)
Step 3 DTM
dtm <- DocumentTermMatrix(corpus)
Step 4 Load Dictionaries
dic.competence <- read_excel(here("Raw Data", "6. Dictionaries", "Brand.xlsx"))
dic.competence <- tolower(dic.competence$COMPETENCE)
dic.competence <- stemDocument(dic.competence)
dic.competence <- unique(dic.competence)
Step 5 Count frequencies
corpus.terms = colnames(dtm)
competence = match(corpus.terms, dic.competence, nomatch=0)
Step 6 Calculate scores
competence.score = sum(competence) / rowSums(as.matrix(dtm))
competence.score.df = data.frame(scores = competence.score)
What does competence return when you run that line? I'm not sure how your dictionary is set up, so I can't say for certain what's happening there. I brought in my own random corpus text as the primary text and brought in a separate corpus as the dictionary and your code worked great. The row names of competence.score.df were the names of the different txt files in my corpus and the scores were all in a 0-1 range.
# this is my 'dictionary' of terms:
tdm <- TermDocumentMatrix(Corpus(DirSource("./corpus/corpus3")),
control = list(removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE,
removePunctuation = TRUE))
# then I used your programming and it worked as I think you were expecting
# notice what I used here for the dictionary
(competence = match(colnames(dtm),
Terms(tdm)[1:10], # I only used the first 10 in my test of your code
nomatch = 0))
(competence.score = sum(competence)/rowSums(as.matrix(dtm)))
(competence.score.df = data.frame(scores = competence.score))
I have a problem with the word stemming completion of my created corpus using the tm package.
Here are the most important lines of my code:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
corpus
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z" and "0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
# Keep a copy of the generated corpus for stem completion later as dictionary
corpus_copy <- corpus
# Stemming words of corpus
corpus <- tm_map(corpus, stemDocument, language="english")
Now to complete the word stemming I apply stemCompletion of the tm package.
# Completing the stemming with the generated dictionary
corpus <- tm_map(corpus, content_transformer(stemCompletion), dictionary = corpus_copy, type="prevalent")
However, this is where my corpus gets destroyed and messed up and the stemCompletion does not work properly. Peculiarly, R does not indicate an error, the code runs but the result is terrible.
Does anybody know a solution for this? BTW my "comments_final" data frame consist of youtube comments, which I downloaded using the tubeR package.
Thank you so much for your help in advance, I really need help for my master's thesis thank you.
It does seem to work in a bit weird way, so I came up with my own stemCompletion function and applied it to the corpus. In your case try this:
stemCompletion2 <- function(x, dictionary) {
# split each word and store it
x <- unlist(strsplit(as.character(x), " "))
# # Oddly, stemCompletion completes an empty string to
# a word in dictionary. Remove empty string to avoid issue.
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
corpus <- lapply(corpus, stemCompletion2, corpus_copy)
corpus <- as.VCorpus(corpus)`
Hope this helps!
I am new in supervised methods. Here is my way to normalize my data:
corpuscleaned1 <- tm_map(AI_corpus, removePunctuation) ## Revome punctuation.
corpuscleaned2 <- tm_map(corpuscleaned1, stripWhitespace) ## Remove Whitespace.
corpuscleaned3 <- tm_map(corpuscleaned2, removeNumbers) ## Remove Numbers.
corpuscleaned4 <- tm_map(corpuscleaned3, stemDocument, language = "english") ## Remove StemW.
corpuscleaned5 <- tm_map(corpuscleaned4, removeWords, stopwords("en")) ## Remove StopW.
head(AI_corpus[[1]]$content) ## Examine original txt.
head(corpuscleaned5[[1]]$content) ## Examine clean txt.
AI_corpus <- my corpus about Amnesty Int. reports 1993-2013.
I am trying to create a dendrogram in r based off an excel sheet for use in text mining. I have one large column, each cell with a string of text. I want the smallest branch of the dendrogram to represent an individual cell, yet when I run my script I instead get a dendrogram of every word within the entire excel file. How do I fix this?
library(tm)
library(stringi)
library(proxy)
Data <- read.csv(file.choose(),header=TRUE)
docs <- Corpus(VectorSource(Data))
docs[[1]]
docs1 <- tm_map(docs, PlainTextDocument)
docs2 <- tm_map(docs1, stripWhitespace)
docs3 <- tm_map(docs2, removeWords, stopwords("english"))
docs4 <- tm_map(docs3, removePunctuation)
docs5 <- tm_map(docs4, content_transformer(tolower))
docs5[[1]]
TermMatrix <- TermDocumentMatrix(docs5)
docsdissim <- dist(as.matrix(TermMatrix), method = "euclidean")
docsdissim2 <- as.matrix(docsdissim)
docsdissim2
h <- hclust(docsdissim, method = "ward.D2")