R text mining documents from CSV file - r

First of all, my apology to repeat a question that was asked Aug 1 '13. But I cannot comment to the original question as I must have 50 reputation to be able to comment which I dont have. The original question can be retrieved from R text mining documents from CSV file (one row per doc) .
I am trying to work with the tm package in R, and have a CSV file of article abstracts with each line being a different abstract. I want each line to be a different document within the corpus. There are 2,000 rows in my data set.
I run the following codes as previously suggested by Ben:
# change this file location to suit your machine
file_loc <- "C:/Users/.../docs.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
docs <- DocumentTermMatrix(corp)
When I check class:
# checking class
class(docs)
[1] "DocumentTermMatrix" "simple_triplet_matrix"
The problem is tm transformations do not work on this class:
# Preparing the Corpus
# Simple Transforms
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
I get this error:
Error in UseMethod("tm_map", x) :
no applicable method for 'tm_map' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')"
or another code:
docs <- tm_map(docs, toSpace, "/|#|nn|")
I get the same error:
Error in UseMethod("tm_map", x) :
no applicable method for 'tm_map' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')"
Your help would be greatly appreciated.

The code
docs <- tm_map(docs, toSpace, "/|#|nn|")
must be replaced with
docs <- tm_map(docs, toSpace, "/|#|\\|").
Then it will work fine.

Related

Error in tolower(txt) non character argument in R (for textmining)

So as a beginner, I'm trying to do simple text-mining (NLP) using R language.
I preprocessed my data using tm_map function and inspected it and all the punctuations, numbers were removed.
I also converted the text document in lower case using tolower() function.
It worked great.
But while creating a document matrix, I'm encountering an issue where the error is:
error in tolower(txt): non character argument
What is this error about and how to go ahead with this?
Is this something related to UTF8?
Any leads would be appreciated.
docs <- tm_map(docs, removePunctuation)
inspect(docs[1])
for(j in seq(docs)) {
docs[[j]] <- gsub("\n", " ", docs[[j]])
}
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
This all worked just fine and my text document (which is simply an ebook) got converted into lower case, with no white spaces, numbers, etc. just fine and the next step returns the error.
# returns the above error.
dtm <- DocumentTermMatrix(docs)
The problem is not turning your corpus into a DocumentTermMatrix. The problem lies in your for loop. It turns your corpus into a list of characters.
If you want to use gsub like this, you need to use the content_transformer function.
# removes the need of the for loop and keeps everything in a corpus.
docs <- tm_map(docs, content_transformer(function(x) gsub("\n", " ", x)))
This removes the need of the loop and keeps everything as it should be. After this line you can run the rest of your code without any issue.

tm_map: Can use the removewords function with my own stopwords registered as an txt file?

I'm using the R tm package for text analysis on a facebook group, and find that the removewords function isn't working for me. I tried to combine the french stopwords with my own, but they are still appearing. So I create a file named "french.txt" with my own list as in the following command:
nom_fichier <- "Analyse textuelle/french.txt"
my_stop_words <- readLines(nom_fichier, encoding="UTF-8")
Here is the data for text mining:
text <- readLines(groupe_fb_ief, encoding="UTF-8")```
docs <- Corpus(VectorSource(text))
inspect(docs)
Here are the tm_map commands:
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, my_stop_words)
Applying that, it's still not working and I don't understand why. I even try to change to order of the commands with no result.
Do you have any idea ? Is it possible to change the french stopwords within R ? Where this list is located ?
Thanks!!
Rather than use RemoveWords, I typically use an anti_join() to remove all stop words.
library(tidytext)
my_stop_words <- my_stop_words %>%
unnest_tokens(output = word, input = text, token = "words")
# anti_join
anti_join(docs,my_stop_words, by = "word")
That is if the the column that contains your corpus is called "word". Hope this helps.

Error in creating TermDocumentMatrix using tm package in R

I am unable to create a term document matrix using tm package in R which throws the following error as I try to create one out of a preprocessed corpus.
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class
"character"
Below is my script that I am using. I am using R v3.4.1 with tm package v0.7-1.
data <- readLines("Data/en_US/en_US_sample.txt", n = 100)
data <- Corpus(VectorSource(data))
data <- tm_map(data, removePunctuation)
data <- tm_map(data, removeNumbers)
data <- tm_map(data, content_transformer(tolower))
data <- tm_map(data, removeWords, stopwords("en"))
data <- tm_map(data, stripWhitespace)
words <- TermDocumentMatrix("data")
I believe TermDocumentMatrix requires the corpus to be in some specified text document format so I tried coercing my corpus to PlainTextDocument using tm_map but it doesn't solve the problem. When I am loading the my text data using Corpus on VectorSource, object created shows the class as SimpleCorpus which might be the problem but I am not totally sure.
Any help would be much appreciated. Thanks!
You did everything right, just in your last line you accidentally passed a character "data" (note the quotation marks) to the function TermDocumentMatrix() instead of the object data.

R tm document names missing

Using R{tm} package, i create a corpus, per usual:
mycorpus <- Corpus(DirSource(folder,pattern="txt"))
Please note I am not using an encoding variable. The summary (mycorpus) shows document names listed. However after a series of tm_map transforms:
(content_transformer(tolower),content_transformer(removeWords), stopwords("SMART"),stripWhitespace)
ending with mycorpus<- tm_map(mycorpus, PlainTextDocument) and mydtm <- DocumentTermMatrix(mycorpus, control = list(...))
I get an error with inspect(mydtm[1:10, intersect(colnames(dtm), 'toyota')]) to get my variable of choice:
Terms
Docs toyota
character(0) 0
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
The file names (doc ids) have disappeared. Any idea what could be causing this error? more importantly, how do i reinstate the document names? Many thanks.
Code below will work for single file. You likely could use something like list.files to read all files in the directory.
First, I would wrap the cleaning functions in a custom function. Note the order matters and you have to use content_transformer if the function is not from tm.
clean.corpus<-function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, custom.stopwords)
return(corpus)
}
Then concatenate English words with custom words. This is passed as the last part of the custom function above.
custom.stopwords <- c(stopwords('english'), 'lol', 'smh')
doc<-read.csv('coffee.csv', header=TRUE)
The CSV is a data frame with a column of tweets in a text document and another column with an ID for each tweet. The file from my workshop with this file is here.
The csv file is now in memory so next step is to read it in tabular fashion with a specific mapping when making a corpus. Here the content is in a column called text and the unique ID is in a column name "id".
custom.reader <- readTabular(mapping=list(content="text", id="id"))
corpus <- VCorpus(DataframeSource(doc), readerControl=list(reader=custom.reader))
corpus<-clean.corpus(corpus)
The corpus creation uses the readerControl and then once done you can apply the pre-processing steps. Without the reader control the package assigns the 0 character as the name.
The corpus content of document 1 can be accessed here
corpus[[1]][1]
You can review the corpus meta data for the first document with this code
corpus[[1]][2]
So I think you are needing to use readTabular and readerControl in your corpus construction no matter the source.
I was having the same problem and I realized that it was due to tolower. tolower, unlike removeNumbers, removePunctuation, removeWords, stemDocument, stripWhitespace are not tranformations defined in the tm package. To get a list of transformations defined in the tm package that can be directly applied to a corpus, type:
getTransformations()
[1] “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace”
Thus, in order to use tolower it first must make a transformation for tolower for it to handle corpus objects properly.
docs <- tm_map(docs,content_transformer(tolower))
The above line of code should stop the files from being renamed to character(0)
The same trick can be applied to any R function to work with corpuses. For example for gsub, the following syntax applies:
docs <- tm_map(docs, content_transformer(gsub), pattern = “internt”, replacement = “internet”)

How to select words from corpus for TermDocumentMatrix creation in tm

I want to retain only pattern words (i.e gene names which I have specified) from each document of my corpus to generate the dtm. I do not want to pre-process the documents before corpus creation. I want to select and retain the gene names from the corpus only. I have used a custom function to keep only the terms in "pattern" and remove everything else (How to select only a subset of corpus terms for TermDocumentMatrix creation in tm). Here are my codes.
library(tm)
library(Rstem)
library(RTextTools)
docs <- Corpus(DirSource(path of the directory))
# Custom function to keep only the terms in "pattern" and remove everything else
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE)))
# The pattern i want to search for
gene = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, gene)[[1]]
However, I get the error
" Error in UseMethod("content", x) :no applicable method for 'content' applied to an object of class "character" "

Resources