Text mining- how to build a term-document matrix

Text mining- how to build a term-document matrix - r

What I am trying to do is to load a csv file, and convert to an term-document matrix.
Here is part of my code:
myCorpus<-read.csv('alert-sample-data-4-mining.csv', head=TRUE)
TermDocumentMatrix(myCorpus, control=list(wordLengths=c(1,Inf)))
But get an error message said: Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"

A few things here -- you're not loading the tm library and you're not creating a corpus. Try something like this (assuming your text data is in a field called "text" in the csv file):
library(tm)
myCorpus <- read.csv("alert-sample-data-4-mining.csv")
corpus <- Corpus(VectorSource(myCorpus$text))
TermDocumentMatrix(corpus)

Related

tm_filter with VCorpus in kaggle gives Error

i want to filter out with file names from VCorpus (TM package)
library(tm)
# Creates a tm VCorpus class object from folder ,there are more than 100 txt files
cuisine_corpus <- VCorpus(DirSource("~/Documents/filesfolder"))
# filter out with 50 file names
cuisine_corpus_sub <- tm_filter(cuisine_corpus,FUN = function(x) meta(x)[["id"]] %in% filenames)
In RStudio, it work well.
But in kaggle, Error:
: meta() only works on corpus, dfm, dictionary2, tokens objects.
or how to filter the files in DirSource path ?

Convert tokens to corpus

I have a variable name df which is a character vector.
As a preprocessing step I would like to remove stopwords and my own list of stopwords. After that I would like to create a corpus from the previous example and a dfm.
I use the following command lines:
library(quanteda)
datastopwords_removed <- tokens_remove(tokens(df2, remove_punct = TRUE), c(stopwords("english"), mystopwords$phrases))
mycorpus <- corpus(datastopwords_remove)
myDfm <- dfm(datastopwords_remove, ngrams = c(1,5))
But in corpus I receive this error:
Error in UseMethod("corpus") :
no applicable method for 'corpus' applied to an object of class "tokens"
How can I fix it? Also if I have in mystopword list phrases with more than one token should I make any special handling because it works and I didn't received an error so I suppose it removes them.

Error in creating TermDocumentMatrix using tm package in R

I am unable to create a term document matrix using tm package in R which throws the following error as I try to create one out of a preprocessed corpus.
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class
"character"
Below is my script that I am using. I am using R v3.4.1 with tm package v0.7-1.
data <- readLines("Data/en_US/en_US_sample.txt", n = 100)
data <- Corpus(VectorSource(data))
data <- tm_map(data, removePunctuation)
data <- tm_map(data, removeNumbers)
data <- tm_map(data, content_transformer(tolower))
data <- tm_map(data, removeWords, stopwords("en"))
data <- tm_map(data, stripWhitespace)
words <- TermDocumentMatrix("data")
I believe TermDocumentMatrix requires the corpus to be in some specified text document format so I tried coercing my corpus to PlainTextDocument using tm_map but it doesn't solve the problem. When I am loading the my text data using Corpus on VectorSource, object created shows the class as SimpleCorpus which might be the problem but I am not totally sure.
Any help would be much appreciated. Thanks!

You did everything right, just in your last line you accidentally passed a character "data" (note the quotation marks) to the function TermDocumentMatrix() instead of the object data.

How to select words from corpus for TermDocumentMatrix creation in tm

I want to retain only pattern words (i.e gene names which I have specified) from each document of my corpus to generate the dtm. I do not want to pre-process the documents before corpus creation. I want to select and retain the gene names from the corpus only. I have used a custom function to keep only the terms in "pattern" and remove everything else (How to select only a subset of corpus terms for TermDocumentMatrix creation in tm). Here are my codes.
library(tm)
library(Rstem)
library(RTextTools)
docs <- Corpus(DirSource(path of the directory))
# Custom function to keep only the terms in "pattern" and remove everything else
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE)))
# The pattern i want to search for
gene = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, gene)[[1]]
However, I get the error
" Error in UseMethod("content", x) :no applicable method for 'content' applied to an object of class "character" "

R text mining documents from CSV file

First of all, my apology to repeat a question that was asked Aug 1 '13. But I cannot comment to the original question as I must have 50 reputation to be able to comment which I dont have. The original question can be retrieved from R text mining documents from CSV file (one row per doc) .
I am trying to work with the tm package in R, and have a CSV file of article abstracts with each line being a different abstract. I want each line to be a different document within the corpus. There are 2,000 rows in my data set.
I run the following codes as previously suggested by Ben:
# change this file location to suit your machine
file_loc <- "C:/Users/.../docs.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
docs <- DocumentTermMatrix(corp)
When I check class:
# checking class
class(docs)
[1] "DocumentTermMatrix" "simple_triplet_matrix"
The problem is tm transformations do not work on this class:
# Preparing the Corpus
# Simple Transforms
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
I get this error:
Error in UseMethod("tm_map", x) :
no applicable method for 'tm_map' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')"
or another code:
docs <- tm_map(docs, toSpace, "/|#|nn|")
I get the same error:
Error in UseMethod("tm_map", x) :
no applicable method for 'tm_map' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')"
Your help would be greatly appreciated.

The code
docs <- tm_map(docs, toSpace, "/|#|nn|")
must be replaced with
docs <- tm_map(docs, toSpace, "/|#|\\|").
Then it will work fine.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Text mining- how to build a term-document matrix - r

Related

tm_filter with VCorpus in kaggle gives Error

Convert tokens to corpus

Error in creating TermDocumentMatrix using tm package in R

How to select words from corpus for TermDocumentMatrix creation in tm

R text mining documents from CSV file

Categories

Resources