I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set.
Originally I did the following:
fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t")
This creates a corpus with 1 document and >10,000 rows, and I want >10,000 docs with 1 row each.
I imagine I could just have 10,000+ separate CSV or TXT documents within a folder and create a corpus from that... but I'm thinking there is a much simpler answer than that, reading each line as a separate document.
Here's a complete workflow to get what you want:
# change this file location to suit your machine
file_loc <- "C:\\Documents and Settings\\Administrator\\Desktop\\Book1.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
In the dtm object each row will be a doc, or a line of your original CSV file. Each column will be a word.
You can use TermDocumentMatrix() on your fdbk object, and obtain a term document matrix where each row represent a customer feedback.
Related
I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!
I have a large, un-organized XML file that I need to search to determine if a certain ID numbers are in the file. I would like to use R to do so and because of the format, I am having trouble converting it to a data frame or even a list to extract to a csv. I figured I can search easily if it is in a csv format. So , I need help understanding how to do convert it and extract it properly, or how to search the document for values using R. Below is the code I have used to try and covert the doc,but several errors occur with my various attempts.
## Method 1. I tried to convert to a data frame, but the each column is not the same length.
require(XML)
require(plyr)
file<-"EJ.XML"
doc <- xmlParse(file,useInternalNodes = TRUE)
xL <- xmlToList(doc)
data <- ldply(xL, data.frame)
datanew <- read.table(data, header = FALSE, fill = TRUE)
## Method 2. I tried to convert it to a list and the file extracts but only lists 2 words on the file.
data<- xmlParse("EJ.XML")
print(data)
head(data)
xml_data<- xmlToList(data)
class(data)
topxml <- xmlRoot(data)
topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
row.names=NULL)
write.csv(xml_df, file = "MyData.csv",row.names=FALSE)
I am going to do some research on how to search within R as well, but I assume the file needs to be in a data frame or list to so either way. Any help is appreciated! Attached is a screen shot of the data. I am interested in finding matching entity id numbers to a list I have in a excel doc.
I have multiple txt files, each referring to a different month of the year (for many years). So, how I could analyze these files (text mining) each of these separately from a unique corpus (or something similar), by taking track of the month-year reference, thank you.
Here is an example I programmed for Game of Thrones subtitles. The subtitles are in the form 60 text files, one file for one episode in the form of S01E01 were we wanted to keep the episode information.
The following code will read the files into a list, and will turn it into a dataframe with episode information and text. You will have to adapt it to your own problem.
library(plyr)
####### Read data ######
filenames <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=TRUE)
filenames_short <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=FALSE)
ldf <- alply(.data=filenames,.margins=1,.fun=scan,what = "character", quiet = T, quote = "")
names(ldf) <- filenames_short
# Loop over all filenames
# Turns list into two columns of a dataframe, episode and word
# create empty dataframe
df_got_subs <- as.data.frame(NULL)
for (i in 1:60) {
# extract listname
# vector with list name
listenname <- filenames_short[i]
vec_listenname <- rep.int(listenname,length(ldf[[i]]))
# Doublecheck
cat("listenname: ",listenname,"\n")
# turn list element into vector
vec_subs <- as.vector(ldf[[i]])
# create dataframe from vectors
df_subs <- cbind.data.frame(vec_listenname,vec_subs,stringsAsFactors=FALSE)
# attach to the "big" dataframe
df_got_subs <- rbind.data.frame(df_got_subs,df_subs)
}
# test datastructure
str(df_got_subs)
# change column names
colnames(df_got_subs) <- c("episode","subs")
The whole text mining we did with the tidytext package from Julia Silge. I didn't post the code because she gives much better examples in this post:
http://juliasilge.com/blog/Life-Changing-Magic/
I hope this helps with your problem.
I have a list of URL for which i have fetched the webcontent, and included that into tm corpora:
library(tm)
library(XML)
link <- c(
"http://www.r-statistics.com/tag/hadley-wickham/",
"http://had.co.nz/",
"http://vita.had.co.nz/articles.html",
"http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html",
"http://www.analyticstory.com/hadley-wickham/"
)
create.corpus <- function(url.name){
doc=htmlParse(url.name)
parag=xpathSApply(doc,'//p',xmlValue)
if (length(parag)==0){
parag="empty"
}
cc=Corpus(VectorSource(parag))
meta(cc,"link")=url.name
return(cc)
}
link=catch$url
cc <- lapply(link, create.corpus)
This gives me a "large list" of corpora, one for each URL.
Combining them one by one works:
x=cc[[1]]
y=cc[[2]]
z=c(x,y,recursive=T) # preserved metadata
x;y;z
# A corpus with 8 text documents
# A corpus with 2 text documents
# A corpus with 10 text documents
But this becomes unfeasible for a list with a few thousand corpora.
So how can a list of corpora be merged into one corpus while maintaining the meta data?
You can use do.call to call c:
do.call(function(...) c(..., recursive = TRUE), cc)
# A corpus with 155 text documents
I don't think that tm offer any built-in function to join/merge many corpus. But after all a corpus is a list of document , so how the question is how to transform a list of list to a list. I would do create a new corpus using all documents , then assign meta manually:
y = Corpus(VectorSource(unlist(cc)))
meta(y,'link') = do.call(rbind,lapply(cc,meta))$link
Your code does not work because catch is not defined, so I don't know exactly what that is supposed to do.
But now tm corpora can just be put into a vector to make one big corpora: https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/tm_combine
So maybe c(unlist(cc)) would work. I have no way to test if that would work though because your code doesn't run.
I was wondering if there is any chance of R's text mining package having the following feature:
myCorpus <- Corpus(DirSource(<directory-contatining-textfiles>),control=...)
# add docs
myCorpus.addDocs(DirSource(<new-dir>),control=...)
Ideally I would like to incorporate additional documents into the existing corpus.
Any help is appreciated
You should be able just to use c(,) as in
> library(tm)
> data("acq")
> data("crude")
> together <- c(acq,crude)
> acq
A corpus with 50 text documents
> crude
A corpus with 20 text documents
> together
A corpus with 70 text documents
You can find more in the tm package documentation under tm_combine.
I overcome this issue as well in the context of big data text mining sets. It was not possible to load the entire data set at once.
Here, another option for such big data sets is possible. The approach is to collect a vector of one document corpora inside a loop. After processing all documents like this, it is possible to convert this vector into one huge corpus e.g. to create a DTM on it.
# Vector to collect the corpora:
webCorpusCollection <- c()
# Loop over raw data:
for(i in ...) {
try({
# Convert one document into a corpus:
webDocument <- Corpus(VectorSource(iconv(webDocuments[i,1], "latin1", "UTF-8")))
#
# Do other things e.g. preprocessing...
#
# Store this document into the corpus vector:
webCorpusCollection <- rbind(webCorpusCollection, webDocument)
})
}
# Collecting done. Create one huge corpus:
webCorpus <- Corpus(VectorSource(unlist(webCorpusCollection[,"content"])))