I am having a bit of trouble with some work with text files and the associated metadata for the files. I can read in the files, pre-process them, and then convert them to a readable format for the lda package I am using (using this guide by Sievert). Example below:
#Reading the files
corpus <- file.path("Folder/Fiction/texts")
corpus <- list.files(corpus)
corpus <- lapply(corpus, readLines)
***pre-processing functions removed for space***
corp.list <- strsplit(corpus, "[[:space:]]+")
# compute the table of terms:
corpterm.table <- table(unlist(corp.list))
corpterm.table <- sort(corpterm.table, decreasing = TRUE)
***removing stopwords, again removed for space***
# now put the corpus into the format required by the lda package:
getCorp.terms <- function(x) {
index <- match(x, vocabCorp)
index <- index[!is.na(index)]
rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
corpus <- lapply(corp.list, getCorp.terms)
At this point, the corpus variable is a list of document tokens with a separate vector per document, but has been detached from its file path, and the name of the file. Herein is where my problem begins: I have a csv with the metadata for the texts (their file names, titles, authors, years, genres, etc.) which I would like to have associated with each vector of tokens, in order to easily model my information over time, by gender, etc.
I am unsure of how to do this, but am guessing it would need to be done as the files are being read, and not merged after I have manipulated the document texts. I would imagine it would be something that would look like:
corpus.f <- file.path (stuff)
corpus <- list.files(corpus)
corpus <- lapply(corpus, ReadLines)
corpus.df <- as.data.frame(c(corpus.f,corpus))
corpus.info <- read.csv(stuff.csv)
And from there using the merge or match function in combination to associate each document (or vector of document tokens) with its correct row of metadata.
Try changing to this:
pth <- file.path("Folder/Fiction/texts")
fi <- list.files(pth)
corpus <- lapply(fi, readLines)
corp.list <- strsplit(corpus, "[[:space:]]+")
setNames(object = corp.list, nm = fi) -> corp.list
Related
I'm guessing that the technique for this is similar to taking the first N characters from any dataframe, regardless of if it is a corpus or not.
My attempt:
create.greetings <- function(corpus, create_df = FALSE) {
for(i in length(Charlotte.corpus.raw)) {
Doc1<-Charlotte.corpus.raw[i]
Word1<-Doc1[1:25]
Greetings[i]<-Word1
}
return(VCorpus)
}
Where Greetings begins as a corpus with n=6. I couldn't figure out how to make a null corpus, or a corpus of large enough characters. I have a corpus of 200 documents here (Charlotte.corpus.raw). Unlike vectors (and by extension, dataframes), there doesn't seem to be a easy way to create null corpora.
Part of the problem is that R doesn't seem to recognize the class of "document". It only recognizes corpus. That is, that to R, a single document is a corpus of n=1.
Reproducable Sample:
You will need the 'tm' and 'dplyr' and 'NLP' packages as well as more common R packages
read.corpus <- function(directory, pattern = "", to.lower = TRUE) {
corpus <- DirSource(directory = directory, pattern = pattern) %>%
VCorpus # Read files and create `VCorpus` object
if(to.lower == TRUE) corpus <- # Lowercase text
tm_map(corpus,
content_transformer(tolower))
return(corpus)
}
Then run the function for any directory you have with a few txt documents, then you have a corpus to work with. Then replace Charlotte.corpus.raw from above with whatever you name your corpus as.
Each row of greetings will contain the first 25 words of each document:
greetings <- c()
for(i in 1:length(corpus)) {
row <- unlist(corpus[i])[1:25]
greetings <- rbind(greetings, row)
}
I have multiple txt files, each referring to a different month of the year (for many years). So, how I could analyze these files (text mining) each of these separately from a unique corpus (or something similar), by taking track of the month-year reference, thank you.
Here is an example I programmed for Game of Thrones subtitles. The subtitles are in the form 60 text files, one file for one episode in the form of S01E01 were we wanted to keep the episode information.
The following code will read the files into a list, and will turn it into a dataframe with episode information and text. You will have to adapt it to your own problem.
library(plyr)
####### Read data ######
filenames <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=TRUE)
filenames_short <- list.files("Set7/Game of Thrones Subtitles", pattern="*", full.names=FALSE)
ldf <- alply(.data=filenames,.margins=1,.fun=scan,what = "character", quiet = T, quote = "")
names(ldf) <- filenames_short
# Loop over all filenames
# Turns list into two columns of a dataframe, episode and word
# create empty dataframe
df_got_subs <- as.data.frame(NULL)
for (i in 1:60) {
# extract listname
# vector with list name
listenname <- filenames_short[i]
vec_listenname <- rep.int(listenname,length(ldf[[i]]))
# Doublecheck
cat("listenname: ",listenname,"\n")
# turn list element into vector
vec_subs <- as.vector(ldf[[i]])
# create dataframe from vectors
df_subs <- cbind.data.frame(vec_listenname,vec_subs,stringsAsFactors=FALSE)
# attach to the "big" dataframe
df_got_subs <- rbind.data.frame(df_got_subs,df_subs)
}
# test datastructure
str(df_got_subs)
# change column names
colnames(df_got_subs) <- c("episode","subs")
The whole text mining we did with the tidytext package from Julia Silge. I didn't post the code because she gives much better examples in this post:
http://juliasilge.com/blog/Life-Changing-Magic/
I hope this helps with your problem.
I could not find any previous questions posted on this, so perhaps you can help.
What is a good way to aggregate data in a tm corpus based on metadata (e.g. aggregate texts of different writers)?
There are at least two obvious ways it could be done:
A built-in function in tm, that would allow a DocumentTermMatrix to be built on a metadata feature. Unfortunately I haven't been able to uncover this.
A way to join documents within a corpus based on some external metadata in a table. It would just use metadata to replace document-ids.
So you would have a table that contains: DocumentId, AuthorName
And a tm-built corpus that contains an amount of documents. I understand it is not difficult to introduce the table as metadata for the corpus object.
A matrix can be built with a following function.
library(tm) # version 0.6, you seem to be using an older version
corpus <-Corpus(DirSource("/directory-with-texts"),
readerControl = list(language="lat"))
metadata <- data.frame(DocID, Author)
#A very crude way to enter metadata into the corpus (assumes the same sequence):
for (i in 1:length(corpus)) {
attr(corpus[[i]], "Author") <- metadata$Author[i]
}
a_documenttermmatrix_by_DocId <-DocumentTermMatrix(corpus)
How would you build a matrix that shows frequencies for each author possibly aggregating multiple documents instead of documents? It would be useful to do this just at this stage and not in post-processing with only a few terms.
a_documenttermmatrix_by_Author <- ?
Many thanks!
A DocumentTermMatrix is really just a matrix with fancy dressing (a Simple Triplet Matrix from the slam library) that contains term frequencies for each term and document. Aggregating data from multiple documents by author is really just adding up the columns for the author. Consider formatting the matrix as a standard R matrix and use standard subsetting / aggregating methods:
# Format the document term matrix as a standard matrix.
# The rownames of m become the document Id's
# The colnames of m become the individual terms
m <- as.matrix(dtm)
# Transpose matrix to use the "by" operator.
# Rows become individual terms
# Columns become document ids
# Group columns by Author
# Aggregate column sums (word frequencies) for each author, resulting in a list.
author.list <- by(t(m), metadata$Author, colSums)
# Format the list as a matrix and do stuff with it
author.dtm <- matrix(unlist(author.list), nrow = length(author.list), byrow = T)
# Add column names (term) and row names (author)
colnames(author.dtm) <- rownames(m)
rownames(author.dtm) <- names(author.list)
# View the resulting matrix
View(author.dtm[1:10, 1:10])
The resulting matrix will be a standard matrix where the rows are the Authors and the columns are the individual terms. You should be able to do whatever analysis you want at that point.
I have a very crude workaround for this if the corpus text can be found in a table. However this does not help a lot with a large corpus in a 'tm' format, however it may be handy in other cases. Feel free to improve it, as it is very crude!
custom_term_matrix <- function(author_vector, text_vector)
{
author_vector <- factor(author_vector)
temp <- data.frame(Author = levels(author_vector))
for (i in 1:length(temp$Author)){
temp$Content[i] <- paste(c(as.character(text_vector[author_vector ==
levels(author_vector)[i]])), sep=" ", collapse="")
}
m <- list(id = "Author", content = "Content")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(data1), readerControl = list(reader = myReader))
custom_matrix <<- DocumentTermMatrix(mycorpus, control =
list(removePunctuation = TRUE))
}
There probably is a function internal to tm, that I haven't been able to find, so I will be grateful for any help!
I have been reading different questions/answers (especially here and here) without managing to apply any to my situation.
I have a 11,390 rows matrix with attributes id, author, text, such as:
library(tm)
m <- cbind(c("01","02","03","04","05","06"),
c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
I want to create a tm corpus out of it. I can quickly create my corpus with
tm_corpus <- Corpus(VectorSource(m[,3]))
which terminates execution for my 11,390 rows matrix in
user system elapsed
2.383 0.175 2.557
But then when I try to add metadata to the corpus with
meta(tm_corpus, type="local", tag="Author") <- m[,2]
the execution time is over the 15 minutes and counting (I then stopped execution).
According to the discussion here chances are to decreasing significantly the time in processing the corpus with tm_map; something like
tm_corpus <- tm_map(tm_corpus, addMeta, m[,2])
Still I am not sure how to do this. Probably it is going to be something like
addMeta <- function(text, vector) {
meta(text, tag="Author") = vector[??]
text
}
For one thing how to pass to tm_map a vector of values to be assign to each text of the corpus? Should I call the function from within a loop? Should I enclose the tm_map function within vapply?
Have you already tried the excellent readTabular?
## your sample data
matrix <- cbind(c("01","02","03","04","05","06"),
c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
## simple transformations
matrix <- as.data.frame(matrix)
names(matrix) <- c("id", "author", "content")
Now your ex-matrix now data.frame can be read easily in as a corpus using readTabular. ReadTabular wants you to define a Reader which itselfs takes a mapping. In your mapping "content" points to the text data and the other names - well - to meta.
## define myReader, which will be used in creation of Corpus
myReader <- readTabular(mapping=list(id="id", author="author", content="content"))
Now the creation of the Corpus is same as always, apart from small changes:
## create the corpus
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus,
readerControl = list(reader=myReader))
Now have a look at the content and meta data of the first items:
lapply(tm_corpus, as.character)
lapply(tm_corpus, meta)
## output just as expected.
This should be fast, as it is part of the package and extremely adaptable. In my own project I am using this on a data.table with some 20 variables - it works like a charm.
However I cannot provide benchmarking with the answer you have already approved as suitable. I simply guess it is faster and more efficient.
Yes tm_map is faster and it is the way to go. You should use it here with a global counter.
auths <- paste0('Author',seq(11390))
i <- 0
tm_corpus = tm_map(tm_corpus, function(x) {
i <<- i +1
meta(x, "Author") <- m[i,2]
x
})
Since readTabular from tm package has been deprecated, now the solution might be like this:
matrix <- cbind(c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
matrix <- as.data.frame(matrix)
names(matrix) <- c("doc_id", "text")
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus)
inspect(tm_corpus)
meta(tm_corpus)
I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.
I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as....
stemDocument(x, language = map_IETF(Language(x)))
So assume that this is my doc "this is a test for R load"
How do I load the data for text processing and to create the object x?
Like #richiemorrisroe I found this poorly documented. Here's how I get my text in to use with the tm package and make the document term matrix:
library(tm) #load text mining library
setwd('F:/My Documents/My texts') #sets R's working directory to near where my files are
a <-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat")) #specifies the exact folder where my text file(s) is for analysis with tm.
summary(a) #check what went in
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords
a <- tm_map(a, stemDocument, language = "english")
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
In this case you don't need to specify the exact file name. So long as it's the only one in the directory referred to in line 3, it will be used by the tm functions. I do it this way because I have not had any success in specifying the file name in line 3.
If anyone can suggest how to get text into the lda package I'd be most grateful. I haven't been able to work that out at all.
Can't you just use the function readPlain from the same library? Or you could just use the more common scan function.
mydoc.txt <-scan("./mydoc.txt", what = "character")
I actually found this quite tricky to begin with, so here's a more comprehensive explanation.
First, you need to set up a source for your text documents. I found that the easiest way (especially if you plan on adding more documents, is to create a directory source that will read all of your files in.
source <- DirSource("yourdirectoryname/") #input path for documents
YourCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents
You can then apply the StemDocument function to your Corpus. HTH.
I believe what you wanted to do was read individual file into a corpus and then make it treat the different rows in the text file as different observations.
See if this gives you what you want:
text <- read.delim("this is a test for R load.txt", sep = "/t")
text_corpus <- Corpus(VectorSource(text), readerControl = list(language = "en"))
This is assuming that the file "this is a test for R load.txt" has only one column which has the text data.
Here the "text_corpus" is the object that you are looking for.
Hope this helps.
Here's my solution for a text file with a line per observation. the latest vignette on tm (Feb 2017) gives more detail.
text <- read.delim(textFileName, header=F, sep = "\n",stringsAsFactors = F)
colnames(text) <- c("MyCol")
docs <- text$MyCol
a <- VCorpus(VectorSource(docs))
The following assumes you have a directory of text files from which you want to create a bag of words.
The only change that needs to be made is replace
path = "C:\\windows\\path\\to\\text\\files\\
with your directory path.
library(tidyverse)
library(tidytext)
# create a data frame listing all files to be analyzed
all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\", # path can be relative or absolute
pattern = ".txt$", # this pattern only selects files ending with .txt
full.names = TRUE) # gives the file path as well as name
# create a data frame with one word per line
my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>% # read in each file in list
mutate(filename = basename(.x)) %>% # add the file name as a new column
unnest_tokens(word, txt)) # split each word out as a separate row
# count the total # of rows/words in your corpus
my_corpus %>%
summarize(number_rows = n())
# group and count by "filename" field and sort descending
my_corpus %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
arrange(desc(number_rows))
# remove stop words
my_corpus2 <- my_corpus %>%
anti_join(stop_words)
# repeat the count after stop words are removed
my_corpus2 %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
arrange(desc(number_rows))