R text file and text mining...how to load data - r

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.
I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as....
stemDocument(x, language = map_IETF(Language(x)))
So assume that this is my doc "this is a test for R load"
How do I load the data for text processing and to create the object x?

Like #richiemorrisroe I found this poorly documented. Here's how I get my text in to use with the tm package and make the document term matrix:
library(tm) #load text mining library
setwd('F:/My Documents/My texts') #sets R's working directory to near where my files are
a <-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat")) #specifies the exact folder where my text file(s) is for analysis with tm.
summary(a) #check what went in
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords
a <- tm_map(a, stemDocument, language = "english")
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
In this case you don't need to specify the exact file name. So long as it's the only one in the directory referred to in line 3, it will be used by the tm functions. I do it this way because I have not had any success in specifying the file name in line 3.
If anyone can suggest how to get text into the lda package I'd be most grateful. I haven't been able to work that out at all.

Can't you just use the function readPlain from the same library? Or you could just use the more common scan function.
mydoc.txt <-scan("./mydoc.txt", what = "character")

I actually found this quite tricky to begin with, so here's a more comprehensive explanation.
First, you need to set up a source for your text documents. I found that the easiest way (especially if you plan on adding more documents, is to create a directory source that will read all of your files in.
source <- DirSource("yourdirectoryname/") #input path for documents
YourCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents
You can then apply the StemDocument function to your Corpus. HTH.

I believe what you wanted to do was read individual file into a corpus and then make it treat the different rows in the text file as different observations.
See if this gives you what you want:
text <- read.delim("this is a test for R load.txt", sep = "/t")
text_corpus <- Corpus(VectorSource(text), readerControl = list(language = "en"))
This is assuming that the file "this is a test for R load.txt" has only one column which has the text data.
Here the "text_corpus" is the object that you are looking for.
Hope this helps.

Here's my solution for a text file with a line per observation. the latest vignette on tm (Feb 2017) gives more detail.
text <- read.delim(textFileName, header=F, sep = "\n",stringsAsFactors = F)
colnames(text) <- c("MyCol")
docs <- text$MyCol
a <- VCorpus(VectorSource(docs))

The following assumes you have a directory of text files from which you want to create a bag of words.
The only change that needs to be made is replace
path = "C:\\windows\\path\\to\\text\\files\\
with your directory path.
library(tidyverse)
library(tidytext)
# create a data frame listing all files to be analyzed
all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\", # path can be relative or absolute
pattern = ".txt$", # this pattern only selects files ending with .txt
full.names = TRUE) # gives the file path as well as name
# create a data frame with one word per line
my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>% # read in each file in list
mutate(filename = basename(.x)) %>% # add the file name as a new column
unnest_tokens(word, txt)) # split each word out as a separate row
# count the total # of rows/words in your corpus
my_corpus %>%
summarize(number_rows = n())
# group and count by "filename" field and sort descending
my_corpus %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
arrange(desc(number_rows))
# remove stop words
my_corpus2 <- my_corpus %>%
anti_join(stop_words)
# repeat the count after stop words are removed
my_corpus2 %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
arrange(desc(number_rows))

Related

creating corpus from multiple txt files

I have multiple txt files, I want to have a tidy data. To do that first I create corpus ( I am not sure is it true way to do it). I wrote the following code to have the corpus data.
folder<-"C:\\Users\\user\\Desktop\\text analysis\\doc"
list.files(path=folder)
filelist<- list.files(path=folder, pattern="*.txt")
paste(folder, "\\", filelist)
filelist<-paste(folder, "\\", filelist, sep="")
typeof(filelist)
a<- lapply(filelist,FUN=readLines)
corpus <- lapply(a ,FUN=paste, collapse=" ")
When I check the class(corpus) it returns list. From that point how can I create tidy data?
Looking at your other question as well, you need to read up on text-mining and how to read in files. Your result now is a list object. In itself not a bad object, but for your purposes not correct. Instead of lapply, use sapply in your last line, like this:
corpus <- sapply(a , FUN = paste, collapse = " ")
This will return a character vector. Next you need to turn this into a data.frame. I added the filelist to the data.frame to keep track of which text belongs to which document.
my_data <- data.frame(files = filelist, text = corpus, stringsAsFactors = FALSE)
and then use tidytext to continue:
library(tidytext)
tidy_text <- unnest_tokens(my_data, words, text)
using tm and tidytext package
If you would use the tm package, you could read everything in like this:
library(tm)
folder <- getwd() # <-- here goes your folder
corpus <- VCorpus(DirSource(directory = folder,
pattern = "*.txt"))
which you could turn into tidytext like this:
library(tidytext)
tidy_corpus <- tidy(corpus)
tidy_text <- unnest_tokens(tidy_corpus, words, text)
If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.
To find all the text files within a working directory, you can use list.files with an argument:
all_txts <- list.files(pattern = ".txt$")
The all_txts object will then be a character vector that contains all your filenames.
Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.
library(tidyverse)
library(tidytext)
map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
mutate(filename = basename(.x)) %>%
unnest_tokens(word, txt))

How to apply a custom function to a quanteda corpus

I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.
I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.
Using tm package previously I did this:
# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
library(stringr)
library(stringi)
library(tm)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
Then within my tm corpus transformations I did this:
mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))
What is the equivilent way to apply this custom function to my quanteda corpus?
Impossible to know if that will work from your example, which leaves some parts out, but generally:
If you want to access texts in a quanteda corpus, you can use texts(), and to replace those texts, texts()<-.
So in your case, assuming that mycorpus is a tm corpus, you could do this:
library("quanteda")
stringi_spelling_update2 <- function(x, lut = spellingdoc) {
stringi::stri_replace_all_regex(str = x,
pattern = paste0("\\b", lut[,1], "\\b"),
replacement = lut[,2],
vectorize_all = FALSE)
}
myquantedacorpus <- corpus(mycorpus)
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)
I think I found an indirect answer over here.
texts(myCorpus) <- myFunction(myCorpus)

Stemming corpus in R

I am using the library(tm) package in R to stem words in R but I am still getting different words with the same root in the document term matrix (dtm). For example, I am getting "certif" and "certifi" as different words, "categor" and "categori" as different words, "cathet" and "catheter" as different words, "character" and "characteristi" as different words, and so on. Isn't stemDocument supposed to take endings off and count them as one word? How can I fix this? This is the code I used:
docs <- Corpus(VectorSource(df$Long_Descriptor)
docs <- tm_map(docs, removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower), lazy = TRUE) %>%
tm_map(removeWords, stopwords("english"), lazy = TRUE) %>%
tm_map(stemDocument, language = c("english"), lazy = TRUE)
dtm <- DocumentTermMatrix(docs)

R: Combining lists with csv metadata

I am having a bit of trouble with some work with text files and the associated metadata for the files. I can read in the files, pre-process them, and then convert them to a readable format for the lda package I am using (using this guide by Sievert). Example below:
#Reading the files
corpus <- file.path("Folder/Fiction/texts")
corpus <- list.files(corpus)
corpus <- lapply(corpus, readLines)
***pre-processing functions removed for space***
corp.list <- strsplit(corpus, "[[:space:]]+")
# compute the table of terms:
corpterm.table <- table(unlist(corp.list))
corpterm.table <- sort(corpterm.table, decreasing = TRUE)
***removing stopwords, again removed for space***
# now put the corpus into the format required by the lda package:
getCorp.terms <- function(x) {
index <- match(x, vocabCorp)
index <- index[!is.na(index)]
rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
corpus <- lapply(corp.list, getCorp.terms)
At this point, the corpus variable is a list of document tokens with a separate vector per document, but has been detached from its file path, and the name of the file. Herein is where my problem begins: I have a csv with the metadata for the texts (their file names, titles, authors, years, genres, etc.) which I would like to have associated with each vector of tokens, in order to easily model my information over time, by gender, etc.
I am unsure of how to do this, but am guessing it would need to be done as the files are being read, and not merged after I have manipulated the document texts. I would imagine it would be something that would look like:
corpus.f <- file.path (stuff)
corpus <- list.files(corpus)
corpus <- lapply(corpus, ReadLines)
corpus.df <- as.data.frame(c(corpus.f,corpus))
corpus.info <- read.csv(stuff.csv)
And from there using the merge or match function in combination to associate each document (or vector of document tokens) with its correct row of metadata.
Try changing to this:
pth <- file.path("Folder/Fiction/texts")
fi <- list.files(pth)
corpus <- lapply(fi, readLines)
corp.list <- strsplit(corpus, "[[:space:]]+")
setNames(object = corp.list, nm = fi) -> corp.list

Plotting term document matrix from clipboard

I want to plot a term document matrix, but am having trouble generating a corpus. I want to be able to generate a corpus from selecting text and copying it to clipboard. For example, I want a plotted TDM off of 150 paragraphs of Lorem Ipsum data.
This part here is just for drawing in word data from lipsum.com
library("tm")
#generate a corpus from clipboard
clipboard2 <- read.table("clipboard",sep="\r")
The next part would (if it worked), split clipboard2 into a bunch of documents from which to get correlations off of. I think there's an easier solution here than creating documents which are then re-read back in for corpus' sake.
#how many docs to print out for correlations sake
for (i in 1:10) {
start <- floor(1 + (i-1) * nrow(clipboard2) / 10)
end <- i * nrow(clipboard2) / 10
write.table(clipboard2[start:end, 1],
paste0("C:/Users/me/Documents/", i ,".txt", collapse=""), sep="\t")
}
Pulling in the corpus of documents into a variable. Everything from this point on works fine if I manually split lipsum.com data into a few documents in some directory.
#Corpus collection
feedback <- Corpus(DirSource("C:/Users/me/Documents/"))
Removing words and whitespace, though there might be some redundancy here. Then creating the TDM.
#Cleanup
feedback <- tm_map(feedback, stripWhitespace)
feedback <- tm_map(feedback, tolower)
feedback <- tm_map(feedback, removeWords, stopwords("english"))
#TDM creation (redundant?)
tdm <- TermDocumentMatrix(feedback, control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE))
And finally, plotting the TDM. No issues here.
#plotting TDM
plot(tdm,
terms = findFreqTerms(tdm, lowfreq = 70),
corThreshold = 0.6)
)
It's somewhat unclear to me which part you are asking about, but as far as reading in the clipboard directly into a corpus, you could use
dd <- read.table("clipboard", sep="\r", stringsAsFactors=F)
feedback <- Corpus(VectorSource(dd$V1))
That will create a new document for each paragraph. But the idea is that you can use a character vector as a source so you can collapse/merge elements in the vector first to create more complex documents.

Resources