creating corpus from multiple txt files - r

I have multiple txt files, I want to have a tidy data. To do that first I create corpus ( I am not sure is it true way to do it). I wrote the following code to have the corpus data.
folder<-"C:\\Users\\user\\Desktop\\text analysis\\doc"
list.files(path=folder)
filelist<- list.files(path=folder, pattern="*.txt")
paste(folder, "\\", filelist)
filelist<-paste(folder, "\\", filelist, sep="")
typeof(filelist)
a<- lapply(filelist,FUN=readLines)
corpus <- lapply(a ,FUN=paste, collapse=" ")
When I check the class(corpus) it returns list. From that point how can I create tidy data?

Looking at your other question as well, you need to read up on text-mining and how to read in files. Your result now is a list object. In itself not a bad object, but for your purposes not correct. Instead of lapply, use sapply in your last line, like this:
corpus <- sapply(a , FUN = paste, collapse = " ")
This will return a character vector. Next you need to turn this into a data.frame. I added the filelist to the data.frame to keep track of which text belongs to which document.
my_data <- data.frame(files = filelist, text = corpus, stringsAsFactors = FALSE)
and then use tidytext to continue:
library(tidytext)
tidy_text <- unnest_tokens(my_data, words, text)
using tm and tidytext package
If you would use the tm package, you could read everything in like this:
library(tm)
folder <- getwd() # <-- here goes your folder
corpus <- VCorpus(DirSource(directory = folder,
pattern = "*.txt"))
which you could turn into tidytext like this:
library(tidytext)
tidy_corpus <- tidy(corpus)
tidy_text <- unnest_tokens(tidy_corpus, words, text)

If you have text files and you want tidy data, I would go straight from one to the other and not bother with the tm package in between.
To find all the text files within a working directory, you can use list.files with an argument:
all_txts <- list.files(pattern = ".txt$")
The all_txts object will then be a character vector that contains all your filenames.
Then, you can set up a pipe to read in all the text files and unnest them using tidytext with a map function from purrr. You can use a mutate() within the map() to annotate each line with the filename, if you'd like.
library(tidyverse)
library(tidytext)
map_df(all_txts, ~ data_frame(txt = read_file(.x)) %>%
mutate(filename = basename(.x)) %>%
unnest_tokens(word, txt))

Related

Importing multiple .txt files into r

I need to import multiple .txt files into R. Each file has multiple sentences in it (for eg, "On Monday, I went to the park.") I would like to be able to import all the files in at the same time and then add them to a tibble, so that I can do text analysis on it.
So far, I have tried
#to create vector of txt files
files <- list.files(pattern = "txt$")
# Read all the files and create a FileName column to store filenames
files_list <- files %>%
set_names(.) %>%
map_df(read_table2, .id = "FileName")
my_data <- read.delim(file(files))
But I don't know how to actually load the text in each .txt file into the data. When I run this code above, it only reads in the text from one of the files, not all.
I also tried:
sapply(files, read.delim)
mainlist = list()
for (i in 1: length(fileList)) {
mainlist[[i]] = read.delim(files[i], header = TRUE, sep = "\t")
}
And while it prints out all the info in each .txt file, when I try to put it in a tibble using
mainlist_tib <- tibble(mainlist)
the tibble is empty.
Any assistance would be greatly appreciated!
Edit: Regarding the tibble, I would like for it to have a column for the txt file name and then another column for the text from the file, and then to be able to use the unnest_tokens() function to have a tibble where each row contains only one word. Sort of like in the example from the text mining textbook by Silge and Robinson: https://www.tidytextmining.com/tidytext.html
You could try it like this:
library(dplyr)
library(purrr)
files %>%
set_names(.) %>%
map_dfr(~readr::read_table(., col_names = F), .id = "FileName")

Naming the columns of a merged file equal to the folder name the source file comes from

I have written a script in R that combines my text files having one column of data to a .csv file where all the columns are listed besides each other. Unfortunately, my analysis Software always lables the text file in the same way so that all the text files are called "List".
Hereby, I was able to combine the different text files to a .csv file.
fileList <- list.files(path = ".", recursive = TRUE, pattern = "DistList.txt", full.names = TRUE)
listData <- lapply(fileList, read.table)
names(listData) <- gsub("DistList.txt","",basename(fileList))
library(tidyverse)
library(reshape2)
bind_rows(listData, .id = "FileName") %>%
group_by(FileName) %>%
mutate(rowNum = row_number()) %>%
dcast(rowNum~FileName, value.var = "V1") %>%
select(-rowNum) %>%
write.csv(file="Result.csv")
Now, I would like to change the column names in such a way that it is equal to the name of the folder, in which the text file is located. As I don't have that much experience using R yet, I can't figure out, how I should do it.
Thank you very much for your help already in advance!
The line
names(listData) <- gsub("DistList.txt", "", basename(fileList))
should be:
names(listData) <- gsub("DistList.txt", "", fileList)
Because by using basename we are removing all the folders, leaving us with filename "DistList.txt", and that filename gets replaced by empty string "" using gsub.
We might actually want below instead, extract the last directory, which should give in this case something like c("C1.1", "C1.2", ...):
names(listData) <- basename(dirname(fileList))

How to apply a custom function to a quanteda corpus

I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.
I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.
Using tm package previously I did this:
# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
library(stringr)
library(stringi)
library(tm)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
Then within my tm corpus transformations I did this:
mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))
What is the equivilent way to apply this custom function to my quanteda corpus?
Impossible to know if that will work from your example, which leaves some parts out, but generally:
If you want to access texts in a quanteda corpus, you can use texts(), and to replace those texts, texts()<-.
So in your case, assuming that mycorpus is a tm corpus, you could do this:
library("quanteda")
stringi_spelling_update2 <- function(x, lut = spellingdoc) {
stringi::stri_replace_all_regex(str = x,
pattern = paste0("\\b", lut[,1], "\\b"),
replacement = lut[,2],
vectorize_all = FALSE)
}
myquantedacorpus <- corpus(mycorpus)
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)
I think I found an indirect answer over here.
texts(myCorpus) <- myFunction(myCorpus)

R: Combining lists with csv metadata

I am having a bit of trouble with some work with text files and the associated metadata for the files. I can read in the files, pre-process them, and then convert them to a readable format for the lda package I am using (using this guide by Sievert). Example below:
#Reading the files
corpus <- file.path("Folder/Fiction/texts")
corpus <- list.files(corpus)
corpus <- lapply(corpus, readLines)
***pre-processing functions removed for space***
corp.list <- strsplit(corpus, "[[:space:]]+")
# compute the table of terms:
corpterm.table <- table(unlist(corp.list))
corpterm.table <- sort(corpterm.table, decreasing = TRUE)
***removing stopwords, again removed for space***
# now put the corpus into the format required by the lda package:
getCorp.terms <- function(x) {
index <- match(x, vocabCorp)
index <- index[!is.na(index)]
rbind(as.integer(index - 1), as.integer(rep(1, length(index))))
}
corpus <- lapply(corp.list, getCorp.terms)
At this point, the corpus variable is a list of document tokens with a separate vector per document, but has been detached from its file path, and the name of the file. Herein is where my problem begins: I have a csv with the metadata for the texts (their file names, titles, authors, years, genres, etc.) which I would like to have associated with each vector of tokens, in order to easily model my information over time, by gender, etc.
I am unsure of how to do this, but am guessing it would need to be done as the files are being read, and not merged after I have manipulated the document texts. I would imagine it would be something that would look like:
corpus.f <- file.path (stuff)
corpus <- list.files(corpus)
corpus <- lapply(corpus, ReadLines)
corpus.df <- as.data.frame(c(corpus.f,corpus))
corpus.info <- read.csv(stuff.csv)
And from there using the merge or match function in combination to associate each document (or vector of document tokens) with its correct row of metadata.
Try changing to this:
pth <- file.path("Folder/Fiction/texts")
fi <- list.files(pth)
corpus <- lapply(fi, readLines)
corp.list <- strsplit(corpus, "[[:space:]]+")
setNames(object = corp.list, nm = fi) -> corp.list

R text file and text mining...how to load data

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.
I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as....
stemDocument(x, language = map_IETF(Language(x)))
So assume that this is my doc "this is a test for R load"
How do I load the data for text processing and to create the object x?
Like #richiemorrisroe I found this poorly documented. Here's how I get my text in to use with the tm package and make the document term matrix:
library(tm) #load text mining library
setwd('F:/My Documents/My texts') #sets R's working directory to near where my files are
a <-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat")) #specifies the exact folder where my text file(s) is for analysis with tm.
summary(a) #check what went in
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords
a <- tm_map(a, stemDocument, language = "english")
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
In this case you don't need to specify the exact file name. So long as it's the only one in the directory referred to in line 3, it will be used by the tm functions. I do it this way because I have not had any success in specifying the file name in line 3.
If anyone can suggest how to get text into the lda package I'd be most grateful. I haven't been able to work that out at all.
Can't you just use the function readPlain from the same library? Or you could just use the more common scan function.
mydoc.txt <-scan("./mydoc.txt", what = "character")
I actually found this quite tricky to begin with, so here's a more comprehensive explanation.
First, you need to set up a source for your text documents. I found that the easiest way (especially if you plan on adding more documents, is to create a directory source that will read all of your files in.
source <- DirSource("yourdirectoryname/") #input path for documents
YourCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents
You can then apply the StemDocument function to your Corpus. HTH.
I believe what you wanted to do was read individual file into a corpus and then make it treat the different rows in the text file as different observations.
See if this gives you what you want:
text <- read.delim("this is a test for R load.txt", sep = "/t")
text_corpus <- Corpus(VectorSource(text), readerControl = list(language = "en"))
This is assuming that the file "this is a test for R load.txt" has only one column which has the text data.
Here the "text_corpus" is the object that you are looking for.
Hope this helps.
Here's my solution for a text file with a line per observation. the latest vignette on tm (Feb 2017) gives more detail.
text <- read.delim(textFileName, header=F, sep = "\n",stringsAsFactors = F)
colnames(text) <- c("MyCol")
docs <- text$MyCol
a <- VCorpus(VectorSource(docs))
The following assumes you have a directory of text files from which you want to create a bag of words.
The only change that needs to be made is replace
path = "C:\\windows\\path\\to\\text\\files\\
with your directory path.
library(tidyverse)
library(tidytext)
# create a data frame listing all files to be analyzed
all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\", # path can be relative or absolute
pattern = ".txt$", # this pattern only selects files ending with .txt
full.names = TRUE) # gives the file path as well as name
# create a data frame with one word per line
my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>% # read in each file in list
mutate(filename = basename(.x)) %>% # add the file name as a new column
unnest_tokens(word, txt)) # split each word out as a separate row
# count the total # of rows/words in your corpus
my_corpus %>%
summarize(number_rows = n())
# group and count by "filename" field and sort descending
my_corpus %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
arrange(desc(number_rows))
# remove stop words
my_corpus2 <- my_corpus %>%
anti_join(stop_words)
# repeat the count after stop words are removed
my_corpus2 %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
arrange(desc(number_rows))

Resources