Getting started with text analysis, making a dataframe in R

Getting started with text analysis, making a dataframe in R - r

I am doing text analysis in R. Thus far, I have a vector that contains the corpus and metadata in a csv that I would like to merge with it. Here is how I obtain the corpus in vector form
corpus <- VCorpus(VectorSource(alldocs)) # corpus is a vector
Here is the metadata:
metadata <- read.csv("alldocs.csv", header = TRUE, na.strings = c(""), sep = ",")
How can I combine the two? I want to combine them in order (i.e., first document in corpus corresponds to first row in csv, etc.). In the end, I want a dataframe where each row corresponds to the right document from the corpus.
Update:
I was told to try to make the problem reproducible.
I started with a folder with all the texts I have. I start by loading them into a vector:
alldocs <- Corpus(
DirSource("/path/file/wheredocumentsare"),
readerControl = list(reader = readPlain, language = "en", load = FALSE)
)
corpus <- VCorpus(VectorSource(alldocs)) # corpus is a vector
metadata <- read.csv("metadata.csv", header = TRUE, na.strings = c(""), sep = ",")
I would like to combine metadata and corpus.Yet when I input,
fulldata <- data.frame(corpus, metadata)
I get the following error message
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class "c("VCorpus", "Corpus")" to a data.frame

Related

How to convert a large tokenized dfm to matrix in R?

I have a large tokenized dfm of the dimension 2656242 x 630566. I want to convert this to a matrix but any kind of operation on this gives me the following error
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
My code till now is as below:
Booker_PreSale = Samp2 %>% filter(Booking_Status=="Booker" & Pre_Post_Sale=="Pre-Sale")
Non_Booker_PreSale = Samp2 %>% filter(Booking_Status=="Non-Booker" & Pre_Post_Sale=="Pre-Sale")
data = rbind(Booker_PreSale,Non_Booker_PreSale)
data = data[,c(5,2)]
data = na.omit(data)
data$Booking_Status = as.factor(data$Booking_Status)
data$TextLength = nchar(as.character(data$comments))
library(caret)
set.seed(32984)
indexes = createDataPartition(data$Booking_Status,times = 1,
p=0.7,list = FALSE)
train = data[indexes,]
test = data[-indexes,]
library(quanteda)
train_tokens = tokens(as.character(train$comments), what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
train_tokens = tokens_tolower(train_tokens)
train_tokens = tokens_select(train_tokens, stopwords(),
selection = "remove")
train_tokens = tokens_wordstem(train_tokens, language = "english")
train_tokens_dfm = dfm(train_tokens, tolower = FALSE)
train_tokens_matrix = as.matrix(train_tokens_dfm[,c(1:500)])
I am unable to proceed any further from this. Need some help with a way around this.
Thanks in advance.

Seems like your dfm is simply to large. Therefore, first ask yourself whether you really need to convert your dfm object to a matrix. If you want to fit a model (e.g., a topic model) that takes your tokenized documents as input, you most likely do not need to convert the dfm object to a matrix!
If you do not explicitly need a matrix, I would recommend to first convert your dfm object to a non-quanteda format; this can be achieved using
non_dfm <- quanteda::convert(train_tokens_dfm).
You can then extract the dfm content as a list of lists using dfm_list<-non_dfm$vocab. Each list element is associated with a document and contains two rows: the first row gives the index of the token, and the second row is the number of occurrences of this token in the document. You thus have exactly the same information that is contained in a document feature matrix.

Unable to Export (or view) Total If-Idf Results for textmining

As part of my efforts to textmine research papers I am interested in looking at Tf-Idf values.
So far I have had difficulty using tidytext for tf-idf due to issues with columns/objects not being detected (consistent issue on this site). Therefore I utilised TM weighting and hoped to view all my results by exporting to csv.
The limited results that I have are in the right format (paper; term; tf-idf value). Only a few of the papers though are available. This is despite the fact that the object states that there are 71 documents. (One document is not readable therefore shows up with error that can be ignored.)
Any help is appreciated, cheers
setwd('C:\\Users\\[--myname--]\\Desktop\\Text_Mine_TestSet_1')
files <- list.files(pattern = 'pdf$')
summary(files)
corpus_a1 <- Corpus(URISource(files),
readerControl = list(reader = readPDF()))
TDM_a1 <- TermDocumentMatrix(corpus_a1, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming =TRUE,
removenumbers = TRUE))
DTM_a1 <- DocumentTermMatrix(corpus_a1, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming =TRUE,
removenumbers = TRUE))
# --------------------------
tdm_TfIdf <- weightTfIdf(TDM_a1)
tdm_TfIdf # 71 Documents 32,177 terms (can sparse here)
tdm_TfIdf %>%
View() # Odd table
inspect(tdm_TfIdf) # Shows limited output
print(tdm_TfIdf)
library(devtools)
tdm_inspect <- inspect(tdm_TfIdf)
tdm_DF <- as.data.frame(tdm_inspect, stringsAsFactors = FALSE)
tdm_DF
write.table(tdm_DF)
write.csv(tdm_DF, 'C:\\Users\\Hunter S. Baggen\\Desktop\\tdm_TfIdf.csv',
row.names = TRUE)
# ---------------------
# SAME ISSUE SIMPLY X and Y AXIS FLIPPED
dtm_TfIdf <- weightTfIdf(DTM_a1)
dtm_TfIdf # 71 Documents 32,177 terms (can sparse here)
dtm_TfIdf %>%
View() # Odd table
inspect(dtm_TfIdf) # Shows limited output
print(dtm_TfIdf)
dtm_inspect <- inspect(dtm_TfIdf)
dtm_DF <- as.data.frame(dtm_inspect, stringsAsFactors = FALSE)
dtm_DF
write.table(dtm_DF)
write.csv(dtm_DF, 'C:\\Users\\Hunter S. Baggen\\Desktop\\dtm_TfIdf.csv',
row.names = TRUE)
As stated above, four papers and ten terms appear in the resulting csv file. I am unsure why the results would be limited in this manner.

Ultimately I was able to accomplish this goal (though not another related one I posted about and is related to my work). Most important, I used Cermine (https://github.com/CeON/CERMINE) who I cannot thank enough and will cite in my work. This allowed me to convert my .pdf into .txt while keeping document format.
In regards to exporting TFIDF values to .csv files (in Excel) I also had a great deal of help. This help, however, has no original reference point that I can find; I found it from someone who sourced it from another etc. After making a dataframe (DF <- function(x,y)) export each as a sheet within Excel (.csv file) with this code:
*NB please take credit if you wrote this script it has been immensely useful
xlsx.writeMultipleData <- function (file, ...)
{
require(xlsx, quietly = TRUE)
objects <- list(...)
fargs <- as.list(match.call(expand.dots = TRUE))
objnames <- as.character(fargs)[-c(1, 2)]
nobjects <- length(objects)
for (i in 1:nobjects) {
if (i == 1)
write.xlsx(objects[[i]], file, sheetName = objnames[i])
else write.xlsx(objects[[i]], file, sheetName = objnames[i],
append = TRUE)
}
}
xlsx.writeMultipleData('filename.xlsx',
Dataframe_A, Dataframe_B, etc)

Adding RegEx to specify character ngrams for a corpus in R

I'm having trouble using a RegEx on a corpus.
I read in a couple of text documents that I converted to a corpus.
I want to display it in a TermDocumentMatrix after some pre-processing.
First I want to specify them with the RegEx "(\b([a-z]*)\B)". For example for "the host" -> "th" "hos"
Then I want to use character n-grams with n = 1:3, so for the previous example ->
t" "th", "h", "ho", "hos" Hence I want all characters that define the beginning of the word but do not include the last character of it.
My code so far is giving me a TermDocumentMatrix with n = 1:3 on the whole corpus. However all my approaches to add the RegEx so far haven't beeen working.
I was wondering if there's a way to include in: typedPrefix <- tokens()...
Here's the code:
# read documents
FILEDIR <- (path)
txts <- readtext(paste0(FILEDIR, "/", "*.txt"))
my_corpus <- corpus(txts)
#start processing
typedPrefix <- my_corpus
typedPrefix <- tokens(gsub("\\s", "_", typedPrefix), "character", ngrams=1:3, conc="", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
dfm2 <- dfm(typedPrefix)
tdm2 <- as.TermDocumentMatrix(t(dfm2), weighting=weightTf)
as.matrix(tdm2)
#write output file
write.csv2(as.matrix(tdm2), file = "typedPrefix.csv")

Using lapply to apply a function over read-in list of files and saving output as new list of files

I'm quite new at R and a bit stuck on what I feel is likely a common operation to do. I have a number of files (57 with ~1.5 billion rows cumulatively by 6 columns) that I need to perform basic functions on. I'm able to read these files in and perform the calculations I need no problem but I'm tripping up in the final output. I envision the function working on 1 file at a time, outputting the worked file and moving onto the next.
After calculations I would like to output 57 new .txt files named after the file the input data first came from. So far I'm able to perform the calculations on smaller test datasets and spit out 1 appended .txt file but this isn't what I want as a final output.
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#begin looping process
loop_output = lapply(files,
function(x) {
#Load 'x' file in
DF<- read.table(x, header = FALSE, sep= "\t")
#Call calculated height average a name
R_ref= 1647.038203
#Add column names to .las data
colnames(DF) <- c("X","Y","Z","I","A","FC")
#Calculate return
DF$R_calc <- (R_ref - DF$Z)/cos(DF$A*pi/180)
#Calculate intensity
DF$Ir_calc <- DF$I * (DF$R_calc^2/R_ref^2)
#Output new .txt with calcuated columns
write.table(DF, file=, row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
})
My latest code endeavors have been to mess around with the intial lapply/sapply function as so:
#begin looping process
loop_output = sapply(names(files),
function(x) {
As well as the output line:
#Output new .csv with calcuated columns
write.table(DF, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")
From what I've been reading the file naming function during write.table output may be one of the keys I don't have fully aligned yet with the rest of the script. I've been viewing a lot of other asked questions that I felt were applicable:
Using lapply to apply a function over list of data frames and saving output to files with different names
Write list of data.frames to separate CSV files with lapply
to no luck. I deeply appreciate any insights or paths towards the right direction on inputting x number of files, performing the same function on each, then outputting the same x number of files. Thank you.

The reason the output is directed to the same file is probably that file = paste0(names(DF), "txt", sep=".") returns the same value for every iteration. That is, DF must have the same column names in every iteration, therefore names(DF) will be the same, and paste0(names(DF), "txt", sep=".") will be the same. Along with the append = TRUE option the result is that all output is written to the same file.
Inside the anonymous function, x is the name of the input file. Instead of using names(DF) as a basis for the output file name you could do some transformation of this character string.
example.
Given
x <- "/foo/raw_data.csv"
Inside the function you could do something like this
infile <- x
outfile <- file.path(dirname(infile), gsub('raw', 'clean', basename(infile)))
outfile
[1] "/foo/clean_data.csv"
Then use the new name for output, with append = FALSE (unless you need it to be true)
write.table(DF, file = outfile, row.names = FALSE, col.names = FALSE, append = FALSE, fileEncoding = "UTF-8")

Using your code, this is the general idea:
require(purrr)
#list filenames
files <- list.files(path=, pattern="*.txt", full.names=TRUE, recursive=FALSE)
#Call calculated height average a name
R_ref= 1647.038203
dfTransform <- function(file){
colnames(file) <- c("X","Y","Z","I","A","FC")
#Calculate return
file$R_calc <- (R_ref - file$Z)/cos(file$A*pi/180)
#Calculate intensity
file$Ir_calc <- file$I * (file$R_calc^2/R_ref^2)
return(file)
}
output <- files %>% map(read.table,header = FALSE, sep= "\t") %>%
map(dfTransform) %>%
map(write.table, file=paste0(names(DF), "txt", sep="."),
row.names = FALSE, col.names = FALSE, append = TRUE,fileEncoding = "UTF-8")

Getting count of keywords using tm package in R

I'm trying to get a count of the keywords in my corpus using the R "tm" package. This is my code so far:
# get the data strings
f<-as.vector(forum[[1]])
# replace +
f<-gsub("+", " ", f ,fixed=TRUE)
# lower case
f<-tolower(f)
# show all strings that contain mobile
mobile<- f[grep("mobile", f, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)]
text.corp.mobile <- Corpus(VectorSource(mobile))
text.corp.mobile <- tm_map(text.corp.mobile , removePunctuation)
text.corp.mobile <- tm_map(text.corp.mobile , removeWords, c(stopwords("english"),"mobile"))
dtm.mobile <- DocumentTermMatrix(text.corp.mobile)
dtm.mobile
dtm.mat.mobile <- as.matrix(dtm.mobile)
dtm.mat.mobile
This returns a table with binary results of weather a keyword appeared in one of the corpus texts or not.
Instead of getting the final result in a binary form I would like to get a count for each keyword. For example:
'car' appeared 5 times
'button' appeared 9 times

without seeing your actual data, its a bit hard to tell but since you just called DocumentTermMatrix I would try something like this:
dtm.mat.mobile <- as.matrix(dtm.mobile)
word.freqs <- sort(rowSums(dtm.mat.mobile), decreasing=TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Getting started with text analysis, making a dataframe in R - r

Related

How to convert a large tokenized dfm to matrix in R?

Unable to Export (or view) Total If-Idf Results for textmining

Adding RegEx to specify character ngrams for a corpus in R

Using lapply to apply a function over read-in list of files and saving output as new list of files

Getting count of keywords using tm package in R

Categories

Resources