How to convert a large tokenized dfm to matrix in R? - r

I have a large tokenized dfm of the dimension 2656242 x 630566. I want to convert this to a matrix but any kind of operation on this gives me the following error
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
My code till now is as below:
Booker_PreSale = Samp2 %>% filter(Booking_Status=="Booker" & Pre_Post_Sale=="Pre-Sale")
Non_Booker_PreSale = Samp2 %>% filter(Booking_Status=="Non-Booker" & Pre_Post_Sale=="Pre-Sale")
data = rbind(Booker_PreSale,Non_Booker_PreSale)
data = data[,c(5,2)]
data = na.omit(data)
data$Booking_Status = as.factor(data$Booking_Status)
data$TextLength = nchar(as.character(data$comments))
library(caret)
set.seed(32984)
indexes = createDataPartition(data$Booking_Status,times = 1,
p=0.7,list = FALSE)
train = data[indexes,]
test = data[-indexes,]
library(quanteda)
train_tokens = tokens(as.character(train$comments), what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
train_tokens = tokens_tolower(train_tokens)
train_tokens = tokens_select(train_tokens, stopwords(),
selection = "remove")
train_tokens = tokens_wordstem(train_tokens, language = "english")
train_tokens_dfm = dfm(train_tokens, tolower = FALSE)
train_tokens_matrix = as.matrix(train_tokens_dfm[,c(1:500)])
I am unable to proceed any further from this. Need some help with a way around this.
Thanks in advance.

Seems like your dfm is simply to large. Therefore, first ask yourself whether you really need to convert your dfm object to a matrix. If you want to fit a model (e.g., a topic model) that takes your tokenized documents as input, you most likely do not need to convert the dfm object to a matrix!
If you do not explicitly need a matrix, I would recommend to first convert your dfm object to a non-quanteda format; this can be achieved using
non_dfm <- quanteda::convert(train_tokens_dfm).
You can then extract the dfm content as a list of lists using dfm_list<-non_dfm$vocab. Each list element is associated with a document and contains two rows: the first row gives the index of the token, and the second row is the number of occurrences of this token in the document. You thus have exactly the same information that is contained in a document feature matrix.

Related

Error in `rownames<-`(`*tmp*`, value = colnames(countData)): attempt to set 'rownames' on an object with no dimensions

I found one exactly the same question without any useful answer since author didn't provide their files. I am using DESeq2 library following the manual 3.2 Starting from count matrices. I have my countdata and coldata imported from CSV files. I understand that countdata file can be a problem here but I don't understand what's the problem exactly.
My code:
library(DESeq2)
NGS <- read.csv2(paste0(datadir,"/CLN3_NGS_orig.csv"), header = T,stringsAsFactors = F)
Sinfo <- read.csv2(paste0(datadir,"/Sampleinfo.csv"), header = T,stringsAsFactors = F)
head(NGS)
head(Sinfo)
coldata <- DataFrame(Sinfo)
coldata <- lapply(coldata, as.factor)
coldata
lapply(NGSnum, class)
NGSnum <- data.frame(NGS[1], apply(NGS[2:13],2, as.numeric))
NGSFull <- DESeqDataSetFromMatrix(
countData = NGSnum,
colData = coldata,
design = ~ Genotype + Treatment)
NGSFull
NGS$Genotype <- relevel(NGSFull$Genotype, "WT")
deseqNGS <- DESeq(NGS)
res <- results(deseqNGS)
res
My error after appyling DESeqDataSetFromMatrix:
Error in `rownames<-`(`*tmp*`, value = colnames(countData)) :
attempt to set 'rownames' on an object with no dimensions
My coldata and countdata files on pastebin: coldata & countdata
By the way, my countdata contain transcripts, sometimes several transcripts (ENST) correspond to single gene (ENSG). Can DESeq2 sort it out for me and give me only output with genes? It is easy to convert transcripts to genes but harder to make one position out of several.
Thank you in advance,
Kasia
As a general rule Bioconductor questions will get a lot more (relevant) attention on the Bioconductor support site link here.
However, I can attempt to give a few pointers. The error you are getting is because your coldata is a list instead of a DataFrame object.
coldata <- lapply(coldata, as.factor)
creates a list for each column.
There are also a few other issues that I've addressed in the code below. Most importantly NGSnum needs to be an integer matrix. Many RNAseq count matrices are actually floating points (or doubles in R) but that is due to the algorithm assigning probabilities for reads that could have come from multiple genes. What I've done is rounded the values to turn them into integers.
library(DESeq2)
NGS <- read.csv2("Countdata10.csv", header = TRUE, stringsAsFactors = FALSE)
Sinfo <- read.csv2(paste0("Sampleinfo.csv"), header = TRUE, stringsAsFactors = FALSE)
coldata <- DataFrame(apply(X = Sinfo, MARGIN = 2, FUN = as.factor)) # use apply instead of apply
NGSnum <- apply(X = NGS[,-1], MARGIN = 2, FUN = as.numeric)
NGSnum <- apply(X = NGSnum, MARGIN = 2, FUN = round)
rownames(NGSnum) <- NGS$Transcript
NGSFull <- DESeqDataSetFromMatrix(
countData = NGSnum,
colData = coldata,
design = ~ Genotype + Treatment)
NGSFull$Genotype <- relevel(NGSFull$Genotype, "WT")
deseqNGS <- DESeq(NGSFull)
res <- results(deseqNGS)
res

Unable to Export (or view) Total If-Idf Results for textmining

As part of my efforts to textmine research papers I am interested in looking at Tf-Idf values.
So far I have had difficulty using tidytext for tf-idf due to issues with columns/objects not being detected (consistent issue on this site). Therefore I utilised TM weighting and hoped to view all my results by exporting to csv.
The limited results that I have are in the right format (paper; term; tf-idf value). Only a few of the papers though are available. This is despite the fact that the object states that there are 71 documents. (One document is not readable therefore shows up with error that can be ignored.)
Any help is appreciated, cheers
setwd('C:\\Users\\[--myname--]\\Desktop\\Text_Mine_TestSet_1')
files <- list.files(pattern = 'pdf$')
summary(files)
corpus_a1 <- Corpus(URISource(files),
readerControl = list(reader = readPDF()))
TDM_a1 <- TermDocumentMatrix(corpus_a1, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming =TRUE,
removenumbers = TRUE))
DTM_a1 <- DocumentTermMatrix(corpus_a1, control = list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming =TRUE,
removenumbers = TRUE))
# --------------------------
tdm_TfIdf <- weightTfIdf(TDM_a1)
tdm_TfIdf # 71 Documents 32,177 terms (can sparse here)
tdm_TfIdf %>%
View() # Odd table
inspect(tdm_TfIdf) # Shows limited output
print(tdm_TfIdf)
library(devtools)
tdm_inspect <- inspect(tdm_TfIdf)
tdm_DF <- as.data.frame(tdm_inspect, stringsAsFactors = FALSE)
tdm_DF
write.table(tdm_DF)
write.csv(tdm_DF, 'C:\\Users\\Hunter S. Baggen\\Desktop\\tdm_TfIdf.csv',
row.names = TRUE)
# ---------------------
# SAME ISSUE SIMPLY X and Y AXIS FLIPPED
dtm_TfIdf <- weightTfIdf(DTM_a1)
dtm_TfIdf # 71 Documents 32,177 terms (can sparse here)
dtm_TfIdf %>%
View() # Odd table
inspect(dtm_TfIdf) # Shows limited output
print(dtm_TfIdf)
dtm_inspect <- inspect(dtm_TfIdf)
dtm_DF <- as.data.frame(dtm_inspect, stringsAsFactors = FALSE)
dtm_DF
write.table(dtm_DF)
write.csv(dtm_DF, 'C:\\Users\\Hunter S. Baggen\\Desktop\\dtm_TfIdf.csv',
row.names = TRUE)
As stated above, four papers and ten terms appear in the resulting csv file. I am unsure why the results would be limited in this manner.
Ultimately I was able to accomplish this goal (though not another related one I posted about and is related to my work). Most important, I used Cermine (https://github.com/CeON/CERMINE) who I cannot thank enough and will cite in my work. This allowed me to convert my .pdf into .txt while keeping document format.
In regards to exporting TFIDF values to .csv files (in Excel) I also had a great deal of help. This help, however, has no original reference point that I can find; I found it from someone who sourced it from another etc. After making a dataframe (DF <- function(x,y)) export each as a sheet within Excel (.csv file) with this code:
*NB please take credit if you wrote this script it has been immensely useful
xlsx.writeMultipleData <- function (file, ...)
{
require(xlsx, quietly = TRUE)
objects <- list(...)
fargs <- as.list(match.call(expand.dots = TRUE))
objnames <- as.character(fargs)[-c(1, 2)]
nobjects <- length(objects)
for (i in 1:nobjects) {
if (i == 1)
write.xlsx(objects[[i]], file, sheetName = objnames[i])
else write.xlsx(objects[[i]], file, sheetName = objnames[i],
append = TRUE)
}
}
xlsx.writeMultipleData('filename.xlsx',
Dataframe_A, Dataframe_B, etc)

Create Document Term Matrix with N-Grams in R

I am using "tm" package to create DocumentTermMatrix in R. It works well for one - gram but i am trying to create a DocumenttermMatrix of N-Grams(N = 3 for now) using tm package and tokenize_ngrams function from "tokenizers" package.
But im not able to create it.
I searched for possible solution but i didnt get much help.
For privacy reasons i can not share the data.
Here is what i have tried,
library(tm)
library(tokenizers)
data is a dataframe with around 4.5k rows and 2 columns namely "doc_id" and "text"
data_corpus = Corpus(DataframeSource(data))
custom function for n-gram tokenization :
ngram_tokenizer = function(x){
temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
return(temp)
}
control list for DTM creation :
1-gram
control_list_unigram = list(tokenize = "words",
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
for N-gram tokenization
control_list_ngram = list(tokenize = ngram_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
dim(dtm_unigram)
dim(dtm_ngram)
The dimension of both the dtm's were same.
Please correct me!
Unfortunately tm has some quirks that are annoying and not always clear. First of all, tokenizing doesn't seem to work on corpera created Corpus. You need to use VCorpus for this.
So change the line data_corpus = Corpus(DataframeSource(data)) to data_corpus = VCorpus(DataframeSource(data)).
That is one issue tackled. Now the corpus will work for tokenizing but now you will run into an issue with tokenize_ngrams. You will get the following error:
Input must be a character vector of any length or a list of character
vectors, each of which has a length of 1.
when you run this line:dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)
To solve this, and not have a dependency on the tokenizer package, you can use the following function to tokenize the data.
NLP_tokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
}
This uses the ngrams function from the NLP package which is loaded when you load the tm package. 1:3 tells it to create ngrams from 1 to 3 words. So your control_list_ngram should look like this:
control_list_ngram = list(tokenize = NLP_tokenizer,
removePunctuation = FALSE,
removeNumbers = FALSE,
stopwords = stopwords("english"),
tolower = T,
stemming = T,
weighting = function(x)
weightTf(x)
)
Personally I would use the quanteda package for all of this work. But for now this should help you.

Getting started with text analysis, making a dataframe in R

I am doing text analysis in R. Thus far, I have a vector that contains the corpus and metadata in a csv that I would like to merge with it. Here is how I obtain the corpus in vector form
corpus <- VCorpus(VectorSource(alldocs)) # corpus is a vector
Here is the metadata:
metadata <- read.csv("alldocs.csv", header = TRUE, na.strings = c(""), sep = ",")
How can I combine the two? I want to combine them in order (i.e., first document in corpus corresponds to first row in csv, etc.). In the end, I want a dataframe where each row corresponds to the right document from the corpus.
Update:
I was told to try to make the problem reproducible.
I started with a folder with all the texts I have. I start by loading them into a vector:
alldocs <- Corpus(
DirSource("/path/file/wheredocumentsare"),
readerControl = list(reader = readPlain, language = "en", load = FALSE)
)
corpus <- VCorpus(VectorSource(alldocs)) # corpus is a vector
metadata <- read.csv("metadata.csv", header = TRUE, na.strings = c(""), sep = ",")
I would like to combine metadata and corpus.Yet when I input,
fulldata <- data.frame(corpus, metadata)
I get the following error message
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class "c("VCorpus", "Corpus")" to a data.frame

How to remove outlier values while reading a file

I have a large number of files, each in tab-delimited format. I need to apply some modeling (glm/gbm etc) on each of these files. They are obtained from hospital data where in exceptional cases entries may not be the proper format. For example, when entering glucose level for a patient, the data entry operator may enter N or A by mistake instead of actual number.
While reading these files in loop, I am encountering problem as such columns (glucose) are treated as factor while it should be a numeric. It is painful to investigate each file and and look for error. I am reading the files in the following way but it is certainly not a good approach.
read.table(fn, header = TRUE, sep= "\t" , na.strings = c('', 'NEG', 'TR', 'NA', '<NA>', "Done", "D", "A"))
Is there any other function through which I can assume those values/outliers to be na?
You can inspect which elements are not number (for the glucose case):
data = read.csv(file, as.is = TRUE, sep = '\t') # dont convert string to factor
glucose = data$glucose
sapply(glucose, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
Then you can work with these indexes (interpolate or remove).
To loop the files:
files = list.files(path, '*.csv')
for (file in files)
{
data = read.csv(file, sep = '\t', as.is = TRUE)
gluc = data$glucose
idxs = sapply(gluc, function(x)!is.na(as.numeric(x)), USE.NAMES = FALSE)
# interpolate or remove here
}
Use the colClasses argument to read.table and friends to specify which columns should be numeric, then R does not need to try and guess. If a column is designated to be numeric then any entries that are not numbers will be converted to NA automatically.

Resources