Writing a Simple Triplet Matrix to a File? - r

I am using the tm package to compute term-document-matrix for a dataset, I now have to write the term-document-matrix to a file but when I use the write functions in R I am getting a error.
Here is the code which I am using and the error I am getting:
data("crude")
tdm <- TermDocumentMatrix(crude, control = list(weighting = weightTfIdf, stopwords = TRUE))
dtm <- DocumentTermMatrix(crude, control = list(weighting = weightTfIdf, stopwords = TRUE))
and this is the error while I use the write.table command on this data:
Error in cat(list(...), file, sep, fill, labels, append) : argument 1 (type 'list') cannot be handled by 'cat'
I understand that tbm is a object of type Simple Triplet Matrix, but how can I write this to a simple text file.

I think I might be misunderstanding the question, but if all you want to do is export the term document matrix to a file, then how about this:
m <- inspect(tdm)
DF <- as.data.frame(m, stringsAsFactors = FALSE)
write.table(DF)
Is that what you're after mate?
Hope that helps a little,
Tony Breyal

Should the file be "human-readable"? If not, use dump, dput, or save. If so, convert your list into a data.frame.
Edit: You can convert your list into a matrix if each list element is equal length by doing matrix(unlist(list.name), nrow=length(list.name[[1]])) or something like that (or with plyr).
Why aren't you doing your SVM analysis in R (e.g. with kernlab)?
Edit 2: Ok, I looked at your data, and it isn't easy to convert into a matrix because the list elements aren't equal length:
> is.list(tdm)
[1] TRUE
> str(tdm)
List of 7
$ i : int [1:1475] 15 29 151 152 173 205 215 216 227 228 ...
$ j : int [1:1475] 1 1 1 1 1 1 1 1 1 1 ...
$ v : Named num [1:1475] 3.32 4.32 2.32 2 2.32 ...
..- attr(*, "names")= chr [1:1475] "1.50" "16.00" "barrel," "barrel." ...
$ nrow : int 985
$ ncol : int 20
$ dimnames :List of 2
..$ Terms: chr [1:985] "(bpd)" "(bpd)." "(gcc)" "(it) appears to be nearing a crossroads with regard to\nderegulation, both as it pertains to investments and imports," ...
..$ Docs : chr [1:20] "127" "144" "191" "194" ...
$ Weighting: chr [1:2] "term frequency - inverse document frequency" "tf-idf"
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
In order to convert this to a matrix, you will need to either take elements of this list (e.g. i, j) or else do some other manipulation.
Edit 3: Just to conclude my commentary here: these objects are intended to be used with the inspect function (see the package vignette).
As discussed, in order to use a function like write.table, you will need to convert your list into a matrix, which requires some manipulation of that list such that you have several vectors of equal length. Looking at the structure of these tm objects: this will be very difficult to do, and I suggest you work with the helper functions that are included with that package.

dtmMatrix <- as.matrix(dtm)
write.csv(dtmMatrix, 'mydata.csv')
This certainly does the work. However, when I tried it on a very large DTM (25000 by 35000), it gave errors relating to lack of memory space.
I used the following method:
dtm <- DocumentTermMatrix(corpus)
dtm1 <- removeSparseTerms(dtm,0.998) ##max allowed sparsity 0.998
m <- inspect(dtm1)
DF <- as.data.frame(m, stringsAsFactors = FALSE)
write.csv(DF,"mydata0.998sparse.csv")
Which reduced the size of the document term matrix to a great extent!
Here you can increase the max allowable sparsity (closer to 1) to include more terms in DF.

Related

Error in asMethod(object): Cholmod error 'problem too large'

I have the following object
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:120671481] 0 2 3 6 10 13 21 22 25 36 ...
..# p : int [1:51366] 0 3024 4536 8694 3302271 3302649 5715381 5756541 5784009 5801691 ...
..# Dim : int [1:2] 10314738 51365
..# Dimnames:List of 2
.. ..$ : chr [1:10314738] "line1" "line2" "line3" "line4" ...
.. ..$ : chr [1:51365] "sparito" "davide," "15enne" "di" ...
.. .. ..- attr(*, ".match.hash")=Class 'match.hash' <externalptr>
..# x : num [1:120671481] 1 1 1 1 1 1 1 1 1 1 ...
..# factors : list()
This object comes from the function dtm_builder of text2map package. Since I would like to remove empty rows from the matrix, I thought about using the command:
raw.sum=apply(dtm,1,FUN=sum) #sum by raw each raw of the table
dtm2=dtm[raw.sum!=0,]
Anyway, I obtained the following error:
Error in asMethod(object): Cholmod error 'problem too large' at file ..
How could I fix it?
The short answer to your problem is that you're likely converting a sparse object to a dense object. Matrix package sparse matrix classes are very memory efficient when a matrix has a lot of zeros (like a DTM) by simply not allocating memory for the zeros.
#akrun's answer should work, but there is a rowSums function in base R and a rowSums function from the Matrix package. You would need to load the Matrix package first.
Here is an example dgCMatrix (note not loading Matrix package yet)
m1 <- Matrix::Matrix(1:9, 3, 3, sparse = TRUE)
m1[1, 1:3] <- 0
class(m1)
If we use the base R rowSums you get the error:
rowSums(m1)
Error in rowSums(dtm): 'x' must be an array of at least two dimensions
If the Matrix package is loaded,rowSums will be replaced with the Matrix package's own method, which works with dgCMatrix. This is also true for the bracket operators [. If you update text2map to version 0.1.5, Matrix is loaded by default.
That is a massive DTM, so you may still run into memory issues -- which will depend on your machine. One thing to note is that removing sparse rows/columns will not help much. So, although words that occur once or twice will make up about 60% of your columns, you will reduce the size in terms of memory more by removing the most frequent words (i.e. words with a number in every row).

writing a loop to go through a large list with sublists and save these sublist

I would like to extract data from large list with many sub-lists called 'summary' https://www.dropbox.com/s/uiair94p0v7z2zr/summary10.csv?dl=0
This file is compilation of the fitting of dose response curve by patient and drugs. I share a small file with just 10 patients, 105 drugs and x and y as readout for the fitting with each 100pt.
I would like to save all the fits for each patient and every drug in a separate file.
I tried to write the list into a df to use tidyverse but didn't manage. I have only started out with R so this is very complex for me.
for (i in 1:length(summary10))
{for (j in 1:length(summary10[[i]]))
{x1 <- summary10[[i]][[j]][[1]]
y1 <- summary10[[i]][[j]][[2]]
print(summary10[[i]][[j]]);}}
the loop works but I don't know how to save them in different files so that I will be able to know what is what. I tried something I found online but it doesn't work:
for (i in 1:length(summary10))
{for (j in 1:length(summary10[[i]]))
{x1 <- summary10[[i]][[j]][[1]]
y1 <- summary10[[i]][[j]][[2]]
cbind(x1,y1) -> resp
write.csv(resp, file = paste0(summary[[i]], ".-csv"), row.names = FALSE)
}}
Error in file(file, ifelse(append, "a", "w")) : invalid 'description' argument
In addition:
Warning message: In if (file == "") file <- stdout() else if (is.character(file)) { : the condition has length > 1 and only the first element will be used
It's really hard to anticipate what goes wrong, when we cannot see how you made summary10. No way am I going to guess how you came from your tabular file, to a list of lists (or whatever summary10 may be).
But in the end, your error indicates that you are providing an illicit filename in the file = paste0(summary[[i]], ".-csv") argument. First tip on debugging is simply printing to console. Try this on for size:
cbind(x1,y1) -> resp
cat(paste0(summary[[i]], ".-csv", '\n') # <-----
# use `cat` to print to console the contents of your expressiosn
write.csv(resp, file = paste0(summary[[i]], ".-csv"), row.names = FALSE)
What is it? It should evaluate to a simple string, say B.M.21.S.-csv, but it might not be the case.
At a first glance, I would guess you've misspelled your variable. summary is usually a function, whereas you might be looking for summary10. Still, the i'th element of summary10 looks like it could be a list itself, so your expression will fail to produce a simple string.
Update with summary10
I always recommend using str to examine the structure of an object. For lists, use the argument max.level to avoid printing endless nested lists:
> str(summary10, max.level=1)
List of 10
$ B-HR-25 :List of 106
$ B-SR-22 :List of 106
$ B-VHR-01:List of 106
$ B-SR-23 :List of 106
$ B-SR-24 :List of 106
$ B-HR-21 :List of 106
$ B-M-21 :List of 106
$ B-SR-21 :List of 106
$ B-MR-01 :List of 106
$ B-M-01 :List of 106
And then a step further in:
> str(summary10[[1]], max.level=2)
List of 106
$ PP242 :List of 2
..$ x: num [1:100] 1 1.1 1.2 1.32 1.45 ...
..$ y: num [1:100] 0.923 0.922 0.921 0.92 0.919 ...
$ AZD8055 :List of 2
..$ x: num [1:100] 1 1.1 1.2 1.32 1.45 ...
..$ y: num [1:100] 0.953 0.953 0.953 0.952 0.952 ...
So object summary10 is a collection of patients (lists of lists); summary10[1] is the collection containing the first patient, summary10[[1]] the first patient (a list itself) with their responses to drugs.
So what happens when you try to make a filename from summary10[[i]]? Try it, I won't print the output here. Back to str(summary10), the patients' designations ("B-HR-25", etc.) are the names of the entries. Get them with names(summary10). As an exercise, compare names(summary10), names(summary10)[1], names(summary10[1]) and names(summary10[[1]]).

Access transition matrix from markovchainFit object

I want to first calculate a markov transition matrix and then take exponent of it. To achieve the first goal I use the markovchainFit function inside markovchain package and it return me a data.frame , rather than a matrix. So I need to convert it to matrix before I take exponent.
My R code snippet is like
#################################
# Estimate Transition Matrix #
#################################
setwd("G:/Data_backup/GDP_per_Capita")
library("foreign")
library("Hmisc")
mydata <- stata.get("G:/Data_backup/GDP_per_Capita/states.dta")
mydata
library(markovchain)
library(expm)
rgdp_e=mydata[,2:7]
rgdp_o=mydata[,8:13]
createSequenceMatrix(rgdp_e)
rgdp_e_trans<-markovchainFit(data=rgdp_e,,method="bootstrap",nboot=5, name="Bootstrap Mc")
rgdp_e_trans<-as.numeric(unlist(rgdp_e_trans))
rgdp_e_trans<-as.matrix(rgdp_e_trans)
is.matrix(rgdp_e_trans)
rgdp_e_trans %^% 1/5
the rgdp_e_trans is a data frame, and I try to convert it to a numeric matrix. It seems work when I test it using is.matrix command. However, the final line give me an error said
Error in rgdp_e_trans %^% 2 :
(list) object cannot be coerced to type 'double'
After some searching work in stackoverflow, I find this question sharing the similar problem and use rgdp_e_trans<-as.numeric(unlist(rgdp_e_trans)) to coerce the object to be `double', but it seems not work.
Besides, the data.frame rgdp_e_trans contains no factor or characters
The output in the console is like
> rgdp_e=mydata[,2:7]
> rgdp_o=mydata[,8:13]
> createSequenceMatrix(rgdp_e)
Error: not compatible with STRSXP
> rgdp_e_trans<-markovchainFit(data=rgdp_e,,method="bootstrap",nboot=5, name="Bootstrap Mc")
> rgdp_e_trans
$estimate
1 2 3 4 5
1 0.6172840 0.18930041 0.09053498 0.074074074 0.02880658
2 0.1125828 0.59602649 0.28476821 0.006622517 0.00000000
3 0.0000000 0.03846154 0.60256410 0.358974359 0.00000000
4 0.0000000 0.01162791 0.03488372 0.691860465 0.26162791
5 0.0000000 0.00000000 0.00000000 0.044247788 0.95575221
> rgdp_e_trans<-as.numeric(unlist(rgdp_e_trans))
Error: (list) object cannot be coerced to type 'double'
> rgdp_e_trans<-as.matrix(rgdp_e_trans)
> is.matrix(rgdp_e_trans)
[1] TRUE
> rgdp_e_trans %^% 1/5
Error in rgdp_e_trans %^% 1 :
(list) object cannot be coerced to type 'double'
>
Any suggestion to fix the problem, or alternative way to calculate the exponent ? Thank you.
Additional:
> str(rgdp_e_trans)
List of 1
$ estimate:Formal class 'markovchain' [package "markovchain"] with 4 slots
.. ..# states : chr [1:5] "1" "2" "3" "4" ...
.. ..# byrow : logi TRUE
.. ..# transitionMatrix: num [1:5, 1:5] 0.617 0.113 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..# name : chr "Bootstrap Mc"
and I comment out the as.matrix part
rgdp_e=mydata[,2:7]
rgdp_o=mydata[,8:13]
createSequenceMatrix(rgdp_e)
rgdp_e_trans<-markovchainFit(data=rgdp_e,,method="bootstrap",nboot=5, name="Bootstrap Mc")
rgdp_e_trans
str(rgdp_e_trans)
# rgdp_e_trans<-as.numeric(unlist(rgdp_e_trans))
# rgdp_e_trans<-as.matrix(rgdp_e_trans)
# is.matrix(rgdp_e_trans)
rgdp_e_trans$estimate %^% 1/5
You can access the transition matrix directly from the object returned by markovchainFit as:
rgdp_e_trans$estimate#transitionMatrix
Here rgdp_e_trans is your return value from markovchainFit, which is actually a list containing the information from the fitting process. You access the estimates item of that list by using the $ operator. The estimate object is from a formal S4 class (see e.g. Advanced R by Hadley Wickham for a description of the object systems used in R), which is why in order to access its items you have to use the # operator instead of the standard $ used for the more common S3 objects.
If you print out the return value of as.matrix(rgdp_e_trans) it should be immediately obvious where your initial approach went wrong. In general it's a good idea to check the structure of an object with the str function - instead of relying on its print method - when you encounter unexpected results or are working with new types of objects.

Saving integration values in an array in R

I want to save the integral values in an array.Say,from q=1 to q=10 in the following program.But due to output with a non-numeric part ,not being able to do so.Kindly help
q=10
integrand<-function(x)(q*x^3)
integrate(integrand,lower=0,upper=10)
the output is 25000 with absolute error < 2.8e-10
How to remove the non-numerical part?
str() is your friend to figure this out:
> intval <- integrate(integrand,lower=0,upper=10)
> str(intval)
List of 5
$ value : num 25000
$ abs.error : num 2.78e-10
$ subdivisions: int 1
$ message : chr "OK"
$ call : language integrate(f = integrand, lower = 0, upper = 10)
- attr(*, "class")= chr "integrate"
So you can see that it is the value member you need:
> intval$value
[1] 25000
Then:
integrand<-function(x,q=10)(q*x^3)
tmpfun <- function(q) {
integrate(integrand,lower=0,upper=10,q=q)$value
}
sapply(1:10,tmpfun)
## [1] 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000
I hope this is a simplified example, because this particular answer is much more simply obtained by (1) integrating analytically and (2) realizing that a scalar multiple can be taken out of an integral: 1:10*(10^4/4) gets the same answer.

How to delete a row from a data.frame without losing the attributes

for starters: I searched for hours on this problem by now - so if the answer should be trivial, please forgive me...
What I want to do is delete a row (no. 101) from a data.frame. It contains test data and should not appear in my analyses. My problem is: Whenever I subset from the data.frame, the attributes (esp. comments) are lost.
str(x)
# x has comments for each variable
x <- x[1:100,]
str(x)
# now x has lost all comments
It is well documented that subsetting will drop all attributes - so far, it's perfectly clear. The manual (e.g. http://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.data.frame.html) even suggests a way to preserve the attributes:
## keeping special attributes: use a class with a
## "as.data.frame" and "[" method:
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
d <- data.frame(i= 0:7, f= gl(2,4),
u= structure(11:18, unit = "kg", class="avector"))
str(d[2:4, -1]) # 'u' keeps its "unit"
I am not yet so far into R to understand what exactly happens here. However, simply running these lines (except the last three) does not change the behavior of my subsetting. Using the command subset() with an appropriate vector (100-times TRUE + 1 FALSE) gives me the same result. And simply storing the attributes to a variable and restoring it after the subset, does not work, either.
# Does not work...
tmp <- attributes(x)
x <- x[1:100,]
attributes(x) <- tmp
Of course, I could write all comments to a vector (var=>comment), subset and write them back using a loop - but that does not seem a well-founded solution. And I am quite sure I will encounter datasets with other relevant attributes in future analyses.
So this is where my efforts in stackoverflow, Google, and brain power got stuck. I would very much appreciate if anyone could help me out with a hint. Thanks!
If I understand you correctly, you have some data in a data.frame, and the columns of the data.frame have comments associated with them. Perhaps something like the following?
set.seed(1)
mydf<-data.frame(aa=rpois(100,4),bb=sample(LETTERS[1:5],
100,replace=TRUE))
comment(mydf$aa)<-"Don't drop me!"
comment(mydf$bb)<-"Me either!"
So this would give you something like
> str(mydf)
'data.frame': 100 obs. of 2 variables:
$ aa: atomic 3 3 4 7 2 7 7 5 5 1 ...
..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2 2 5 4 2 1 3 5 3 ...
..- attr(*, "comment")= chr "Me either!"
And when you subset this, the comments are dropped:
> str(mydf[1:2,]) # comment dropped.
'data.frame': 2 obs. of 2 variables:
$ aa: num 3 3
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
To preserve the comments, define the function [.avector, as you did above (from the documentation) then add the appropriate class attributes to each of the columns in your data.frame (EDIT: to keep the factor levels of bb, add "factor" to the class of bb.):
mydf$aa<-structure(mydf$aa, class="avector")
mydf$bb<-structure(mydf$bb, class=c("avector","factor"))
So that the comments are preserved:
> str(mydf[1:2,])
'data.frame': 2 obs. of 2 variables:
$ aa:Class 'avector' atomic [1:2] 3 3
.. ..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
..- attr(*, "comment")= chr "Me either!"
EDIT:
If there are many columns in your data.frame that have attributes you want to preserve, you could use lapply (EDITED to include original column class):
mydf2 <- data.frame( lapply( mydf, function(x) {
structure( x, class = c("avector", class(x) ) )
} ) )
However, this drops comments associated with the data.frame itself (such as comment(mydf)<-"I'm a data.frame"), so if you have any, assign them to the new data.frame:
comment(mydf2)<-comment(mydf)
And then you have
> str(mydf2[1:2,])
'data.frame': 2 obs. of 2 variables:
$ aa:Classes 'avector', 'numeric' atomic [1:2] 3 3
.. ..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
..- attr(*, "comment")= chr "Me either!"
- attr(*, "comment")= chr "I'm a data.frame"
For those who look for the "all-in" solution based on BenBarnes explanation: Here it is.
(give the your "up" to the post from BenBarnes if this is working for you)
# Define the avector-subselection method (from the manual)
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
# Assign each column in the data.frame the (additional) class avector
# Note that this will "lose" the data.frame's attributes, therefore write to a copy
df2 <- data.frame(
lapply(df, function(x) {
structure( x, class = c("avector", class(x) ) )
} )
)
# Finally copy the attribute for the original data.frame if necessary
mostattributes(df2) <- attributes(df)
# Now subselects work without losing attributes :)
df2 <- df2[1:100,]
str(df2)
The good thing: When attached the class to all the data.frame's element once, the subselects never again bother attributes.
Okay - sometimes I am stunned how complicated it is to do the most simple operations in R. But I surely did not learn about the "classes" feature if I just marked and deleted the case in SPSS ;)
This is solved by the sticky package. (Full Disclosure: I am the package author.) Apply the sticky() to your vectors and the attributes are preserved through subset operations. For example:
> df <- data.frame(
+ sticky = sticky( structure(1:5, comment="sticky attribute") ),
+ nonstick = structure( letters[1:5], comment="non-sticky attribute" )
+ )
>
> comment(df[1:3, "nonstick"])
NULL
> comment(df[1:3, "sticky"])
[1] "sticky attribute"
This works for any attribute and not only comment.
See the sticky package for details:
on Github
on CRAN
I spent hours trying to figure out how to retain attribute data (specifically variable labels) when subsetting a dataframe (removing columns). The answer was so simple, I couldn't believe it. Just use the function spss.get from the Hmisc package, and then no matter how you subset, the variable labels are retained.

Resources