What determines the size of a saved object in R? - r

When I save an object from R using save(), what determines the size of the saved file? Clearly it is not the same (or close to) the size of the object determined by object.size().
Example:
I read a data frame and saved it using
snpmat=read.table("Heart.txt.gz",header=T)
save(snpmat,file="datamat.RData")
The size of the file datamat.RData is 360MB.
> object.size(snpmat)
4998850664 bytes #Much larger
Then I performed some regression analysis and obtained another data frame adj.snpmat of same dimensions (6820000 rows and 80 columns).
> object.size(adj.snpmat)
4971567760 bytes
I save it using
> save(adj.snpmat,file="adj.datamat.RData")
Now the size of the file adj.datamat.RData is 3.3GB. I'm confused why the two files are so different in size while the object.size() gives similar sizes. Any idea about what determines the size of the saved object is welcome.
Some more information:
> typeof(snpmat)
[1] "list"
> class(snpmat)
[1] "data.frame"
> typeof(snpmat[,1])
[1] "integer"
> typeof(snpmat[,2])
[1] "double" #This is true for all columns except column 1
> typeof(adj.snpmat)
[1] "list"
> class(adj.snpmat)
[1] "data.frame"
> typeof(adj.snpmat[,1])
[1] "character"
> typeof(adj.snpmat[,2])
[1] "double" #This is true for all columns except column 1

Your matrices are very different and therefore compress very differently.
SNP data contains only a few values (e.g., 1 or 0) and is also very sparse. This means that is very easy to compress. For example, if you had a matrix of all zeros, you could think of compressing the data by specifying a single value (0) as well as the dimensions.
Your regression matrix contains many different types of values, and are also real numbers (I'm assuming p-values, coefficients, etc.) This makes it much less compressible.

Related

What is the difference between "Large Matrix" and regular numeric matrix?

When a relatively large matrix is created, Rstudio marks it as a Large Matrix in its environment window:
x <- matrix(rnorm(10000 * 5000), ncol=5000)
# Large matrix (50000000 elements, 381.5 Mb)
The mode() function as expected returns "numeric" for this object:
mode(x)
## [1] "numeric"
If however I run the following command:
mode(x) <- "numeric"
Rstudio changes "Large Matrix" into a regular numeric matrix:
# x: num [1:10000, 1:5000]
So what is the difference between these 2 objects? Does this difference exist in Rstudio only or these two objects are different in R as well?
In my understanding, "Large Matrix" and matrix is the same matrix object. What matters is how these objects are displayed in the global environment in RStudio.
RStudio also distinguishes between vectors and large vectors. Consider the following vector:
n <- 256
v1 <- rnorm(n*n-5)
This vector is listed as a large vector. Now, let's decrease its size by one:
v2 <- rnorm(n*n-6)
Suddenly, it becomes a normal vector. The structure of both objects is the same (which can be verified by running str). So is their class and storage mode. What is different then? Notice that the size of v2 in memory is exactly 512 Kb.
lobstr::obj_size(v2)
>524,288 B # or exactly 512 kB
The size of v1 is slightly greater:
lobstr::obj_size(v1)
>524,296 B # or 512.0078125 KB
As far as I understand (correct me if I am wrong), for mere convenience RStudio displays objects that are greater than 512 kB differently.

R understanding behaviour of as.matrix and matrix while plotting an image

I am reading data from MNIST handwritten digit dataset. Its first column is the digit label and rest 784 columns (28 X 28) are pixel values. Pixel values are between [0, 255]. Thus, there are a total of 1+784=785 columns. To plot a row as an image, I have to convert the 784 pixel values into a matrix of 28X28. In this conversion process, I am not clear about the behaviour of as.matrix() and matrix() functions. The code and comments are as follows:
> train <- read.csv("train.csv", header=TRUE)
# Read 2nd row. Ignore the label col and read rest of 784 columns
> data<-train[2,2:785]
# Convert data into matrix of 28X28
> data<-as.matrix(data,nrow=28,ncol=28)
> dim(data)
[1] 1 784 <= Failed to convert to 28X28
> class(data)
[1] "matrix" <= But class is matrix
# as.matrix() failed. Use matrix()
> data<-matrix(data,nrow=28,ncol=28)
> dim(data)
[1] 28 28 <= Matrix conversion success
Probably, the solution is to ignore as.matrix() altogether and just use matrix(). But, it so happens that in plotting the image as.matrix() does serve as a necessary intermediary. For example, the following code works to plot the image of digit zero:
train <- read.csv("train.csv", header=TRUE)
data<-train[2,2:785]
data<-as.matrix(data,nrow=28,ncol=28)
data<-matrix(data,nrow=28,ncol=28)
##Color ramp def.
colors<-c('white','black')
cus_col<-colorRampPalette(colors=colors)
image(1:28,1:28,data,main="IInd row",col=cus_col(256))
But, in the following code where I have removed as.matrix(), the code gives an error:
> train <- read.csv("train.csv", header=TRUE)
> data<-train[2,2:785]
> data<-matrix(data,nrow=28,ncol=28)
> ##Color ramp def.
> colors<-c('white','black')
> cus_col<-colorRampPalette(colors=colors)
> image(1:28,1:28,data,main="IInd row",col=cus_col(256))
Error in is.finite(z) : default method not implemented for type 'list'
I am unable to understand the role of as.matrix() and the meaning of this error. Kindly do let me know what is wrong.
EDITED
Here is what the first three rows of data look like:
label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,pixel11,pixel12,pixel13,pixel14,pixel15,pixel16,pixel17,pixel18,pixel19,pixel20,pixel21,pixel22,pixel23,pixel24,pixel25,pixel26,pixel27,pixel28,pixel29,pixel30,pixel31,pixel32,pixel33,pixel34,pixel35,pixel36,pixel37,pixel38,pixel39,pixel40,pixel41,pixel42,pixel43,pixel44,pixel45,pixel46,pixel47,pixel48,pixel49,pixel50,pixel51,pixel52,pixel53,pixel54,pixel55,pixel56,pixel57,pixel58,pixel59,pixel60,pixel61,pixel62,pixel63,pixel64,pixel65,pixel66,pixel67,pixel68,pixel69,pixel70,pixel71,pixel72,pixel73,pixel74,pixel75,pixel76,pixel77,pixel78,pixel79,pixel80,pixel81,pixel82,pixel83,pixel84,pixel85,pixel86,pixel87,pixel88,pixel89,pixel90,pixel91,pixel92,pixel93,pixel94,pixel95,pixel96,pixel97,pixel98,pixel99,pixel100,pixel101,pixel102,pixel103,pixel104,pixel105,pixel106,pixel107,pixel108,pixel109,pixel110,pixel111,pixel112,pixel113,pixel114,pixel115,pixel116,pixel117,pixel118,pixel119,pixel120,pixel121,pixel122,pixel123,pixel124,pixel125,pixel126,pixel127,pixel128,pixel129,pixel130,pixel131,pixel132,pixel133,pixel134,pixel135,pixel136,pixel137,pixel138,pixel139,pixel140,pixel141,pixel142,pixel143,pixel144,pixel145,pixel146,pixel147,pixel148,pixel149,pixel150,pixel151,pixel152,pixel153,pixel154,pixel155,pixel156,pixel157,pixel158,pixel159,pixel160,pixel161,pixel162,pixel163,pixel164,pixel165,pixel166,pixel167,pixel168,pixel169,pixel170,pixel171,pixel172,pixel173,pixel174,pixel175,pixel176,pixel177,pixel178,pixel179,pixel180,pixel181,pixel182,pixel183,pixel184,pixel185,pixel186,pixel187,pixel188,pixel189,pixel190,pixel191,pixel192,pixel193,pixel194,pixel195,pixel196,pixel197,pixel198,pixel199,pixel200,pixel201,pixel202,pixel203,pixel204,pixel205,pixel206,pixel207,pixel208,pixel209,pixel210,pixel211,pixel212,pixel213,pixel214,pixel215,pixel216,pixel217,pixel218,pixel219,pixel220,pixel221,pixel222,pixel223,pixel224,pixel225,pixel226,pixel227,pixel228,pixel229,pixel230,pixel231,pixel232,pixel233,pixel234,pixel235,pixel236,pixel237,pixel238,pixel239,pixel240,pixel241,pixel242,pixel243,pixel244,pixel245,pixel246,pixel247,pixel248,pixel249,pixel250,pixel251,pixel252,pixel253,pixel254,pixel255,pixel256,pixel257,pixel258,pixel259,pixel260,pixel261,pixel262,pixel263,pixel264,pixel265,pixel266,pixel267,pixel268,pixel269,pixel270,pixel271,pixel272,pixel273,pixel274,pixel275,pixel276,pixel277,pixel278,pixel279,pixel280,pixel281,pixel282,pixel283,pixel284,pixel285,pixel286,pixel287,pixel288,pixel289,pixel290,pixel291,pixel292,pixel293,pixel294,pixel295,pixel296,pixel297,pixel298,pixel299,pixel300,pixel301,pixel302,pixel303,pixel304,pixel305,pixel306,pixel307,pixel308,pixel309,pixel310,pixel311,pixel312,pixel313,pixel314,pixel315,pixel316,pixel317,pixel318,pixel319,pixel320,pixel321,pixel322,pixel323,pixel324,pixel325,pixel326,pixel327,pixel328,pixel329,pixel330,pixel331,pixel332,pixel333,pixel334,pixel335,pixel336,pixel337,pixel338,pixel339,pixel340,pixel341,pixel342,pixel343,pixel344,pixel345,pixel346,pixel347,pixel348,pixel349,pixel350,pixel351,pixel352,pixel353,pixel354,pixel355,pixel356,pixel357,pixel358,pixel359,pixel360,pixel361,pixel362,pixel363,pixel364,pixel365,pixel366,pixel367,pixel368,pixel369,pixel370,pixel371,pixel372,pixel373,pixel374,pixel375,pixel376,pixel377,pixel378,pixel379,pixel380,pixel381,pixel382,pixel383,pixel384,pixel385,pixel386,pixel387,pixel388,pixel389,pixel390,pixel391,pixel392,pixel393,pixel394,pixel395,pixel396,pixel397,pixel398,pixel399,pixel400,pixel401,pixel402,pixel403,pixel404,pixel405,pixel406,pixel407,pixel408,pixel409,pixel410,pixel411,pixel412,pixel413,pixel414,pixel415,pixel416,pixel417,pixel418,pixel419,pixel420,pixel421,pixel422,pixel423,pixel424,pixel425,pixel426,pixel427,pixel428,pixel429,pixel430,pixel431,pixel432,pixel433,pixel434,pixel435,pixel436,pixel437,pixel438,pixel439,pixel440,pixel441,pixel442,pixel443,pixel444,pixel445,pixel446,pixel447,pixel448,pixel449,pixel450,pixel451,pixel452,pixel453,pixel454,pixel455,pixel456,pixel457,pixel458,pixel459,pixel460,pixel461,pixel462,pixel463,pixel464,pixel465,pixel466,pixel467,pixel468,pixel469,pixel470,pixel471,pixel472,pixel473,pixel474,pixel475,pixel476,pixel477,pixel478,pixel479,pixel480,pixel481,pixel482,pixel483,pixel484,pixel485,pixel486,pixel487,pixel488,pixel489,pixel490,pixel491,pixel492,pixel493,pixel494,pixel495,pixel496,pixel497,pixel498,pixel499,pixel500,pixel501,pixel502,pixel503,pixel504,pixel505,pixel506,pixel507,pixel508,pixel509,pixel510,pixel511,pixel512,pixel513,pixel514,pixel515,pixel516,pixel517,pixel518,pixel519,pixel520,pixel521,pixel522,pixel523,pixel524,pixel525,pixel526,pixel527,pixel528,pixel529,pixel530,pixel531,pixel532,pixel533,pixel534,pixel535,pixel536,pixel537,pixel538,pixel539,pixel540,pixel541,pixel542,pixel543,pixel544,pixel545,pixel546,pixel547,pixel548,pixel549,pixel550,pixel551,pixel552,pixel553,pixel554,pixel555,pixel556,pixel557,pixel558,pixel559,pixel560,pixel561,pixel562,pixel563,pixel564,pixel565,pixel566,pixel567,pixel568,pixel569,pixel570,pixel571,pixel572,pixel573,pixel574,pixel575,pixel576,pixel577,pixel578,pixel579,pixel580,pixel581,pixel582,pixel583,pixel584,pixel585,pixel586,pixel587,pixel588,pixel589,pixel590,pixel591,pixel592,pixel593,pixel594,pixel595,pixel596,pixel597,pixel598,pixel599,pixel600,pixel601,pixel602,pixel603,pixel604,pixel605,pixel606,pixel607,pixel608,pixel609,pixel610,pixel611,pixel612,pixel613,pixel614,pixel615,pixel616,pixel617,pixel618,pixel619,pixel620,pixel621,pixel622,pixel623,pixel624,pixel625,pixel626,pixel627,pixel628,pixel629,pixel630,pixel631,pixel632,pixel633,pixel634,pixel635,pixel636,pixel637,pixel638,pixel639,pixel640,pixel641,pixel642,pixel643,pixel644,pixel645,pixel646,pixel647,pixel648,pixel649,pixel650,pixel651,pixel652,pixel653,pixel654,pixel655,pixel656,pixel657,pixel658,pixel659,pixel660,pixel661,pixel662,pixel663,pixel664,pixel665,pixel666,pixel667,pixel668,pixel669,pixel670,pixel671,pixel672,pixel673,pixel674,pixel675,pixel676,pixel677,pixel678,pixel679,pixel680,pixel681,pixel682,pixel683,pixel684,pixel685,pixel686,pixel687,pixel688,pixel689,pixel690,pixel691,pixel692,pixel693,pixel694,pixel695,pixel696,pixel697,pixel698,pixel699,pixel700,pixel701,pixel702,pixel703,pixel704,pixel705,pixel706,pixel707,pixel708,pixel709,pixel710,pixel711,pixel712,pixel713,pixel714,pixel715,pixel716,pixel717,pixel718,pixel719,pixel720,pixel721,pixel722,pixel723,pixel724,pixel725,pixel726,pixel727,pixel728,pixel729,pixel730,pixel731,pixel732,pixel733,pixel734,pixel735,pixel736,pixel737,pixel738,pixel739,pixel740,pixel741,pixel742,pixel743,pixel744,pixel745,pixel746,pixel747,pixel748,pixel749,pixel750,pixel751,pixel752,pixel753,pixel754,pixel755,pixel756,pixel757,pixel758,pixel759,pixel760,pixel761,pixel762,pixel763,pixel764,pixel765,pixel766,pixel767,pixel768,pixel769,pixel770,pixel771,pixel772,pixel773,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,188,255,94,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,191,250,253,93,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,123,248,253,167,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,80,247,253,208,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,29,207,253,235,77,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,54,209,253,253,88,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,93,254,253,238,170,17,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,210,254,253,159,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,209,253,254,240,81,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,27,253,253,254,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,206,254,254,198,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,168,253,253,196,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,203,253,248,76,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22,188,253,245,93,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,103,253,253,191,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,240,253,195,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,15,220,253,253,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,94,253,253,253,94,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,89,251,253,250,131,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,214,218,95,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,18,30,137,137,192,86,72,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,86,250,254,254,254,254,217,246,151,32,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,16,179,254,254,254,254,254,254,254,254,254,231,54,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,72,254,254,254,254,254,254,254,254,254,254,254,254,104,0,0,0,0,0,0,0,0,0,0,0,0,0,61,191,254,254,254,254,254,109,83,199,254,254,254,254,243,85,0,0,0,0,0,0,0,0,0,0,0,0,172,254,254,254,202,147,147,45,0,11,29,200,254,254,254,171,0,0,0,0,0,0,0,0,0,0,0,1,174,254,254,89,67,0,0,0,0,0,0,128,252,254,254,212,76,0,0,0,0,0,0,0,0,0,0,47,254,254,254,29,0,0,0,0,0,0,0,0,83,254,254,254,153,0,0,0,0,0,0,0,0,0,0,80,254,254,240,24,0,0,0,0,0,0,0,0,25,240,254,254,153,0,0,0,0,0,0,0,0,0,0,64,254,254,186,7,0,0,0,0,0,0,0,0,0,166,254,254,224,12,0,0,0,0,0,0,0,0,14,232,254,254,254,29,0,0,0,0,0,0,0,0,0,75,254,254,254,17,0,0,0,0,0,0,0,0,18,254,254,254,254,29,0,0,0,0,0,0,0,0,0,48,254,254,254,17,0,0,0,0,0,0,0,0,2,163,254,254,254,29,0,0,0,0,0,0,0,0,0,48,254,254,254,17,0,0,0,0,0,0,0,0,0,94,254,254,254,200,12,0,0,0,0,0,0,0,16,209,254,254,150,1,0,0,0,0,0,0,0,0,0,15,206,254,254,254,202,66,0,0,0,0,0,21,161,254,254,245,31,0,0,0,0,0,0,0,0,0,0,0,60,212,254,254,254,194,48,48,34,41,48,209,254,254,254,171,0,0,0,0,0,0,0,0,0,0,0,0,0,86,243,254,254,254,254,254,233,243,254,254,254,254,254,86,0,0,0,0,0,0,0,0,0,0,0,0,0,0,114,254,254,254,254,254,254,254,254,254,254,239,86,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,182,254,254,254,254,254,254,254,254,243,70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,76,146,254,255,254,255,146,19,15,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Difference is explained in the documentation.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/matrix.html
Also all rows & columns of a matrix must have the same class (numeric, character, etc). In a dataframe, you can have some of each. Sometimes R is not able to convert different classes when using as.matrix()

Proper way to subset big.matrix

I would like to know if there is a 'proper' way to subset big.matrix objects in R. It is simple to subset a matrix but the class always reverts to 'matrix'. This isn't a problem when working with small datasets like this but with massive datasets but with extremely large datasets the subset could still benefit from the 'big.matrix' class.
require(bigmemory)
data(iris)
# I realize the warning about factors but not important for this example
big <- as.big.matrix(iris)
class(big)
[1] "big.matrix"
attr(,"package")
[1] "bigmemory"
class(big[,c("Sepal.Length", "Sepal.Width")])
[1] "matrix"
class(big[,1:2])
[1] "matrix"
I have since learned that the 'proper' way to subset a big.matrix is to use sub.big.matrix although this is only for contiguous columns and/or rows. Non-contiguous subsetting is not currently implemented.
sm <- sub.big.matrix(big, firstCol=1, lastCol=2)
It doesn't seem to be possible without calling as.big.matrix on the subset.
From the big.matrix documentation,
If x is a big.matrix, then x[1:5,] is returned as an R matrix containing the first five rows of x.
I presume this also applies to columns as well. So it seems you would need to call
a <- as.big.matrix(big[,1:2])
in order for the subset to also be a big.matrix object.
class(a)
# [1] "big.matrix"
# attr(,"package")
# [1] "bigmemory"

R - Big Data - vector exceeds vector length limit

I have the following R code:
data <- read.csv('testfile.data', header = T)
mat = as.matrix(data)
Some more statistics of my testfile.data:
> ncol(data)
[1] 75713
> nrow(data)
[1] 44771
Since this is a large dataset, so I am using Amazon EC2 with 64GB Ram space. So hopefully memory isn't an issue. I am able to load the data (1st line works).
But as.matrix transformation (2nd line errors) throws the following error:
resulting vector exceeds vector length limit in 'AnswerType'
Any clue what might be the issue?
As noted, the development version of R supports vectors larger than 2^31-1. This is more-or-less transparent, for instance
> m = matrix(0L, .Machine$integer.max / 4, 5)
> length(m)
[1] 2684354555
This is with
> R.version.string
[1] "R Under development (unstable) (2012-08-07 r60193)"
Large objects consume a lot of memory (62.5% of my 16G, for my example) and to do anything useful requires several times that memory. Further, even simple operations on large data can take appreciable time. And many operations on long vectors are not yet supported
> sum(m)
Error: long vectors not supported yet:
/home/mtmorgan/src/R-devel/src/include/Rinlinedfuns.h:100
So it often makes sense to process data in smaller chunks by iterating through a larger file. This gives full access to R's routines, and allows parallel evaluation (via the parallel package). Another strategy is to down-sample the data, which should not be too intimidating to a statistical audience.
Your matrix has more elements than the maximum vector length of 2^31-1. This is a problem because a matrix is just a vector with a dim attribute. read.csv works because it returns a data.frame, which is a list of vectors.
R> 75713*44771 > 2^31-1
[1] TRUE
See ?"Memory-limits" for more details.

tm package error "Cannot convert DocumentTermMatrix into normal matrix since vector is too large"

I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.
> corp
A corpus with 1859 text documents
> mat<-DocumentTermMatrix(corp)
> dim(mat)
[1] 1859 25722
> is(mat)
[1] "DocumentTermMatrix"
> mat2<-as.matrix(mat)
Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB
> object.size(mat)
5502000 bytes
For some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?
Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix?
The quick and dirty way is to export your data into a sparse matrix object from an external package like Matrix.
> attributes(dtm)
$names
[1] "i" "j" "v" "nrow" "ncol" "dimnames"
$class
[1] "DocumentTermMatrix" "simple_triplet_matrix"
$Weighting
[1] "term frequency" "tf"
The dtm object has the i, j and v attributes which is the internal representation of your DocumentTermMatrix. Use:
library("Matrix")
mat <- sparseMatrix(
i=dtm$i,
j=dtm$j,
x=dtm$v,
dims=c(dtm$nrow, dtm$ncol)
)
and you're done.
A naive comparison between your objects:
> mat[1,1:100]
> head(as.vector(dtm[1,]), 100)
will each give you the exact same output.
DocumentTermMatrix uses sparse matrix representation, so it doesn't take up all that memory storing all those zeros. Depending what it is you want to do you might have some luck with the SparseM package which provides some linear algebra routines using sparse matrices..
Are you able to increase the amount of RAM available to R? See this post: Increasing (or decreasing) the memory available to R processes
Also, sometimes when working with big objects in R, I occassionally call gc() to free up wasted memory.
The number of documents should not be a problem but you may want to try removing sparse terms, this could very well reduce the dimension of document term matrix.
inspect(removeSparseTerms(dtm, 0.7))
It removes terms that has at least a sparsity of 0.7.
Another option available to you is that you specify minimum word length and minimum document frequency when you create document term matrix
a.dtm <- DocumentTermMatrix(a.corpus, control = list(weighting = weightTfIdf, minWordLength = 2, minDocFreq = 5))
use inspect(dtm) before and after your changes, you will see huge difference, more importantly you won't ruin significant relations hidden in your docs and terms.
Since you only have 1859 documents, the distance matrix you need to compute is fairly small. Using the slam package (and in particular, its crossapply_simple_triplet_matrix function), you might be able to compute the distance matrix directly, instead of converting the DTM into a dense matrix first. This means that you will have to compute the Jaccard similarity yourself. I have successfully tried something similar for the cosine distance matrix on a large number of documents.

Resources