What is the problem with initializing a matrix object to NULL and then growing it with cbind() and rbind()?
In case the number of rows and columns are not known a priori, is it not necessary to grow from NULL?
Edit: My question was prompted by the need to understand memory efficient ways of writing R code. The matrix context is more general and I'm probably looking for suggestions about efficient ways to handle other data objects as well.
Apologize for being too abstract/generic, but I did not really have a specific problem in mind.
It would be helpful if you provided more detail about what you're trying to do.
One "problem" (if there is one?) is that every time you "grow" the matrix, you will actually be recreating the entire matrix from scratch, which is a very memory inefficient. There is no such thing as inserting a value into a matrix in R.
An alternative approach would be to store each object in your local environment (with the assign() function) and then assemble your matrix at the end once you know how many objects there are (with get()).
Related
In R, some functions only work on a data.frame and others only on a tibble or a matrix.
Converting my data using as.data.frame or as.matrix often solves this, but I am wondering how the three are different ?
Because they serve different purposes.
Short summary:
Data frame is a list of equal-length vectors. This means, that adding a column is as easy as adding a vector to a list. It also means that while each column has its own data type, the columns can be of different types. This makes data frames useful for data storage.
Matrix is a special case of an atomic vector that has two dimensions. This means that whole matrix has to have a single data type which makes them useful for algebraic operations. It can also make numeric operations faster in some cases since you don't have to perform type checks. However if you are careful enough with the data frames, it will not be a big difference.
Tibble is a modernized version of a data frame used in the tidyverse. They use several techniques to make them 'smarter' - for example lazy loading.
Long description of matrices, data frames and other data structures as used in R.
So to sum up: matrix and data frame are both 2d data structures. Each of these serves a different purpose and thus behaves differently. Tibble is an attempt to modernize the data frame that is used in the widely spread Tidyverse.
If I try to rephrase it from a less technical perspective:
Each data structure is making tradeoffs.
Data frame is trading a little of its efficiency for convenience and clarity.
Matrix is efficient, but harder to wield since it enforces restrictions upon its data.
Tibble is trading more of the efficiency even more convenience while also trying to mask the said tradeoff with techniques that try to postpone the computation to a time when it doesn't appear to be its fault.
About the difference between data frame and tibbles, the 2 main differences are explained here:https://www.rstudio.com/blog/tibble-1-0-0/
Besides, my understanding is the following:
-If you subset a tibble, you always get back a tibble.
-Tibbles can have complex entries.
-Tibbles can be grouped.
-Tibbles display better
This question already has an answer here:
What you can do with a data.frame that you can't with a data.table?
(1 answer)
Closed 9 years ago.
Apparently in my last question I demonstrated confusion between data.frame and data.table. Admittedly, I didn't realize there was a distinction.
So I read the help for each but in practical, everyday terms, what is the difference, what are the implications and what are each used for that would help guide me to their appropriate usage?
While this is a broad question, if someone is new to R this can be confusing and the distinction can get lost.
All data.tables are also data.frames. Loosely speaking, you can think of data.tables as data.frames with extra features.
data.frame is part of base R.
data.table is a package that extends data.frames. Two of its most notable features are speed and cleaner syntax.
However, that syntax sugar is different from the standard R syntax for data.frame while being hard for the untrained eye to distinguish at a glance. Therefore, if you read a code snippet and there is no other context to indicate you are working with data.tables and try to apply the code to a data.frame it may fail or produce unexpected results. (a clear giveaway that you are working with d.t's, besides the library/require call is the presence of the assignment operator := which is unique to d.t)
With all that being said, I think it is hard to actually appreciate the beauty of data.table without experiencing the shortcomings of data.frame. (for example, see the first 3 bullet points of #eddi's answer). In other words, I would very much suggest learning how to work with and manipulate data.frames first then move on to data.tables.
A few differences in my day to day life that come to mind (in no particular order):
not having to specify the data.table name over and over (leading to clumsy syntax and silly mistakes) in expressions (on the flip side I sometimes miss the TAB-completion of names)
much faster and very intuitive by operations
no more frantically hitting Ctrl-C after typing df, forgetting how large df was (also leading to almost never using head)
faster and better file reading with fread
the package also provides a number of other utility functions, like %between% or rbindlist that make life better
faster everything else, since a lot of data.frame operations copy the entire thing needlessly
They are similar. Data frames are lists of vectors of equal length while data tables (data.table) is an inheritance of data frames. Therefore data tables are data frames but data frames are not necessarily data tables. The data tables package and function were written to enhance the speed of indexing, ordered joins, assignment, grouping and listing columns (etc.).
See http://datatable.r-forge.r-project.org/datatable-intro.pdf for more information.
Sorry, maybe I am blind, but I couldn't find anything specific for a rather common problem:
I want to implement
solve(A,b)
with
A
being a large square matrix in the sense that command above uses all my memory and issues an error (b is a vector with corresponding length). The matrix I have is not sparse in the sense that there would be large blocks of zero etc.
There must be some function out there which implements a stepwise iterative scheme such that a solution can be found even with limited memory available.
I found several posts on sparse matrix and, of course, the Matrix package, but could not identify a function which does what I need. I have also seen this post but
biglm
produces a complete linear model fit. All I need is a simple solve. I will have to repeat that step several times, so it would be great to keep it as slim as possible.
I already worry about the "duplication of an old issue" and "look here" comments, but I would be really grateful for some help.
I have an MCA object generated from function MCA in package missMDA, which returns several types of results from Multiple Correspondence Analysis. Of these, I want to be able to use the 'dist' function, if appropriate, to calculate all pairwise 2d distances among the coordinates. Before I can do that, it seems that I need to figure out how to specifically reference the vectors of X and Y coordinates from this object, but when I ask for mydata$var$coord I get an unruly list of values, and I'm not sure how to send the results to an appropriate format that the dist function can use.
I am also interested in learning how to understand the structure of different kinds of objects in general, so that I have will a clearer roadmap for referencing their components in the future (and don't have to come groveling back to all of you seeking help with that!).
My apologies if I haven't stated my question clearly enough. Thanks in advance!
Figured out what to do, which is (apparently) R 101:
names(mydata)
Gave me the appropriate information about the components of this object (a list of type MCA). From this, I was able to reference the component of interest -
mydata$ind$coord
mydata$var$coord
I guess it pays to be patient!
I have been working on large datasets lately (more than 400 thousands lines). So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.
Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.
It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).
To sum up, I would like to know if:
XTS has indeed a size limitation
What do you think is the smartest way to handle large time series? (I was thinking about splitting the analysis into several smaller datasets).
I don't get an error message, R simply shuts down automatically. Is this a known behavior?
SOLUTION
The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
Chuncking data is indeed a good idea, but it is not needed.
This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
Regarding your two questions, my $0.02:
Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.
Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),
There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.