What are the differences between data.frame, tibble and matrix? - r

In R, some functions only work on a data.frame and others only on a tibble or a matrix.
Converting my data using as.data.frame or as.matrix often solves this, but I am wondering how the three are different ?

Because they serve different purposes.
Short summary:
Data frame is a list of equal-length vectors. This means, that adding a column is as easy as adding a vector to a list. It also means that while each column has its own data type, the columns can be of different types. This makes data frames useful for data storage.
Matrix is a special case of an atomic vector that has two dimensions. This means that whole matrix has to have a single data type which makes them useful for algebraic operations. It can also make numeric operations faster in some cases since you don't have to perform type checks. However if you are careful enough with the data frames, it will not be a big difference.
Tibble is a modernized version of a data frame used in the tidyverse. They use several techniques to make them 'smarter' - for example lazy loading.
Long description of matrices, data frames and other data structures as used in R.
So to sum up: matrix and data frame are both 2d data structures. Each of these serves a different purpose and thus behaves differently. Tibble is an attempt to modernize the data frame that is used in the widely spread Tidyverse.
If I try to rephrase it from a less technical perspective:
Each data structure is making tradeoffs.
Data frame is trading a little of its efficiency for convenience and clarity.
Matrix is efficient, but harder to wield since it enforces restrictions upon its data.
Tibble is trading more of the efficiency even more convenience while also trying to mask the said tradeoff with techniques that try to postpone the computation to a time when it doesn't appear to be its fault.

About the difference between data frame and tibbles, the 2 main differences are explained here:https://www.rstudio.com/blog/tibble-1-0-0/
Besides, my understanding is the following:
-If you subset a tibble, you always get back a tibble.
-Tibbles can have complex entries.
-Tibbles can be grouped.
-Tibbles display better

Related

R: convert data frame columns to least memory demanding data type without loss of information

My data is massive and I was wondering if there is a way I could tell R to convert each column to data types which are less memory demanding without any loss of information.
In Stata, there is a function called compress that does that. I was wondering if there is something similar in R.
I would also be grateful if you have other simple advice of how to handle large datasets in R (in addition to using data.table instead of dplyr).

I want to process tens of thousands of columns using Spark via sparklyr, but I can't

I tried using sdf_pivot() to widen my column with duplicate values into multiple (a very big number) columns. I planned to use these columns as the feature space for training an ML model.
Example: I have a language element sequence in one column (words), which I wish to turn into binary matrix of a huge width (say, 100,000) and run a sentiment analysis using a logistic regression.
The first problem is that by default sparklyr does not allow me to make more than 10K columns, citing possible eeror in my design.
The second problem is that even if I override this warning and make lots of columns, further calculations last forever on this very wide data.
Question 1: is it a good practice to make extra wide datasets or I should work differently with so deep feature spaces, while using the power of fast parallel calculations with Spark?
Question 2: is it possible to construct the vector-type feature column avoiding the generation of a very wide matrix?
I just need a small example or practical tips to follow.
https://github.com/rstudio/sparklyr/issues/1322

Creating an n-dimensional data frame in R

I have need for a data structure of at least three dimensions where the class of one of the dimensions can change. In 2 dimensions this would be a data frame. In 3 dimensions I can create an object which is a list of data frames, but will then have to implement enough of the generic functions to make the data structure usable. I will have some functions which are unaware of the 3rd dimension, aggregating the 3rd dimension down so only 2 remain. In other cases, I will have functions specifically designed to analyze the additional dimensions.
All-in-all, this seems like a hell of a lot of work. As it seems like a rather generic problem, are there any R packages or data structures which have already solved it? If my data was all of a single class, I could just use an array, but unfortunately, that's not the case.

What is the practical difference between data.frame and data.table in R [duplicate]

This question already has an answer here:
What you can do with a data.frame that you can't with a data.table?
(1 answer)
Closed 9 years ago.
Apparently in my last question I demonstrated confusion between data.frame and data.table. Admittedly, I didn't realize there was a distinction.
So I read the help for each but in practical, everyday terms, what is the difference, what are the implications and what are each used for that would help guide me to their appropriate usage?
While this is a broad question, if someone is new to R this can be confusing and the distinction can get lost.
All data.tables are also data.frames. Loosely speaking, you can think of data.tables as data.frames with extra features.
data.frame is part of base R.
data.table is a package that extends data.frames. Two of its most notable features are speed and cleaner syntax.
However, that syntax sugar is different from the standard R syntax for data.frame while being hard for the untrained eye to distinguish at a glance. Therefore, if you read a code snippet and there is no other context to indicate you are working with data.tables and try to apply the code to a data.frame it may fail or produce unexpected results. (a clear giveaway that you are working with d.t's, besides the library/require call is the presence of the assignment operator := which is unique to d.t)
With all that being said, I think it is hard to actually appreciate the beauty of data.table without experiencing the shortcomings of data.frame. (for example, see the first 3 bullet points of #eddi's answer). In other words, I would very much suggest learning how to work with and manipulate data.frames first then move on to data.tables.
A few differences in my day to day life that come to mind (in no particular order):
not having to specify the data.table name over and over (leading to clumsy syntax and silly mistakes) in expressions (on the flip side I sometimes miss the TAB-completion of names)
much faster and very intuitive by operations
no more frantically hitting Ctrl-C after typing df, forgetting how large df was (also leading to almost never using head)
faster and better file reading with fread
the package also provides a number of other utility functions, like %between% or rbindlist that make life better
faster everything else, since a lot of data.frame operations copy the entire thing needlessly
They are similar. Data frames are lists of vectors of equal length while data tables (data.table) is an inheritance of data frames. Therefore data tables are data frames but data frames are not necessarily data tables. The data tables package and function were written to enhance the speed of indexing, ordered joins, assignment, grouping and listing columns (etc.).
See http://datatable.r-forge.r-project.org/datatable-intro.pdf for more information.

how to use princomp() or prcomp() functions in R with large datasets, without trasposing the data?

I have just started knowing PCA and i wish to use it for a huge microarray dataset with more than 4,00,000 rows. I have my columns in the form of samples, and rows in the form of genes/locus. I did go through some tutorials on using PCA and came across princomp() and prcomp() and a few others.
Now, as i learn here that, in order to plot ¨samples¨ in the biplot, i would need to have them in the rows, and genes/locus in the columns, and hence i will have to transpose my data before using it for PCA.
However, since the rows are more than 4,00,000, i am not really able to transpose them into columns, because the columns are limited. So my question is that, is there any way to perform a PCA on my data, without transposing it, using these R functions ? If not, can anyone of you suggest me any other way or method to do so ?
Why do you hate to transpose your data? It's easy!
If you read your data into R (for example as the matrix microarray.data) you can transpose them with just a command:
transposed.microarray.data<-t(microarray.data)

Resources