I am attempting to understand why development had shifted from reshape to reshape2 package. They seem to be functionally the same, however, I am unable to upgrade to reshape2 currently due to an older version of R running on the server. I am concerned about the possibility of a major bug that would have shifted development to a whole new package instead of simply continuing development of reshape. Does anyone know if there is a major flaw in the reshape package?
reshape2 let Hadley make a rebooted reshape that was way, way faster, while avoiding busting up people's dependencies and habits.
https://stat.ethz.ch/pipermail/r-packages/2010/001169.html
Reshape2 is a reboot of the reshape package. It's been over five years
since the first release of the package, and in that time I've learned
a tremendous amount about R programming, and how to work with data in
R. Reshape2 uses that knowledge to make a new package for reshaping
data that is much more focussed and much much faster.
This version improves speed at the cost of functionality, so I have
renamed it to reshape2 to avoid causing problems for existing users.
Based on user feedback I may reintroduce some of these features.
What's new in reshape2:
considerably faster and more memory efficient thanks to a much
better underlying algorithm that uses the power and speed of
subsetting to the fullest extent, in most cases only making a
single copy of the data.
cast is replaced by two functions depending on the output type:
dcast produces data frames, and acast produces matrices/arrays.
multidimensional margins are now possible: grand_row and
grand_col have been dropped: now the name of the margin refers to
the variable that has its value set to (all).
some features have been removed such as the | cast operator, and
the ability to return multiple values from an aggregation function.
I'm reasonably sure both these operations are better performed by
plyr.
a new cast syntax which allows you to reshape based on functions
of variables (based on the same underlying syntax as plyr):
better development practices like namespaces and tests.
Related
I'm relatively new to R programming and I've been doing research, but I can't find the answer to this topic.
Does it take more processing power to load the full tidyverse in the beginning of the code rather than to load just dplyr package. For example, I might only need functions that can be found in dplyr. Am I reducing the speed/performance of my code by loading the full tidyverse, which must be a larger package considering that it contains several other packages? Or would the processing speed be the same regardless of which package I choose to load. From an ease of coding, I'd rather use tidyverse since it's more comprehensive, but if I'm using more processing power, then perhaps loading the less comprehensive package is more efficient.
As NelsonGon commented, your processing speed is not reduced by loading packages. Although the packages themselves will take time to load, it may be negligible, especially if you are already wanting to load dplyr, tidyr, and purrr for example.
Loading more libraries on the search path (using library(dplyr) for example) might not hurt your speed, but may cause namespace errors down the road.
Now, there are some benchmarks out there comparing dpylr, data.table, and base R and dpylr tends to be slower, but YMMV. Here's one I found: https://www.r-bloggers.com/2018/01/tidyverse-and-data-table-sitting-side-by-side-and-then-base-r-walks-in/. So, if you are doing operations that take a long time, it might be worthwhile to use data.table instead.
When I read about Microsoft R Open I ususally read that it is faster in matrix calculations that R from CRAN due to multicore support.
I understand that this can increase performance e.g. when running regressions. Does it also significantly increase calculations from tidyr or dplyr? The underlying question is, i guess, whether these packages rely on matrix calculations or not. More generally, do data.frames work with matrix calculations under the hood? As far as I know, data.frames are a special kind of a list...
Does anyone have an answer to this. Theoretically ans (ideally) some benchmarks?
This question already has an answer here:
What you can do with a data.frame that you can't with a data.table?
(1 answer)
Closed 9 years ago.
Apparently in my last question I demonstrated confusion between data.frame and data.table. Admittedly, I didn't realize there was a distinction.
So I read the help for each but in practical, everyday terms, what is the difference, what are the implications and what are each used for that would help guide me to their appropriate usage?
While this is a broad question, if someone is new to R this can be confusing and the distinction can get lost.
All data.tables are also data.frames. Loosely speaking, you can think of data.tables as data.frames with extra features.
data.frame is part of base R.
data.table is a package that extends data.frames. Two of its most notable features are speed and cleaner syntax.
However, that syntax sugar is different from the standard R syntax for data.frame while being hard for the untrained eye to distinguish at a glance. Therefore, if you read a code snippet and there is no other context to indicate you are working with data.tables and try to apply the code to a data.frame it may fail or produce unexpected results. (a clear giveaway that you are working with d.t's, besides the library/require call is the presence of the assignment operator := which is unique to d.t)
With all that being said, I think it is hard to actually appreciate the beauty of data.table without experiencing the shortcomings of data.frame. (for example, see the first 3 bullet points of #eddi's answer). In other words, I would very much suggest learning how to work with and manipulate data.frames first then move on to data.tables.
A few differences in my day to day life that come to mind (in no particular order):
not having to specify the data.table name over and over (leading to clumsy syntax and silly mistakes) in expressions (on the flip side I sometimes miss the TAB-completion of names)
much faster and very intuitive by operations
no more frantically hitting Ctrl-C after typing df, forgetting how large df was (also leading to almost never using head)
faster and better file reading with fread
the package also provides a number of other utility functions, like %between% or rbindlist that make life better
faster everything else, since a lot of data.frame operations copy the entire thing needlessly
They are similar. Data frames are lists of vectors of equal length while data tables (data.table) is an inheritance of data frames. Therefore data tables are data frames but data frames are not necessarily data tables. The data tables package and function were written to enhance the speed of indexing, ordered joins, assignment, grouping and listing columns (etc.).
See http://datatable.r-forge.r-project.org/datatable-intro.pdf for more information.
How can I use the R packages zoo or xts with very large data sets? (100GB)
I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them.
How can I use it?
I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?
I just want to aggreagate series, cleanse, and perform some cointegrations and plots.
I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.
Added: I'm on Windows
I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.
If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )
I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.
Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )
I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL
It seems like melt will reshape your data frame with id columns and stacked measured variables after which a cast lets you perform aggregation. ddply, from the plyr package seems to be very similar..you give it a data frame, a couple of column variables for grouping, and an aggregation function and you get back a data frame...so how are they different and are there any good resources/references to share for learning these tools besides their documentation (which, especially for reshape, is a bit difficult to follow)
Thanks
One difference is that stats::reshape has a built-in way to handle "wide" data whereas reshape2 (cast/melt) does not. See this question for an example: Reshape in the middle
That said, stats::reshape has frustrating arguments and specializes in only one type of data transformation (albeit a common one).
plyr tends to be used in place of apply functions, whereas reshape2 tends to replace reshape. Even though the functionalities overlap, they each lend themselves to particular task.
Hadley Wickham, the author of the reshape2 and plyr packages, has a nice pdf on tidy data that's worth a read. He also has an article on plyr here: http://www.jstatsoft.org/v40/i01