Dataset file size in R, possible overhead? - r

I am useing stat transfer to convert a dataset from SAS file format to R-format. The file in SAS is ~ 489mb, when converted to Rdata its 520mb. Given that the file is a data frame with 4090222 x 11 "cell's", I suppose that the difference can be explained to some extent.
But when I open the converted dataset, and ask R to save it, the 530mb goes down to some 120mb, I really dont understand how and why this is happening. I suspect data is dropped (because the resize is so notable), but as far as I can see, this is not happening.
I have tried all.equal which returns TRUE. In fact everything I try, tells me that the datasets are indeed equal... But it does not add up?
Am I makeing some huge mistake?
EDIT: See Gregors point below, "problem" solved!

Just turning my comments into an answer:
R compresses data when it saves it as .RData, and actually does an impressive job of it as compared to other statistical programming languages, as demonstrated in this blog entry.
So the answer is no, you shouldn't be worried.

Related

How to continue project in new file in R

I have a large population survey dataset for a project and the first step is to make exclusions and have a final dataset for analyses. To organize my work, I must continue my work in a new file where I derive survey variables correctly. Is there a command used to continue work by saving all the previous data and code to the new file?
I don´t think I understand the problem you have. You can always create multiple .R files and split the code among them as you wish, and you can also arrange those files as you see fit in the file system (group them in the same folder with informative names and comments, etc...).
As for the data side of the problem, you can load your data into R, make any changes / filters needed, and then save it to another file with one of the billions of functions to write stuff to the disk: write.table() from base, fwrite() from data.table (which can be MUCH faster), etc...
I feel that my answer is way too obvious. When you say "project" you mean "something I have to get done" or the actual projects that you can create in rstudio. If it´s the first, then I think I have covered it. If it´s the second, I never got to use that feature so I am not going to be able to help :(
Maybe you can elaborate a bit more.

Non-programmer, ascii file data extract (can I even learn to code?)

As the title says, I'm not a programmer. I've tried R before, got very confused and abandoned it. I'm a physician, and I do all my statistics either with SPSS or Excel. I'd like to learn some coding for when I get into problems like this:
I have an ascii file that I'd like to extract data from. The fields are contained within columns of variable width. 90% of the file is useless to me. For example, the fields I'm interested in extracting are encoded in columns 00645-00649, 03315-03319, etc. I'd like to get this into a format so I can run stats in SPSS/Excel. Should I be looking to use R, Python, something else or am I totally beyond hope?
Thanks in advance.
It's impossible to say for certain given only the information here, but the DATA LIST command in SPSS may well allow you to read the data into SPSS directly from the current file. If you can specify the column locations of the desired variables, you can specify those on that command, and SPSS will simply skip over the unnamed columns.

R Raster writeRaster doesn't overwrite

Identifying a potential bug here. When calling writeRaster overwrite=TRUE, the new raster values remain unchanged. I originally wrote the wrong raster object, then corrected the code, and wrote a new raster to the same file name. The values in the attribute table of the written file are the same as the original, even though the raster object I am writing has the correct attributes when viewed in R.
Workaround was to give the new raster a different name (or manually delete the old).
R 3.0.0, Windows 7 64-b
Apologies to Brian, with whom I share our modeling workstation. This was my post.
Josh O'Brien- Looks like you were right, there was something locking the write-protection. I think ArcCatalog was locking it up.
This tool has performed as expected many times since this incident.
I found the same issue.
I confirm that if you have ArcMap open, the R function overwrite=TRUE doesn't work.
By the way, without any warning message.
Hope this help other R-user in managing raster files.

Basic analysis of large CSV with FF package in R

I have been messing around with R for the last year and now want to get a little deeper. I want to learn more about the ff and big data packages because have been trouble getting through some of the documentation.
I like to learn by doing, so lets say I have a huge CSV called data.csv and its 300 mbs. It has 5 headers Url, PR, tweets, likes, age. I want to deduplicate the list based on URLs. Then I want to plot PR and likes on a scatter plot to see if there is any correlation. How would I go about doing that basic analysis?
I always get confused with the chunking of the big data processes and how you cant load everything in at once.
What are come common problems you have ran into using the ff package or big data?
Is there another package that works better?
Basically any information to get started using a lot of data in R would be useful.
Thanks!
Nico

XTS size limitation

I have been working on large datasets lately (more than 400 thousands lines). So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.
Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.
It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).
To sum up, I would like to know if:
XTS has indeed a size limitation
What do you think is the smartest way to handle large time series? (I was thinking about splitting the analysis into several smaller datasets).
I don't get an error message, R simply shuts down automatically. Is this a known behavior?
SOLUTION
The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
Chuncking data is indeed a good idea, but it is not needed.
This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
Regarding your two questions, my $0.02:
Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.
Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),
There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.

Resources