Importing huge SAS database to Rstudio - r

I have a huge dataset in SAS (about 14 gb) that I am trying to import to R-studio for some analysis. However, nothing is working for importation. I tried 'haven', 'sas7bdat', 'foreign' but nothing. I converted the file to csv (became about 7gb) and tried read.csv, fread..but nothing again.
R-studio takes a huge time to process the file and at teh end says that it doesn't work. (something about not being able to allocate space for a vector of x size in mg about 60mg depending on the method used).
Does anyone have any ideas about how to solve this situation ?

Related

Writing .xlsx in R, importing into PowerBi error

I'm experiencing an odd error. I have a large dataframe in R (75000 rows, 97 columns) and I need to save it out and then import it into Power Bi.
At first I just did the simple:
library(tidyverse)
write_csv(Visits,"Visits.csv")
and while it seems to export and looks fine in excel, the csv itself is all messed up when I look at the contents in Power Bi. Here's an example of what I mean:
The 'phase.x' column should only have "follow-up" or "treatment" in that column. In excel, looks great:
but that exact same file gets screwed up in Power Bi:
I figured that being a 'comma separated variable' file, there must be some extra comma somewhere, and I saved it as an .xlsx instead.
So, while in excel, I saved that .csv as an .xlsx and it opened great in Power Bi!
Jump forward a moment and instead of write_csv() in R, I use write.xlsx(). But now I get this error:
If I simply go to that file, open it in excel, save it and hit close, that error goes away and it can load into Power Bi just fine. I figure it has something to do with this question on here.
Any ideas on what I might be screwing up as I save it out of R? Somehow I can fix it in R and not have to open and save it every time?
In power BI check that your source has ignore quoted line breaks enabled. I've found this is often an issue with .csv files in PowerBI.

read_excel is reading slow the data

Function read_excel from readxl package is reading too slow the excel files with a large amount of columns. The problem is that before reinstalling the OS I hadn't this problem.
For example then to read and process 10 files it took about 10 second, now it takes 20-30s to read only 1 file.
I also tried to install the same version of R. Anyone knows what could be the problem?

Read a sample from sas7bdat file in R

I have a sas7bdat file of size around 80 GB. Since my pc has a memory of 4 GB the only way I can see is reading some of its rows. I tried using the sas7bdat package in R which gives the error "big endian files are not supported"
The read_sas() function in haven seems to work but the function supports reading specific columns only while I need to read any subset of rows with all columns. For example, it will be fine if I can read 1% of the data to understand it.
Is there any way to do this? Any package which can work?
Later on I plan to read parts of the file and divide it into 100 or so sections
If you have Windows you can use the SAS Universal Viewer, which is free, and export the dataset to CSV. Then you can import the CSV into R in more readable chunks using this method.

use readOGR to load in a large spatial file in R

For my processes in R I want to read in a 20 gigabyte file. I got it in a XML file type.
In R I cannot load it in with readOGR since it is to big. It gives me the error cannot allocate vector 99.8 mb.
Since my file is to big the logical next step in my mind would be to split the file. But since I can not open it in R and any other GIS package at hand, I can not split the file before I load it in. I am already using the best PC to my availability.
Is there a solution?
UPDATE BECAUSE OF COMMENT
If I use head() my line looks like underneath. It does not work unfortunately.
headfive <- head(readOGR('file.xml', layer = 'layername'),5)

source() taking long time and often crashed

I used dump() command to dump some dataframes in R.
The particular dump files are about 200 MB, and one is about1.5 GB. Later I tried to retrieve them using source() and it is taking a lot of time and says windows stopped working after 3-4 hours. I am using 64 bit R 3.0.0 ( I tried in R 2.15.3 too) in windows 7 with memory of 48 GB. For one of the file, It threw some memory error, (I don't have log now) but loaded 4-5 datasets out of about 15 datasets.
Is there any way I can load a particular dataset if I know name?
or is there any other way?
I have learned my lesson and probably save command to create data and the original data. or one data in one dump file (or R image file)
Thank you
Use save() and load() rather than dump() and source().
save() writes out an binary representation of the data to an .Rdata file, which can then be loaded back in using load().
dump() converts everything to a text representation, which source() then has to reconvert back to binary. Both ends of that process are very inefficient.

Resources