Downloading Bigish Datasets with bigrquery - Best practices? - r

I'm trying to download a table of about 250k rows and 500 cols from bigquery into R for some model building in h2o using the R wrappers. It's about 1.1gb when downloaded from BQ.
However, it runs for a long time and then looses the connection so never makes it to R (i'm rerunning now so i can get a more precise example of the error).
I'm just wondering if using bigrquery to do this seems like a reasonable task or is bigrquery mainly for pulling smaller datasets from BigQuery into R.
Just wondering if anyone has any tips and tricks that might be useful - am going through the library code to try figure out exactly how its doing it (was going to see if was an option to shared out the file locally or something even). But not entirely sure i even know what i'm looking at.
Update:
I've gone with quick fix of using the cli's to download the data locally
bq extract blahblah gs://blah/blahblah_*.csv
gsutil cp gs://blah/blahblah_*.csv /blah/data/
And then to read the data just use:
# get file names in case shareded accross multiple files
file_names <- paste(sep='','/blah/data/',list.files(path='/blah/data/',pattern=paste(sep='',my_lob,'_model_data_final')))
# read each file
df <- do.call(rbind,lapply(file_names,read.csv))
Is actually a lot quicker this way - 250k no problem.
I do find that BigQuery could do with a bit better integration into the wider ecosystem of tools out there. Love that R + Dataflow examples, defo going to look into that a bit more.

Related

What are the different ways to update an Excel sheet from R

I am in the process of automating a number of graphs that are produced where I work through R that are currently in Excel.
Note that for now, I am not able to convince that doing the graphs directly in R is the best solution, so the solution cannot be "use ggplot2", although I will push for it.
So in the meantime, my path is to download, update and tidy data in R, then export it to an existing Excel file where the graph is already constructed.
The way I have been trying to do that is through openxlsx, which seems to be the most frequent recommendation (for instance here).
However, I am encountering an issue taht I cannot solve with this way (I asked a question there that did not inspire a lot of answers !).
Therefore, I am going to try other ways, but I seem to mainly be directed to the aforementioned solution. What are the existing alternatives ?

Subset of features on external memory

I have a large file that I'm not able to load so I'm using a local file with xgb.DMatrix. But I'd like to use only a subset of the features. The documentation on xgboost says that the colset argument on slice is "currently not used" and there is no metion of this feature in the github page. And I haven't found any other clue of how to do column subsetting with external memory.
I wish to compare models generated with different features subsettings. The only thing I could think of is to create a new file with the features that I want to use but it's taking a long time and will take a lot of memory... I can't help wondering if there is a better way.
ps.: I tried using h2o package too but h2o.importFile froze.

R running very slowly after loading large datasets > 8GB

I have been unable to work in R given how slow it is operating once my datasets are loaded. These datasets total around 8GB. I am running on a 8GB RAM and have adjusted memory.limit to exceed my RAM but nothing seems to be working. Also, I have used fread from the data.table package to read these files; simply because read.table would not run.
After seeing a similar post on the forum addressing the same issue, I have attempted to run gctorture(), but to no avail.
R is running so slowly that I cannot even check the length of the list of datasets I have uploaded, cannot View or do any basic operation once these datasets are uploaded.
I have tried uploading the datasets in 'pieces', so 1/3 of the total files over 3 times, which seemed to make things run more smoothly for the importing part, but has not changed anything with regards to how slow R runs after this.
Is there any way to get around this issue? Any help would be much appreciated.
Thank you all for your time.
The problem arises because R loads the full dataset into the RAM which mostly brings the system to a halt when you try to View your data.
If it's a really huge dataset, first make sure the data contains only the most important columns and rows. Valid columns can be identified through the domain and world knowledge you have about the problem. You can also try to eliminate rows with missing values.
Once this is done, depending on your size of the data, you can try different approaches. One is through the use of packages like bigmemory and ff. bigmemory for example, creates a pointer object using which you can read the data from disk without loading it to the memory.
Another approach is through parallelism (implicit or explicit). MapReduce is another package which is very useful for handling big datasets.
For more information on these, check out this blog post on rpubs and this old but gold post from SO.

Why is collect in SparkR so slow?

I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.
To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.
Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.
I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).
Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?
#Will
I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.
When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).
If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.
Short: Serialization/deserialization is very slow.
See for example post on my blog http://dsnotes.com/articles/r-read-hdfs
However it should be equally slow in both sparkR and sparklyr.

R - downloading a subset from original dataset

I'm wondering if there is a possibility to download (from the website) a subset made from original dataset in Rdata format. The easiest way of course is to proceed in this manner:
set<-url("http://xxx.com/datasets/dataset.RData")
load(set)
subset<-set[set$var=="yyy",]
however I'm trying to speed up my code and avoid downloading unnecessary columns.
Thanks for any feedback.
Matt
There is no mechanism for that task. There is also no mechanism for inspecting .Rdata files. In the past when this has been requested, people have been advised to convert to a real database management system.

Resources