I'm using "bigrquery" package on Rstudio Server to retrieve data from Google BigQuery. The target is querying 30~180 tables which are around 3.5GB individually. The query result is a table around 7~40 GB, which will be transformed into data frame in R and finally R-shiny application.
I want to know which way would be faster:
Using src_bigquery() + dplyr functions, and collect the data wanted at last
Using query_exec() to get the "raw data" first, then do all the data manipulation by dplyr
Now I am trying method 2, but I found that even the query itself only takes around 30s to run, but retrieve the query result takes more than 10 mins.
Any suggestions to accelerate this process? Or any suggestions about the comparisons between method 1 and 2?
Related
I am trying to obtain a database that comes from Mongo DB to R, so I can make anlaysis on it. The bridge between these two is a R package: Rmongo.
As I have some policy rules, I cannot show you the dataset and my output, so I will try to explain as best as possible.
My two first commands, after installing the package, are these ones:
mg1 <- mongoDbConnect("test", "localhost", 27018)
dbShowCollections(mg1)
Which works, as it shows the collection, or the different variables.
Then, I can use the commands made by the Rmongo package, meaning:
query = dbGetQuery(mg1, 'address_history','{}')
This normally returns a data frame with all the variables on each column. But, because it is a nested file, I only get the first three variables (out of around fifty) because they are at the top of the nest. For the rest, I get one column of the data frame with the json code (so of approximately 50 variables) that I cannot seem to turn in a data frame. If someone is familiar with that, please help me.
I already saw on Stack Overflow a way to do it manually thanks to gsub, and in general pattern with the code, but this code is dissimilar, and doing it manually will not make it work.
Furthermore, there is also another command via the Rmongo package:
query2 = dbGetQueryForKeys(mg1, 'address_history', '{}', '{address:1}')
where I can return the variable that I want. Unfortunately, because this is a nested file, it also cannot find the variables that are not in the top of the nest.
Is there another command or another package that I can use? I am open to any other opportunity to get this dataset (very large) into an R data frame, so I can make any inferences.
Thank you very much!
I tried just now setting up Rmongo and mongolite for R. I got mongolite working in minutes with the starter data locally . I could not get even get the data I wanted inserted using Rmongo.
I think if you try installing mongolite you will find their documentation and package simpler. https://github.com/jeroen/mongolite
I am using the package RMySQL with DBI package in R.
When I run the code,
dbReadTable(con, "data")
it is taking forever.
I think the table is very big data.
Any ideas on how to speed up this process?
Thanks,
Try to get the database to do as much filtering & processing as possible. A database has many more ways to optimize operations than R, and isn't constrained by RAM so severely. It also reduces the amount that has to travel across the network.
Common approaches tactics are
using the WHERE clause to reduce rows
explicitly list (only the necessary) columns, instead of using *
do as much aggregation in SQL as possible (eg, GROUP BY + MAX)
use INSERT queries to write from table to table, so the data doesn't even need to pass through R.
I imagine RMySQL should be faster than the newish odbc package, but it's worth experimenting with.
What's 'forever'? 5 min or 5 hours? Are things still slow once the data get to R? If things are still too slow to be feasible, consider escalating to something like sparklyr.
I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.
To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.
Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.
I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).
Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?
#Will
I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.
When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).
If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.
Short: Serialization/deserialization is very slow.
See for example post on my blog http://dsnotes.com/articles/r-read-hdfs
However it should be equally slow in both sparkR and sparklyr.
I'm pretty new to Tableau but have a lot of experience with R. Everytime I use SCRIPT_REAL to call an R function based on Tableau aggregates, I get back a number that seems to be like the closest fraction approximation. For example if raw R gives me .741312, Tableau will spit out .777778, and so on. Does anything have any experience with this issue?
I'm pretty sure this is an aggregation issue.
From the Tableau and R Integration post by Jonathan Drummey on their community site:
Using Every Row of Data - Disaggregated Data For accurate results
for the R functions, sometimes those R functions need to be called
with every row in the underlying data. There are two solutions to
this:
Disaggregate the measures using Analysis->Aggregate Measures->Off. This doesn’t actually cause the measures to stop their
aggregations, instead it tells Tableau to return every row in the data
without aggregating by the dimensions on the view (which gives the
wanted effect). Using this with R scripts can get the desired results,
but can cause problems for views that we want to have R work on the
non-aggregated data and then display the data with some level of
aggregation.
The second solution deals with this situation: Add a
dimension such as a unique Row ID to the view, and set the Compute
Using (addressing) of the R script to be along that dimension. If
we’re doing some sort of aggregation with R, then we might need to
reduce the number of values returned by filtering them out with
something like:
IF FIRST()==0 THEN SCRIPT_REAL('insert R script here') END
If we need to then perform additional aggregations on that
data, we can do so with table calculations with the appropriate
Compute Usings that take into account the increased level of detail in
the view.
How can I use the R packages zoo or xts with very large data sets? (100GB)
I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them.
How can I use it?
I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?
I just want to aggreagate series, cleanse, and perform some cointegrations and plots.
I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.
Added: I'm on Windows
I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.
If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )
I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.
Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )
I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL