r - SQL on large datasets from several access databases - r

I'm working on a process improvement that will use SQL in r to work with large datasets. Currently the source data is stored in several different MS Access databases. My initial approach was to use RODBC to read all of the source data into r, and then use sqldf() to summarize the data as needed. I'm running out of RAM before I can even begin use sqldf() though.
Is there a more efficient way for me to complete this task using r? I've been looking for a way to run a SQL query that joins the separate databases before reading them into r, but so far I haven't found any packages that support this functionality.

Should your data be in a database dplyr (a part of the tidyverse) would be the tool you are looking for.
You can use it to connect to a local / remote database, push your joins / filters / whatever there and collect() the result as a data frame. You will find the process neatly summarized on http://db.rstudio.com/dplyr/
What I am not quite certain of - but it is not a R issue but rather an MS Access issue - is the means for accessing data across multiple MS Access databases.
You may need to write custom SQL code for that & pass it to one of the databases via DBI::dbGetQuery() and have MS Access handle the database link.

The link you posted looks promising. If it doesn't yield the intended results, consider linking one Access DB to all the others. Links take almost no memory. Union the links and fetch the data from there.
# Load RODBC package
library(RODBC)
# Connect to Access db
channel <- odbcConnectAccess("C:/Documents/Name_Of_My_Access_Database")
# Get data
data <- sqlQuery(channel , paste ("select * from Name_of_table_in_my_database"))
These URLs may help as well.
https://www.r-bloggers.com/getting-access-data-into-r/
How to connect R with Access database in 64-bit Window?

Related

Efficient method to move data from Oracle (SQL Developer) to MS SQL Server

Daily, I query a few tables in SQL Developer, filtering to prior day activity, adding column to date stamp the data, then export to xlsx. Then I manually import each file to a MS SQL Server via SQL Server Import and Export Wizard. Takes many clicks, much waiting...
I'm essentially creating an archive in SQL Server, the application I'm querying overwrites data daily. I'm not a DBA of either database, I use the archived data to do validations and research.
It's tough to get my org to provide additional software, I've been trying to make this work via SQL Developer, SSMS Express ed, and other standard tools.
I'm looking to make this reasonably automated, either via scripts, scheduled tasks, etc. Appreciate suggestions that would work on my current situation, but if that isn't reasonable, and there's a very reasonable alternative, I can go back to the org to request software/access/assistance.
You can use SSIS to import the data directly from Oracle to SQL Server, unless you need the .xlsx files for another purpose. You can also export from Oracle to these, then load to SQL Server from these files if you do need the files. For the date stamp column, a Derived Column can be added within a Data Flow Task using the SSIS GETDATE() function for a timestamp in order to achieve the same result. This function returns a timestamp, and if only the date is necessary the (DT_DBDATE) function can cast it to a date data type that's compatible with this data type of SQL Server. Once you have the SSIS package configured, you can schedule in to run at regular intervals as a SQL Agent job. I'd also recommend installing the SSIS catalog (SSISDB) and using this the source to run the packages from. The following links shed more light on these areas.
SSIS
Connecting to Oracle from SSIS
Data Flow Task
Derived Column Transformation
Creating SQL Server Agent Jobs for SSIS packages
SSIS Catalog
Another option that you may consider (if it is supported in SQL Express) is using the BCP utility, which can be run from command line.
The BCP utility allows you to bulk copy the data from a delimited text file into a SQL Server table.
If you go this approach, things to consider:
Number of Columns in the source file need to match the number of columns in the destination
Data types must match (or be comparable)
Typically, empty strings will be converted to nulls, so you will need to consider if the columns are nullable.
(to name a few - if you want to delve deeper, you might also need to look at custom delimiters between fields and records. Don't forget, commas and line feeds are still valid characters in char type fields).
Anyhow, maybe it will work for you, maybe not. Sure, you might still have to deal with the exporting of the data from Oracle, but it might ease the pain getting the data in.
Have a read:
https://learn.microsoft.com/en-us/sql/tools/bcp-utility?view=sql-server-2017

Is it possible to combine two different database utility like Teradata's Fast export to Greenplum's GPloader?

Usually, I use a JDBC connection with some ETL tool to move data from one database(i.e Teradata) to another database(i.e Greenplum).
However, both of these databases comes with inbuilt utilities which can load/export huge amounts of data very fast, far faster than JDBC!. But the downside as far as I am aware of is that it can do so only to/from a file.
So, if I want to use them I have to Follow a process like-
Teradata ---(Fast Export)---> File ---(Gploader)---> Greenplum
I am wondering if it is possible to skip the File part and Combine the two utilities.
Teradata ---(FastExport & Gploader)--> Greenplum.
That way I can transfer huge amounts of data very quickly!
Yes, you most certainly can. Greenplum supports all kinds of external tables. One solution is to use an External Table that executes a command. That command can be a Java program that connects to Teradata to get data and uses the FastExport option.
I wrote the tool "gplink" to do just this. It automates the creation of Greenplum External Tables for JDBC sources.
Github:
https://github.com/pivotalguru/gplink
Teradata connection example:
https://github.com/pivotalguru/gplink/blob/master/connections/teradata.properties
And my blog:
http://www.pivotalguru.com/?page_id=982

How to validate data in Teradata from Oracle

My source data is in Oracle and target data is in Teradata.Can you please provide me the easy and quick way to validate data .There are 900 tables.If possible can you provide syntax too
There is a product available known as the Teradata Gateway that works with Oracle and allows you to access Teradata in a "heterogeneous" manner. This may not be the most effective way to compare the data.
Ultimately what your requirements sound more process driven and to be done effectively would require the source data to be compared/validated as stage tables on the Teradata environment after your ETL/ELT process has completed.

Spark newbie (ODBC/SparkSQL)

I have a spark cluster setup and tried both native scala and spark sql on my dataset and the setup seems to work for the most part. I have the following questions
From an ODBC/extenal connectivity to the cluster, what should i expect?
- the admin/developer shapes the data and persists/caches a few RDDs that will be exposed? (Thinking on the lines of hive tables)
- What would be the equivalent of connecting to a "Hive metastore" in spark/spark sql?
Is thinking along the lines of hive faulted?
My other question was
- when i issue hive queries, (and say create tables and such), it uses the same hive metastore as hadoop/hive
- Where do the tables get created when i issue sql queries using sqlcontext?
- If i persist the table, it is the same concept as persisting an RDD?
Appreciate your answers
Nithya
(this is written with spark 1.1 in mind, be aware that new features tend to be added quickly, some limitations mentioned below might very well disappear at some point in the future).
You can use Spark SQL with Hive syntax and connect to Hive metastore, which will result in your Spark SQL hive commands to be executed on the same data space as if they were executed through Hive directly.
To do that you simply need to instantiate a HiveContext as explained here and provide a hive-site.xml configuration file that specifies, among other things, where to find the Hive metastore.
The result of a SELECT statement is a SchemaRDD, which is an RDD of Row objects that has an associated schema. You can use it just like you use any RDD, including cache and persist and the effect is the same (the fact that the data comes from hive has not influence here).
If your hive command is creating data, e.g. "CREATE TABLE ... ", the corresponding table gets created in exactly the same place as with regular Hive, i.e. /var/lib/hive/warehouse by default.
Executing Hive SQL through Spark provides you with all the caching benefits of Spark: executing a 2nd SQL query on the same data set within the same spark context will typically be much faster than the first query.
Since Spark 1.1, it is possible to start the Thrift JDBC server, which is essentially an equivalent to HiveServer2 and thus allows you to execute SparkQL commands through a JDBC connection.
Note that not all Hive features are available (yet?), see details here.
Finally, you can also discard Hive syntax and metastore and execute SQL queries directly on CSV and Parquet files. My best guess is that this will become the preferred approach in the future, although at the moment the set of SQL features available like this is smaller than when using the Hive syntax.

Balance reads in a MongoDB replica set with rmongodb

I have a MongoDB as a replica set with one master and one slave. I am using RmongoDB and I want to explicitly send a query to each machine using a parallelized for loop.
I succesfully created a conection with all the hosts:
mongo <- mongo.create(host=c("mastermng01:27001","slavemng01:27001"),
name="myRS",
username="user",
password="pass",
db="myDB")
ns_actual <- "myDB.MyCollection"
Then, I run a query like this:
cursor <- mongo.find(mongo,ns=ns_actual,query=list(var1="value"),
options=mongo.find.slave.ok)
So far, R knows the slave hosts and it is allowed to query them. But when it is going to do it? Can I force R to balance the queries among the hosts?
Sorry, no solution so far. The underlying C connector is not supporting this functionality. There is a new mongoC library available which supports this. But moving rmongodb to this library will take a lot of time which is currently not available.

Resources