Balance reads in a MongoDB replica set with rmongodb - r

I have a MongoDB as a replica set with one master and one slave. I am using RmongoDB and I want to explicitly send a query to each machine using a parallelized for loop.
I succesfully created a conection with all the hosts:
mongo <- mongo.create(host=c("mastermng01:27001","slavemng01:27001"),
name="myRS",
username="user",
password="pass",
db="myDB")
ns_actual <- "myDB.MyCollection"
Then, I run a query like this:
cursor <- mongo.find(mongo,ns=ns_actual,query=list(var1="value"),
options=mongo.find.slave.ok)
So far, R knows the slave hosts and it is allowed to query them. But when it is going to do it? Can I force R to balance the queries among the hosts?

Sorry, no solution so far. The underlying C connector is not supporting this functionality. There is a new mongoC library available which supports this. But moving rmongodb to this library will take a lot of time which is currently not available.

Related

Kafka Connector for Oracle Database Source

I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?
For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.
We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/

Is it possible to combine two different database utility like Teradata's Fast export to Greenplum's GPloader?

Usually, I use a JDBC connection with some ETL tool to move data from one database(i.e Teradata) to another database(i.e Greenplum).
However, both of these databases comes with inbuilt utilities which can load/export huge amounts of data very fast, far faster than JDBC!. But the downside as far as I am aware of is that it can do so only to/from a file.
So, if I want to use them I have to Follow a process like-
Teradata ---(Fast Export)---> File ---(Gploader)---> Greenplum
I am wondering if it is possible to skip the File part and Combine the two utilities.
Teradata ---(FastExport & Gploader)--> Greenplum.
That way I can transfer huge amounts of data very quickly!
Yes, you most certainly can. Greenplum supports all kinds of external tables. One solution is to use an External Table that executes a command. That command can be a Java program that connects to Teradata to get data and uses the FastExport option.
I wrote the tool "gplink" to do just this. It automates the creation of Greenplum External Tables for JDBC sources.
Github:
https://github.com/pivotalguru/gplink
Teradata connection example:
https://github.com/pivotalguru/gplink/blob/master/connections/teradata.properties
And my blog:
http://www.pivotalguru.com/?page_id=982

r - SQL on large datasets from several access databases

I'm working on a process improvement that will use SQL in r to work with large datasets. Currently the source data is stored in several different MS Access databases. My initial approach was to use RODBC to read all of the source data into r, and then use sqldf() to summarize the data as needed. I'm running out of RAM before I can even begin use sqldf() though.
Is there a more efficient way for me to complete this task using r? I've been looking for a way to run a SQL query that joins the separate databases before reading them into r, but so far I haven't found any packages that support this functionality.
Should your data be in a database dplyr (a part of the tidyverse) would be the tool you are looking for.
You can use it to connect to a local / remote database, push your joins / filters / whatever there and collect() the result as a data frame. You will find the process neatly summarized on http://db.rstudio.com/dplyr/
What I am not quite certain of - but it is not a R issue but rather an MS Access issue - is the means for accessing data across multiple MS Access databases.
You may need to write custom SQL code for that & pass it to one of the databases via DBI::dbGetQuery() and have MS Access handle the database link.
The link you posted looks promising. If it doesn't yield the intended results, consider linking one Access DB to all the others. Links take almost no memory. Union the links and fetch the data from there.
# Load RODBC package
library(RODBC)
# Connect to Access db
channel <- odbcConnectAccess("C:/Documents/Name_Of_My_Access_Database")
# Get data
data <- sqlQuery(channel , paste ("select * from Name_of_table_in_my_database"))
These URLs may help as well.
https://www.r-bloggers.com/getting-access-data-into-r/
How to connect R with Access database in 64-bit Window?

How can I use multiple databases in a sitescope monitor?

We have 2 databases. One is an oracle 11g DB and the other is a DB2 database. I need to make a query to the oracle database to get a list of accounts and feed that as parameters into another DB2 query. If the DB2 Query returns any results then I need to send an alert. Is this in any way possible with sitescope (I am fairly new to sitescope so be gentle)? It looks like there is only room for 1 connection string in the sitescope monitors. Can I create 2 monitors (one for DB2 and one for Oracle) and use the results of one query as a parameter into the other monitor? Looks like there is some monitor to monitor capabilities but still trying to understand what is possible. Thanks!
It is possible to extract the results from the first query using a custom script alert, but it is not possible than to reuse the data in another monitor.
SiteScope (SiS) Database Query monitor has no feature to include dynamically changing data in the query used. Generally speaking monitor configurations are static and can only be updated by an external agent (user or integration).
Thinking inside the box of one vendor using HP Operations Orchestration (OO) would be an option to achieve your goal. You could either use OO to run the checks and send you alerts in case of a problem or run the checks and dump the result to a file somewhere which than subsequently can be read by SiS using Script monitor or Logfile monitor.

Spark newbie (ODBC/SparkSQL)

I have a spark cluster setup and tried both native scala and spark sql on my dataset and the setup seems to work for the most part. I have the following questions
From an ODBC/extenal connectivity to the cluster, what should i expect?
- the admin/developer shapes the data and persists/caches a few RDDs that will be exposed? (Thinking on the lines of hive tables)
- What would be the equivalent of connecting to a "Hive metastore" in spark/spark sql?
Is thinking along the lines of hive faulted?
My other question was
- when i issue hive queries, (and say create tables and such), it uses the same hive metastore as hadoop/hive
- Where do the tables get created when i issue sql queries using sqlcontext?
- If i persist the table, it is the same concept as persisting an RDD?
Appreciate your answers
Nithya
(this is written with spark 1.1 in mind, be aware that new features tend to be added quickly, some limitations mentioned below might very well disappear at some point in the future).
You can use Spark SQL with Hive syntax and connect to Hive metastore, which will result in your Spark SQL hive commands to be executed on the same data space as if they were executed through Hive directly.
To do that you simply need to instantiate a HiveContext as explained here and provide a hive-site.xml configuration file that specifies, among other things, where to find the Hive metastore.
The result of a SELECT statement is a SchemaRDD, which is an RDD of Row objects that has an associated schema. You can use it just like you use any RDD, including cache and persist and the effect is the same (the fact that the data comes from hive has not influence here).
If your hive command is creating data, e.g. "CREATE TABLE ... ", the corresponding table gets created in exactly the same place as with regular Hive, i.e. /var/lib/hive/warehouse by default.
Executing Hive SQL through Spark provides you with all the caching benefits of Spark: executing a 2nd SQL query on the same data set within the same spark context will typically be much faster than the first query.
Since Spark 1.1, it is possible to start the Thrift JDBC server, which is essentially an equivalent to HiveServer2 and thus allows you to execute SparkQL commands through a JDBC connection.
Note that not all Hive features are available (yet?), see details here.
Finally, you can also discard Hive syntax and metastore and execute SQL queries directly on CSV and Parquet files. My best guess is that this will become the preferred approach in the future, although at the moment the set of SQL features available like this is smaller than when using the Hive syntax.

Resources