Rolling joins with database backends

Rolling joins with database backends - r

Version 1.1.0 of dplyr acquired features that allow it to express complex joins; for example one can now express rolling joins. I believe that at the moment (as of dbplyr 2.3.0) there is no translation of these constructs into SQL. I was curious about:
I assume that the plan is to provide backend translations for most of these new constructs. Is this correct?
If so, I was wondering what the likely translation of these constructs would be for the MS SQL Server backend? For example, what are possible T-SQL translations for the likes of join_by(company == id, closest(year >= since))?

Related

Is there a way to run GQL queries without defining entity model in python

I have a java, spring-data app that uses Datastore. I need a subset of this data to run analytics using python app. What I need in python app is essentially a join (yup, relational doesnt get out of me) between two "Kinds" queried by key of one kind.
NDB client requires creating same entity models in python to be able to query data, which is a drag. Why cant i simply run the console version of GQL(select * from kind) using python. Maybe I am missing something as this sort of querying is available in almost all relational and nosql DBs.

Your observations are correct: a GQL query cannot perform a SQL-like "join" query. This is documented on the GQL Reference for Python NDB/DB documentation page.
If you would like to submit a feature request to request its implementation, you can simply open an issue for it in the Public Issue Tracker.

Kafka Connector for Oracle Database Source

I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?

For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.

We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/

r - SQL on large datasets from several access databases

I'm working on a process improvement that will use SQL in r to work with large datasets. Currently the source data is stored in several different MS Access databases. My initial approach was to use RODBC to read all of the source data into r, and then use sqldf() to summarize the data as needed. I'm running out of RAM before I can even begin use sqldf() though.
Is there a more efficient way for me to complete this task using r? I've been looking for a way to run a SQL query that joins the separate databases before reading them into r, but so far I haven't found any packages that support this functionality.

Should your data be in a database dplyr (a part of the tidyverse) would be the tool you are looking for.
You can use it to connect to a local / remote database, push your joins / filters / whatever there and collect() the result as a data frame. You will find the process neatly summarized on http://db.rstudio.com/dplyr/
What I am not quite certain of - but it is not a R issue but rather an MS Access issue - is the means for accessing data across multiple MS Access databases.
You may need to write custom SQL code for that & pass it to one of the databases via DBI::dbGetQuery() and have MS Access handle the database link.

The link you posted looks promising. If it doesn't yield the intended results, consider linking one Access DB to all the others. Links take almost no memory. Union the links and fetch the data from there.
# Load RODBC package
library(RODBC)
# Connect to Access db
channel <- odbcConnectAccess("C:/Documents/Name_Of_My_Access_Database")
# Get data
data <- sqlQuery(channel , paste ("select * from Name_of_table_in_my_database"))
These URLs may help as well.
https://www.r-bloggers.com/getting-access-data-into-r/
How to connect R with Access database in 64-bit Window?

Alternative to group by for cosmos db

Given that cosmos db does not support group by, what is a good alternative to achieve similar functionality:
Select sum(*) , groupterm from tble group by groupterm
Can I efficiently achieve this in a cosmos stored procedure?

As Cosmos_DB states as follows:
Aggregation capability in SQL limited to COUNT, SUM, MIN, MAX, AVG functions. No support for GROUP BY or other aggregation functionality found in database systems. However, stored procedures can be used to implement in-the-database aggregation capability.
Can I efficiently achieve this in a cosmos stored procedure?
For .NET and Node.js
Larry Maccherone has provided a great package documentdb-lumenize which supports Aggregations (Group-by, Pivot-table, and N-dimensional Cube) and Time Series Transformations as Stored Procedures in DocumentDB.
Additionally, for Python and Scala, you could refer to azure-cosmosdb-spark.

Group by is now supported in Cosmos db SQL API. You will be needing SDK version 3.3 or higher
Azure Cosmos DB currently supports GROUP BY in .NET SDK 3.3 or later.
Support for other language SDK's and the Azure Portal is not currently
available but is planned.
https://learn.microsoft.com/en-gb/azure/cosmos-db/sql-query-group-by

Finally, Azure Cosmos DB currently supports GROUP BY in .NET SDK 3.3 or later. Support for other language SDK's and the Azure Portal is not currently available but is planned.
<group_by_clause> ::= GROUP BY <scalar_expression_list>
<scalar_expression_list> ::=
<scalar_expression>
| <scalar_expression_list>, <scalar_expression>

Spark newbie (ODBC/SparkSQL)

I have a spark cluster setup and tried both native scala and spark sql on my dataset and the setup seems to work for the most part. I have the following questions
From an ODBC/extenal connectivity to the cluster, what should i expect?
- the admin/developer shapes the data and persists/caches a few RDDs that will be exposed? (Thinking on the lines of hive tables)
- What would be the equivalent of connecting to a "Hive metastore" in spark/spark sql?
Is thinking along the lines of hive faulted?
My other question was
- when i issue hive queries, (and say create tables and such), it uses the same hive metastore as hadoop/hive
- Where do the tables get created when i issue sql queries using sqlcontext?
- If i persist the table, it is the same concept as persisting an RDD?
Appreciate your answers
Nithya

(this is written with spark 1.1 in mind, be aware that new features tend to be added quickly, some limitations mentioned below might very well disappear at some point in the future).
You can use Spark SQL with Hive syntax and connect to Hive metastore, which will result in your Spark SQL hive commands to be executed on the same data space as if they were executed through Hive directly.
To do that you simply need to instantiate a HiveContext as explained here and provide a hive-site.xml configuration file that specifies, among other things, where to find the Hive metastore.
The result of a SELECT statement is a SchemaRDD, which is an RDD of Row objects that has an associated schema. You can use it just like you use any RDD, including cache and persist and the effect is the same (the fact that the data comes from hive has not influence here).
If your hive command is creating data, e.g. "CREATE TABLE ... ", the corresponding table gets created in exactly the same place as with regular Hive, i.e. /var/lib/hive/warehouse by default.
Executing Hive SQL through Spark provides you with all the caching benefits of Spark: executing a 2nd SQL query on the same data set within the same spark context will typically be much faster than the first query.
Since Spark 1.1, it is possible to start the Thrift JDBC server, which is essentially an equivalent to HiveServer2 and thus allows you to execute SparkQL commands through a JDBC connection.
Note that not all Hive features are available (yet?), see details here.
Finally, you can also discard Hive syntax and metastore and execute SQL queries directly on CSV and Parquet files. My best guess is that this will become the preferred approach in the future, although at the moment the set of SQL features available like this is smaller than when using the Hive syntax.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rolling joins with database backends - r

Related

Is there a way to run GQL queries without defining entity model in python

Kafka Connector for Oracle Database Source

r - SQL on large datasets from several access databases

Alternative to group by for cosmos db

Spark newbie (ODBC/SparkSQL)

Categories

Resources