I have Golden Gate on oracle 12c. Is it possible to get subset of data (like Where condition) from source to replicate - oracle-golden-gate

I have Golden Gate on oracle 12c. Is it possible to get subset of data (like Where condition) from source to replicate in the target database.
Second question: Is it possible to replicate data using golden gate in to two different databases if so how. From one source to two target schemas.

Yes, it is possible to get subset of data. Use the:
FILTER
WHERE
clause at the Extract or the Replicat param file.
Yes, it is possible to replicate to two targets. There are two possibilities:
You just can one Extract process and two Replicat processes (which are reading one trail),
You can create two Extract processes writing data to two trail files and two Replicat processes.
Each one of the Replicat processes would be writing data to a separate database target.

Related

DolphinDB Python API: Issues with partitionColName when writing data to dfs table with compo domain

It's my understanding that the PartitionedTableAppender method of DolphinDB Python API can implement concurrent data writes. I'm trying to write data to a dfs table with compo domain where the partitions are determined by the values of "datetime" and "symbol". Now the data I'd like to write include records of 150 symbols in one day. This is what I tried:
But it seems only one partitioning column can be specified in partitionColName. Please inform me if I do have a wrong way of writing this.
Just specify one partitioning column in this case even if it uses compo domain. Based on the given information, it is suggested to specify partitionColName to be "symbol" and then concurrent writes can be implemented. However, the script still works if you set it to be "datetime", but the data cannot be written concurrently because it only contains one day's records with which only one partition is created.
Refer to the basic operating rules when you are using PartitionedTableAppender:
DolphinDB does not allow multiple writers to write data to one
partition at the same time. Therefore, when multiple threads are
writing to the same table concurrently, it is recommended to make sure each of
them writes to a different partition. Python API provides a
convenient way by dividing data by partition automatically.
With DolphinDB server version 1.30 or above, we can write to DFS
tables with the PartitionedTableAppender object in Python API. The
user needs to first specify a connection pool. The system obtains
the partitioning information before assigning them to the connection pool
for concurrent writing. A partition can only be written to by one
connection pool at one time.
Therefore, only one partitioning column is needed for a table with compo domain. Just specify a highly-differentiated partitioning column to create numbers of partitions and assign them to multiple connection pools. Thus, the data can be written concurrently to the dfs tables.

How do I connect kqlmagic to more than one Log Analytics workspace at the same time?

In my Jupyter notebook, I want to run the same KQL query against different Sentinel workspaces and compare the results as data frames. Is there an easy way to have multiple workspace connections at the same time or would I need to reconnect and query each workspace individually every time I change my KQL query?
You have few options to achieve it.
As suggested above to use a cross-workspace-query that will result in a table that will include records from all the workspaces specified, you can then split it into multiple data frames.
Create multiple connections, and query each one by one. You can have multiple queries in one %%kql cell (separate each query with an empty line and assign result of each query to a different python variable
write python code that iterates over the workspaces, and use %kql (one line magic)
write python code that iterates over workspaces, and invoke Kqlmagic with the ipython magic API
write python code that iterated over workspaces, and use Kqlmagic module
(I am the author of Kqlmagic,)
See if cross-workspace queries satisfy your requirements. And a bit more documentation here. Cross-workspace queries are for exactly you describe. You use a union operator to link both - similar to how you would link two tables using union.
Snipped from the article:
workspace('<workspace-A>').SecurityEvent
| union workspace('<workspace-B>').SecurityEvent

Parquet concat or split two schemes

I have two CSV files. In the first one I have: first_name, last_name and in the second I have: email, phone. The two files connect by line index (same number of records). I need to save all data in parquet format.
First option - connect two schemes to one and save everything in one parquet file.
Second option - save two schemes separately (as two parquet files).
According to my use-case there is a high probability to take the second option (2 files). At the end I need to query data using various tools, most often using Presto.
Question 1- is it possible to pull data from two parquet files (let's say select first_name, email)?
Question 2- Will there be a difference in run times?
I have run some tests, but cannot come to an accurate conclusion...
You can pull data from those two tables but you need to have some join keys in order to combine the records. If it is not there the you might have to use row_number() assuming data are in the same order in both the tables. Data size also matters here.
In big data world, denormalized format is the recommendation if you have to join those tables very frequently in your queries. This approach will give you better performance.

Filtering data while reading from S3 to Spark

We are moving to AWS EMR/S3 and using R for analysis (sparklyr library). We have 500gb sales data in S3 containing records for multiple products. We want to analyze data for couple of products and want to read only subset of file into EMR.
So far my understanding is that spark_read_csv will pull in all the data. Is there a way in R/Python/Hive to read data only for products we are interested in?
In short, the choice of the format is on the opposite side of the efficient spectrum.
Using data
Partitioned by (partitionBy option of the DataFrameWriter or correct directory structure) column of interest.
Clustered by (bucketBy option of the DataFrameWriter and persistent metastore) on the column of interest.
can help to narrow down the search to particular partitions in some cases, but if filter(product == p1) is highly selective, then you're likely looking at the wrong tool.
Depending on the requirements:
A proper database.
Data warehouse on Hadoop.
might be a better choice.
You should also consider choosing a better storage format (like Parquet).

How to Combine multiple files in BizTalk?

I have multiple flatfiles (CSV) (with multiple records) where files will be received randomly. I have to combine them (records) with unique ID fields.
How can I combine them, if there is no common unique field for all files, and I don't know which one will be received first?
Here are some files examples:
In real there are 16 files.
Fields and records are much more then in this example.
I would avoid trying to do this purely in XSLT/BizTalk orchestrations/C# code. These are fairly simple flat files. Load them into SQL, and create a view to join your data up.
You can still use BizTalk to pickup/load the files. You can also still use BizTalk to execute the view or procedure that joins the data up and sends your final message.
There are a few questions that might help guide how this would work here:
When do you want to join the data together? What triggers that (a time of day, a certain number of messages received, a certain type of message, a particular record, etc)? How will BizTalk know when it's received enough/the right data to join?
What does a canonical version of this data look like? Does all of the data from all of these files truly get correlated into one entity (e.g. a "Trade" or a "Transfer" etc.)?
I'd probably start with defining my canonical entity, and then look towards the path of getting a "complete" picture of that canonical entity by using SQL for this kind of case.

Resources