Compare xml of two tables [PENTAHO] - bigdata

Do you have any idea how to do something like this in pentaho?
I have two tables. First table it is source table in mssql and second it is target table in db2.
In first table I have column with type xml. We supply second table this data second table. In second also I have column XML. I would like to compare in pentaho whether the xml value in the second table corresponds to what is in the first table.

You can use combination of "Multiway merge join" & "Switch/Case" step to get the non-matching id from source data. I have prepare a SOLUTION for you. You can get help from here.
Here
Get both table input from MSSQL & DB2
Merge both the table data with condition source.XML=destination.xml with FULL JOIN
Get the source IDs only when IDs are not available in destination table using SWITCH/CASE
Select the IDs and write into the text file

Related

I see row count difference in Teradata tables between export result and query count

Recently I've moved one Teradata test table data to bigquery and I see the row count difference between TD and BQ. As I checked further, I see one of the row value is in "DATE"format instead of "String" because that column is PI column and the data type is VARCHAR. so this row is returning in BQ when I run select but not in TD whereas I see that row when I'm exporting data to excel. I'm really not sure what could be the reason of not showing when I run select statement. Please help me someone to know the reason and also let me know how can I search those problematic data when table is too big. Thanks.
eg : create multiset table Test(a int,b varchar,c varchar,d timestamp(6)) primary index (b);
Data like below in that table.
enter image description here

merge secondary database into main one avoiding duplicate

I have two databases with the same structure. The first is the main one, while the second get updated periodically (in reality I have multiple "secondary" databases that I want to merge one by one into the main one).
The structure of the main and the secondary databases is identical.
I want to periodically dump all new values from the secondary database in the main one. However, the second time I do it, I want to exclude rows that were already copied the first time (and so on).
The tables in all these database have:
an ID column set as PRIMARY KEY going from 1 to N for each database (I suspect this was a mistake, but at the moment I can't change this)
a DATE column, representing a posix timestamp (float)
some other columns
My code looks like this:
ATTACH DATABASE secondary.db AS temp_db
DROP TABLE IF EXISTS my_table_temp
CREATE TABLE my_table_temp AS SELECT * FROM my_table
INSERT INTO main.my_table_temp SELECT * FROM temp_db.my_table
DELETE FROM my_table
INSERT INTO main.my_table SELECT DISTINCT * FROM main.my_table_temp ORDER BY date
DROP TABLE my_table_temp
the problem is that - I suspect due to the repeated ID column - the DISTINCT clause returns me:
UNIQUE constraint failed: my_table.id
However I don't care at all of the ID field that could also be dropped or reset.
NOTES:
the secondary databases are constantly updated by a code that - at the moment - I can't change
I initialize the "main" database copy-pasting one of the secondary to avoid regenerating the whole structure from scratch. Maybe there is a better way of doing this
Apologies if this is a naive question, but I'm very new with SQLite.
Thanks
Following the advice from #forpas, I solved this with the following code:
Assuming the columns to be id,date,col1 and col2
ATTACH DATABASE secondary.db AS temp_db
DROP TABLE IF EXISTS my_table_temp
CREATE TABLE my_table_temp AS SELECT date,col1,col2 FROM my_table
INSERT INTO main.my_table_temp SELECT date,col1,col2 FROM temp_db.my_table
DROP TABLE my_table /* I need to recreate my_table as I've removed a column*/
CREATE TABLE main.my_table AS SELECT DISTINCT date,col1,col2 FROM main.my_table_temp ORDER BY date
DROP TABLE my_table_temp
also, I automatized the extraction of the column names doing
SELECT name FROM PRAGMA_TABLE_INFO('my_table');
This is then passed to the python code running the script and the column id is removed from the list. Note that the second (and following) time I run this code, the column id won't be present in my_table to start with. However this approach allows the code to be the same in the two cases: either if the column id is there or not.
This procedure is then iterated over each table name to fully merge the two databases.

sqlite3 - the philosophy behind sqlite design for this scenario

suppose we have a file with just one table named TableA and this table has just one column named Text;
let say we populate our TableA with 3,000,000 of strings like these(each line a record):
Many of our patients are incontinent.
Many of our patients are severely disturbed.
Many of our patients need help with dressing.
if I save the file at this level it'll be: ~326 MB
now let say we want to increase the speed of our queries and therefore we set our Text column as the PrimaryKey(or create index on it);
if I save the file at this level it'll be: ~700 MB
our query:
SELECT Text FROM "TableA" where Text like '% home %'
for the table without index: ~5.545s
for the indexed table: ~2.231s
As far as I know when we create index on a column or set a column to be our PrimaryKey then sqlite engine doesn't need to refer to table itself(if no other column was requested in query) and it uses the index for query and hence the speed of query execution increases;
My question is in the scenario above which we have just one column and set that column to be the PrimaryKey too, then why sqlite holds some kind of unnecessary data?(at least it seems unnecessary!)(in this case ~326 MB) why not just keeping the index\PrimaryKey data?
In SQLite, table rows are stored in the order of the internal rowid column.
Therefore, indexes must be stored separately.
In SQLite 3.8.2 or later, you can create a WITHOUT ROWID table which is stored in order of its primary key values.

SQL Server Stored Procedure, selecting rows from multiple tables

I have a stored procedure that is pulling data from one table, now I have got the same data collected in a different table, as the IDs are different I cannot put all data from the second table into the first one.
So now I have to have two select statements in one stored procedure.
Although the corresponding data is same but the column names in both table are different.
For instance BRIEF_TITLE would be briefTitle in the second table.
How can I merge the data from two different tables into one?
The result is bonded to ASP.net grid view control.
As the comments above state, you need something like this:
select BRIEF_TITLE from t1
union all
select briefTitle from t2
This will give you a column name of BRIEF_TITLE, but if you want something else, then add an alias to the first select
select BRIEF_TITLE as ShortTitle from t1
union all
select briefTitle from t2

SQLite - select every row from all tables where a column name exists

I need to extract all rows from every table that has the column imgAssetURL to add to a pre loading system.
I think in essence something like:
SELECT imgAssetURL FROM *
What are my options?
The definitions for all tables are located in the sqlite_master table. You would have to read those definitions, figure out which tables have the column in it, and run a query on each of those.
See http://www.sqlite.org/fileformat2.html#sqlite_master

Resources