tFlowToIterate with BigDataBatch job - bigdata

I have currently a simple standard job on Talend doing this :
It simply reads a file of several lines (tHDFSInput), and for each line of this file (tFlowToIterate), I create a INSERT query "INSERT ... SELECT ... FROM" based on what I read in my file (tHiveRow). And it works well, it's just a bit slow.
I now need to modify my "Standard" job to make a "Big Data Batch" job in order to make it faster, and also because we asked me to only make Big Data Batch from now on.
The thing is that there is no tFlowToIterate and no tHiveRow component with Big Data Batch...
How can I do this ?
Thank's a lot.

Though I haven't tried this solution, I think this can help you.
Create hive table upfront.
Place tHDFSConfiguration component in the job and provide cluster details.
Use tFileInputDelimited component. On providing storage configuration as tHDFSConfiguration(defined in step1), this will read from HDFS.
Use tHiveOutput component. Connect tFileInputDelimited to tHiveOutput. In tHiveOutput, you can provide the table, format and save mode.

In order to load HDFS into Hive without modification on the data maybe you can use only one component : tHiveLoad
Insert your HDFS path inside the component.
tHiveLoad documentation : https://help.talend.com/reader/hCrOzogIwKfuR3mPf~LydA/ILvaWaTQF60ovIN6jpZpzg

Related

How to run liquibase changelogSyncSQL or changelogSync up to a tag and/or labels?

I'm adding liquibase to an existing project where tables and data already exist. I would like to know how I might limit the scope of changelogSync[SQL] to a subset of available changes.
Background
I've run liquibase generateChangeLog to capture the current state and placed this into say src/main/resources/db/changelog/changes/V2021.04.13.00.00.00__init01.yaml.
I've also added another changeset to cover some new requirements in a new file. Let's call it src/main/resources/db/changelog/changes/V2021.04.13.00.00.00__new-feature.yaml.
I've added a main changelog file src/main/resources/db/changelog/db.changelog-master.yaml with the following contents:
databaseChangeLog:
- includeAll:
path: changes
relativeToChangelogFile: true
I now want to ensure that when I run liquibase changelogSync[SQL] against a particular version of the db that the scope is limited to the first changelog init01, thereby allowing from that point on a liquibase update or updateToTag et al, to continue with changes following init01.
I'm surprised to see that the changelogSync[SQL] commands don't seem to offer some way (that I can see from the docs for how to do this.
Besides printing the SQL and manually changing it, is there something I've missed? Any suggested approaches welcome. Thanks!
What about changelogSyncToTagSQL ?
Wouldn't it cover your needs?
Or maybe you could try changelogSyncSQL with additional parameters "label" or and "context" ?
changelogSyncToTagSQL
context
labels
As it stands, the only solution I've found is to generate the SQL and then manually edit its contents to filter the change sets which don't correspond to your current schema.
Then you can apply the sql to the db.

Kusto ingestion error "BadRequest_EmptyArchive: Empty zip archive"

I have a bunch of .csv files in Azure blob storage, and an ingestion rule to pull them into Kusto (Azure Data Explorer). This used to work, but I've been getting a lot of ingestion failures lately. ".show ingestion failures" gives me:
Details FailureKind OperationKind ErrorCode ShouldRetry IngestionProperties IngestionSourcePath
BadRequest_EmptyArchive: Empty zip archive Permanent DataIngestPull BadRequest_EmptyArchive 0 "[Format=Csv/mandatory, IngestionMapping=[{""column"":""CrashSource"",""datatype"":""string"",""Ordinal"":""0""},{""column"":""CrashType"",""datatype"":""string"",""Ordinal"":""1""},{""column"":""ReportId"",""datatype"":""string"",""Ordinal"":""2""},{""column"":""DeviceId"",""datatype"":""string"",""Ordinal"":""3""},{""column"":""DeviceSerialNumber"",""datatype"":""string"",""Ordinal"":""4""},{""column"":""DumpFilePath"",""datatype"":""string"",""Ordinal"":""5""},{""column"":""FailureXmlPath"",""datatype"":""string"",""Ordinal"":""6""},{""column"":""PROCESS_NAME"",""datatype"":""string"",""Ordinal"":""7""},{""column"":""BUILD_VERSION_STRING"",""datatype"":""string"",""Ordinal"":""8""},{""column"":""DUMP_TYPE"",""datatype"":""string"",""Ordinal"":""9""},{""column"":""PRIMARY_PROBLEM_CLASS"",""datatype"":""string"",""Ordinal"":""10""},{""column"":""IMAGE_NAME"",""datatype"":""string"",""Ordinal"":""11""},{""column"":""FAILURE_BUCKET_ID"",""datatype"":""string"",""Ordinal"":""12""},{""column"":""OS_VERSION"",""datatype"":""string"",""Ordinal"":""13""},{""column"":""TARGET_TIME"",""datatype"":""string"",""Ordinal"":""14""},{""column"":""FAILURE_ID_HASH_STRING"",""datatype"":""string"",""Ordinal"":""15""},{""column"":""FAILURE_ID_HASH"",""datatype"":""string"",""Ordinal"":""16""},{""column"":""FAILURE_ID_REPORT_LINK"",""datatype"":""string"",""Ordinal"":""17""}], ValidationPolicy=[Options=ValidateCsvInputConstantColumns, Implications=BestEffort], Tags=[ToStringEmpty], IngestIfNotExists=[ToStringEmpty], ZipPattern=[null]]" https://crashanalysisresults.blob.core.usgovcloudapi.net/datacontainer/Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.zip.crashanalysis.csv
My CSV files are not zipped in blob storage. Do I need to do something with ZipPattern to say so?
Here's what this CSV contains (many strings simplified):
CrashSource,CrashType,ReportId,DeviceId,DeviceSerialNumber,DumpFilePath,FailureXmlPath,PROCESS_NAME,BUILD_VERSION_STRING,DUMP_TYPE,PRIMARY_PROBLEM_CLASS,IMAGE_NAME,FAILURE_BUCKET_ID,OS_VERSION,TARGET_TIME,FAILURE_ID_HASH_STRING,FAILURE_ID_HASH,FAILURE_ID_REPORT_LINK
"source","type","reportid","deviceid","","dumpfilepath","failurexmlpath","process","version","1","problem class","image","bucket","version","2020-07-27T22:36:44.000Z","hash string","{b27ad816-2cb5-c004-d164-516c7a32dcef}","link"
As often happens. I seem to have found my own answer just by asking. While reviewing my question, I realized the string ".zip" is in the middle of my CSV file name. (Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.zip.crashanalysis.csv) That made me wonder if Kusto is treating it differently based on that. I tested by taking the exact same file, renaming it "Telemetry.37c92f1a-a951-4047-b839-e685bd11758f.crashanalysis.csv", and uploading it. This was successfully ingested.
<forehead smack> I guess I will rename my files to get past this, but this sounds like a Kusto ingestion bug to me.
Thanks for reporting this, Sue.
Yes, Kusto automatically attempts to check files with ".zip" in them. We'll check why this is "tripping" if the string is in the middle of the file name instead of just its end.

Dynamic output issue when rowset is empty

I'm running a u-sql script similar to this:
#output =
SELECT Tag,
Json
FROM #table;
OUTPUT #output
TO #"/Dir/{Tag}/filename.txt"
USING Outputters.Text(quoting : false);
The problem is that #output is empty and the execution crashes. I already checked that if I don't use {tag} in the output path the script works well (it writes an empty file but that's the expectable).
Is there a way to avoid the crash and simply don't output anything?
Thank you
The form you are using is not yet publicly supported. Output to Files (U-SQL) documents the only supported version right now.
That said, depending on the runtime that you are using, and the flags that you have set in the script, you might be running the private preview feature of outputting to a set of files. In that case, I would expect that it would work properly.
Are you able to share a job link?

how to create session name dynamically while running a workflow

I have created a single mapping, session and workflow to load different tables which have the same data structure. I created a shell script to create a parameter file dynamically and run workflow based on tables. Now the issue is that when I run this workflow all session names are same as I created only one session but tables are different. I need the session names to be different and I should get like (session_name_table_name).
Please help me to solve this issue. sorry for my bad English if not able to understand.
In the pmcmdcommand pass the option -rin <run instance>. <run instance> should be replaced with your table name.
In the workflow monitor, the run instance name would appear in [ ] beside the workflow. Ex. wkf_s_m_load_test [T_TEST1]

I'm having problems with configuring a filter that replicates specific tables only

I am trying to use filters to select specific tables to replicate.
I tried running this with the installer
./tools/tungsten-installer --master-slave -a \
...
--svc-extractor-filters=replicate \
--property=replicator.filter.replicate.do=test,*.foo"
and got this exception in trepctl status after the master had not installed properly:
Plugin class name property is missing or null: key=replicator.filter.replicate
which file is this properties file? How do I find it? Moreover, in specifying the settings for the filter, how do I know what exactly to put?
I discovered that I am supposed to Modify the configuration template file prior to configuration according to Issue 219 but what changes am I supposed to make in tungsten-replicator-2.0.5-diff that will later on be patched to the extraction?
Issue 254 suggests that If you want to apply a filter out of the box, you can use these options with tungsten-installer:
-a --property=replicator.filter.Replicate.ignoreFilter=schema_x.tablex,schema_x,tabley,schema_y,tablez
--svc-thl-filter=Replicate
However when I try using this for --property=replicator.filter.replicate.do,
but the problem is still the same:
pendingExceptionMessage: Plugin class name property is missing or null: key=replicator.filter.replicate
Your assistance will be greatly appreciated.
Rumbi
Update:
Hi
I had a look at this file: /root/tungsten/tungsten-replicator/samples/
conf/filters/default/tableignore.tpl .Acoording to this sample, a
static-SERVICE_NAME.properties file is supposed to have something like
this configured, please confirm if this is the correct syntax:
replicator.filter.tabledo=com.continuent.tungsten.replicator.filter.JavaScr iptFilter
replicator.filter.tabledo.script=${replicator.home.dir}/samples/
scripts/javascript-advanced/tabledo.js
replicator.filter.tabledo.tables=foo(database).bar(table)
replicator.stage.thl-to-dbms.filters=tabledo
However, I did not find tabledo.js (or something similar) in the
directory where tableignore.js exists. Could I please have the
location of this file. If there is an alternative way of specifiying
--property=replicator.filter.replicate.do=test without the use of
this .js file, your suggestions are most welcome.
Download the latest version of tungsten replicator. The missing tpl file was added about a month ago. After installation, the filtered tables should be added to static-service.properties under the section FILTERS.
Locate your replicator configuration file in static-YOUR_SERVICE_NAME.properties, e.g.
/opt/continuent/tungsten/tungsten-replicator/conf/static-mysql2vertica.properties
Make sure the individual dbms properties are set, in particular the setting replicator.applier.dbms:
# Batch applier basic configuration information.
replicator.applier.dbms=com.continuent.tungsten.replicator.applier.batch.SimpleBatchApplier
replicator.applier.dbms.url=jdbc:mysql:thin://${replicator.global.db.host}:${replicator.global.db.port}/tungsten_${service.name}?createDB=true
replicator.applier.dbms.driver=org.drizzle.jdbc.DrizzleDriver
replicator.applier.dbms.user=${replicator.global.db.user}
replicator.applier.dbms.password=${replicator.global.db.password}
replicator.applier.dbms.startupScript=${replicator.home.dir}/samples/scripts/batch/mysql-connect.sql
# Timezone and character set.
replicator.applier.dbms.timezone=GMT+0:00
replicator.applier.dbms.charset=UTF-8
# Parameters for loading and merging via stage tables.
replicator.applier.dbms.stageTablePrefix=stage_xxx_
replicator.applier.dbms.stageDirectory=/tmp/staging
replicator.applier.dbms.stageLoadScript=${replicator.home.dir}/samples/scripts/batch/mysql-load.sql
replicator.applier.dbms.stageMergeScript=${replicator.home.dir}/samples/scripts/batch/mysql-merge.sql
replicator.applier.dbms.cleanUpFiles=false
Depending on the database you are replicating to you may have to omit/modify some of the lines.
For more information see:
https://code.google.com/p/tungsten-replicator/wiki/Replicator_Batch_Loading
I don't know if this problem is still open or not.
I am using this version 2.0.6-xxx and installing the service using the parameters works for me.
I would like to point it out, that as the parameter says "--svc-extractor-filters" defines an extractor filter. Meaning that the parameters will guide the extraction of data in the master server.
If you intend to use it on the slave service, you should use the "--svc-applier-filters".
The parameters
--svc-extractor-filters=replicate \
--property=replicator.filter.replicate.do=test,*.foo"
supposed to create the following in the properties file:
This is the filter set up.
replicator.filter.replicate=com.continuent.tungsten.replicator.filter.ReplicateFilter
replicator.filter.replicate.ignore=
replicator.filter.replicate.do=test,*.foo
And you should also be able to find the
replicator.stage.binlog-to-q.filters=replicate
parameter set.
If you intend to use this filter in the slave, please find the line with:
replicator.stage.q-to-dbms.filters=mysqlsessions,pkey,bidiSlave
and change it as
replicator.stage.q-to-dbms.filters=mysqlsessions,pkey,bidiSlave,replicate
Hope this brief description did help to you!

Resources