Can you please explain the "m 1 " in the last line? - cloudera

sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp --m 1

It specifies the number of map tasks used to perform the import.
source

It tells sqoop to use only one mapper task to perform the import operation. This will lead to creation of only 1 part file.

Related

Dynamic exclusion list in lsyncd

Our cloud platform is powered by opennebula. So we have two instances of the frontend in "cold swap". We use lsyncd daemon trying to keep instances in datastores synced, but there are some points: we don't want to sync VM's images that have an extension .bak cause of the other script moves all the .bak to other storage on schedule. The sync script logic looks like find all the .bak in /var/lib/one/datastores/ then create exclude.lst and then start lsyncd. Seems OK until we take a look at the datastores:
oneadmin#nola:~/cluster$ dir /var/lib/one/datastores/1/
006e099c57061d87d4b8f78ec7199221
008a10fa0764c9ac8d6fb9206c9b69bd
069299977f2fea243a837efed271182f
0a73a9adf74d92b4f175abcb578cabac
0b1cc002e370e1acd880cf781df0a6fb
0b470b182ac6d554774a3615ce87e292
0c0d98d1e0aabc23ef548ddb564c578d
0c3fad9c92a8efc7e13a73d8ae85caa3
..and so on.
We solved it with this monstrous function:
function create_exclude {
oneimage list -x | \
xmlstarlet sel -t -m "IMAGE_POOL/IMAGE" -v "ID" -o ";" -v "NAME" -o ";" -v "SOURCE" -o ":" | \
sed s/:/'\n'/g | \
awk -F";" '/.bak;\/var\/lib/ {print $3}' | \
cut -d / -f8 > /var/lib/one/cluster/exclude.lst
}
The result is the list which contains VM IDs with .bak images inside so we can exclude the whole VM folder from syncing. That's not kinda what we wanted, as the original image stays not synced. But it could be solved by restart the lsyncd script at the moment when other script moves all the .bak to other storage.
Now we get to the topic of the question.
It works until a new .bak will created. No way to add new string in exclude.lst "on the go" but to stop lsync and restart script which re-creates exclude.lst. But there is also no possibility to check the moment of creation a new .bak except another script that will monitor it in some period.
I believe that less complicated solution exists. It depends on opennebula of course, particularly in the way of the /datastores/ folder stores VMs.
Glad to know you are using OpenNebula to run your cloud :) Have you tried to use our Community Forum for support? I'm sure the rest of the Community will be happy to give a hand!
Cheers!

Access a variable defined in Jenkinsfiles in Shell Script within Jenkinsfile

I am defining a shell script in one of the stages in my Jenkinsfile. How can I access a variable that I define in my Jenkinsfile with the shell script?
In below scenario , I am writing the value of the shell variable to a file and reading into a groovy variable. Is there a way to pass data from shell to groovy without writing it to file system?
unstash 'sources'
sh'''
source venv/bin/activate
export AWS_ROLE_ARN=arn:aws:iam::<accountid>:role/<role name>
layer_arn="$(awssume aws lambda list-layer-versions --layer-name dependencies --region us-east-1 --query \"LayerVersions[0].LayerVersionArn\" | tr -d '\"')"
echo $layer_arn > layer_arn
'''
layer_arn = readFile('layer_arn').trim()
You can can shell command line, providing variable value.
sh "some stuff $my_var"
You can defined environment variable and use it within your shell
withEnv(["MY_VAR=${my_var}") {
sh 'some stuff'
}
Regards

How to create a list of topics in Apache Kafka using single command

As of now I am creating a topic one by one by using below command.
sh ./bin/kafka-topics --create --zookeeper localhost:2181 --topic sdelivery --replication-factor 1 --partitions 1
Since I have 200+ topics to be created. Is there any way to create a list of topic with a single command?
I am using 0.10.2 version of Apache Kafka.
This seems like more of a unix/bash question than a Kafka one: the xargs utility is specifically designed to run a command repeatedly from a list of arguments. In your specific case you could use:
cat topic_list.txt | xargs -I % -L1 sh ./bin/kafka-topics --create --zookeeper localhost:2181 --topic % --replication-factor 1 --partitions 1
If you want to do a "dry run" and see what commands will be executed you can replace the sh with echo sh.
Alternatively you can just make sure that your config files have default topic settings of --replication-factor 1 and --partitions 1 and just allow those topics to be automatically created the first time you send a message to them.
You could use Terraform or Kubernetes Operators which will not only help you create topics (with one command), but also manage them later if you did need to delete or modify their configs.
But without custom solutions, and only for purpose of batch-creation, you can use awk for this
Create a file
$ cat /tmp/topics.txt
test1:1:1
test2:1:2
Then use AWK system function to execute the kafka-topics script, and parse the file
$ awk -F':' '{ system("./bin/kafka-topics.sh --create --zookeeper localhost:2181 --topic=" $1 " --replication-factor=" $2 --partitions=" $3 " ) }' /tmp/topics.txt
Created topic "test1".
Created topic "test2".
And we can see the topics are created
$ ./bin/kafka-topics.sh --zookeeper localhost:2181 --list
test1
test2
Note: Hundreds of topics this quickly might overload Zookeeper, so might help to add a call to "; sleep 10" at the end.
As of Kafka 3.0, the Zookeeper flag is replaced by bootstrap-servers.

Aster Database to Hadoop using Sqoop

I am using the below syntax to read from Teradata Aster database Table transaction and load into Hadoop/Hive Table
I have added the below jar files in /usr/iop/4.1.0.0/sqoop/lib folder
terajdbc4.jar
tdgssconfig.jar
noarch-aster-jdbc-driver.jar
Syntax:
sqoop import --connect jdbc:ncluster://hostname.gm.com:2406/Database=test --username abcde --password test33 --table aqa.transaction
Error:
Warning: /usr/iop/4.1.0.0/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/12/14 15:38:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6_IBM_20
16/12/14 15:38:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/12/14 15:38:49 ERROR tool.BaseSqoopTool: Got error creating database manager: java.io.IOException: No manager for connect string: jdbc:ncluster://hostname.gm.com:2406/Database=test
at org.apache.sqoop.ConnFactory.getManager(ConnFactory.java:191)
at org.apache.sqoop.tool.BaseSqoopTool.init(BaseSqoopTool.java:256)
at org.apache.sqoop.tool.ImportTool.init(ImportTool.java:89)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:593)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Add --connection-manager <class-name> in your sqoop command if available for your RDBMS in sqoop.
Otherwise, add --driver <driver-name> in your sqoop command to use Generic connection manager.
You can try with JDBC jar from Aster.
Here are some steps that I followed to create an external Hive table after importing an Aster table using Sqoop:
Download JDBC jar from https://aster-community.teradata.com/docs/DOC-2254
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$PWD/noarch-aster-jdbc-driver.jar
sqoop import -D mapreduce.job.name="Sqoop Hive Import for Aster table tableName" --connect "jdbc:ncluster://X.X.X.X/database" --driver com.asterdata.ncluster.Driver --username "user1" --password "password" --query "select * from schema.table where \$CONDITIONS limit 10" --split-by col1 --as-avrodatafile --target-dir /tmp/aster/tableName
Create an external Hive table on target directory or replace avrodatafile with hive table options.

Is it possible to import .gzip file into sqlite / Could I skip some column while importing?

I tried to play around with .import but it seems to limited with csv and delimited file. Is it possible to import gzip file ? or at least, pipe from command line ?
Also, could I skip some un-wanted column like mysql "LOAD DATA INFILE" ?
If you don't want to use named pipes, you could also:
zcat $YOURFILE.gz | sqlite3 $YOURDB.sqlite ".import /dev/stdin $TABLENAME"
If you need to modify stuff before import, you could use perl (or awk, sed, whatever) between the zcat and sqlite commands.
For example, if your file already uses the pipe character as a delimiter and you would like to import only columns 0 to 3 and 5 to 6:
zcat $YOURFILE.gz | perl -F'\|' -anle 'print join("|", #F[0..3,5..6])' | sqlite3 $YOURDB.sqlite ".import /dev/stdin $TABLENAME"
$ mkfifo tempfile
$ zcat my_records.csv.gz > tempfile
This works like magic!
Although the mkfifo does create temporary file, the size of this file is 0 byte.
When running this command $ zcat my_records.csv.gz > tempfile, it will halt at the command prompt.
This allows you to run
sqlite3> .import tempfile db_table
After sqlite3 finished importing the named pipe, zcat command will also finish running. You can then remove the named pipe.
$ rm -f tempfile
zcat data.gz |\
cat <(echo -e ".separator ','\n.import /dev/stdin dest_table") - |\
sqlite3 db.sqlite
works nicely (linux).
You can create a named pipe. It will act like a normal file but decompress on the fly. SQLite will know nothing about it.
It turns out the example on wikipedia is with gzip. http://en.wikipedia.org/wiki/Named_pipe
You could write a parser for data that would convert it to a series of SQL statements. Perl is a good language for that. It can even handle gzip'd files.
Are you running this in a *Nix OS? If so, you could create a temporary file to hold the decompressed data:
tf="$(mktemp)" &&
zcat <my_records.csv.gz >"$tf"
sqlite3 /path/to/database.sqlite3 ".import $tf"
rm -f "$tf"

Resources