Hadoop Streaming jar not found when submitting Google Dataproc Hadoop Job? - hadoop-streaming

When trying to submit a Hadoop MapReduce job programmatically (from a Java application using the dataproc library), the job fails immediately. When submitting that exact same job through the UI, it works fine.
I've tried SSHing onto the Dataproc cluster to confirm the file exists, to check permissions, and changed the jar reference. Nothing has worked yet.
The error I'm getting:
Exception in thread "main" java.lang.ClassNotFoundException: file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:18)
Job output is complete
When I clone the failed job in the console and look at the REST equivalent this is what I see:
POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
"projectId": "project-id",
"job": {
"reference": {
"projectId": "project-id",
"jobId": "jobDoesNotWork"
},
"placement": {
"clusterName": "cluster-name",
"clusterUuid": "uuid"
},
"submittedBy": "service-account#project.iam.gserviceaccount.com",
"jobUuid": "uuid",
"hadoopJob": {
"args": [
"-Dmapred.reduce.tasks=20",
"-Dmapred.output.compress=true",
"-Dmapred.compress.map.output=true",
"-Dstream.map.output.field.separator=,",
"-Dmapred.textoutputformat.separator=,",
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
"-Dmapreduce.input.fileinputformat.split.minsize=268435456",
"-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
"-mapper",
"/bin/cat",
"-reducer",
"/bin/cat",
"-inputformat",
"org.apache.hadoop.mapred.lib.CombineTextInputFormat",
"-outputformat",
"org.apache.hadoop.mapred.TextOutputFormat",
"-input",
"gs://input/path/",
"-output",
"gs://output/path/"
],
"mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
}
}
}
When I submit the job through the console it works. The REST equivalent of that job:
POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
"projectId": "project-id",
"job": {
"reference": {
"projectId": "project-id,
"jobId": "jobDoesWork"
},
"placement": {
"clusterName": "cluster-name,
"clusterUuid": ""
},
"submittedBy": "user_email_account#email.com",
"jobUuid": "uuid",
"hadoopJob": {
"args": [
"-Dmapred.reduce.tasks=20",
"-Dmapred.output.compress=true",
"-Dmapred.compress.map.output=true",
"-Dstream.map.output.field.separator=,",
"-Dmapred.textoutputformat.separator=,",
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
"-Dmapreduce.input.fileinputformat.split.minsize=268435456",
"-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
"-mapper",
"/bin/cat",
"-reducer",
"/bin/cat",
"-inputformat",
"org.apache.hadoop.mapred.lib.CombineTextInputFormat",
"-outputformat",
"org.apache.hadoop.mapred.TextOutputFormat",
"-input",
"gs://input/path/",
"-output",
"gs://output/path/"
],
"mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
}
}
}
I ssh’ed into the box and confirmed that the file is, in fact, present. The only difference I can really see is the “submittedBy”. One works, one doesn't. I'm guessing this is a permission thing, but I cannot seem to tell where the permissions are being pulled from in each scenario. In both cases, the Dataproc cluster is created with the same service account.
Looking at permissions for that jar on the cluster I see:
-rw-r--r-- 1 root root 133856 Nov 27 20:17 hadoop-streaming-2.8.4.jar
lrwxrwxrwx 1 root root 26 Nov 27 20:17 hadoop-streaming.jar -> hadoop-streaming-2.8.4.jar
I tried changing the mainJarFileUri from pointing explicitly to the versioned jar to the link (since it had open permissions), but didn't really expect it to work. And it didn't.
Does anyone with more Dataproc experience have any idea what's going on here, and how I can resolve it?

One common mistake that's easy to make in code is to call setMainClass when you intended to call setMainJarFileUri or vice-versa. The java.lang.ClassNotFoundException you received indicates that Dataproc was trying to submit that jarfile string as a classname rather than a jarfile, so somewhere along the way Dataproc thought you set main_class. You might want to double-check your code to see if this is the bug you encountered.
The reason using "clone job" in the GUI hides this problem is that the GUI tries to be more user-friendly by offering a single text box for setting either main_class or main_jar_file_uri, and infers whether it is a jarfile by looking at the file extension. So, if you submit a job with a jarfile URI in the main_class field and it fails, then you click clone and submit the new job, the GUI will try to be smart and recognize that the new job actually specified a jarfile name, and thus will correctly set the main_jar_file_uri field in the JSON request instead of main_class.

Related

Ingest pipeline is not working over logs obtained from an event hub wih filebeat

I am sending logs to an azure eventhub with Serilog (using WriteTo.AzureEventHub(eventHubClient)), after that I am running a filebeat process with the azure module enabled, so I send these logs to elasticsearch to be able to explore them with Kibana.
The problem I have is that all the information goes to the field "message", I would need to separate the information of my logs in different fields to be able to do good queries.
The way I found was create an ingest pipeline in Kibana and through a grok processor I separate the fields inside the "meessage" and generate multiple fields. In the filebeat.yml I set the pipeline name, but nothing happen, it seems the pipeline is not working.
output.elasticsearch:
# Array of hosts to connect to.
hosts: ["localhost:9200"]
pipeline: "filebeat-otc"
Does anybody knows what I am missing? THANKS in advance.
EDITION. I will add an example of my pipeline and my data. In the simulation is working properly:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{TIME:timestamp}\\s%{LOGLEVEL}\\s{[a-zA-Z]*:%{UUID:CorrelationID},[a-zA-Z]*:%{TEXT:OperationTittle},[a-zA-Z]*:%{TEXT:OriginSystemName},[a-zA-Z]*:%{TEXT:TargetSystemName},[a-zA-Z]*:%{TEXT:OperationProcess},[a-zA-Z]*:%{TEXT:LogMessage},[a-zA-Z]*:%{TEXT:ErrorMessage}}"
],
"pattern_definitions": {
"LOGLEVEL" : "\\[[^\\]]*\\]",
"TEXT" : "[a-zA-Z0-9- ]*"
}
}
}
]
},
"docs": [
{
"_source": {
"message": "15:13:59 [INF] {CorrelationId:83355884-a351-4c8b-af8d-b77c48462f36,OperationTittle:Operation1,OriginSystemName:Fexa,TargetSystemName:Usina,OperationProcess:Testing Log Data,LogMessage:Esto es una buena prueba,ErrorMessage:null}"
}
},
{
"_source": {
"message": "20:13:48 [INF] {CorrelationId:8451ee54-efca-40be-91c8-8c8e18e33f58,OperationTittle:null,OriginSystemName:Fexa,TargetSystemName:Donna,OperationProcess:Testing Log Data,LogMessage:null,ErrorMessage:null}"
}
}
]
}
It seems when you use a module it will create and use an ingest pipeline in elasticsearch, and the pipeline option in the output is ignored.
So my solution was modify the index.final_pipeline. For this, in Kibana I went to Stack Management / Index Management there I found my index, there I went to Edit Settings and set "index.final_pipeline": "the-name-of-my-pipeline".
I hope this helps to anybody.
This was thanks to leandrojmp

How to configure dynamodb-to-lambda trigger using amplify framework/cli

The amplify docks here says that we can configure a lambda function as a dynamodb trigger by running **amplify add function** and selecting the "Lambda Trigger" option, but when I run the "amplify add api" (selected Python as runtime language) I am not getting the lambda trigger option, I'm only getting the "Serverless function" and "lambda layer" options.
Please help me to resolve this issue to access the feature.
docs snapshot - showing 4 options
my CLI snapshot - showing only 2 options
I know it works for nodejs runtime lambda, but I want this option for Python Lambda as well.
Just followed these steps with amplify CLI version 4.50.2.
To create a lambda function that is triggered by changes to a DynamoDB table, you can use the following command line actions, which are walked-through inside of the CLI after entering the below command:
amplify add function
Select which capability you want to add:
❯ Lambda function (serverless function)
Provide an AWS Lambda function name:
<YourFunctionsName>
Choose the runtime that you want to use:
> NodeJS # IMPORTANT: Must be NodeJS as of now, you can change this later by manually editing ...-cloudformation-template.json file inside function directory
Choose the function template you want to use
> Lambda Trigger
What event source do you want to associate with the lambda trigger
> Amazon DynamoDB Stream
Choose a DynamoDB event source option
>Use API category graphql #model backend DynamoDB table(s) in the current Amplify project
Choose the graphql #model(s)
<Select any models (using spacebar) you want to trigger the function after editing>
Do you want to trigger advanced settings
Y # IMPORTANT: If you are using a dynamodb event source based on a table defined by graphql schema, you will need to give this function read access to the api resource that contains the graphql schema that defines the table that drives the event
Do you want to access other resources in this project from your Lambda function?
y # See above, select your api that contains the data model and make sure that the function has at least read access.
After this, the other options (layer, call scheduling) are up to you.
After creating the function via the above CLI options, you can change the "Runtime" field inside the -cloudformation-template.json file inside function directory, eg if you want a python lambda function change the runtime to "python3.8". You will also need to create a file called index.py inside your function's directory which has a handler(event, context) function. See example below:
import json
def handler(event, context):
print("Triggered via DynamoDB")
print(event)
return json.dumps({'status_code': 200, "message": "Received from DynamoDB"})
After making these edits, you can run amplify push and, if you open your fxn in the management console online, it should show an attached dynamoDB stream.
Doesn't appear to be available anymore in the CLI codebase - see Supported-service.json deleted and replaced by supported-services.ts
https://github.com/aws-amplify/amplify-cli/commit/607ae21287941805f44ea8a9b78dd12d16d71f85#diff-a0fd8c5607fd81977cb4745b9af3af2c6649ded748991bf9968a7d782b000c6b
https://github.com/aws-amplify/amplify-cli/commits/4e974007d95c894ab4108a2dff8d5996e7e3ce25/packages/amplify-category-function/src/provider-utils/supported-services.ts
Select nodejs and you will be able to view lambda trigger
just add the following to {YOUR_FUNCTION_NAME}-cloudformation-template.json, remember to replace (YOUR_TABLE_NAME) to your table name.
"LambdaTriggerPolicyPurchase": {
"DependsOn": [
"LambdaExecutionRole"
],
"Type": "AWS::IAM::Policy",
"Properties": {
"PolicyName": "amplify-lambda-execution-policy-Purchase",
"Roles": [
{
"Ref": "LambdaExecutionRole"
}
],
"PolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"dynamodb:DescribeStream",
"dynamodb:GetRecords",
"dynamodb:GetShardIterator",
"dynamodb:ListStreams"
],
"Resource": {
"Fn::ImportValue": {
"Fn::Sub": "${apilanguageGraphQLAPIIdOutput}:GetAtt:(YOUR_TABLE_NAME):StreamArn"
}
}
}
]
}
}
},
"LambdaEventSourceMappingPurchase": {
"Type": "AWS::Lambda::EventSourceMapping",
"DependsOn": [
"LambdaTriggerPolicyPurchase",
"LambdaExecutionRole"
],
"Properties": {
"BatchSize": 100,
"Enabled": true,
"EventSourceArn": {
"Fn::ImportValue": {
"Fn::Sub": "${apilanguageGraphQLAPIIdOutput}:GetAtt:(YOUR_TABLE_NAME):StreamArn"
}
},
"FunctionName": {
"Fn::GetAtt": [
"LambdaFunction",
"Arn"
]
},
"StartingPosition": "LATEST"
}
},
i got them by creating a dummy function using the template that shows up after you choose nodejs and checking compare its -cloudformation-template.json with my own function

Authorization error when trying to apply an index template to elasticsearch

so we have an elastic search service running in AWS, the elasticsearch version is 7.8.0. And I need to add an index template to limit the amount of shards that are allocated to new indices when they are added.
I followed this example of how to add an index template and got this very simple template:
PUT _index_template/shard_limitation
{
"template": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}
}
When running this request from the inside Kibana's Dev Tools console I get the following error: {"Message":"Your request: '/_index_template/shard_limitation' is not allowed."}. As well as an Unauthorized - 401 icon. I'm running this command with the admin user.
I tested it locally (elastic search running on my machine) and it all works fine. Any idea why this might happen?
SOLUTION:
As was suggested by #Ajinkya, the correct way to do this is to not include the "_index" before the template api. The correct way to achieve what I was trying to do is to type the following:
PUT _template/shard_limitation
{
"index_patterns": ["some-pattern"],
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}
'_index_template' operation might not be supported in AWS elasticsearch. You can check supported operations for your AWS ES version here
You can still use '_template' API to add index template
PUT _template/shard_limitation
{
"index_patterns": ["test*"],
"template": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}
}

Script paths into Azure Data Factory DataLakeAnalytics u-sql pipeline

I'm trying to publish a data factory solution with this ADF DataLakeAnalyticsU-SQL pipeline activity following the azure step by step doc (https://learn.microsoft.com/en-us/azure/data-factory/data-factory-usql-activity).
{
"type": "DataLakeAnalyticsU-SQL",
"typeProperties": {
"scriptPath": "\\scripts\\111_risk_index.usql",
"scriptLinkedService": "PremiumAzureDataLakeStoreLinkedService",
"degreeOfParallelism": 3,
"priority": 100,
"parameters": {
"in": "/DF_INPUT/Consodata_Prelios_consegna_230617.txt",
"out": "/DF_OUTPUT/111_Analytics.txt"
}
},
"inputs": [
{
"name": "PremiumDataLakeStoreLocation"
}
],
"outputs": [
{
"name": "PremiumDataLakeStoreLocation"
}
],
"policy": {
"timeout": "06:00:00",
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Minute",
"interval": 15
},
"name": "ConsodataFilesProcessing",
"linkedServiceName": "PremiumAzureDataLakeAnalyticsLinkedService"
}
During publishing got this error:
25/07/2017 18:51:59- Publishing Project 'Premium.DataFactory'....
25/07/2017 18:51:59- Validating 6 json files
25/07/2017 18:52:15- Publishing Project 'Premium.DataFactory' to Data
Factory 'premium-df'
25/07/2017 18:52:15- Value cannot be null.
Parameter name: value
Trying to figure up what could be wrong with the project it came up that the issues reside into the activity options "typeProperties" as shown above, specifically for scriptPath and scriptLinkedService attributes. The doc says:
scriptPath: Path to folder that contains the U-SQL script. Name of the file
is case-sensitive.
scriptLinkedService: Linked service that links the storage that contains the
script to the data factory
Publishing the project without them (using hard-coded script) it will complete successfully. The problem is that I can't either figure out what exactly put into them. I tried with several combinations paths. The only thing I know is that the script file must be referenced locally into the solution as a dependency.
The script linked service needs to be Blob Storage, not Data Lake Storage.
Ignore the publishing error, its misleading.
Have a linked service in your solution to an Azure Storage Account, referred to in the 'scriptLinkedService' attribute. Then in the 'scriptPath' attribute reference the blob container + path.
For example:
"typeProperties": {
"scriptPath": "datafactorysupportingfiles/CreateDimensions - Daily.usql",
"scriptLinkedService": "BlobStore",
"degreeOfParallelism": 2,
"priority": 7
},
Hope this helps.
Ps. Double check for case sensitivity on attribute names. It can also throw unhelpful errors.

sequelize migration not working

I created a migration and ran it. It says it worked fine, but nothing happened. I don't think it is even connecting to my database.
My Migration file:
var util = require("util");
module.exports = {
up : function(migration, DataTypes, done) {
migration.createTable('nameOfTheNewTable', {
attr1 : DataTypes.STRING,
attr2 : DataTypes.INTEGER,
attr3 : {
type : DataTypes.BOOLEAN,
defaultValue : false,
allowNull : false
}
}).success(
function() {
migration.describeTable('nameOfTheNewTable').success(
function(attributes) {
util.puts("nameOfTheNewTable Schema: "
+ JSON.stringify(attributes));
done();
});
});
},
down : function(migration, DataTypes, done) {
// logic for reverting the changes
}
};
My Config.json:
{
"development": {
"username": "user",
"password": "pw",
"database": "my-db",
"dialect" : "sqlite",
"host": "localhost"
}
}
The command:
./node_modules/sequelize/bin/sequelize --migrate --env development
Loaded configuration file "config/config.json".
Using environment "development".
Running migrations...
20130921234513-initial.js
nameOfTheNewTable Schema: {"attr1":{"type":"VARCHAR(255)","allowNull":true,"defaultValue":null},"attr2":{"type":"INTEGER","allowNull":true,"defaultValue":null},"attr3":{"type":"TINYINT(1)","allowNull":false,"defaultValue":false}}
Completed in 8ms
I can run this over and over and the output is always the same. I've tried it on a database which I know to have existing tables and try to describe those tables and still nothing happens.
Am I doing something wrong?
EDIT:
I'm pretty sure I'm not connecting to the db, but try as I might I cannot connect using the migration. I can connect using sqlite3 my-db.sqlite and run commands such as .tables to see tables I have created previously, but I cannot for the life of me get the "nameOfTheNewTable" table created using a migration. (I want to create indexes in the migration too). I have tried using "development", changing values in the config.json like the host, database (my-db, ../my-db, my-db.sqlite), etc.
Here's a good example, in the config.json I put "database" : "bad-db" and the output from the migration is exactly the same. When it is done, there is no bad-db.sqlite file to be found.
You need to specify the 'storage' parameter in your config.json, so that sequelize knows what file to use as the sqlite DB.
Sequelize defaults to using memory storage for sqlite, so it's migrating an in-memory database, then exiting, effectively destroying the DB it just migrated.
you most likely have to wait for migration.createTable to finish:
migration.createTable(/*your options*/).success(function() {
migration.describeTable('nameOfTheNewTable').success(function(attributes) {
util.puts("nameOfTheNewTable Schema: " + JSON.stringify(attributes));
done()
});
})

Resources