Getting InvalidProtocolBufferException while running oozie job - oozie

I'm getting the below exception while running the sample oozie examples.
I've modified the job.properties located at the /examples/apps/map-reduce with the appropriate nameNode and jobTracker details.
I'm using the below command to run the oozie job:
"sudo oozie job -oozie http://ip-10-0-20-143.ec2.internal:11000/oozie -config examples/apps/map-reduce/job.properties -run"
Error: E0501 : E0501: Could not perform authorization operation, Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: "ip-10-0-20-143.ec2.internal/10.0.20.143"; destination host is: "ip-10-0-20-144.ec2.internal":50070;
The hadoop core-site.xml also has the correct proxyuser details for oozie user.
Really, dont know where it is going wrong?? :(

I will answer in case someone will google up this page.
In my case the cause was in using http address for Name Node.
You should check your job configuration and if there stays something like:
nameNode=yourhostname:50070
You should change it to something like this:
nameNode=hdfs://yourhostname:8020
Check your ports first of course!
Please notice that jobTracker parameter has different notation. In my case it's:
jobTracker=yourhostname:8021
and it works fine.
Hope it helps to someone.

Related

Docker container failed to start when deploying to Google Cloud Run

I am new to GCP, and am trying to teach myself by deploying a simple R script in a Docker container that connects to BigQuery and writes the system time. I am following along with this great tutorial: https://arbenkqiku.github.io/create-docker-image-with-r-and-deploy-as-cron-job-on-google-cloud
So far, I have:
1.- Made the R script
library(bigrquery)
library(tidyverse)
bq_auth("/home/rstudio/xxxx-xxxx.json", email="xxxx#xxxx.com")
project = "xxxx-project"
dataset = "xxxx-dataset"
table = "xxxx-table"
job = insert_upload_job(project=project, data=dataset, table=table, write_disposition="WRITE_APPEND",
values=Sys.time() %>% as_tibble(), billing=project)
2.- Made the Dockerfile
FROM rocker/tidyverse:latest
RUN R -e "install.packages('bigrquery', repos='http://cran.us.r-project.org')"
ADD xxxx-xxxx.json /home/rstudio
ADD big-query-tutorial_forQuestion.R /home/rstudio
EXPOSE 8080
CMD ["Rscript", "/home/rstudio/big-query-tutorial.R", "--host", "0.0.0.0"]
3.- Successfully run the container locally on my machine, with the system time being written to BigQuery
4.- Pushed the container to my container registry in Google Cloud
When I try to deploy the container to Cloud Run (fully managed) using the Google Cloud Console I get this error: 
"Cloud Run error: Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable. Logs for this revision might contain more information. Logs URL:https://console.cloud.google.com/logs/xxxxxx"
When I review the log, I find these noteworthy entries:
1.- A warning that says "Container called exit(0)"
2.- Two bugs that say "Container Sandbox: Unsupported syscall setsockopt(0x8,0x0,0xb,0x3e78729eedc4,0x4,0x8). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/c/linux/amd64/setsockopt for more information."
When I check BigQuery, I find that the system time was written to the table, even though the container failed to deploy.
When I use the port specified in the tutorial (8787) Cloud Run throws an error about an "Invalid ENTRYPOINT".
What does this error mean? How can it be fixed? I'd greatly appreciate input as I'm totally stuck!
Thank you!
H.
The comment of John is the right source of the errors: You need to expose a webserver which listen on the $PORT and answer to HTTP/1 HTTP/2 protocols.
However, I have a solution. You can use Cloud Build for this. Simply define your step with your container name and the args if needed
Let me know if you need more guidance on this (strange) workaround.
Log information "Container Sandbox: Unsupported syscall setsockopt" from Google Cloud Run is documented as an issue for gVisor.

Grakn Error; trying to load schema for "phone calls" example

I am trying to run the example grakn migration "phone_calls" (using python and JSON files).
Before reaching there, I need to load the schema, but I am having trouble with getting the schema loaded, as shown here: https://dev.grakn.ai/docs/examples/phone-calls-schema
System:
-Mac OS 10.15
-grakn-core 1.8.3
-python 3.7.3
The grakn server is started. I checked and the 48555 TCP port is open, so I don't think there is any firewall issue. The schema file is in the same folder (phone_calls) as where the json data files is, for the next step. I am using a virtual environment. The error is below:
(project1_env) (base) tiffanytoor1#MacBook-Pro-2 onco % grakn server start
Storage is already running
Grakn Core Server is already running
(project1_env) (base) tiffanytoor1#MacBook-Pro-2 onco % grakn console --keyspace phone_calls --file phone_calls/schema.gql
Unable to create connection to Grakn instance at localhost:48555
Cause: io.grpc.StatusRuntimeException
UNKNOWN: Could not reach any contact point, make sure you've provided valid addresses (showing first 1, use getErrors() for more: Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=5f59fd46): com.datastax.oss.driver.api.core.connection.ConnectionInitException: [JanusGraph Session|control|connecting...] init query OPTIONS: error writing ). Please check server logs for the stack trace.
I would appreciate any help! Thanks!
Nevermind -- I found the solution, in case any one else runs into a similar problem. The server configuration file needs to be edited: point the data directory to your project data files (here: the phone_calls data files) & change the server IP address to your own.

Airflow SFTPHook - No hostkey for host found

I'm trying to use the Airflow SFTPHook by passing in a ssh_conn_id and I'm getting an error:
No hostkey for host myhostname found.
Using the SFTPOperator with the same ssh_conn_id however is working fine. How can I resolve this error?
Just had this issue, the simple trick is to keep your SSH connector inside airflow and to add the following in the "Extra" field :
{"no_host_key_check": true}
Hope it helps !
Edit : Indeed, it allows the man-in-the-middle attack, so even if it helps temporarily, you should get the ssh fingerprint and allow it
The SFTPOperators uses SSHHook. Hence, you should use SSHHook instead.

AWS CodeBuild failure on getting source

I have CodeBuild project that works fine.
Trying to use it in CodePipeline and it failure with empty Repository and Submitter.
Failure logs are simple as:
01:34:17
[Container] 2018/03/08 01:34:10 Waiting for agent ping

01:34:17
[Container] 2018/03/08 01:34:12 Waiting for DOWNLOAD_SOURCE
There are no any settings to adjust CodeBuild phase anywhere.
How can I fix/customise it?
Recreate the build project from within CodePipeline, so it receives the source code from the provider called "CodePipeline".
Source of the information: https://apassionatechie.wordpress.com/2018/02/08/codebuild-aws-from-codepipeline-aws/
Just if somebody would need an answer.
The issue was in not precise file naming for CodeBuild stage where CodeDeploy in it's turn won't be able to pull the ZIP file.
As a fix I've added an extra command to builspec.yml
post_build:
commands:
- zip -r Application.zip target/Application-0.0.1.war

RHadoop Stream Job Fail with Apache Oozie

I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.

Resources