running spark app on cloudera 5 in YARN mode - cloudera

I have been trying to to spark-submit to my cloudera cluster for a few weeks. I really hope someone out there know how this works.
I created a script which calls spark-submit with all the required arguments. The screen dumps out the following lines
Using properties file: null
Using properties file: null
Parsed arguments:
master yarn
deployMode cluster
executorMemory null
executorCores null
totalExecutorCores null
propertiesFile null
driverMemory null
driverCores null
driverExtraClassPath /home/bruce/workspace1/spark-cloudera/yarn/stable/target/spark-yarn_2.10-1.0.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-yarn-client/2.3.0-cdh5.1.0/hadoop-yarn-client-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-common/2.3.0-cdh5.1.0/hadoop-common-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-yarn-api/2.3.0-cdh5.1.0/hadoop-yarn-api-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-yarn-common/2.3.0-cdh5.1.0/hadoop-yarn-common-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/org/apache/hadoop/hadoop-auth/2.3.0-cdh5.1.0/hadoop-auth-2.3.0-cdh5.1.0.jar:/home/bruce/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar
driverExtraLibraryPath null
driverExtraJavaOptions null
supervise false
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass org.apache.spark.examples.SparkPi
primaryResource file:/home/bruce/workspace1/spark-cloudera/examples/target/scala-2.10/spark-examples-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar
name org.apache.spark.examples.SparkPi
childArgs [10]
jars null
verbose true
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
the call gets stuck for a very long time then quits with Connection refused.
What I don't understand is the argument specifies using YarnClient, but no where does it indicate it knows how to contact the yarn resource manager, not the ip, not the port. The submission is made on my lap top, the cluster is on the neighboring subnet. How does spark-submit figure out how to contact the yarn service?

From the Spark Documentation
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory
which contains the (client side) configuration files for the Hadoop
cluster. These configs are used to write to the dfs and connect to the
YARN ResourceManager.

Related

Cannot setup a MySQL Backend for Airflow LocalExecutor

I need to run dags in parallel but do not need significant scaling, so LocalExecutor can do the job just fine. I looked through the Airflow docs and first created a MySQL database:
CREATE DATABASE airflow_db CHARACTER SET utf8;
CREATE USER <user> IDENTIFIED BY <pass>;
GRANT ALL PRIVILEGES ON airflow_db.* TO <user>;
I then modify the following parameters in the airflow.cfg file:
executor = LocalExecutor
sql_alchemy_conn = mysql+mysqlconnector://<user>:<pass>#localhost:3306/airflow_db
When I run airflow db init, I run into the following error message:
AttributeError: 'MySQLConverter' object has no attribute '_dagruntype_to_mysql'
During handling of the above exception, another exception occurred:
TypeError: Python 'dagruntype' cannot be converted to a MySQL type
Please note that nothing else in the airflow.cfg file was altered and that using the default SequentialExecutor with sqlite lets everything run just fine. Also note that I am using Airflow version 2.2.0
I found the solution to my own question. Instead of using the mysqlconnector, I used the pymysql driver:
pip install PyMySQL
The airflow.cfg parameters can then be adjusted as follows:
sql_alchemy_conn = mysql+pymysql://<user>:<pass>#localhost:3306/airflow_db
All else can stay the same.

Docker container failed to start when deploying to Google Cloud Run

I am new to GCP, and am trying to teach myself by deploying a simple R script in a Docker container that connects to BigQuery and writes the system time. I am following along with this great tutorial: https://arbenkqiku.github.io/create-docker-image-with-r-and-deploy-as-cron-job-on-google-cloud
So far, I have:
1.- Made the R script
library(bigrquery)
library(tidyverse)
bq_auth("/home/rstudio/xxxx-xxxx.json", email="xxxx#xxxx.com")
project = "xxxx-project"
dataset = "xxxx-dataset"
table = "xxxx-table"
job = insert_upload_job(project=project, data=dataset, table=table, write_disposition="WRITE_APPEND",
values=Sys.time() %>% as_tibble(), billing=project)
2.- Made the Dockerfile
FROM rocker/tidyverse:latest
RUN R -e "install.packages('bigrquery', repos='http://cran.us.r-project.org')"
ADD xxxx-xxxx.json /home/rstudio
ADD big-query-tutorial_forQuestion.R /home/rstudio
EXPOSE 8080
CMD ["Rscript", "/home/rstudio/big-query-tutorial.R", "--host", "0.0.0.0"]
3.- Successfully run the container locally on my machine, with the system time being written to BigQuery
4.- Pushed the container to my container registry in Google Cloud
When I try to deploy the container to Cloud Run (fully managed) using the Google Cloud Console I get this error: 
"Cloud Run error: Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable. Logs for this revision might contain more information. Logs URL:https://console.cloud.google.com/logs/xxxxxx"
When I review the log, I find these noteworthy entries:
1.- A warning that says "Container called exit(0)"
2.- Two bugs that say "Container Sandbox: Unsupported syscall setsockopt(0x8,0x0,0xb,0x3e78729eedc4,0x4,0x8). It is very likely that you can safely ignore this message and that this is not the cause of any error you might be troubleshooting. Please, refer to https://gvisor.dev/c/linux/amd64/setsockopt for more information."
When I check BigQuery, I find that the system time was written to the table, even though the container failed to deploy.
When I use the port specified in the tutorial (8787) Cloud Run throws an error about an "Invalid ENTRYPOINT".
What does this error mean? How can it be fixed? I'd greatly appreciate input as I'm totally stuck!
Thank you!
H.
The comment of John is the right source of the errors: You need to expose a webserver which listen on the $PORT and answer to HTTP/1 HTTP/2 protocols.
However, I have a solution. You can use Cloud Build for this. Simply define your step with your container name and the args if needed
Let me know if you need more guidance on this (strange) workaround.
Log information "Container Sandbox: Unsupported syscall setsockopt" from Google Cloud Run is documented as an issue for gVisor.

Grakn Error; trying to load schema for "phone calls" example

I am trying to run the example grakn migration "phone_calls" (using python and JSON files).
Before reaching there, I need to load the schema, but I am having trouble with getting the schema loaded, as shown here: https://dev.grakn.ai/docs/examples/phone-calls-schema
System:
-Mac OS 10.15
-grakn-core 1.8.3
-python 3.7.3
The grakn server is started. I checked and the 48555 TCP port is open, so I don't think there is any firewall issue. The schema file is in the same folder (phone_calls) as where the json data files is, for the next step. I am using a virtual environment. The error is below:
(project1_env) (base) tiffanytoor1#MacBook-Pro-2 onco % grakn server start
Storage is already running
Grakn Core Server is already running
(project1_env) (base) tiffanytoor1#MacBook-Pro-2 onco % grakn console --keyspace phone_calls --file phone_calls/schema.gql
Unable to create connection to Grakn instance at localhost:48555
Cause: io.grpc.StatusRuntimeException
UNKNOWN: Could not reach any contact point, make sure you've provided valid addresses (showing first 1, use getErrors() for more: Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=5f59fd46): com.datastax.oss.driver.api.core.connection.ConnectionInitException: [JanusGraph Session|control|connecting...] init query OPTIONS: error writing ). Please check server logs for the stack trace.
I would appreciate any help! Thanks!
Nevermind -- I found the solution, in case any one else runs into a similar problem. The server configuration file needs to be edited: point the data directory to your project data files (here: the phone_calls data files) & change the server IP address to your own.

AWS CodeBuild failure on getting source

I have CodeBuild project that works fine.
Trying to use it in CodePipeline and it failure with empty Repository and Submitter.
Failure logs are simple as:
01:34:17
[Container] 2018/03/08 01:34:10 Waiting for agent ping

01:34:17
[Container] 2018/03/08 01:34:12 Waiting for DOWNLOAD_SOURCE
There are no any settings to adjust CodeBuild phase anywhere.
How can I fix/customise it?
Recreate the build project from within CodePipeline, so it receives the source code from the provider called "CodePipeline".
Source of the information: https://apassionatechie.wordpress.com/2018/02/08/codebuild-aws-from-codepipeline-aws/
Just if somebody would need an answer.
The issue was in not precise file naming for CodeBuild stage where CodeDeploy in it's turn won't be able to pull the ZIP file.
As a fix I've added an extra command to builspec.yml
post_build:
commands:
- zip -r Application.zip target/Application-0.0.1.war

HTTP 400 - Unable to parse remote repository npm metadata

We have 2 remote NPM registries inside of a virtual repository. One of them is the NPM Registry, the other one is from a software provider. When I add the second repository to the virtual repository, I am getting HTTP 400 messages at random.
For example: if I want to install a package from the npm-registry, I see through the logs that Artifactory is trying to get the package from the other repository (which does not have the package) and tries to parse the response as json. The response from the other repository gives back a html file though which results in the following error message:
2017-02-23 09:39:05,424 [http-nio-8080-exec-7112] [ERROR]
(o.a.a.n.r.NpmRemoteRepoHandler:362) - Error while parsing the response of a remote npm
JSON query on 'https://repository.domain.com/api/npm/public/file-loader':
Unexpected character ('<' (code 60)): expected a valid value (number, String, array, object,
'true', 'false' or 'null')
at [Source:org.artifactory.storage.db.binstore.service.UsageTrackingBinaryProvider$ReaderTrackingStream#7360bc6c; line: 1, column: 2]
As you can see, Artifactory is trying to get the package from the other repository. The JSON response of our artifactory, when I try to get the package manually is :
{
"errors" : [ {
"status" : 400,
"message" : "Unable to parse remote repository npm metadata."
} ]
}
Any help would be greatly appreciated, since this makes the NPM Registry completely useless as some requests are returning this HTTP 400 error.
fyi: We are using Artifactory Pro 4.5.1
There are 2 things you should do to avoid this behavior
Configure the virtual repository resolution order so the NPM registry is approached before the software provider registry. The resolution order is controlled by the order they are presented in the Selected Repositories list.
Use include/exclude patterns to control which packages are resolved from the software provider registry. Assuming there is a way to identify the packages which should be resolved from software provider you can define patterns which will limit this registry only for the resolution of certain packages.
Another thing to check is whether the software provider remote repository configured properly. Normally it should not return an HTML response for an API call.

Resources