Data ingestion task : Hadoop running in local instead of remote Hadoop EMR cluster - emr

I have setup a multi-node druid cluster with:
1) 1 node running as coordinator and overlord (m4.xl)
2) 2 nodes each running historical and middle managers both. (r3.2xl)
3) 1 node running broker (r3.2xl)
Now I have an EMR cluster running which I want to use for ingestion tasks, the problem is whenever I try to submit a job via the CURL command, the job always starts as local hadoop job in both the middle managers instead of being submitted to the remote EMR cluster. My data lies in S3 and also S3 is configured for deep storage as well.
I have also copied all the jars from EMR master to hadoop-dependencies/hadoop-client/2.7.3/
Druid version: 0.9.2
EMR version: 5.2
Please find attached indexing job, common runtime properties and middle manager runtime properties.
Q1) How to get the job to submit to remote EMR cluster.
Q2) Logs for
the indexing task are not coming on overlord:8090, how to enable it.
File: data_index.json
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"paths": "s3n://<kjcnskd>smallTest"
}
},
"dataSchema": {
"dataSource": "multi_value_test_01",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "day",
"queryGranularity": "none",
"intervals": [
"2011-09-12/2017-09-13"
]
},
"parser": {
"type": "string",
"parseSpec": {
"format": "tsv",
"delimiter": "\u0001",
"listDelimiter": "|",
"columns": [
"article_type",
"brand",
"gender",
"brand_type",
"master_category",
"supply_type",
"business_unit",
"testdim",
"date",
"week",
"month",
"year",
"style_id",
"live_styles",
"non_live_styles",
"broken_style",
"new_season_styles",
"live_styles_qty",
"non_live_styles_qty",
"broken_style_qty",
"new_season_styles_qty"
],
"dimensionsSpec": {
"dimensions": [
"article_type",
"brand",
"gender",
"brand_type",
"master_category",
"supply_type",
"business_unit",
"testdim",
"week",
"month",
"year",
"style_id"
]
},
"timestampSpec": {
"column": "date",
"format": "yyyyMMdd"
}
}
},
"metricsSpec": [
{
"name": "live_styles",
"type": "doubleSum",
"fieldName": "live_styles"
},
{
"name": "non_live_styles",
"type": "doubleSum",
"fieldName": "non_live_styles"
},
{
"name": "broken_style",
"type": "doubleSum",
"fieldName": "broken_style"
},
{
"name": "new_season_styles",
"type": "doubleSum",
"fieldName": "new_season_styles"
},
{
"name": "live_styles_qty",
"type": "doubleSum",
"fieldName": "live_styles_qty"
},
{
"name": "broken_style_qty",
"type": "doubleSum",
"fieldName": "broken_style_qty"
},
{
"name": "new_season_styles_qty",
"type": "doubleSum",
"fieldName": "new_season_styles_qty"
}
]
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"type": "hashed",
"targetPartitionSize": 5000000
},
"jobProperties": {
"fs.s3.awsAccessKeyId": "XXXXXXXXXXXXXX",
"fs.s3.awsSecretAccessKey": "XXXXXXXXXXXXXX",
"fs.s3.impl": "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"fs.s3n.awsAccessKeyId": "XXXXXXXXXXXXXX",
"fs.s3n.awsSecretAccessKey": "XXXXXXXXXXXXXX",
"fs.s3n.impl": "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
}
}
}
}
File: common.runtime.properties
#
# Licensed to Metamarkets Group Inc. (Metamarkets) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. Metamarkets licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#
# Extensions
#
# This is not the full list of Druid extensions, but common ones that people often use. You may need to change this list
# based on your particular setup.
druid.extensions.loadList=["druid-kafka-eight", "druid-s3-extensions", "druid-histogram", "druid-datasketches", "druid-lookups-cached-global", "mysql-metadata-storage"]
# If you have a different version of Hadoop, place your Hadoop client jar files in your hadoop-dependencies directory
# and uncomment the line below to point to your directory.
druid.extensions.hadoopDependenciesDir=hadoop-dependencies/hadoop-client/2.7.3
#
# Logging
#
# Log all runtime properties on startup. Disable to avoid logging properties on startup:
druid.startup.logging.logProperties=true
#
# Zookeeper
#
druid.zk.service.host=10.0.1.152
druid.zk.paths.base=/druid
#
# Metadata storage
#
# For Derby server on your Druid Coordinator (only viable in a cluster with a single Coordinator, no fail-over):
#druid.metadata.storage.type=derby
#druid.metadata.storage.connector.connectURI=jdbc:derby://metadata.store.ip:1527/var/druid/metadata.db;create=true
#druid.metadata.storage.connector.host=metadata.store.ip
#druid.metadata.storage.connector.port=1527
# For MySQL:
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://10.0.1.140:3306/druid
druid.metadata.storage.connector.user=druid
druid.metadata.storage.connector.password=druid123
# For PostgreSQL (make sure to additionally include the Postgres extension):
#druid.metadata.storage.type=postgresql
#druid.metadata.storage.connector.connectURI=jdbc:postgresql://db.example.com:5432/druid
#druid.metadata.storage.connector.user=...
#druid.metadata.storage.connector.password=...
#
# Deep storage
#
# For local disk (only viable in a cluster if this is a network mount):
#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments
# For HDFS (make sure to include the HDFS extension and that your Hadoop config files in the cp):
#druid.storage.type=hdfs
#druid.storage.storageDirectory=/druid/segments
# For S3:
druid.storage.type=s3
druid.storage.bucket=asfvdcs
druid.storage.baseKey=druid/segments
druid.s3.accessKey=XXXXXXXXXXXX
druid.s3.secretKey=XXXXXXXXXXXX
#
# Indexing service logs
#
# For local disk (only viable in a cluster if this is a network mount):
druid.indexer.logs.type=file
druid.indexer.logs.directory=var/druid/indexing-logs
# For HDFS (make sure to include the HDFS extension and that your Hadoop config files in the cp):
#druid.indexer.logs.type=hdfs
#druid.indexer.logs.directory=/druid/indexing-logs
# For S3:
#druid.indexer.logs.type=s3
#druid.indexer.logs.s3Bucket=testashutosh
#druid.indexer.logs.s3Prefix=druid/indexing-logs
#
# Service discovery
#
druid.selectors.indexing.serviceName=druid/overlord
druid.selectors.coordinator.serviceName=druid/coordinator
#
# Monitoring
#
druid.monitoring.monitors=["com.metamx.metrics.JvmMonitor"]
druid.emitter=logging
druid.emitter.logging.logLevel=info
File: middle manager runtime.properties
druid.service=druid/middleManager
druid.port=8091
# Number of tasks per middleManager
druid.worker.capacity=3
# Task launch parameters
druid.indexer.runner.javaOpts=-server -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.baseTaskDir=var/druid/task
# HTTP server threads
druid.server.http.numThreads=25
# Processing threads and buffers
druid.processing.buffer.sizeBytes=536870912
druid.processing.numThreads=2
# Hadoop indexing
druid.indexer.task.hadoopWorkingPath=hdfs://ip-10-0-1-xxx.ap-southeast-1.compute.internal:8020/tmp/druid-indexing
druid.indexer.task.defaultHadoopCoordinates=["org.apache.hadoop:hadoop-client:2.7.3"]
druid.indexer.runner.type=remote

You need to tell Druid about the Hadoop cluster. To quote the manual:
Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into conf/druid/_common/core-site.xml, conf/druid/_common/hdfs-site.xml, and so on.
If you have already done that, than it would indicate in issue with one of the config files (happened to me).

Related

AWS Serverless how to use "sam local start-api" to debug .net core 3.1 applications

I would like to start serverless application locally and then debug it using Visual Studio. I see command line arguments --debug-port, --debugger-path, --debug-args and --debug-function, but no example of how these can be used for .net core.
This is what I'm using for Visual Studio Code. I'm on Windows using dotnetcore3.1.
Firstly, I had to download the Linux vsdbg debug files (yes, Linux, as these files will be mounted in the SAM docker container)
https://vsdebugger.azureedge.net/vsdbg-17-0-10712-2/vsdbg-linux-x64.tar.gz
Unzip them into a folder, e.g. C:\vsdbg
I have a task to launch SAM. My tasks.json looks like:
{
"version": "2.0.0",
"tasks": [{
"label": "sam local api",
"type": "shell",
"command": "sam",
"args": [
"local",
"start-api",
"-d", "5858",
"--template", "${workspaceFolder}/template.yaml",
"--debugger-path", "C:\\vsdbg",
"--warm-containers", "EAGER"
],
}]
}
IMPORTANT:
** --debugger-path points to the linux debug files folder. the sam cli will mount the files for you.
** I had to use --warm-containers EAGER to keep the container from closing after every request
launch.json looks like this:
{
"name": "sam local api attach",
"type": "coreclr",
"processName": "dotnet",
"request": "attach",
"pipeTransport": {
"pipeCwd": "${workspaceFolder}",
"pipeProgram": "powershell",
"pipeArgs": [
"-c",
"docker exec -i $(docker ps -q -f publish=5858) ${debuggerCommand}"
],
"debuggerPath": "/tmp/lambci_debug_files/vsdbg",
"quoteArgs": false
},
"sourceFileMap": {
"/var/task": "${workspaceFolder}"
}
},
This bit: $(docker ps -q -f publish=5858) gets the id of your docker container by filtering on the port that you're using.
This took quite a bit of fiddling to get working, I'm surprised it's not easier or at least some decent documentation on it.

How to untar a file second time using file spec explode:true on a Jenkins job

I'm using file spec below in a Jenkins job to download artifact and untar the artifact. It untar it but that tar has 2 other tar files and I need to untar those 2 files as well. How can I do that using explode:true the second time?
{
"files": [
{
"pattern": "sys /${RELEASE}/${BUILD_NUMBER}/samples*.tar.gz",
"target": "${PATH}",
"explode": true,
"target": " server.tar.gz",
"explode": true
}
]

How to make Arduino work and not give "cannot open source file "avr/pgmspace.h"" on vscode?

I'm trying to program arduino in vscode. The problem is that It's giving me weird header errors:
cannot open source file "avr/pgmspace.h" (dependency of "C:\Program Files (x86)\Arduino\hardware\arduino\avr\cores\arduino\Arduino.h")
This is my arduino.json:
"board": "arduino:avr:uno"
}
This is my c_cpp_properties.json:
{
"configurations": [
{
"name": "Win32",
"includePath": [
"C:\\Program Files (x86)\\Arduino\\tools\\**",
"C:\\Program Files (x86)\\Arduino\\hardware\\arduino\\avr\\**"
],
"forcedInclude": [
"C:\\Program Files (x86)\\Arduino\\hardware\\arduino\\avr\\cores\\arduino\\Arduino.h"
],
"intelliSenseMode": "msvc-x64",
"compilerPath": "C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx64/x64/cl.exe",
"cStandard": "c11",
"cppStandard": "c++17"
}
],
"version": 4
}
It should recursively include all of the required libraries, and even if imanually try to add the path to avr/pgmspace.h or its directory, it keeps giving me the same errors.
How do I solve this error?
Based on the responses to this issue, I added "c:\\Program Files (x86)\\Arduino\\hardware\\tools\\avr\\avr\\**" to my include path in c_cpp_properties.json:
"includePath": [
"c:\\Program Files (x86)\\Arduino\\tools\\**",
"c:\\Program Files (x86)\\Arduino\\hardware\\arduino\\avr\\**",
"c:\\Program Files (x86)\\Arduino\\hardware\\tools\\avr\\avr\\**"
],
For anyone who stumbles here on a Mac, here is the config that works for me with the Uno:
{
"env": {
"arduino_path": "/Applications/Arduino.app/Contents/Java",
"arduino_avr_include_path": "${env:arduino_path}/hardware/arduino/avr",
"arduino_avr_include2_path": "${env:arduino_path}/hardware/tools/avr/avr/include",
"arduino_avr_compiler_path": "${env:arduino_path}/hardware/tools/avr/bin/avr-g++"
},
"configurations": [
{
"name": "Mac",
"defines": [
"ARDUINO=10810",
"__AVR_ATmega328P__",
"UBRRH"
],
"includePath": [
"${workspaceRoot}",
"${env:arduino_avr_include_path}/**",
"${env:arduino_avr_include2_path}/**"
],
"forcedInclude": [
"/Applications/Arduino.app/Contents/Java/hardware/arduino/avr/cores/arduino/Arduino.h"
],
"intelliSenseMode": "gcc-x64",
"cStandard": "c11",
"cppStandard": "c++11",
"compilerPath": "${env:arduino_compiler_path} -std=gnu++11 -mmcu=atmega328p"
}
],
"version": 4
}
The key to finding <avr/pgmspace.h> was the adding the hardware/tools/avr/avr/include path.
Defining ARDUINO=10810 was identified from the output of running Arduino: Verify with the verbose flag.
Defining __AVR_ATmega328P__ was added to allow proper IntelliSense completion of the raw register macros (e.g. _BV(), OCR0A, TIMSK0, etc.); the correct macro to define was identified based on the chip being an ATMEGA328P-PU and by inspection of the file hardware/tools/avr/avr/include/avr/io.h.
The compilerPath value looks wrong, although it is only used by the IDE and not for target compilation. The documentation says :
The absolute path to the compiler you use to build your project. The extension will query the compiler to determine the system include paths and default defines to use for IntelliSense.
In any case I recommend to configure it properly, I was able to remove
"${HOME}/.arduino15/packages/arduino/tools/avr-gcc/5.4.0-atmel3.6.1-arduino2/bin/../lib/gcc/avr/5.4.0/include",
"${HOME}/.arduino15/packages/arduino/tools/avr-gcc/5.4.0-atmel3.6.1-arduino2/bin/../lib/gcc/avr/5.4.0/include-fixed",
"${HOME}/.arduino15/packages/arduino/tools/avr-gcc/5.4.0-atmel3.6.1-arduino2/bin/../lib/gcc/avr/5.4.0/../../../../avr/include"
when setting
"compilerPath": "${HOME}/.arduino15/packages/arduino/tools/avr-gcc/5.4.0-atmel3.6.1-arduino2/bin/avr-g++ -std=gnu++11 -mmcu=atmega328p",
To figure out the exact compiler that is used turn on verbose logging in the output window:
File -> Preferences -> Settings -> Extensions -> Arduino configuration -> Log level -> verbose
which in this case also should help you figure out why the compiler does not pick up avr/pgmspace.h.
Here are my arduino.json
{
"board": "arduino:avr:uno",
"port": "/dev/ttyUSB0",
"sketch": "src/myproject.ino",
"output": "../build"
}
and c_cpp_properties.json
{
"env": {
"arduino.path": "${HOME}/.arduino15/packages/arduino",
"arduino.avr.include.path": "${env:arduino.path}/hardware/avr/1.6.23",
"arduino.avr.compiler.path": "${env:arduino.path}/tools/avr-gcc/5.4.0-atmel3.6.1-arduino2/bin/avr-g++",
"arduino.libraries.path": "${HOME}/arduino/sketchbook/libraries",
"dummy-last-line": "To allow the second to last line (e.g. the real last line) to always end with a comma"
},
"configurations": [
{
"name": "Linux",
"includePath": [
"./src",
"./test",
"../arduino_ci/cpp/unittest",
"${env:arduino.libraries.path}/SmartLCD",
"${env:arduino.libraries.path}/Chronos/src",
"${env:arduino.libraries.path}/Time",
"${env:arduino.libraries.path}/RTClib",
"${env:arduino.avr.include.path}/libraries/Wire/src",
"${env:arduino.avr.include.path}/cores/arduino",
"${env:arduino.avr.include.path}/variants/standard"
],
"browse": {
"path": [
"${workspaceFolder}/src"
],
"limitSymbolsToIncludedHeaders": true
},
"defines": [
"UBRRH"
],
"intelliSenseMode": "gcc-x64",
"compilerPath": "${env:arduino.avr.compiler.path} -std=gnu++11 -mmcu=atmega328p",
"cStandard": "c11",
"cppStandard": "c++11"
}
],
"version": 4
}
(the UBRRH define is for the Serial variable in HardwareSerial.h)

How to fix VirtualBox redhat-7 eth0 ONBOOT=no connectivity issue with vboxmange tools?

I am creating virtualbox redhat box with packer with the template attached below. Everything is fine except that when the host is created and rebooted, the eth0 network adapter does not start as it is created with ONBOOT=no in /etc/sysconfig/network-scripts. However, if I open the UI of the box and manually trigger ifup eth0, it starts fine, ssh becomes available and the process completes as expected. However, I need to use it in a jenkins pipeline so there is no option someone can go and start the network interface manually. The question, is there any way to change the ONBOOT option to yes for the network adapter with virtualbox manage commands, or trigger the ifup eth0 command somehow. Either option may solve the problem.
{
"variables": {
"build_base": ".",
"isref_machine":"create-ova-caf",
"build_name":"virtual-box-jenkins",
"output_name":"packer-virtual-box",
"disk_size":"40000",
"ram":"1024",
"disk_adapter":"ide"
},
"builders":[
{
"name": "{{user `build_name`}}",
"type": "virtualbox-iso",
"guest_os_type": "Other_64",
"iso_url": "rhelis74_1710051533.iso",
"iso_checksum": "",
"iso_checksum_type": "none",
"hard_drive_interface":"{{user `disk_adapter`}}",
"ssh_username": "root",
"ssh_password": "Secret1.0",
"shutdown_command": "shutdown -P now",
"guest_additions_mode":"disable",
"boot_wait": "3s",
"boot_command": [ "auto<enter>"],
"ssh_timeout": "40m",
"headless":
"true",
"vm_name": "{{user `output_name`}}",
"disk_size": "{{user `disk_size`}}",
"output_directory":"{{user `build_base`}}/output-{{build_name}}",
"format": "ovf",
"vrdp_bind_address": "0.0.0.0",
"vboxmanage": [
["modifyvm", "{{.Name}}","--nictype1","virtio"],
["modifyvm", "{{.Name}}","--memory","{{ user `ram`}}"]
],
"skip_export":true,
"keep_registered": true
}
],
"provisioners": [
{
"type":"shell",
"inline": ["ls"]
}
]
}
To change the network interface boot settings to onboot=yes, we need to create an anaconda kickstart script or copy one from an existing machine and change the configurations in it and pass it as
"boot_command": [ "<esc><wait>",
"vmlinuz initrd=initrd.img net.ifnames=0 biosdevname=0 ",
"ks=hd:fd0:/anaconda-ks.cfg",
"<enter>"
],
and in anaconda file
network --bootproto=dhcp --device=eth0 --onboot=on --ipv6=auto --activate

Upload Image on TryStack Server using Packer tool

I am trying to create and upload an ubuntu based image on trystack server using packer tool. I am using Windows OS to do it. I have created a sample template and loads a script file for setting environment variables using chef. But when I am running the packer build command I get
1 error(s) occurred:
* Get /: unsupported protocol scheme ""
What am I missing in this ??
Here are the template and script files
template.json
{
"builders": [
{
"type": "openstack",
"ssh_username": "root",
"image_name": "sensor-cloud",
"source_image": "66a14661-2dfb-4370-b6d4-87aaefcffdce",
"flavor": "3",
"availability_zone": "nova",
"security_groups": ["mySecurityGroup"]
}
],
"provisioners": [
{
"type": "file",
"source": "sensorCloudCookbook.zip",
"destination": "/tmp/sensorCloudCookbook.zip"
},
{
"type": "shell",
"inline": [
"curl -L https://www.opscode.com/chef/install.sh | bash"
],
"execute_command": "chmod +x {{ .Path }}; sudo -E {{ .Path }}"
},
{
"type": "shell",
"inline": [
"unzip /tmp/sensorCloudCookbook.zip -d /tmp/sensorCloudCookbook"
],
"execute_command": "chmod +x {{ .Path }}; sudo -E {{ .Path }}"
},
{
"type": "shell",
"inline": [
"chef-solo -c /tmp/sensorCloudCookbook/solo.rb -l info -L /tmp/sensorCloudLogs.txt"
],
"execute_command": "chmod +x {{ .Path }}; sudo -E {{ .Path }}"
}
]
}
openstack-config.sh
#!/bin/bash
# To use an OpenStack cloud you need to authenticate against the Identity
# service named keystone, which returns a **Token** and **Service Catalog**.
# The catalog contains the endpoints for all services the user/tenant has
# access to - such as Compute, Image Service, Identity, Object Storage, Block
# Storage, and Networking (code-named nova, glance, keystone, swift,
# cinder, and neutron).
#
# *NOTE*: Using the 2.0 *Identity API* does not necessarily mean any other
# OpenStack API is version 2.0. For example, your cloud provider may implement
# Image API v1.1, Block Storage API v2, and Compute API v2.0. OS_AUTH_URL is
# only for the Identity API served through keystone.
export OS_AUTH_URL=http://128.136.179.2:5000/v2.0
# With the addition of Keystone we have standardized on the term **tenant**
# as the entity that owns the resources.
export OS_TENANT_ID=trystack_tenant_id
export OS_TENANT_NAME="trystack_tenant_name"
export OS_PROJECT_NAME="trystack_project_name"
# In addition to the owning entity (tenant), OpenStack stores the entity
# performing the action as the **user**.
export OS_USERNAME="same_as_trystack_tenant_name"
# With Keystone you pass the keystone password.
echo "Please enter your OpenStack Password: "
read -sr OS_PASSWORD_INPUT
export OS_PASSWORD=$OS_PASSWORD_INPUT
# If your configuration has multiple regions, we set that information here.
# OS_REGION_NAME is optional and only valid in certain environments.
export OS_REGION_NAME="RegionOne"
# Don't leave a blank variable, unset it if it was empty
if [ -z "$OS_REGION_NAME" ]; then unset OS_REGION_NAME; fi
You need to source openstack-config.sh before packer build.

Resources