DCOS Cluster name change after installing - bigdata

I missed to update the cluster name (cluster_name) in my boot node's genconf/config.yaml before deploying the DC/OS cluster. I was wondering if there's a configuration/properties file in the nodes (or using dcos-cli or in etcd) that I need to change to update the cluster name string (that appears on the DC/OS UI).
I tried to use the command from Documentation of Mesosphere:
"dcos cluster rename name new-name"
i don't have any output when executing the command but nothing changes.

The only way to achieve the renaming of a DC/OS cluster where the information persists through upgrades is to update your config.yaml with the correct information. Then "re-install" DC/OS via the advanced install methods.
The DC/OS CLI will not work for this purpose. This command renames the cluster as your DC/OS CLI configuration sees it - the DC/OS cluster itself is not renamed. This is confusing until you look at this page where the text better reflects the purpose of this command.

Related

Apache airflow: install two instances in the same local machine

I have an Airflow instance in a local Ubuntu machine. This instance doesn't work very well, so I would like to install it again. The problem is that I can't delete the current instance, because it is used by other people, so I would like to create a new Airflow instance in the same machine to put various dags there.
How could I do it? I created a different virtual environment, but I don't know how to install a second airflow server in that environment, which works in parallel with the current airflow.
Thank you!
use different port for webserver
use different AIRFLOW_HOME variable
use different sql_alchemy_conn (to point to a different database)
copy the deployment you have to start/stop your airflow components.
Depending on your deployment you might somehew record process id of your running airflow (so called pid-files) or have some other way to determine which processes are running. But that is nothing airflow-specific, this is something that is specific for your deployment.

Passing files from a rocker container to a latex container within a gitlab-ci job

I would like to use Gitlab CI to compile a Latex article as explained in this answer on tex.stackexchange (a similar pdf generation example is shown in the gitlab documentation for artifacts). I use a special latex template given by the journal editor. My Latex article contains figures made with the R statistical software. R and Latex are two large software installations with a lot of dependencies so I decided to use two separate containers for the build, one for the statistical analysis and visualization with R and one to compile a Latex document to pdf.
Here is the content of .gitlab-ci.yml:
knit_rnw_to_tex:
image: rocker/verse:4.0.0
script:
- Rscript -e "knitr::knit('article.Rnw')"
artifacts:
paths:
- figure/
compile_pdf:
image: aergus/latex
script:
- ls figure
- latexmk -pdf -bibtex -use-make article.tex
artifacts:
paths:
- article.pdf
The knit_rnw_to_tex job executed in the R "rocker" container is successful and I can download the figure artifacts from the gitlab "jobs" page. The issue in the second job compile_pdf is that ls figure shows me an empty folder and the Latex article compilation fails because of missing figures.
It should be possible to use artifacts to pass data between jobs according to this answer and to this well explained forum post but they use only one container for different jobs. It doesn't work in my case. Probably because I use two different containers?
Another solution would be to use only the rocker/tidyverse container and install latexmk in there, but the installation of apt install latexmk fails for an unknown reason. Maybe because It has over hundred dependencies and that is to much for gitlab-CI?
The "dependencies" keyword could help according to that answer, but the artifacts are still not available when I use it.
How can I pass the artifacts from one job to the other?
Should I use cache as explained in docs.gitlab.com / caching?
Thank you for the comment as I wanted to be sure, how you do it. Example would help too, but I'll be generic for now (using docker).
To run multiple containers you need a
(The Docker executor)
To quote the documentation on it:
The Docker executor when used with GitLab CI, connects to Docker
Engine and runs each build in a separate and isolated container using
the predefined image that is set up in .gitlab-ci.yml and in
accordance in config.toml.
Workflow
The Docker executor divides the job into multiple steps:
Prepare: Create and start the services.
Pre-job: Clone, restore cache and download artifacts from previous stages. This is run on a special Docker image.
Job: User build. This is run on the user-provided Docker image.
Post-job: Create cache, upload artifacts to GitLab. This is run on a special Docker Image.
Your config.toml could look like this:
[runners.docker]
image = "rocker/verse:4.0.0"
builds_dir = /home/builds/rocker
[[runners.docker.services]]
name = "aergus/latex"
alias = "latex"
From above linked documentation:
The image keyword
The image keyword is the name of the Docker image that is present in the local Docker Engine (list all images with docker images) or any image that can be found at Docker Hub. For more information about images and Docker Hub please read the Docker Fundamentals documentation.
In short, with image we refer to the Docker image, which will be used to create a container on which your build will run.
If you don’t specify the namespace, Docker implies library which includes all official images. That’s why you’ll see many times the library part omitted in .gitlab-ci.yml and config.toml. For example you can define an image like image: ruby:2.6, which is a shortcut for image: library/ruby:2.6.
Then, for each Docker image there are tags, denoting the version of the image. These are defined with a colon (:) after the image name. For example, for Ruby you can see the supported tags at docker hub. If you don’t specify a tag (like image: ruby), latest is implied.
The image you choose to run your build in via image directive must have a working shell in its operating system PATH. Supported shells are sh, bash, and pwsh (since 13.9) for Linux, and PowerShell for Windows. GitLab Runner cannot execute a command using the underlying OS system calls (such as exec).
The services keyword
The services keyword defines just another Docker image that is run during your build and is linked to the Docker image that the image keyword defines. This allows you to access the service image during build time.
The service image can run any application, but the most common use case is to run a database container, e.g., mysql. It’s easier and faster to use an existing image and run it as an additional container than install mysql every time the project is built.
You can see some widely used services examples in the relevant documentation of CI services examples.
If needed, you can assign an alias to each service.
As for your questions:
It should be possible to use artifacts to pass data between jobs
according to this answer and to this well explained forum post but
they use only one container for different jobs. It doesn't work in my
case. Probably because I use two different containers?
The builds and cache storage (from documentation)
The Docker executor by default stores all builds in /builds/<namespace>/<project-name> and all caches in /cache (inside the container). You can overwrite the /builds and /cache directories by defining the builds_dir and cache_dir options under the [[runners]] section in config.toml. This will modify where the data are stored inside the container.
If you modify the /cache storage path, you also need to make sure to mark this directory as persistent by defining it in volumes = ["/my/cache/"] under the [runners.docker] section in config.toml.
builds_dir -> Absolute path to a directory where builds are stored in the context of the selected executor. For example, locally, Docker, or SSH.
The [[runners]] section documentation
As you may have noticed I have customized the build_dir in your toml file to /home/builds/rocker, please adjust it to your own path.
How can I pass the artifacts from one job to the other?
You can use the build_dir directive. Second option would to use Job Artifacts API.
Should I use cache as explained in docs.gitlab.com / caching?
Yes, You should use cache to store project dependencies. The advantage is that you fetch the dependencies only once from internet and then subsequent runs are much faster as they can skip this step. Artifacts are used to share results between build stages.
I hope it is now clearer and I have pointed you into right direction.
The two different images are not the cause of your problems. The artifacts are saved in one image (which seems to work), and then restored in the other. I would therefore advise against building (and maintaining) a single image, as that should not be necessary here.
The reason you are having problems is that you are missing build stages which inform gitlab about dependencies between the jobs. I would therefore advise you to specify stages as well as their respective jobs in your .gitlab-ci.yml:
stages:
- do_stats
- do_compile_pdf
knit_rnw_to_tex:
stage: do_stats
image: rocker/verse:4.0.0
script:
- Rscript -e "knitr::knit('article.Rnw')"
artifacts:
paths:
- figure/
compile_pdf:
stage: do_compile_pdf
image: aergus/latex
script:
- ls figure
- latexmk -pdf -bibtex -use-make article.tex
artifacts:
paths:
- article.pdf
Context:
By default, all artifacts of previous build stages are made available in later stages if you add the corresponding specifications.
If you do not specify any stages, gitlab will put all jobs into the default test stage and execute them in parallel, assuming that they are independent and do not require each others artifacts. It will still store the artifacts but not make them available between the jobs. This is presumably what is causing your problems.
As for the cache: Artifacts are how you pass files between build stages. Caches are for well, caching. In practice, they are used for things like external packages in order to avoid having to download them multiple times, see here. Caches are somewhat unpredictable in situations with multiple different runners. They are only used for performance reasons, and passing files between jobs using cache rather than using the artifact system is a huge anti-pattern.
Edit: I don't know precisely what your knitr setup is, but if you generate an article.tex from your article.Rnw, then you probably need to add that to your artifacts as well.
Also, services are used for things like a MySQL server for testing databases, or the dind (docker in docker) daemon to build docker images. This should not be necessary in your case. Similarly, you should not need to change any runner configuration (in their respective config.toml) from the defaults.
Edit2: I added a MWE here, which works with my gitlab setup.

Starting SparkR session using external config file

I have an RStudio driver instance which is connected to a Spark Cluster. I wanted to know if there is any way to actually connect to Spark cluster from RStudio using an external configuration file which can specify the number of executors, memory and other spark parameters. I know we can do it using the below command
sparkR.session(sparkConfig = list(spark.cores.max='2',spark.executor.memory = '8g'))
I am specifically looking for a method which takes spark parameters from an external file to start the sparkR session.
Spark uses standardized configuration layout with spark-defaults.conf used for specifying configuration option. This file should be located in one of the following directories:
SPARK_HOME/conf
SPARK_CONF_DIR
All you have to do is to configure SPARK_HOME or SPARK_CONF_DIR environment variables and put configuration there.
Each Spark installation comes with template files you can use as an inspiration.

Presto Interpreter in Zeppelin on EMR

Is it possible to add Presto interpreter to Zeppelin on AWS EMR 4.3 and if so, could someone please post the instructions? I have Presto-Sandbox and Zeppelin-Sandbox running on EMR.
There's no official Presto interpreter for Zeppelin, and the conclusion of the Jira ticket raised is that it's not necessary because you can just use the jdbc interpreter
https://issues.apache.org/jira/browse/ZEPPELIN-27
I'm running a later EMR with presto & zeppelin, and the default set of interpreters doesn't include jdbc, but it can be installed using a ssh to the master node and running
sudo /usr/lib/zeppelin/bin/install-interpreter.sh --name jdbc
Even better is to use that as a bootstrap script.
Then you can add a new interpreter in Zeppelin.
Click the login-name drop down in the top right of Zeppelin
Click Interpreter
Click +Create
Give it a name like presto, meaning you need to use %presto as a directive on the first line of a paragraph in zeppelin, or set it as the default interpreter.
The settings you need here are:
default.driver com.facebook.presto.jdbc.PrestoDriver
default.url jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889
default.user hadoop
Note there's no password provided because the EMR environment should be using IAM roles, and ppk keys etc for authentication.
You will also need a Dependency for the presto JDBC driver jar. There's multiple ways to add dependencies in Zeppelin, but one easy way is via a maven groupid:artifactid:version reference in the interpreter settings under Dependencies
e.g.
under artifact
com.facebook.presto:presto-jdbc:0.170
Note the version 0.170 corresponds to the version of Presto currently deployed on EMR, which will change in the future. You can see in the AWS EMR settings which version is being deployed to your cluster.
You can also get Zeppelin to connect directly to a catalog, or a catalog & schema by appending them to the default.url setting
As per the Presto docs for the JDBC driver
https://prestodb.io/docs/current/installation/jdbc.html
e.g. As an example, using Presto with a hive metastore with a database called datakeep
jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889/hive
OR
jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889/hive/datakeep
UPDATE Feb 2018
EMR 5.11.1 is using presto 0.187 and there is an issue in the way Zeppelin interpreter provides properties to the Presto Driver, causing an error something like Unrecognized connection property 'url'
Currently the only solutions appear to be using an older version in the artifact, or manually uploading a patched presto driver
See https://github.com/prestodb/presto/issues/9254 and https://issues.apache.org/jira/browse/ZEPPELIN-2891
In my case using an old reference to a driver (apparently must be older than 0.180) e.g. com.facebook.presto:presto-jdbc:0.179 did not work, and zeppelin gave me an error about can't download dependencies. Funny error but probably something to do with Zeppelin's local maven repo not containing this, not sure I gave up on that.
I can confirm that patching the driver works.
(Assuming you have java & maven installed)
Clone the presto github repo
Checkout the release tag e.g. git checkout 0.187
Make the edits as per that patch https://groups.google.com/group/presto-users/attach/1231343dbdd09/presto-jdbc.diff?part=0.1&authuser=0
Build the jar using mvn clean package
Copy the jar to the zeppelin machine somewhere zeppelin user has permission to read.
In the interpreter, under the Dependencies - Artifacts section, instead of a maven reference use the absolute path to that jar file.
There appears to be an issue passing the user to the presto driver, so just add it to the "default.url" jdbc connection string as a url parameter, e.g.
jdbc:presto://<YOUR EMR CLUSTER MASTER DNS>:8889?user=hadoop
Up and running. Meanwhile, it might be worth considering Athena as an alternative to Presto give it's serverless & is effectively just a fork of Presto. It does have limitation to External hive tables only, and they must be created in Athena's own catalog (or now in AWS Glue catalog, also restricted to External tables).
Chris Kang has a good post on doing that in spark-shell, http://theckang.com/2016/spark-with-presto/. I don't see you wouldn't be able to do that in Zeppelin. Another helpful post is making sure you have the right Java version in EMR, http://queirozf.com/entries/update-java-to-jdk-8-on-amazon-elastic-mapreduce. The current Presto version as of writing only runs on Java 8. I hope it sets you in the right direction.

Not able to create instances on CloudStack

I have created a Zone, Pod and Cluster on CloudStack.I have also added a host in the Cluster, added Primary Storage and Secondary Storage. But in System VMs, nothing is listed. Also, in the logs a message "No running ssvm is found, so command will be sent to LocalHostEndPoint" comes.
Somehow I deduced that due to this, template is not being added and consequently Instances can't be created as Instances use templates to add OS in VMs.
Can anybody please help to point out and sort the problem which may be the cause here.
You need to manually install the "system VM" templates. These are the images for worker VMs that CloudStack deploys to run system services. SSVM is an example of a SystemVM. It is responsible for copying templates to secondary storage.
See Prepare the System VM Template in the installation guide.

Resources