How to stop hive/pig install in Amazon Data Pipeline? - emr

I don't need Hive or Pig, and Amazon Data Pipeline by default installs them on any EMR cluster it spins up. This makes testing take longer than it should. Any ideas on how to disable to install?

This is not possible as of today.
The only workaround would be to launch a small EMR cluster that you use for testing (like with single master - m1.small). Then use it with 'workergroup' rather than 'runsOn'.
Depending on type of activities you want to use, the workergroup field might or might not be supported. But you can always wrap everything in a script (python, shell or blah) and use it with ShellCommandActivity.
Update (correctly reminded by ChristopherB):
From 3.x AMI version, Hive and Pig is bundled in the AMI itself. So the steps do not pull any new packages from S3 but only activate the daemons on master node. So unless you are worried about them consuming your instance resources (CPU, memory etc), it should be okay. They would not take noticable time to run.

Related

Can a serverless build service (e.g. AWS CodeBuild) used to build Android AOSP ROMs?

As far as I understand, AWS CodeBuild is frequently used to build Android apps.
Could a serverless build service like e.g. CodeBuild also be used to build a complete custom ROMs based on AOSP?
The output should be the device specific image files, e.g. boot.img, system.img, ...
The idea is to avoid to set up and maintain an own (virtual) machine with the full AOSP build environment.
Maybe, but probably not. It requires 16GB of RAM to build the AOSP. This is a hard requirement. I've tried it with less. You can get away with 12GB and 4GB of swap, but 4GB with 12 of swap doesn't work....
Anyway, why does this matter?
Because the largest compute available for AWS Code Build is 15GB.
It's also just impractical. The source code for AOSP is ~80GB in size. It takes hours to download it all. You don't want to do that every time. At most, you want to sync with the latest changes and move on.
AWS instances are also virtualized. This has a huge impact on the build time.
As much as I love the cloud, if you want to set up a build server for AOSP, your best bet is to purchase a decent linux workstation to act as your build server. It's a bit of up front cost, but you'll get it back 100 fold in development time saved.

Single install Apache Karaf with failover configuration using shared disk

I'm looking to implement failover (master/slave) for Karaf. Our current
server setup has two application servers that have a shared SAN disk where
our current Java applications are installed in a single location and can
be started on either machine or both machines at the same time.
I was looking to implement Karaf master/slave failover in a similar way
(one install being shared by both app servers), however I'm not sure that
this is really a well beaten path and would appreciate some advice on
whether the alternatives (mentioned below) are significantly better.
Current idea for failover:
Install Karaf once on the shared SAN and setup basic file locking on this
shared disk.
Both application servers will effectively initiate the Karaf start script,
however only one (the first) will fully start (grabbing the lock) and the
second will remain in standby until it grabs the lock (if the master falls
over)
The main benefit I can see from this is that I only have to manage
deploying components to one Karaf installation and I only need to manage
one Karaf installation.
Alternatives:
We install Karaf in two separate locations on the shared SAN and setup to
lock to the same lock file.
Each application server will have their own Karaf instance, thus start
script to run.
This will make our deployment slightly more complicated (2 Karaf
installations to manage and deploy to).
I'd be interested if anyone can indicate any specific concerns that they
have with the current idea.
Note: I understand that Karaf-cellar can simplify my Karaf instance
management, however we would need to undertake another round of PoCs etc..
to approve our company use of cellar (as a separate product). Something
that I'd like to migrate to in the future.
Take a look at the documentation
This is from the documentation on how to set a lockfile for HA:
karaf.lock=true
karaf.lock.class=org.apache.karaf.main.lock.SimpleFileLock
karaf.lock.dir=<PathToLockFileDirectory>
karaf.lock.delay=10000
as can be seen there, you can also set a level for the bundle start-levels to start or not to start:
karaf.lock.level=50

Updating code on production server when using Go

When I develop and update files on production server with PHP I just copy the files on the fly and everything seems to work without interrupting the server.
But if I am to update the code on the Go server and application and would need to kill the server, copy the src files to the server, run go install, and then start the server, this would interrupt the service, and if I do this quite often then it is going to look very bad for my users of the service.
How can I update files without the downtime when using Go with Go's http server?
PHP is an interpreted language, which means you provide your code in source format and the PHP interpreter will read it and execute it (it may create a more compact binary form so that it doesn't have to analyze the source again when needed).
Go is a compiled language, it compiles into a native executable binary; going further it is statically linked which means every code and library your app is referring to is compiled and linked when the executable is created. This implies you can't just "drop-in" new go modules into a running application.
You have to stop your running application and start the new version. You can however minimize the downtime: only stop the running application when the new version of the executable is already created and ready to be run. You may choose to compile it on a remote machine and upload the binary to the server, or upload the source and compile it on the server, it doesn't matter.
With this you could decrease the downtime to a maximum of few seconds, which your users won't notice. Also you shouldn't update in every hour, you can't really achieve significant updates in just an hour of coding. You could schedule updates daily (or even less frequently), and you could schedule them for hours when your traffic is low.
If even a few seconds downtime is not acceptable to you, then you should look for platforms which handle this for you automatically without any downtime. Check out Google App Engine - Go for example.
The grace library will allow you to do graceful restarts without annoyance for your users: https://github.com/facebookgo/grace
Yet in my experience restarting Go applications is so quick, unless you have an high traffic website it won't cause any trouble.
First of all, don't do it in that order. Copy and install first. Then you could stop the old process and run the new one.
If you run multiple instances of your app, then you can do a rolling update, so that when you bounce one server, the other ones are still serving. A similar approach is to do blue-green deployments, which has the advantage that the code your active cluster is running is always homogeneous (whereas during a rolling deploy, you'll have a mixture until they've all rolled), and you can also do a blue-green deployment where you normally have only one instance of your app (whereas rolling requires more than one). It does however require you to have double the instances during the blue-green switch.
One thing you'll want to take into consideration is any in-flight requests -- you may want to make sure that in-flight requests continue to go to old-code servers until their finished.
You can also look into Platform-as-a-Service solutions, that can automate a lot of this stuff for you, plus a whole lot more. That way you're not ssh'ing into production servers and copying files around manually. The 12 Factor App principles are always a good place to start when thinking about ops.

use julia language without internet connection (mirror?)

Problem:
I would like to make julia available for our developers on our corporate network, which has no internet access at all (no proxy), due to sensitive data.
As far as I understand julia is designed to use github.
For instance julia> Pkg.init() tries to access:
git://github.com/JuliaLang/METADATA.jl
Example:
I solved this problem for R by creating a local CRAN repository (rsync) and setting up a local webserver.
I also solved this problem for python the same way by creating a local PyPi repository (bandersnatch) + webserver.
Question:
Is there a way to create a local repository for metadata and packages for julia?
Thank you in advance.
Roman
Yes, one of the benefits from using the Julia package manager is that you should be able to fork METADATA and host it anywhere you'd like (and keep a branch where you can actually check new packages before allowing your clients to update). You might be one of the first people to actually set up such a system, so expect that you will need to submit some issues (or better yet; pull requests) in order to get everything working smoothly.
See the extra arguments to Pkg.init() where you specify the METADATA repo URL.
If you want a simpler solution to manage I would also think about having a two tier setup where you install packages on one system (connected to the internet), and then copy the resulting ~/.julia directory to the restricted system. If the packages you use have binary dependencies, you might run into problems if you don't have similar systems on both sides, or if some of the dependencies is installed globally, but Pkg.build("Pkgname") might be helpful.
This is how I solved it (for now), using second suggestion by
ivarne.I use a two tier setup, two networks one connected to internet (office network), one air gapped network (development network).
System information: openSuSE-13.1 (both networks), julia-0.3.5 (both networks)
Tier one (office network)
installed julia on an NFS share, /sharename/local/julia.
soft linked /sharename/local/bin/julia to /sharename/local/julia/bin/julia
appended /sharename/local/bin/ to $PATH using a script in /etc/profile.d/scriptname.sh
created /etc/gitconfig on all office network machines: [url "https://"] insteadOf = git:// (to solve proxy server problems with github)
now every user on the office network can simply run # julia
Pkg.add("PackageName") is then used to install various packages.
The two networks are connected periodically (with certain security measures ssh, firewall, routing) for automated data exchange for a short period of time.
Tier two (development network)
installed julia on NFS share equal to tier one.
When the networks are connected I use a shell script with rsync -avz --delete to synchronize the .julia directory of tier one to tier two for every user.
Conclusion (so far):
It seems to work reasonably well.
As ivarne suggested there are problems if a package is installed AND something more than just file copying is done (compiled?) on tier one, the package wont run on tier two. But this can be resolved with Pkg.build("Pkgname").
PackageCompiler.jl seems like the best tool for using modern Julia (v1.8) on secure systems. The following approach requires a build server with the same architecture as the deployment server, something your institution probably already uses for developing containers, etc.
Build a sysimage with PackageCompiler's create_sysimage()
Upload the build (sysimage and depot) along with the Julia binaries to the secure system
Alias a script to julia, similar to the following example:
#!/bin/bash
set -Eeu -o pipefail
unset JULIA_LOAD_PATH
export JULIA_PROJECT=/Path/To/Project
export JULIA_DEPOT_PATH=/Path/To/Depot
export JULIA_PKG_OFFLINE=true
/Path/To/julia -J/Path/To/sysimage.so "$#"
I've been able to run a research pipeline on my institution's secure system, for which there is a public version of the approach.

Rake Strategy, DotNet Implementation

When reading about and playing with Rails last year, one of the tools that made the biggest impression on me was Rake. A database versioning system that keeps all dev db's identical integrated right into the build...something like that would make life so much easier (and safer)!
However, one of the things that I haven't been able to figure out:
How do you move these changes to your production servers when you don't actually have access to the production servers? We have multiple servers across the country that where the application is installed/upgraded by a setup package.
Note: This question is more about strategy than Rails/Rake specific technologies. We don't use rails, we use .Net. But if I can figure out this publish scenario, there seem to be several tools Migratordotnet being one that might enable us to do something similar.
As you probably know, the standard Rails way of running migrations in production is Capistrano. It has a deploy:migrations task that runs the migrations on remote servers using ssh.
You might be able to adapt Capistrano to do what you want. It's essentially a flexible way to run commands on groups of remote servers. You need to have Ruby installed on the machine you are deploying from in order to use it, but not on the machines you are deploying to.
Your best option may be to write a custom Capistrano task to upload the setup.exe, run it, then run the migrations (perhaps using Migrator.NET).
You might be able to use something like Red Gate's SQL Compare to produce schema diff scripts that would allow you to automate the process of updating the database. I've used the tool manually to do such changes and could easily see creating a program that would run these updates as part of upgrade process. If I were going to automate it, though, I'd design in something that would enable me to check what version of the schema was in place and run the necessary scripts in the proper order to bring it up the the desired version.

Resources