Artifactory Cross-Replication - artifactory

Caveat: I'm an IT guy, not a JAVA programmer. So please use small words. :)
I have two Artifactory servers, with more on the way for remote sites. They are supposed to have the same information on them, so that developers at Site 1 and Site 2 (and eventually sites 3->n) can work from the "same" repo. (Named, for the same of argument, BOB.)
Am I crazy to think that I can simply have BOB1 and BOB2 do "push" replication to each other? There's nothing in the documentation or online that indicates this should work or is a supported configuration, but I tried it and it worked. (?!)
When bringing additional sites into the loop, can we just push BOB1 -> BOB2 -> BOB3 ... BOB(n) -> BOB1?
Finally, is there an O'Reilly book or something about Artifactory that has more info than JFrog's online docs and costs less than the additional $4k they want for a support contract? I'd really like to know best practices for repository layouts and security settings.

Since I am not sure whether we are talking about a particular problem or just looking for instructions on how to do it, here's the page on replication in Artifactory User Guide. It includes both general explanations on how it works, and detailed instructions on how to set it up.

Do not create loops in the replication architecture! The Repository Replication page warns you never to do this.
Avoid Replication Loops ("Cyclic Replication")
A replication loop occurs ("Cyclic" or "Bi-directional" replication) occurs when two instances of Artifactory running on different servers are replicating content from one to the other concurrently.
For example, "Server A" is configured to replicate its repositories to "Server B", while at the same time, "Server B" is configured to replicate its repositories to "Server A".
Or "Server A" replicates to "Server B" which replicates to "Server C" which replicates back to "Server A".
We strongly recommend avoiding cyclic replication since this can have disastrous effects on your system causing loss of data, or conversely, exponential growth of disk-space usage.

Related

Generic Airflow data staging operator

I am trying to understand how to manage large data with Airflow. The documentation is clear that we need to use external storage, rather than XCom, but I can not find any clean examples of staging data to and from a worker node.
My expectation would be that there should be an operator that can run a staging in operation, run the main operation, and staging out again.
Is there such a Operator or pattern? The closes I've found is an S3 File Transform but it runs an executable to do the transform, rather than a generic Operator, such as a DockerOperator which we'd want to use.
Other "solutions" I've seen rely on everything running on a single host, and using known paths, which isn't a production ready solution.
Is there such an operator that would support data staging, or is there a concrete example of handling large data with Airflow that doesn't rely on each operator being equipped with cloud coping capabilities?
Yes and no. Traditionally, Airflow is mostly orchestrator - so it does not usually "do" the stuff, it usually tells others what to do. You very rarely need to bring actual data to Airflow worker, Worker is mostly there to tell others where the data is coming from, what to do with it and where to send it.
There are exceptions (some transfer operators actually download data from one service and upload it to another) - so the data passes through Airflow node, but this is an exception rather than a rule (the more efficient and better pattern is to invoke an external service to do the transfer and have a sensor to wait until it completes).
This is more of "historical" and somewhat "current" way how Airflow operates, however with Airflow 2 and beyond we are expandingh this and it becomes more and more possible to do a pattern similar to what you describe, and this is where XCom play a big role there.
You can - as of recently - develop Custom XCom Backends that allow for more than meta-data sharing - they are also good for sharing the data. You can see docs here https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html#custom-backends but also you have this nice article from Astronomer about it https://www.astronomer.io/guides/custom-xcom-backends and a very nice talk from Airflow Summit 2021 (from last week) presenting that: https://airflowsummit.org/sessions/2021/customizing-xcom-to-enhance-data-sharing-between-tasks/ . I Highly Recommend to watch the talk!
Looking at your pattern - XCom Pull is staging-in, Operator's execute() is operation and XCom Push is staging-out.
This pattern will be reinforced, I think by upcoming Airflow releases and some cool integrations that are coming. And there will be likely more cool data sharing options in the future (but I think they will all be based on - maybe slightly enhanced - XCom implementation).

Apache Nifi vs Gobblin

I am assessing a big-data project, we would need to pull lots of big data sets from various internet sources (ftp, api, etc), do light transformations and light data quality / sanity checking (eg: row and columnar inspections), and push it downstream. Immediate focus is batchy, but anticipate supporting streaming down the line. Ease of support at scale is an important requirement.
We are looking at Apache Nifi and Gobblin, which seem to overlap in intention. What sort of use cases fit which platform best? How would they conform to the use case above?
Thanks!
My experience is with NiFi, and I've just had a look at Gobblin, but mainly, NiFi is an application in itself, where Gobblin is a framework.
In NiFi, you'll have a GUI, with very granular authorizations, that allow, several users to intervene on different part of the flow, monitor it, etc ...
One other thing is that NiFi is 'always on' and 'always in production' you are potentially able to make your modifications directly on the target, and as such, there are a few safeguards in order to avoid losing data (by mistake, I mean).
So, where I think both solutions can do more or less the same thing, if you have a workflow where you want to deploy once from time to time, Gobblin might be a better fit, but if you want something where you give some users permissions to intervene on parts of the flow directly in production, NiFi will be the best.
In the end, to keep the question oriented on programming:
NiFi allows to you program graphically, to give very granular permissions to your 'developers', and well as to update the 'program' (the NiFi flow) while it is running
Gobblin seems (from what little I've looked up) to work by defining jobs with text files, which seems to be more of a 'classical' development workflow, but that may fit better for your usage.

How do I email myself data from a R script?

I'm hoping to take advantage of Amazon spot instances which come at a lower cost but can terminate anytime. I want to set it up such that I can send myself data mid-way through a script so I can pick up from there in the future.
How would I email myself a .rdata file?
difficulty: The ideal solution will not involve RCurl since I am unable to install that package on my machine instance.
The same way you would on the command-line -- I like the mpack binary for that which you find in Debian and Ubuntu.
So save data to a file /tmp/foo.RData (or generate a temporary name) and then
system("mpack -s Data /tmp/foo.RData you#some.where.com")
in R. That assumes the EC2 instance has mail setup, of course.
Edit Per request for a windoze alternative: blat has been recommended by other for this task.
There is a good article on this in R News from 2007. Amongst other things, the author describes some tactics for catching errors as they occur, and automatically sending email alerts when this happens -- helpful for long simulations.
Off topic: the article also gives tips about how the linux/unix tools screen and make can be very useful for remote monitoring and automatic error reporting. These may also be relevant in cases when you are willing to let R email you.
What you're asking is probably best solved not by email but by using an EBS volume. The volume will persist regardless of the instance (note though that I'm referring to an EBS volume as opposed to an EBS-backed instance).
In another question, I mention a bunch of options for checkpointing and related tools, if you would like to use a separate function for storing your data during the processing.

Determining Website Capacity

A client of mine has a website and they need to determine how 'scalable' the site currently is. What I mean by this is the number of users browsing around the site concurrently.
It's a custom e-commerce app in .net, not written by myself and the code is... well lets just say, a bit dubious.
A much bigger company is looking to buy them / throw funding their way but they need some form of metrics to show how much load it can take before it falls apart. This big company has the ability to 'turn on the taps' to a huge user base - and obviously doesn't want to do that if the site is going to fall over with a sneeze of traffic.
What is a good metric to provide here? And how can I obtain it?
Edit: Question revised
I always use Apache's "ab" tool: link text
Run it from a different machine, preferably a BSD or Linux machine with no firewall rules that will limit the performance of the tool. Because otherwise the result might not be as reliable. If you use a Windows machine, make sure you're using one that isn't limiting the number of active TCP connections.
When using "ab", the number you're looking for it "Requests per second". Experiment with the concurrency switch to see how many concurrent users you can handle before you're getting a lot of errors, or when the requests per seconds is dropping rapidly.
When you are noticing the webserver is having serious issues you should restart the webserver, and let it rest for a while before continuing the test.
You'd be better off with a hosted load test, as this might give you more insight on realworld scenario's (something like http://www.scl.com/software-quality/hosted-load-test, no experience with them though).
Furthermore: scalability is as far as I know, not how many concurrent users can be served, but the way how easy it is to serve more when the site grows bigger (by adding extra servers etc, how easy is it for the website to scale up, does the codebase allow to use unlimited number of servers, etc.)
Well, I suppose it'll depend on what the client cares about.
Do they care about how many users to can access the site at once? Report on that, but running simultaneous requests from another server until it dies, then get the number.
Do they care about something else?
For me, when someone says they want it to 'scale', it really means they have no idea what they want. So try and talk to them, and get specific details of what, exactly, they want to see 'scaling', and then, once you find the areas to analyse, you can do so trivially, and attempt to improve them.

How would I go about figuring out the maximum load my server(s) can handle?

In Joel's article for Inc. entitled How Hard Could It Be?: The Unproven Path, he wrote:
...it turns out that Jeff and his
programmers were so good that they
built a site that could serve 80,000
visitors a day (roughly 755,000 page
views)
How would I go about figuring out the maximum load my server(s) can handle?
Benchmarking your software is often a lot harder than it seems. Sure, it's easy to produce some numbers that say something about the performance of your software, but unless it was calculated using a very accurate representation of the actual usage patterns of your end users, it might be completely different from the actual results you will get in the wild. Websites are notoriously hard to benchmark correctly. Sure, you can run a script that measures the time it takes to generate a page but it will be a very different number from what you will see under real world usage.
Inorder to create a solid benchmark of what your servers can handle, you first need to figure out what the usage patterns of your users is. If your site is already running, you can easily collect this data from your logs. Next, you need to create a simulation that will emulate exactly the same patterns as your real users exhibit... that is - view front page, login, view status page and so forth. Different pages will create a different load on the servers requiring that you actually fetch correct set of pages when simulating load on your servers. Finally, you need to figure out which resources are cached by your users, you can do this again by looking through your access log or using a tool such as firebug.
JMeter, ab, or httperf
You can create several "stress tests" and run them as the other posters are telling.
Apache has a tool called JMeter where you can create these tests and run them several times.
http://jmeter.apache.org/
Greetings.
Jason, Have you looked at the Load Test built in to Visual Studio 2008 Team System? Check out this video to see a demo.
Edit: Here's another video that has better resolution.
Apache has a tool called ab that you can use to benchmark a server. It can simulate loads requests and concurrency situations for you.
Basically you need to mimic the behavior of a user and keep ramping up the number of users being mimiced until the server response is no longer acceptable.
There are a variety of tools that can do this but essentially you want to record a few sessions activity on your site and then play those sessions back (adding some randomisation to reflect real user behaviour) lots of times.
You will want to log the performance of each session and keep increasing the load until the the performance becomes unacceptable.

Resources