There is nginx web server, that serves API calls from different User-Agents. I want to parse nginx logs and collect statistics about API calls from different User-Agents.
I'm going to write python script to parse nginx access.log like this https://gist.github.com/sysdig-blog/22ef4c07714b1a34fe20dac11a80c4e2#file-prometheus-metrics-python-py
Is there more suitable solution?
I highly discourage this approach.
Parsing logs is an old task, and there are many tools out there that are more than capable of doing this in an efficient way.
For me personally, I had success with Fluentd - Open Source Data Collector, but there are more tools, depending on your specific needs.
The community, e.g, the amount, and quality of plugins/addons to the tool, is relevant when choosing the tool.
So if googling fluentd prometheus gets you some results from github and the developer itself - that might be your right course of action.
When an application doesn't expose whitebox monitoring endpoints, parsing the logs is the only solution.
From there, you have multiple choices depending on the scale and the budget of your setup:
centralizing logs (in ES by example) using a sidecar like Filebeat to parse and ship them. You can then make queries to export statistics
log parsing that expose statistics: fluentd, telegraf, mtail are good examples
regular executions of a script that dump the data in a prom file to be collected by a node exporter is also a cheap solution
Rolling your own script would be a last resort: if you need statistics you cannot get from of the shelve tools or statistics that need context to be extracted. But it comes at the cost of handling painful scenarios; in your case, following the file when it rolls can be an issue.
Related
I need to build an LDAP proxy that I can program to inspect and modify the LDAP requests and responses - some of the LDAP requests/responses will simply be passed through, but for others I might want to send two different requests to the server and then combine the results (that's just one example - there will be other use cases).
I've looked at the proxying options documented for OpenLDAP's slapd, and I see that it has quite flexible configuration and 'overlays', but no capability to insert custom code.
So I think that's not a solution, unless slapd's source code is easy to modify, to insert my own modules plus hooks to/from the existing code (?)
An alternative would be to start with a friendly TCP/IP framework library (or even a complete TCP/IP proxy). Then I can link to an ASN.1 decoding/encoding library, and write the rest myself.
I'd prefer to avoid having to write (& learn) all the TCP/IP connection/message handling and event loop myself.
So I'm looking for the most complete starting point that does the hard work and gives me the flexibility to write what I need. Typical lazy/greedy approach :-)
Must be open source, ideally in C or C++, and I'll probably be targetting RHEL/CentOS 8 in a container.
I am using a fairly expensive external API (there's a cost per request) which makes testing code which uses it impractical.
In an ideal world, I would have a proxy server I would do my requests against which would cache each request (based on URL + query string) indefinitely and only hit the actual API server when I explicitly invalidate the cache for a given request. Is such a server available off the shelf with minimal configuration?
My current stack is Node.js, Docker, Nginx, PostgreSQL & AWS S3 (for non ephemeral state). I think Varnish might accomplish what I need but I'm not sure.
Varnish can and will accomplish that, but only if you build a 'test' API that returns some similar data you can play with. Your best bet if you have to save money, is to query the API a few times to get different typical responses. Once you know the ballpark of what to expect from it, create some sort of dummy API, or even some static JSON or XML files that you can use to mimic it. At that point you can test Varnish and Cache invalidation, and I'd be more than happy to help you with the syntax for that, given some examples of the code.
Parse.com has a very useful tool in which it graphs the number of requests per second made to your application across a given time. I was wondering for an Nginx configuration, is there any tool that does the same time?
Using Nginx Plus would be another option to parsing the logs.
You can use the ngx_http_stub_status module (http://nginx.org/en/docs/http/ngx_http_stub_status_module.html) to export basic information, combined with collectd's nginx plugin (https://collectd.org/wiki/index.php/Plugin:nginx).
I am currently working on system that generated product recommendations like those on Amazon : "People who bought this also bought this.."
Current Scenario:
Extract the Google Analytics data of the client and insert it in database.
On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.
When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.
The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.
Currently I am using PHP and MYSQL with gapi package and REST api
storage on AMAZON EC2 .
My Question is:
Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.
PHP with SimpleDB or BIGQuery.
R language with BIGQuery.
RHIPE-(R and hadoop ) with SimpleDB.
Apache Mahout.
Plese help!
This isn't so easy to answer, because the constraints are fairly specialized.
The following considerations can be made, though:
BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.
As a result, this eliminates the 1st, 2nd, and 4th options.
What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.
I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.
(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.
(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.
If you want a real-time API for recommendations based on data in a database, Apache Mahout does this directly. You want to use ReloadFromJDBCDataModel, put on top a GenericItemBasedRecommender, and use the servlet-based wrapper in the examples module. It's probably a day or two of work to get familiar with the code and customize it to your needs, but it's pretty simple.
When you get past about 100M data points you would need to look at distributing the computation Hadoop. That's a fair bit more complex. Mahout has a distributed recommender too which you can customize.
What are some good automated tools for load testing (stress testing) web applications, that do not use record and replay of HTTP network packets?
I am aware that there are numerous load testing tools on the market that record and replay HTTP network packets. But these are unsuitable for my purpose, because of this:
The HTTP packet format changes very often in our application (e.g. when
we optimize an AJAX call). We do not want to adapt all test scripts just because
there is a slight change in HTTP packet format.
Our test team shall not need to know any internals about our application
to write their test scripts. A tool that replays HTTP packets, however, requires
the team to know the format of HTTP requests and responses, such that they
can adapt details of the replayed HTTP packets (e.g. user name).
The automated load testing tool I am looking for should be able to let the test team write "black box" test scripts such as:
Invoke web page at URL http://... .
First, enter XXX into text field XXX.
Then, press button XXX.
Wait until response has been received from web server.
Verify that text field XXX now contains the text XXX.
The tool should be able to simulate up to several 1000 users, and it should be compatible with web applications using ASP.NET and AJAX.
JMeter I've found to be pretty helpful, it also has a recording functionality to record use cases so you don't have to specify each GET/POST manually but rather "click" the use case once and then let JMeter repeat it.
http://jmeter.apache.org/
A license can be expensive for it (if you dont have MSDN), but Visual Studio 2010 Ultimate edition has a great set of load and stress testing tools that do what you describe. You can try it out for free for 90 days here.
TestMaker by PushToTest.com can run recorded scripts such as Selenium as well as many different languages like HTML, Java, Ruby, Groovy, .Net, VB, PHP, etc. It has a common reporting infrastructure and you can create load in your test lab or using cloud testing environments like EC2 for virtual test labs.
They provide free webinars on using open source testing tools on a monthly basis and there is one next Tuesday.
http://www.pushtotest.com
There are a few approaches; I've been in situations, however, where I've had to roll my own load generating utilities.
As far as your test script is concerned it involves:
sending a GET request to http://form entry page (only checking if a 200 response is given)
sending a POST request to http://form submit page with pre-generated key/value pairs for text XXX and performing a regexp check on the response
Unless your web page is complex AJAX there is no need to "simulate a button press" - this is taken care of by the POST request.
Given that your test consists of just a 2-step process there should be several automated load packages that could do this.
I've previously used httperf for load testing a large website: it can simulate a session consisting of several requests and can simulate a large number of users (i.e. sessions) simultaneously. For example, if your website generated a session cookie from the home page you could make that the first request, httperf would then use that cookie for subsequent requests, until it had finished doing the list of requests supplied.
What about http://watin.sourceforge.net/ ?