getting more data from nginx prometheus exporter - nginx

I'm using nginx Prometheus exporter, but the amount of data that its metrics are very little, I want to get information of access.log and error.log too, like how much 200, 404,...
what is your suggestion?

The richier metrics are only available with NGINX Plus which comes at a premium. Unless you want to modify the source code, additional metrics are only available through the log file.
If you are already aggregating logs, say with an elasticsearch, you can use the related exporter to extract metrics.
If not, there are solutions either from dedicated project such as the nginxlog-exporter or generic solutions such as mtail where you can write your own rules.
Finally, there is an intermediary solution which is the official one on Prometheus site: extracting metrics with lua. This is maybe the more robust solution but it comes at the cost of the setup.
It is hard to make a suggestion. It all comes to your time/skill/money budget and the usage you are making of nginx. It you are using it as a proxy, envoy is gaining traction.
In fact, your question is a bit broad but worth an answer because the basic monitoring available is really poor for the widespread usage nginx enjoy (IMNSHO)

Related

Advanced HTTP/2 proxy for load balancing of distributed scraping solution

I have built a distributed HTTP scraper solution that uses different "exit addresses" addresses by design in order to balance the network load.
The solution supports IPv4, IPv6 and HTTP proxy to route the traffic.
Each processor was responsible to define the most efficient route to balance the traffic and it was temporarily implemented manually for prototyping. Currently, the solution grows and with the number of processors as the complexity of the load balancing task get higher, that's why I need a way to create a component dedicated to it.
I did some rather extensive research, but seem to have failed in finding a solution for load balancing traffic between IPv6, IPv4 (thousands of local addresses) and public HTTP proxies. The solution needs to support weights, app-level response checks and cool-down periods.
Does anyone know a solution that already solves this problem? Before I start developing a custom one.
Thanks for your help!
If you search for load balancing proxy you'll discover the Cache Array Routing Protocol (CARP). This CARP might not be what you're searching for and there exists servers only for the proxy-cache what I never knew till now.
Nevertheless those servers have own load balancers too, and perhaps that's a detail where it's worth it to search more.
I found a presentation mentioning CARP as outstanding solution too: https://cs.nyu.edu/artg/internet/Spring2004/lectures/lec_8b.pdf
Example: for proxy-arrays in Netra Proxy Cache Server: https://docs.oracle.com/cd/E19957-01/805-3512-10/6j3bg665f/index.html
Also there exist several concepts for load-balancing (https://link.springer.com/article/10.1023/A:1020943021842):
The three proposed methods can broadly be divided into centralized and decentralized
approaches. The centralized history (CH) method makes use of the transfer rate of each
request to decide which proxy can provide the fastest turnaround time for the next job.
The route transfer pattern (RTP) method learns from the past history to build a virtual
map of traffic flow conditions of the major routes on the Internet at different times of the
day. The map information is then used to predict the best path for a request at a particular time of the day. The two methods require a central executive to collate information
and route requests to proxies. Experimental results show that self-organization can be
achieved (Tsui et al., 2001). The drawback of the centralized approach is that a bottleneck and a single point of failure is created by the central executive. The decentralized
approach—the decentralized history (DH) method—attempts to overcome this problem
by removing the central executive and put a decision maker in every proxy (Kaiser et al.,
2000b) regarding whether it should fetch a requested object or forward the request to another
proxy.
As you use public proxy-servers probably you won't use decentralized history (DH) but centralized history (CH) OR the route transfer pattern (RTP).
Perhaps it would be even useful to replace your own solution completely, i.e. by this: https://github.blog/2018-08-08-glb-director-open-source-load-balancer/. I've no reason for this special example, it's just random by search results I found.
As I'm not working with proxy-servers this post is just a collection of findings, but perhaps there is a usable detail for you. If not, don't mind - probably you know most or all already and it's never adding anything new for you. Also I never mention any concrete solution.
Have you checked this project? https://Traefik.io which supports http/2 and tcp load balancing. The project is open source and available on github. It is build using Go. I'm using it now as my reverse proxy with load balancing for almost everything.
I also wrote a small blog post on docker and Go where I showcase the usage of Traefik. That also might help you in your search. https://marcofranssen.nl/docker-tips-and-tricks-for-your-go-projects/
In the traefik code base you might find your answer, or you might decide to utilize traefik to achieve your goal instead of home grown solution.
See here for a nice explanation on the soon to be arriving Traefik 2.0 with TCP support.
https://blog.containo.us/back-to-traefik-2-0-2f9aa17be305

How to filter HTTP requests based on body before they get to the server?

I'm using nginx to filter some requests before they get to my server, for the majority of it is of no interest, and I don't want to scale my servers to handle all requests, when I'm actually interested only in a fraction of them. Scaling nginx is cheaper.
Problem is nginx doesn't have (afaik) dynamic configuration, and we are doing only basic filtering right now, based on request parameters and origin IP address.
Is there any nginx-like software I can use with (possibly) both capabilities describe above (namely, dynamic configuration and all-powerful filtering, like regex on request body)?
I'm beginning to test OpenResty's lua scripting capabilities, but that looks like a mess workaroud to me. Any opinions on that?
It actually seems like OpenResty is a nice approach. More here: http://www.scalescale.com/scaling-cloudflares-massive-waf/.

Sailsjs distribution across multiple Google compute engine instances

Sailsjs requires setup to handle scaling horizontally. There are multiple ways to do this. I'm not sure if I have done this correctly, due to poor performance during load testing. Please confirm if I understand and am doing the setup correctly.
I've created a load balancer on the Google platform for handling the distribution of requests across the instances. Much is spoken about of Nginx for distributing, but I understand Googles load balancer does all I need in this regard. Note, I use session affinity: Client IP.
I've set up config/session.js to use express-mysql-session, so MemoryStore is not used.
I haven't set up anything in config/sockets.js. My project doesn't use live chat etc with socket.io, all requests are to waterline for data from db. But if this is a issue, please refer me to a way to do this with Mysql db not redis (or memory).
I use pm2 as a way to keep it live and to distribute processing on a instance.
Those are the main factors I've found regarding horizontal scaling with sailsjs.

How to detect proxy requests? [duplicate]

This question already has answers here:
How do you detect a VPN or Proxy connection? [closed]
(7 answers)
Closed 2 years ago.
I know it is popular question, and I read all topics about it. I want to put point for me in this question.
Goal: Detect proxy if user use it
Reason: If user use proxy does not show geo adv. I need to know bool result.
Decision:
1. Use database of proxy IPs (for ex: MaxMind);
2. Check header Connection: keep-alive because cheap proxy does not use persistent connection. But all modern browsers use it;
3. Check other popular headers;
4. Use JS to detect web-proxy by compare browser host and real host.
Questions:
1. Advise database, I read about MaxMind, but some people wrote it is not effective.
2. Check Connection-header. Is it okey?
3. May be I missed something?
PS/ Sorry for my english... I learn it.
Option 1 you suggested is the best option. Proxy detection can be time consuming and complicated.
As you mentioned maxmind and your concern for effectiveness, there are other APIs available like GetIPIntel. It's free and very simple to use. They go beyond simple blacklists and use machine learning and probability theory algorithms to determine a probability value and makes things very accurate.
Option 2 you mentioned doesn't hurt to implement unless you get a lot of false positives. Option 3-4 should not be used alone because it's very easy to get around it. All browser actions can be automated and just because someone is using a proxy, it does not mean they're not using a real browser.
The best way is definitely to use an API. You could use the database from MaxMind but then you need to keep downloading that database and making sure the data is kept up to date by them. And as you said there are questions about the accuracy of MaxMind data.
Personally I would recommend you try https://proxycheck.io which full disclosure is my own site, you get full access to everything for free, premium proxy detecting and blocking with 1,000 daily queries.
You can evaluate IP2Proxy database which is updated daily. It detects open proxy, web proxy, Tor and VPN. https://www.ip2location.com/database/px2-ip-proxytype-country
Check connection header is inaccurate for proxy types such as VPN.
Check headers is easily being defeated. A new generation of proxy will attempt to workaround older generation of detection methods.
Based on our experience, the best method in proxy detection is based on accurate blacklist.

Using NGINX to forward tracking data to Flume

I am working on providing analytics for our web property based on instrumentation data we collect via a simple image beacon. Our data pipeline starts with Flume, and I need the fastest possible way to parse query string parameters, form a simple text message and shove it into Flume.
For performance reasons, I am leaning towards nginx. Since serving static image from memory is already supported, my task is reduced to handling the querystring and forwarding a message to Flume. Hence, the question:
What is the simplest reliable way to integrate nginx with Flume? I am thinking about using syslog (Flume supports syslog listeners), but I struggle with how to configure nginx to forward custom log messages to a syslog (or just TCP) listener running on a remote server and on a custom port. Is it possible with existing 3rd party modules for nginx or would I have to write my own?
Separately, anything existing you can recommend for writing a fast $args parser would be much appreciated.
If you think I am on a completely wrong path and can recommend something better performance-wise, feel free to let me know.
Thanks in advance!
You should parse nginx log file like tail -f do and then pass results to Flume. It will be the most simple and reliable way. The problem with syslog is that it blocks nginx and may completely stuck under high-load or if something goes wrong (this is why nginx doesn't support it).

Resources