This question is regarding a bot of mine which's primary focus is scraping.
The path is mapped out correctly and it does what it needs to do.
Rate limits are tested and I am certain this is not a factor, if it was and where it was we received actual responses.
However, the webpage(s) I am trying to scrape seem to have build in a kind of weird/ unfamiliar security manner, something that I haven't came across before. And here I am wondering, how it's executed and how I deal with it appropriately.
While the scraper/bot is doing it's thing, sending requests getting responses, at random times it will encounter this what I suspect is a security measure. There are simply no responses back from the server, not a 4xx error or any at all.
At first sight the proxies just appear dead, but that's not it, because they are not. The proxies work just fine, and manually I can just browse the page on them, no issues here.
The server just stops giving responses.
Now to find a workaround for this, I would need to be able to tell the difference between a timeout (for my proxies) and a no response. They appear the same, but are not.
Does anyone have insight on this problem, maybe there is a genius way to separate those that I am not aware of.
Now to find a workaround for this, I would need to be able to tell the difference between a timeout (for my proxies) and a no response. They appear the same, but are not.
A timeout is if the server does not respond within a specific time. No response means, that the server either closes the connection either before the timeout occurs or that it will close the connection after the timeout occurred without sending anything back.
The first case can be easily detected by the connection close before timeout. If you want to detect instead if the server will close the connection without response only after your current timeout then your only option is to extend the timeout. There is nothing in the server which will indicate that the server will close the connection without response at some future time.
And since your only connection is with the proxy there is no real way to detect if the problem is at the proxy or the server. Your only hope might be to set your timeout waiting for the proxy larger then the timeout the proxy has waiting for the server. This way you'll maybe get a response from the proxy indicating that the connection to the server timed out.
They appear the same, but are not.
They are the same. There is no difference. A read timeout means that data didn't arrive within the timeout period. For whatever reason. TCP doesn't know, and can't tell you. At the C level, recv() returned -1 with errno == EAGAIN/EWOULDBLOCK. That's all the information there is.
What you are asking is tantamount to 'data didn't arrive: where didn't it arrive from?' It's not a meaningful question.
Related
I am talking about only one case here.
client sent a request to server -> server received it and returned a response -> unfortunately the response dropped.
I have only one question about this.
Is this case even possible? If it's possible then what should the response code be, or will client simply see it as read timeout?
As I want to sync status between client/server and want 100% accuracy no matter how poor the network is, the answer to this question can greatly affect the client's 'retry on failure' strategy.
Any comment is appreciated.
Yes, the situation you have described is possible and occurs regularly. It is called "packet loss". Since the packet is lost, the response never reaches the client, so no response code could possibly be received. Web browsers will display this as "Error connecting to server" or similar.
HTTP requests and responses are generally carried inside TCP packets. If a TCP packet carrying the HTTP response does not arrive in the expected time window, the request is retransmitted. The request will only be retransmitted a certain number of times before a timeout error will occur and the connection is considered broken or dead. (The number of attempts before TCP timeout can be configured on both the client and server sides.)
Is this case even possible?
Yes. It's easy to see why if you picture a physical cable between the client and the server. If I send a request down the cable to the server, and then, before the server has a chance to respond, unplug the cable, the server will receive the request, but the client will never "hear" the response.
If it's possible then what should the response code be, or will client simply see it as read timeout?
It will be a timeout. If we go back to our physical cable example, the client is sitting waiting for a response that will never come. Hopefully, it will eventually give up.
It depends on exactly what tool or library you're using how this is wrapped up, however - it might give you a specific error code for "timeout" or "network error"; it might wrap it up as some internal 5xx status code; it might raise an exception inside your code; etc.
I am new to Apache Camel and Netty and this is my first project. I am trying to use Camel with the Netty component to load balance heavy traffic in a back end load test scenario.This is the setup I have right now:
from("netty:tcp:\\this-ip:9445?defaultCodec=false&sync=true").loadBalance().roundRobin().to("netty:tcp:\\backend1:9445?defaultCodec=false&sync=true,netty:tcp:\\backend2:9445?defaultCodec=false&sync=true)
The issue is unexpected buffer sizes that I am receiving in the response that I see in the client system sending tcp traffic to Camel. When I send multiple requests one after the other I see no issues and the buffer size is as expected. But, when I try running multiple users sending similar requests to Camel on the same port, I intermittently see unexpected buffer sizes, sometimes 0 bytes to sometimes even greater than the expected number of bytes. I tried playing around with multiple options mentioned in the Camel-Netty page like:
Increasing backlog
keepAlive
buffersizes
timeouts
poolSizes
workerCount
synchronous
stream caching (did not work)
disabled useOriginalMessage for performance
System level TCP parameters, etc. among others.
I am yet to resolve the issue. I am not sure if I'm fundamentally missing something. I did take a look at the encoder/decoders and guess if that could be an issue. But, I don't understand why a load balancer needs to encode/decode messages. I have worked with other load balancers which just require endpoint configurations and hence, I am assuming that Camel does not require this. Am I right? Please know that the issue is not with my client/backend as I ran a 2000 user load test from my client to the backend with less than 1% failures but see a large number of failure ( not that there are no successes) with Camel. I have the following questions:
1.Is this a valid use-case for Apache Camel- Netty? Should I be looking at Mina or others?
2.Can I try to route tcp traffic to JMS or other components and then finally to the tcp endpoint?
3.Do I need encoders/decoders or should this configuration work?
4.Should I continue with this approach or try some other load balancer?
Please let me know if you have any other suggestions. TIA.
Edit1:
I also tried the same approach with netty4 and mina components. The route looks similar to the one in netty. The route with netty4 is as follows:
from("netty4:tcp:\\this-ip:9445?defaultCodec=false&sync=true").to("netty4:tcp:\\backend1:9445?defaultCodec=false&sync=true")
I read a few posts which had the same issue but did not find any solution relevant to my issue.
Edit2:
I increased the receive timeout at my client and immediately noticed the mismatch in expected buffer length issue fall to less than 1%. However, I see that the response times for each transaction when using Camel and not using it is huge; almost 10 times higher. Can you help me with reducing the response times for each transaction? The message received back at my client varies from 5000 to 20000 bytes. Here is my latest route:
from("netty:tcp://this-ip:9445?sync=true&allowDefaultCodec=false&workerCount=20&requestTimeout=30000")
.threads(20)
.loadBalance()
.roundRobin()
.to("netty:tcp://backend-1:9445?sync=true&allowDefaultCodec=false","netty:tcp://backend-2:9445?sync=true&allowDefaultCodec=false")
I also used certain performance enhancements like:
context.setAllowUseOriginalMessage(false);
context.disableJMX();
context.setMessageHistory(false);
context.setLazyLoadTypeConverters(true);
Can you point me in the right direction about how I can reduce the individual transaction times?
For netty4 component there is no parameter called defaultCodec. It is called allowDefaultCodec. http://camel.apache.org/netty4.html
Also, try something like this first.
from("netty4:tcp:\\this-ip:9445?textline=true&sync=true").to("netty4:tcp:\\backend1:9445?textline=true&sync=true")
The above means the data being sent is normal text. If you are sending byte or something else you will need to provide decoding/encoding for netty to handle the data.
And a side note. Before running the Camel route, test manually to send test messages via a standard tcp tool like sockettest to verify that everything works. Then implement the same via Camel. You can find sockettest here http://sockettest.sourceforge.net/ .
I finally solved the issue with the same route settings as above. The issue was with the Request and Response Delimiter not configured properly due to which it was either closing the connection too early leading to unexpected buffer sizes or it was waiting too long even after the entire buffer was received leading to high response times.
I'm using SignalR with Redis as a message bus on a server that sits behind an Nginx proxy for load balancing. I used SignalR's PersistentConnection class to write a simple chat program that broadcasts messages to users belonging to the same certain group. Users are added to a group in OnConnectedAsync, removed in OnDisconnectAsync, and the user-to-group mapping is deterministic.
Currently, the client side falls back to long polling for whatever reason (I'm not entirely sure why), and whenever the client sets up a new connection after waiting for and receiving a response, seemingly at random, the server will sometimes respond to the new connection immediately with the previous response, despite there having only been one POST.
The message ID's tend to differ by exactly one, (the smaller ID coming first), with the rest of the response remaining the same. I logged some debug info and am quite positive that my override of OnReceivedAsync is sending one response per one request. I tried the same implementation without the Redis message bus, and got the same problem. Running locally (with long polling) however yielded good results so I suspect that the problem might be with the way the message bus might be buffering messages to refresh clients who might not be caught up, and some weird timing with the cutting/setting up of connections with the Nginx load balancer, but beyond that, I am very much at a loss.
Any help would be appreciated.
EDIT: Further investigation reveals that duplication occurs at somewhat regular intervals of approximately 20-30 seconds. I'm led to believe that the message expiration in the message bus might have something to do with the bug.
EDIT: Bug can be seen here: http://tinyurl.com/9q5t3va
The server is simply broadcasting a counter being sent by the client. You will notice some responses are duplicated every 20 or so.
Reducing the number of worker processes in the IIS (6.0) Server Manager from 2 to 1 solved the problem.
We currently experience a problem with a self-written server application running on Windows (occurs on different versions). The server listens at a TCP port, accepts connections, exchanges some data and then closes the connections again. There are about 100 clients that connect from time to time.
Sometimes the server stops to work: Log files show that connections are still accepted, but that at the first read attempt a socket error (10054 - Connection reset by peer) occurs. I don't think it is a client issue because it suddenly stops working for all clients.
Now we found out, that the same problem occurs with our old server software, that is even written in another programming language. So it doesn't seem to be an error in our program - I think it has to be some kind of OS / firewall issue? Of course, firewalls have been deactivated, which didn't solve the issue yet.
Any ideas where to look into? Wireshark logs will follow soon..
Excerpt from the log (Timestamp, Thread Id, message)
11:37:56.137 T#3960 Connection from 10.21.13.3
11:37:56.138 T#3960 Client Exception: Socket Error # 10054
Connection reset by peer.
11:37:56.138 T#3960 ClientDisconnected
11:38:00.294 T#4144 Connection from 10.21.13.3
You can see that the exception occurs almost at the same time as the connection is accepted, in this case the client reconnects after a few seconds.
A "stateful" firewall or NAT keeps track of connections, and ought to send RSTs for connectiosn it doesn't know about. If the firewall loses track of connections for some reason, then you'll probably see random connections being reset.
Our router at work does this — it forgets about connections when the PPP connection dies, which is remarkably unhelpful when it rains and the DSL restart takes a bit too long. However, instead of resetting connections, it just drops packets (even more unhelpful!).
Sounds like a firewall or routing issue - maybe stale connections get disconnected after a timeout period. Are you using a ping/keepalive inside your protocol.
Otherwise you may ask Wireshark to see what is going on.
First, thanks for many hints - I'm afraid the problem was a completely different one which you couldn't possibly solve by reading my question.
The server application uses log4net, configured with a log file an ImmediateFlush = true. If every log statement is directly written into the file and multiple socket connections occur this slows down the whole application.
The server needed about a minute to really accept the connection. This was far more than the timeout on clientside. So in the log there was only shown "accepted" followed by "disconnected" - even the log was delayed!
Sorry for the inconvenience...
Have you tried changing the backlog and then see how much time or how many clients are served before this problem occurs
You don't say what Windows versions you're using for the server, but you should be aware that the Windows TCP/IP stack behaves differently in server and client OSes. There are limits on how many simultaneous incoming connections a client OS will allow, and they are significantly lower than you might expect.
What do the logs look like from the client side?
Since the error is stating that the client is dropping the connection; if you see the same error on the client side then it is a firewall or proxy that is dropping the connection (both side seeing the opposite side dropping the connection is indicative of a proxy/firewall).
If the error is not present on the client side; then I would say that your client side is where you will see the actual error.
How long can I expect a client/server TCP connection to last in the wild?
I want it to stay permanently connected, but things happen, so the client will have to reconnect. At what point do I say that there's a problem in the code rather than there's a problem with some external equipment?
I agree with Zan Lynx. There's no guarantee, but you can keep a connection alive almost indefinitely by sending data over it, assuming there are no connectivity or bandwidth issues.
Generally I've gone for the application level keep-alive approach, although this has usually because it's been in the client spec so I've had to do it. But just send some short piece of data every minute or two, to which you expect some sort of acknowledgement.
Whether you count one failure to acknowledge as the connection having failed is up to you. Generally this is what I have done in the past, although there was a case I had wait for three failed responses in a row to drop the connection because the app at the other end of the connection was extremely flaky about responding to "are you there?" requests.
If the connection fails, which at some point it probably will, even with machines on the same network, then just try to reestablish it. If that fails a set number of times then you have a problem. If your connection persistently fails after it's been connected for a while then again, you have a problem. Most likely in both cases it's probably some network issue, rather than your code, or maybe a problem with the TCP/IP stack on your machine (has been known: I encountered issues with this on an old version of QNX--it'd just randomly fall over). Having said that you might have a software problem, and the only way to know for sure is often to attach a debugger, or to get some logging in there. E.g. if you can always connect successfully, but after a time you stop getting ACKs, even after reconnect, then maybe your server is deadlocking, or getting stuck in a loop or something.
What's really useful is to set up a series of long-running tests under a variety of load conditions, from just sending the keep alive are you there?/ack requests and responses, to absolutely battering the server. This will generally give you more confidence about your software components, and can be really useful in shaking out some really weird problems which won't necessarily cause a problem with your connection, although they might result in problems with the transactions taking place. For example, I was once writing a telecoms application server that provided services such as number translation, and we'd just leave it running for days at a time. The thing was that when Saturday came round, for the whole day, it would reject every call request that came in, which amounted to millions of calls, and we had no idea why. It turned out to be because of a single typo in some date conversion code that only caused a problem on Saturdays.
Hope that helps.
I think the most important idea here is theory vs. practice.
The original theory was that the connections had no lifetimes. If you had a connection, it stayed open forever, even if there was no traffic, until an event caused it to close.
The new theory is that most OS releases have turned on the keep-alive timer. This means that connections will last forever, as long as the system on the other end responds to an occasional TCP-level exchange.
In reality, many connections will be terminated after time, with a variety of criteria and situations.
Two really good examples are: The remote client is using DHCP, the lease expires, and the IP address changes.
Another example is firewalls, which seem to be increasingly intelligent, and can identify keep-alive traffic vs. real data, and close connections based on any high level criteria, especially idle time.
How you want to implement reconnect logic depends a lot on your architecture, the working environment, and your performance goals.
It shouldn't really matter, you should design your code to automatically reconnect if that is the desired behavior.
There really is no way to tell. There is nothing inherent to TCP that would cause the connection to just drop after a certain amount of time. Someone on a reliable connection could have years of uptime, while someone on a different connection could have to reconnect every 5 minutes. There is no way to tell or even guess.
You will need some data going over the connection periodically to keep it alive - many OS's or firewalls will drop an inactive connection.
Pick a value. One drop every hour is probably fine. Ten unexpected connection drops in 5 minutes probably indicates a problem.
TCP connections will generally last about two hours without any traffic. Either end can send keep-alive packets, which are, I think, just an ACK on the last received packet. This can usually be set per socket or by default on every TCP connection.
An application level keep-alive is also possible. For a telnet style protocol like FTP, SMTP, POP or IMAP something like sending return, newline and getting back a command prompt.