network optimizations while web crawling - using udp and using connection pooling?

network optimizations while web crawling - using udp and using connection pooling? - networking

I'm looking at donne martin's design for a web crawler.
they are suggesting the following network optimization:
The Crawler Service can improve performance and reduce memory usage by keeping many open connections at a time, referred to as connection pooling
Switching to UDP could also boost performance
I don't understand both suggestions: what's connection pooling got to do with web crawling? isn't each crawler service opening its own connection to the host its currently crawling? what good would connection pooling do here?
and about UDP - isn't crawling issuing a HTTP over TCP requests to web hosts? how is UDP relevant here?

what's connection pooling got to do with web crawling? isn't each crawler service opening its own connection to the host its currently crawling?
I think you are assuming that the crawler will send a request to a host only once. This is not the case, a host may have hundreds of pages that you want to crawl, and opening a connection each time is not efficient.
about UDP - isn't crawling issuing a HTTP over TCP requests to web hosts? how is UDP relevant here?
Taken from the book Web Data Mining:
The crawler needs to resolve host names in URLs to IP addresses. The
connections to the Domain Name System (DNS) servers for this purpose
are one of the major bottlenecks of a naïve crawler, which opens a new
TCP connection to the DNS server for each URL. To address this
bottleneck, the crawler can take several steps. First, it can use UDP
instead of TCP as the transport protocol for DNS requests. While UDP
does not guarantee delivery of packets and a request can occasionally
be dropped, this is rare. On the other hand, UDP incurs no connection
overhead with a significant speed-up over TCP

Related

Injecting packets in open TCP connection

This is not a hacking question.
Imagine I have an application running on a local machine which has a TCP connection to some remote server. There are numerous ways to see packets, the obvious one is Wireshark. But is there a way to send packets to both the application, and the server without getting in the middle of two? That is, without running a proxy between the two, programmatically send packets to application as if they were coming from the server, as well as to the server as if they were coming from the application.

How does the client knows which transport protocol to use?

Let's assume that I start a server at one of the computers in my private network (192.168.10.10:9900).
Now when making a request from some other computer in the same network, how does the client computer (OS?) knows which protocol to use / which protocol the server follows ? [TCP or UDP]
EDIT: As mentioned in the answers, I was basically looking for a default protocol which will be used by the client in the absence of any transport protocol information.

TCP / UDP protocols work at the transport layer level (TCP / IP MODEL) and its main difference is that TCP has a method to ensure the arrival of messages while UDP is lighter because of its virtue is to be faster in Information delivery. The use of one protocol or another is always defined by the application that will use it.
So the reference you put on the private server with ip: port 192.168.10.10:9900 is very vague to be more precise we could say that we have an Apache web server running on the ip: port 192.168.10.10:9900 (the port for default is 80 when installing the server, but it can be changed in the configuration).
Now the web servers (apache, IIS, etc.) work using the TCP protocol because when a client (computer, cell phone, etc.) consults a page through a browser (Chrome, Firefox, etc.), the ideal thing is that all the website and not just some pieces. This is why this type of servers chose and use this protocol in the first instance since they seek that in the end the result is that the user obtains the complete page regardless of whether a few more milliseconds are sacrificed in the validations involved in using TPC.
Now going to the client side. The user when visiting a web page from any browser (Chrome, Firefox, etc.) will use TCP since this protocol is already configured in the browser to send the query messages and subsequently receive the messages with the same form Website information.
Now this behavior is going to be repeated for any client / server application. For example, to change the type of application on the UDP side, we can observe the operation of DHCP services which are used to receive an IP when connecting any device to a Wi-Fi network. In this case, this service seeks to be as fast as possible (instead of the most reliable) since you want the device to connect as quickly as possible to the network, so use the UDP protocol and in this case any equipment when connecting To a WIFI network you will send your messages using this protocol.
Finally, if you want to know promptly about the type of TCP / UDP protocol used by a specific application, you can search on the Wireshark application which allows you to scan the messages that leave the device or show the protocol used in the different layers of the application.

There is no reason any client would make a request to your server, so why would it care what protocol it follows? Clients don't just randomly connect to things to see if there's a server there. So it doesn't make any difference to any client.

Normally, the client computer will use the TCP protocol as default. If you start the server using UDP protocol mode, then when you use curl -XGET 192.168.10.10:9900/test-page, it will give you back an curl: (7) Failed to connect to 192.168.10.10 port 9900: Connection refused error. You can try it, use the nc -lvp 9900 -u, it will give you that result.

The answers here are pointing to some default protocol. Its' not that, Whenever you start an application let say HTTP server, the server's internal has code to open a socket(which can be TCP or UDP), since HTTP:80 is a TCP protocol the code creates a TCP socket. Similarly for other network application it depends on their requirement what kind of transport layer protocol to use (TCP Or UDP). Like a DNS client will create a UDP socket to connect to DNS server, since DNS:53 is mostly over UDP. Both TCP and UDP have different use cases, advantages and disadvantages. Depending on there uses cases / advantages / disadvantages of UDP/TCP decision is taken to implement server using either of them.

What is the cost of establishing connection using Unix Domain sockets versus TCP sockets?

Oddly I didn't find this info by googling. What is the cost of establishing connection using Unix Domain sockets versus TCP sockets?
Right now I have to do connection pooling with TCP sockets because reconnecting is quite expensive. I wonder if I can simplify my client by simply switching to Unix Domain sockets and getting rid of connection pooling.

If you look into the code, you'll see that Unix Domain sockets execute far less code than TCP sockets.
Messages sent through TCP sockets have to go all the way through the networking stack to the loopback interface (which is a virtual network interface device typically called "lo" on Unix-style systems), and then back up to the receiving socket. The networking stack code tacks on TCP and IP headers, makes routing decisions, forwards a packet to itself through "lo", then does more routing and strips the headers back off. Furthermore, because TCP is a networking protocol, the connection establishment part of it has all kinds of added complexity to deal with dropped packets. Most significantly for you, TCP has to send three messages just to establish the connection (SYN, SYN-ACK, and ACK).
Unix Domain sockets simply look at the virtual file system (or the "abstract namespace") to find the destination socket object (in RAM) and queue the message directly. Furthermore, even if you are using the file system to name your destination socket, if that socket has been accessed recently, its file system structures will be cached in RAM, so you won't have to go to to disk. Establishing a connection, for a Unix Domain socket involves creating a new socket object instance in RAM (i.e., the socket that gets returned by accept(), which is something that has to be done for TCP too) and storing a pointer in each of the two connected socket objects (so they each have a pointer to the other socket later when they need to send). That's pretty much it. No extra packets are needed.
By the way, this paper suggests that Unix Domain sockets are actually faster than even Pipes for data transfers:
http://osnet.cs.binghamton.edu/publications/TR-20070820.pdf
Unfortunately, they didn't do specific measurements of connection establishment costs, but as I have said, I've looked at the Linux source code and it's really quite a lot simpler than the TCP connection establishment code.

Connecting to a server using TCP sockets may involve network traffic, as well as the TCP three-way handshake.
Local sockets (formerly known as Unix domain sockets) are all local, but need to access a physical file on disk.
If you only do local communication then local sockets might be faster as there is less overhead from the protocol. If your application needs to connect remotely then you can't use local sockets.
By the way, if you're only communicating locally, and not over a network, a pair named pipes (or anonymous if you're forking) might be even better.

Creating Peer to Peer connections using intermediate server

I want to connect two clients (via TCP/IP sockets). The clients can discover each other using an intermediate server. Once the clients discover each other, there should not be any involvement of the server.
I made some study about this and found many people suggesting JXTA. But I'd like to create the protocol myself from scratch (because in future I might have to implement the same using WebSockets as well (when my client is a Browser)). Currently, my clients can be Desktop applications or mobile applications.
My questions are:
How will clients discover each other at the server? If the server sends the global IP address of the clients to each other, will that information be enough to create a peer-to-peer connection? What if the clients are on the same LAN network and the server is on a different WAN?
Client have dynamic IP address. Can their IP change all of a sudden even if it has an active socket?
Is peer-to-peer connection is reliable for transfer of non-continuous data (like in chat application)?
[NOTE: by peer-to-peer connection I mean establishing a client-server TCP/IP socket connection (using Java) by making one of the client as temporary socket-server]
Thanks in advance.

1) When the clients connect to the server they will have to notify the server of the port number they will keep open for incoming connections from other clients. The server will know client's IP address. Then the server will need to communicate/send these details to the other party/client. The actual location of the clients does not make any difference. If two clients are on the same network the network routers will find them and make their communication paths shorter.
2) Dynamic IP address can NOT change during active connection - if it does the connection will be dropped and both clients will have to re-initiate the connection through the server in 1)
3) Yes

How vulnerable is a persistent, unencrypted Internet connection between 2 trusted networks?

Scenario: 2 network devices, each on separate private LANs. The LANs are connected by public Internet.
Device on network A is listening on a socket; network A firewall has NAT port forward setup for external access from network B's port range.
Device on network B makes outgoing connection to network A to connect to the listen socket.
Is there any difference in vulnerability of short-term connection made for a data transfer then dropped when complete (e.g. few seconds), and a persistent connection which employs a keep-alive mechanism and reconnects when dropped (hours, days, ..)?
The security of actually making the connection is not part of my question.

the client will maintain a persistent connection to server
No such thing exists.
Each connection -- no matter how long it's supposed to last -- will eventually get disconnected. It may be seconds before the disconnect or centuries, but it will eventually get disconnected. Nothing is "persistent" in the sense of perpetually on.
There is no such thing as a "keep-alive mechanism". It will get disconnected.
"Assume the server authenticates the client upon connection". Can't assume that. That's the vulnerability. Unless you have a secure socket layer (SSL) to assure that the TCP/IP traffic itself is secure. If you're going to use SSL, why mess around with "keep-alive"?
When it gets disconnected, how does it get connected again? And how do you trust the connection?
Scenario One: Denial of Service.
Bad Guys are probing your socket waiting for it to accept a connection.
Your "persistent" connection goes down. (Either client crashed or you crashed or network routing infrastructure crashed. Doesn't matter. Socket dead. Must reconnect.)
Bad Guys get your listening socket first. They spoof their IP address and you think they're the client. They're in -- masquerading as the client.
The client host attempts their connection and you reject it saying they're already connected.
Indeed, this is the exact reason why folks invented and use SSL.
Based on this, you can dream up a DNS-enabled scenario that will allow Bad Guys to (a) get connected and then (b) adjust a DNS entry to make them receive connections intended for you. Now they're in the middle. Ideally, DNS security foils this, but it depends on the client's configuration. They could be open to DNS hacks, who knows?
The point is this.
Don't Count On A Persistent Connection
It doesn't exist. Everything gets disconnected and reconnected. That's why we have SSL.
The client can simply reconnect, the server must respond to the user request with the appropriate error.
False. The client cannot "simply" reconnect. Anyone can connect. Indeed, you have to assume "everyone" is trying to connect and will beat the approved client.
To be sure it's the approved client you have to exchange credentials. Essentially implementing SSL. Don't implement your own SSL. Use existing SSL.
would they have to break into a switch site?
Only in the movies. In the real world, we use packet sniffers.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex