Is a netflow record equal to a session? - networking

Because I don't very understand a session definition in network I have a puzzle that whether a netflow record equal to a session?
If I upload some files to the server through FTP at a time, and there produce 50
netflow records(same source and destination IP but ports are different). Does the process equal to 50 sessions, or the process after the server closed the connection equal to only one session?
such like this photo:
thanks a lot :)

Short answer; It all depends.
There are many factors and variables when working with network flows, whether it's Cisco's Netflow format (in various versions), IETF's IPFIX or other similar formats. If we take a very common format, Netflow v5, a flow is defined by 5 or 7 tuples (depending on how detailed the definition is). These tuples are; Source and destination IP address, source and destination port and protocol (in addition Type of Service and ingress interface index). Also Netflow v5 is a uni-directional network flow protocol, meaning it will treat connections coming from the server separately from those going to the server. So any IP packet matching that 5/7 tuple definition in one direction will constitute a network flow and result in a Netflow record. All this have to be taken into consideration when examining Netflow data and comparing it to network communication sessions (which by itself also may have different definitions based on its context).
And as if that wasn't enough, there are also implementation specific variables and limitations, that may split a session into several records. Usually flow protocols implement various timeouts to be able to efficiently collect and store data. TCP sessions may have connections open over a long time period, and makes it challenging for the flow generator to keep and maintain the flow in memory. Some network flow formats have the ability to locate such split records and merge them into one single flow record.
So, to sum up, network analysts starting to work with network flows easily fall into the trap of thinking one record equals one session. That assumption may be true sometimes, but not always.

Related

How to get high availability when download data on 2 servers?

I have data need to be downloaded on a local server every 24 hours. For high availability we provided 2 servers to avoid failures and losing data.
My question is: What is the best approach to use the 2 servers?
My ideas are:
-Download on a server and just if download failed for any reason, download will continue on the other server.
-Download will occur on the 2 servers at the same time every day.
Any advice?
In terms of your high-level approach, break it down into manageable chunks i.e. reliable data acquisition, and highly available data dissemination. I would start with the second part first, because that's the state you want to get to.
Highly available data dissemination
Working backwards (i.e. this is the second part of your problem), when offering highly-available data to consumers you have two options:
Active-Passive
Active-Active
Active-Active means you have at least two nodes servicing requests for the data, with some kind of Load Balancer (LB) in front, which allocates the requests. Depending on the technology you are using there may be existing components/solutions in that tech stack, or reference models describing potential solutions.
Active-Passive means you have one node taking all the traffic, and when that becomes unavailable requests are directed at the stand-by / passive node.
The passive node can be "hot" ready to go, or "cold" - meaning it's not fully operational but is relatively fast and easy to stand-up and start taking traffic.
In both cases, and if you have only 2 nodes, you ideally want both the nodes to be capable of handling the entire load. That's obvious for Active-Passive, but it also applies to active-active, so that if one goes down the other will successfully handle all requests.
In both cases you need some kind of network component that routes the traffic. Ideally it will be able to operate autonomously (it will have to if you want active-active load sharing), but you could have a manual / alert based process for switching from active to passive. For one thing, it will depend on what your non-functional requirements are.
Reliable data acquisition
Having figured out how you will disseminate the data, you know where you need to get it to.
E.g. if active-active you need to get it to both at the same time (I don't know what tolerances you can have) since you want them to serve the same consistent data. One option to get around that issues is this:
Have the LB route all traffic to node A.
Node B performs the download.
The LB is informed that Node B successfully got the new data and is ready to serve it. LB then switches the traffic flow to just Node B.
Node A gets the updated data (perhaps from Node B, so the data is guaranteed to be the same).
The LB is informed that Node A successfully got the new data and is ready to serve it. LB then allows the traffic flow to Nodes A & B.
This pattern would also work for active-passive:
Node A is the active node, B is the passive node.
Node B downloads the new data, and is ready to serve it.
Node A gets updated with the new data (probably from node B), to ensure consistency.
Node A serves the new data.
You get the data on the passive node first so that if node A went down, node B would already have the new data. Admittedly the time-window for that to happen should be quite small.

Network design: one-to-many high bandwidth transmissions that excludes users

If StackOverflow is the wrong Exchange for this question, please help direct me to the correct one.
Short Version
What is the best design for a networking application in which one user transmits a constant, high-bandwidth stream of data to many other addresses? The solution must not require the uploader to duplicate the packets for each recipient and preferably will not transmit to users that have not been accepted by the transmitter.
Long Version
A friend and I have written an application that enables someone to transmit data in real time to one or more recipients that he wants to receive the data. I have designed the high-level application protocol to use UDP and to encode the data so that each packet can be lost without hurting the use of the rest. This solution requires managing sockets with each user and sending each packet to every user.
The problem here is that the stream can be very high bandwidth. The user can modify the settings for how high quality the data he is sending should be, and can end up sending 6 Mbps to each user. It is unfeasible to expect a user to pay his ISP enough to be allowed to upload such a stream to the preferred minimum of four other users at a time.
We need a way for the transmitter to send a packet exactly once and have each user receive a copy.
We have looked at multicasting. It may be what we need to use in the end, but we are concerned about the fact that anyone can join any group. It would be preferable to not allow users we do not want to see the data to not be allowed to join in. There is also the problem that if multiple transmitters happen to use the same group, viewers may find that they are receiving multiple streams' worth of data when they only want one.
My searching has revealed something IBM published over a decade ago called Explicit Multicast (Xcast) that looks perfect, but I have yet to find any information to determine whether this technology is commonly supported. Also, I have not yet seen whether it supports datagrams.
Does anyone know the best way to design an application that meets our needs?
Please keep in mind that we have no funds to support our project. Solutions need to be free.
Edit
In the summary above, I hinted at but failed to explicitly state that this is for a real-time application. The motivating drive behind the application is to keep the clients/recipients as close together in time as is possible. If packets are lost or arrive too late to be used in keeping the server and clients in phase, they need to be disregarded. That is why I designed the application protocol on top of UDP with independent data in each packet. Even if a client receives only one packet out of 300 for a given time step, it will use what it did get.
I think that I_am_Helpful's recommendation may be a good step in the right direction (or possibly the destination). I need to do some experimentation to determine whether using a system like Spread will work. However, I do not think I can budget more than additional 17 ms in transmission time.
If you can think of a system that enables sending unreliable datagrams to a specific group of users (like Spread) for a real-time application (unlike Spread, see p. 3), please let me know about it.
We need a way for the transmitter to send a packet exactly once and
have each user receive a copy.
In my limited knowledge, I would say that Reliable Multicasting appears to be one of the viable option for broadcasting in the group. I would like to mention that there are some of the possible Java API's* which could help you achieving the same :
JGroups Java API
The Spread Toolkit -> Spread consists of a library that user applications are linked with, a binary daemon which runs on each computer that is part of the processor group, and various utility and demonstration programs.
Appia
*NOTE : I have never worked with these API's.
It would be preferable to not allow users we do not want to see the
data to not be allowed to join in.
They do provide this feature, e.g., Spread supports thousands of groups with different sets of members. It also provides a range of reliability, ordering and stability guarantees for messages. JGroups can be used to create groups of processes whose members can send messages to each other. It also has facilities like group creation and deletion(Group members can be spread across LANs or WANs).
There is also the problem that if multiple transmitters happen to use
the same group, viewers may find that they are receiving multiple
streams' worth of data when they only want one.
When you could easily create multiple groups in the same network(using Spread,etc.), then, I believe that would no longer be an issue. It is your responsibility to declassify users into different groups.
I hope the given information helps. Good LUCK.
Via multicast you achieve exactly you want: sending each packet once, but authentication seems to be a concern for you.
One possible solution could be simetric cryptography, where the same key is used to encrypt and decrypt. Via TCP your clients connect to a server and fetch the multicast IP Address of the transmission and its associated key, then they join the multicast group and decrypt the incomming transmission.
If you accept a more flexible solution, you could have a server which sends a transmission in real time to a set of distributed servers. Your clients connect to one of these distributed servers via unicast, and after authentication is done, they are inluded in a list of receivers. Each distributed server sends each new transmission packet to each registered client via UDP. in ordinary situations your clients would have the same experience as if it was delivered in a multicast group, but the servers will spend far more bandwidth. Multiple transmission at a time will be allowed, so it could be good for you, and you can have more control, as clients can send signals to the servers, like PAUSE, and etc.

bind zmq dealer to multiple repliers

I want to create a proxy server which routes incoming packets from REQ type sockets to one of the REP sockets on one of the computers in a cluster. I have been reading the guide and I think the proper structure is a combination of ROUTER and DEALER on the proxy server. Where the ROUTER passes messages to the dealer to be distributed. However, I cannot figure out how to create this connection scheme. Is this the correct architecture? If so how to I bind a dealer to multiple addresses. The flow I envision is like this REQ->ROUTER|DEALER->[REP, REP, ...] where only one REP socket would handle a single request.
NB: forget about packets -- think in terms of "Behaviour", that's the key
ZeroMQ is rather an abstract layer for certain communication-behavioral patterns, so while terms alike socket do sound similar to what one has read/used previously, the ZeroMQ-world is by far different from many points of view.
This very formalism allows ZeroMQ Formal-Communication-Patterns to grow in scale, to get assembled in higher-order-patterns ( for load-balancing, for fault-tolerance, for performance-scaling ). Mastering this style of thinkign, you forget about packets, thread-sync-issues, I/O-polling and focus on your higher-abstraction-based design -- on Behaviour -- rather than on underlying details. This makes your design both free from re-inventing wheel & very powerful, as you re-use a highly professional tools right for your problem-domain tasks.
DEALER->[REP,REP,...] Segment
That said, your DEALER-node ( in fact a ZMQsocket-access-node, having The Behaviour called a "DEALER" to resemble it's queue/buffering-style, it's round-robin dispatcher, it's send-out&expect-answer-in model ) may .bind() to multiple localhost address:port-s and these "service-points" may also operate over different TransportClass-es -- one working over tcp://, another over inproc://, if that makes sense for your Design Architecture -- ZeroMQ empowers you to use this transparently abstracted from all the "awfull&dangerous" lower level gritty-nitties.
ZeroMQ also allows to reverse .connect() / .bind()
In principle, where helpfull, one may reverse the .bind() and .connect() from DEALER to a known target address of the respective REP entity.
You leave a couple details out that are important to determining the correct architecture.
When you say "from REQ type sockets to one of the REP sockets on one of the computers in a cluster", how do you determine which computer gets the message? Is it addressed to a specific computer? Does a computer announce its availability before it can receive a message? Does each message just get passed to the next one in line in a round-robin fashion? (if it's not the last one, you probably don't want a DEALER socket)
When you say "how do I bind a dealer to multiple addresses", it's not clear what you mean by "addresses"... Do you mean to say that the proxy has a unique IP address that it uses to communicate with each computer in the cluster? Or are you just wondering how to manage the connection to multiple different peers with the same socket? The former is a special case, the latter is simple.
I'm going to work with the following assumptions:
You want a worker computer from the cluster to announce its availability for work before it receives any work, and any computer in the cluster can handle any job. A faster worker, or a worker working on a smaller job, will not have to wait behind some slow worker to finish their job and get a new job first.
The proxy/broker uses a single ip interface to communicate with all workers.
If those are true, then what you want will be closer to this:
REQ->ROUTER|ROUTER->[REQ, REQ, ...]
A worker will create a request to the backend router socket to announce its availability, and await a reply with work. Once it is finished, it will create a new request with the finished work, which again announces its availability. The other half of the pattern you've already worked out.
This is the Simple Pirate Pattern from the ZMQ guide. It's a good place to start, but it's not very robust. This is in the Reliable Request-Reply Patterns section of the guide, and I suggest you read or reread that section carefully as it will guide you well. In particular, they keep refining this pattern into more and more reliable implementations and wind up with the Majordomo pattern, which is very robust and fault tolerant. You should see if you need all the features that provides or if you can scale it back a little. Either way, you should learn and understand what these patterns are doing and why before you make the choice to do something different.

What benefit is conferred by TCP timestamp?

I have a security scan finding directing me to disable TCP timestamps. I understand the reasons for the recommendation: the timestamp can be used to calculate server uptime, which can be helpful to an attacker (good explanation under heading "TCP Timestamps" at http://www.silby.com/eurobsdcon05/eurobsdcon_silbersack.pdf).
However, it's my understanding that TCP timestamps are intended to enhance TCP performance. Naturally, in the cost/benefit analysis, performance degradation is a big, possibly too big, cost. I'm having a hard time understanding how much, if any, performance cost there is likely to be. Any nodes in the hivemind care to assist?
The answer is most succinctly expressed in RFC 1323 - Round-Trip Measurement... The introduction to the RFC also provides some relevant historical context...
Introduction
The introduction of fiber optics is resulting in ever-higher
transmission speeds, and the fastest paths are moving out of the
domain for which TCP was originally engineered. This memo defines a
set of modest extensions to TCP to extend the domain of its
application to match this increasing network capability. It is based
upon and obsoletes RFC-1072 [Jacobson88b] and RFC-1185 [Jacobson90b].
(3) Round-Trip Measurement
TCP implements reliable data delivery by retransmitting
segments that are not acknowledged within some retransmission
timeout (RTO) interval. Accurate dynamic determination of an
appropriate RTO is essential to TCP performance. RTO is
determined by estimating the mean and variance of the
measured round-trip time (RTT), i.e., the time interval
between sending a segment and receiving an acknowledgment for
it [Jacobson88a].
Section 4 introduces a new TCP option, "Timestamps", and then
defines a mechanism using this option that allows nearly
every segment, including retransmissions, to be timed at
negligible computational cost. We use the mnemonic RTTM
(Round Trip Time Measurement) for this mechanism, to
distinguish it from other uses of the Timestamps option.
The specific performance penalty you incur by disabling timestamps would depend on your specific server operating system and how you do it (for examples, see this PSC doc on performance tuning). Some OS require that you either enable or disable all RFC1323 options at once... others allow you to selectively enable RFC 1323 options.
If your data transfer is somehow throttled by your virtual server (maybe you only bought the cheap vhost plan), then perhaps you couldn't possibly use higher performance anyway... perhaps it's worth turning them off to try. If you do, be sure to benchmark your before and after performance from several different locations, if possible.
Why would the security people want you to disable timestamps? What possible threat could a timestamp represent? I bet the NTP crew would be unhappy with this ;^)
The TCP Timestamp when enabled will allow you to guess the uptime of a target system (nmap v -O . Knowing how long a system has been up will enable you to determine whether security patches that require reboot has been applied or not.
To Daniel and anyone else wanting clarification:
http://www.forensicswiki.org/wiki/TCP_timestamps
"TCP timestamps are used to provide protection against wrapped sequence numbers. It is possible to calculate system uptime (and boot time) by analyzing TCP timestamps (see below).
These calculated uptimes (and boot times) can help in detecting hidden network-enabled operating systems (see TrueCrypt), linking spoofed IP and MAC addresses together, linking IP addresses with Ad-Hoc wireless APs, etc."
It's a low-risk vulnerability denoted in PCI compliancy.
I got asked a similar question on this topic, today. My take is as follows:
An unpatched system is the vulnerability, not whether attacker(s) can easily find it. The solution, therefore, is to patch your systems regularly. Disabling TCP timestamps won't do anything to make your systems less vulnerable - it's simply security through obscurity, which is no security at all.
Turning the question on its head, consider scripting a solution that uses TCP timestamps to identify hosts on your network that have the longest uptimes. These will typically be your most vulnerable systems. Use this information to prioritise patching, to ensure that your network remains protected.
Don't forget that information like uptime can also be useful to your system administrators. :)
I wouldn't do it.
Without timestamp the TCP Protection Against Wrapped Sequence numbers (PAWS) mechanism wont work. It uses the timestamp option to determine the sudden and random sequence number change is a wrap (16 bit sequence numbers) rather than an insane packet from another flow.
If you don't have this then your TCP sessions will burp every once in a while according to how fast they are using up the sequence number space.
From RFC 1185:
ARPANET 56kbps 7KBps 3*10**5 (~3.6 days)
DS1 1.5Mbps 190KBps 10**4 (~3 hours)
Ethernet 10Mbps 1.25MBps 1700 (~30 mins)
DS3 45Mbps 5.6MBps 380
FDDI 100Mbps 12.5MBps 170
Gigabit 1Gbps 125MBps 17
Take 45Mbps (well within 802.11n speeds), then we would have a burp every ~380 seconds. Not horrible, but annoying.
Why would the security people want you to disable timestamps? What possible threat could a timestamp represent? I bet the NTP crew would be unhappy with this ;^)
Hmmmm, I read something about using TCP timestamps to guess the clock frequency of the sender? Maybe this is what they are scared of? I don't know ;^)
Timestamps are less important to RTT estimation than you would think. I happen to like them because they are useful in determining RTT at the receiver or a middlebox. However, according to the cannon of TCP, only the sender needs such forbidden knowledge ;^)
The sender does not need timestamps to calculate the RTT. t1 = timestamp when I sent the packet, t2 = timestamp when I received the ACK. RTT = t2 - t1. Do a little smoothing on that and you are good to go!
...Daniel

Lots of ports with little data, or one port with lots of data?

I've been checking out using a system called ROS (http://www.ros.org) for some work.
There are lots of different types of data that get sent between network nodes in ROS.
You define a struct of data that you want to send in a message, and ROS will handle opening a specific port between the two nodes that will only send that struct of data.
So if there are 5 different messages, there will be 5 different ports.
As opposed to this scenario, I have seen other platforms that just push all the different messages across one port. This means that there needs to be a sort of multiplexing/demultiplexing (done by some sort of message parsing on the receivers end).
What I wonder is... which is better from a performance perspective?
Do operating systems switch based on ports quickly, so that a system like ROS doesn't have to do too much work to work out what is in the message and interpreting it?
OR
Is opening lots of ports going to mean lots of slower kernel calls, and the cost of having to work out and translate message types end up being more then the time spent switching between ports?
When this scales to a large amount of data at high rates and lots of different messages types there will be lots of ports. So I imagine that when scaling each of these topologies that performance will be a big factor in selecting the way to work.
I should also point out that these nodes usually exist on one small network, or most of the time on the one machine in which networking is used as a force of inter-process communication. So the transmission time is only a very small factor in the overall system timing.
ROS being an architecture for robots may have one node for every sensor and actuator, so depending on the complexity of your system we may be talking about 20-30 nodes pushing small-ish (100bytes or so) data between 10-100Hz
It depends. I do not know the specifics of ROS but in networking it comes down to the following constraints:
Distance: speed of light is fast but over a distance it starts making a difference
Protocol Overhead: connection oriented vs. connection-less
On the OS side, maintaining a list of free ports isn't such much of an overhead - of course there is a cost to it but everything is relative: if you are talking about a distributed system with long distance links, then it is easy to argue that cycling through OS network ports ranks as lower concern compared to managing communication quality.
Without a more specific question, I'll stop here.
I don't have any data on this, but it seems plausible that multiple ports might be handled more efficiently by multi-core systems, as opposed to demultiplexing within the program.

Resources