classification of user browsing activity using machine learning - tcp

if you record all IP traffic (using wireshark or similar program) while browsing the internet, you'll find many packets sent not as part of of your browsing activity.
my question is:
if you wish to classify the packets (sent from your PC) into two groups:
1) packets sent as part of your browsing activity
2) all other packets
how would you use machine learning to solve this issue?
you can assume the packet-payload can't be used for this purpose because it's either encapsulated or encrypted, so only packet-headers can be used, e.g. TCP window size, TCP flag bits, packet length and packet directions.

Sounds like a binary classification problem.
There are three basic approaches you might use:
Collect packages you can manually label by "browsing activity" and "others" and train binary classifier on top (like SVM etc.)
Collect just packages which are "browsing activity" and train one-class classifier on top (like one class SVM)
Just collect all the data you can and try to cluster it into two clusters, there is a (very small unfortunately!) chance that the division found will be the one you are looking for
In each of the above cases you will need to prepare set of features to represent your data. So either a constant set of some features, or you might try to simply use packet header as a raw text and traing some text-based model, like some convolutional neural network etc.

Related

Trading off between User Bandwidth and Download Interval

I am designing a non commercial open source client app which needs to download data of exactly 100 KB from server on regular interval and show an alert in client app based on the data changes. Now I need to trade off between the user bandwidth and download interval.
Analysis,
If I set the interval = 1 hour. That means within 1 month app will download 30*24*100KB = 72MB.
If I set the interval = 30 mins. That means within 1 month app will download 30*48*100KB = 144MB.
And so on.
Now, I am considering only the file size where in practice there will be some portion of bandwidth used for control flow apart from data flow. For downloading file of exactly 100 KB from server, how much overhead bandwidth of control flow should I consider in my analysis for TCP communication? Is there any guideline/reference or research on that topic?
Assume, if 10KB is used for control flow, total monthly usage will include 14.4MB extra data which needed to be identified in my analysis.
Note: (1) I am limited to analyse only the client app part. (2) No changes in server side can be done at that moment (i.e. pull based to push based, partial data change api etc. cannot be applied). (3) I am limited to download the file using TCP. (4) Although, that much granularity is not often be considered in practice, let's assume, for my case the analysis required to be that much granular that I need to know the data vs control bandwidth ratio.
If you are asking only for the TCP/IP part, the payload/PDU ratio is 1460/1500 for IPv4 and 1440/1500 for IPv6, assuming an MTU of 1500 bytes (sources: this already mentioned discussion, this other discussion, this other article).
I also found this really nice page that allows you to see all the header sizes for an arbitrary protocol stack and this academic paper.
However besides the protocol headers, there are more effects that reduce the bandwidth:
TCP will send additional messages, e.g. for performing a handshake when establishing the connection,
Retransmission of data may occur,
Actual frame sizes are negotiated on the lower communication layers, so TCP segments might be smaller than assumed.
In summary, this is not easy to answer precisely, because there are influences in the transmission process that are beyond your control.
Have you considered to measure the actual amount of data needed for transmitting one (or more) 100KB chunk(s) of payload rather than performing a theoretical analysis?

How does a computer know what data to reassemble?

When a computer X sends data through a network to computer Y the data goes down through the OSI layer. This is ok. I understand. But once the data is put on the media as eletric signals then how does the computer Y know what to reassmble, given the headers and trailers of the data model generated in OSI, once it is put on the electric media at layer 1 does not exist any more?
The physical layer is just 1's and 0's as you say - the trick is that there is a pattern that tells the receiver that this is the start of a packet. This is usual referred to as 'Framing'.
Once the receiver knows that, it simply reads in as many bits as its needs for the Layer 2 header and it then has that and so on.
The headers are clear in a typical OSI or networking diagrams, e.g. (https://www.ciscopress.com/articles/article.asp?p=2738463):
So the way the first two layers work on the receiver is:
layer 1 just recognises whether the signal is a one or a zero and creates the stream of ones and zeros.
layer 2 reads this stream and when it recognises the start pattern it then know the following bits are the header and so on and hence it can identify the frames.
You can see examples of start and stop patterns online e.g. (http://sinauonline.50webs.com/Cisco/Cisco%20Exploration%20Sem1Chap7.html):

How to define topology in Castalia-3.2 for WBAN

How can defined topology in Castalia-3.2 for WBAN ?
How can import topology in omnet++ to casalia ?
where the topology defined in default WBAN scenario in Castalia?
with regard
thanks
Topology of a network is an abstraction that shows the structure of the communication links in the network. It's an abstraction because the notion of a link is an abstraction itself. There are no "real" links in a wireless network. The communication is happening in a broadcast medium and there are many parameters that dictate if a packet is received or not, such as the power of transmission, the path loss between transmitter and receiver, noise and interference, and also just luck. Still, the notion of a link could be useful in some circumstances, and some simulators are using it to define simulation scenarios. You might be used to simulators that you can draw nodes and then simply draw lines between them to define their links. This is not how Castalia models a network.
Castalia does not model links between the nodes, it models the channel and radios to get a more realistic communication behaviour.
Topology is often confused with deployment (I confuse them myself sometimes). Deployment is just the placement of nodes on the field. There are multiple ways to define deployment in Castalia, if you wish, but it is not needed in all scenarios (more on this later). People can confuse deployment with topology, because under very simplistic assumptions certain deployments lead to certain topologies. Castalia does not make these assumptions. Study the manual (especially chapter 4) to get a better understanding of Castalia's modeling.
After you have understood the modeling in Castalia, and you still want a specific/custom topology for some reason then you could play with some parameters to achieve your topology at least in a statistical sense. Assuming all nodes use the same radios and the same transmission power, then the path loss between nodes becomes a defining factor of the "quality" of the link between the nodes. In Castalia, you can define the path losses for each and every pair of nodes, using a pathloss map file.
SN.wirelessChannel.pathLossMapFile = "../Parameters/WirelessChannel/BANmodels/pathLossMap.txt"
This tells Castalia to use the specific path losses found in the file instead of computing path losses based on a wireless channel model. The deployment does not matter in this case. At least it does not matter for communication purposes (it might matter for other aspects of the simulation, for example if we are sampling a physical process that depends on location).
In our own simulations with BAN, we have defined a pathloss map based on experimental data, because other available models are not very accurate for BAN. For example the, lognormal shadowing model, which is Castalia's default, is not a good fit for BAN simulations. We did not want to enforce a specific topology, we just wanted a realistic channel model, and defining a pathloss map based on experimental data was the best way.
I have the impression though that when you say topology, you are not only referring to which nodes could communicate with which nodes, but which nodes do communicate with which nodes. This is also a matter of the layers above the radio (MAC and routing). For example it's the MAC and Routing that allow for relay nodes or not.
Note that in Castalia's current implementations of 802.15.6MAC and 802.15.4MAC, relay nodes are not allowed. So you can not create a mesh topology with these default implementations. Only a star topology is supported. If you want something more you'll have to implemented yourself.

Predicting/calculating congestion in telecom network

I have an application installed at my phone which is providing below details every minute: - Bandwidth , -Packet loss ,-signal strength,- RTT for google.com every minute.
I am trying to predict congestion based on these 4 attribute , but some how it doesn't look accurate to me , previously i have only used bandwidth .
I want predict congestion at any point more appropriately , appreciate any recommendations .
I think you are saying you are trying to measure network 'responsiveness', and from these measurements get a sense of how congested the network is. You also mention you want to predict which I guess means you want to make an estimate of the future 'responsiveness' based on your measurements and observations.
The items you are measuring look sensible, although you may want to include jitter if you are interested in VoIP or other real time streamed media.
The issue you have is that there are many variables which can effect your measurements, for example:
congestion in the radio cell you are in at the time
congestion in the backhaul network
delays in the server you are using to measure the RTT
congestion or faults with the particular APN your mobile is using to access data services
network faults
As some of these can be irregularly occurring but can have a large impact, it is quite hard to build up an accurate view of the overall network 'responsiveness' with a single handset. For example your local cell may be busy or have a problem but others users of Google.com in other cells will have perfectly good response, or Google.com may be busy or delayed and other users in your cell accessing a different server may again have perfectly good response.
It would likely be useful for you to look at some of the generally available web speedtest applications to see the type of information they provide - they have the advantage of being able to gather results from many thousands of users, and also generally have access to the servers to understand any issues on that side.
Depending on what you are trying to achieve it might be that a combination of measurements from one of the general speedtest services, combined with your own measurements will give you enough data to draw some sort of meaningful conclusions.

64/66b encoding

There are a few things I don't understand about 64/66bit encoding, and failed to find the answers to on the web. Any help/links would be greatly appreciated:
i) how is the start of a frame recognised? I don't think it can be by the initial 10/01 bits called the preamble on wikipedia because you cannot tell them apart (if an idle link is 0, then 0000 10 and 000 01 0 look rather similar). I expect the end of a frame is indicated by a control word, with the rest of the bits perhaps used for the CRC?
ii) how do the scramblers synchronise, and how do they avoid scrambling the same packet the same way? Or to put this another way, why is not possible for a malicious user to induce substantial packet loss by carefully choosing a bad message?
iii) this might have been answered in ii), but if a packet is sent to a switch, and then onto another host, is it scrambled the same way both times?
Once again, many thanks in advance
Layers
First of all the OSI model needs to be clear.
The ethernet frame is a data link layer, while the 64b/66b encoding is part of the physical layer (More precisely the PCS of the physical layer)
The physical layer doesn't know anything about the start of a frame. It sees only data. (The start of an ethernet frame are data bytes which contain the preamble.)
64b/66b encoding
Now let's assume that the link is up and running.
In this case the idle link is not full of '0'-s. (In that case the link wouldn't be self-synchronous) Idle messages (idle characters and/or synchronization blocks ie control information) are sent over the idle link. (The control information encoded with 0b10 preamble) (This is why the emitted spectrum and power dissipation don't depend on if the link is in idle state or not)
So a start of a new frame acts like following:
The link sends idle information. (with 0b10 preamble)
Upper layer (data link layer) sends the frame (in 64bit chunks of data) to physical layer.
The physical layer sends the data (with 0b01 preamble) over the link.
(Note that physical layer frequently inserts control (sync) symbols into the raw frame even during a data burst)
Synchronization
Before data transmission 64b/66b encoded lane must be initialized. This initialization includes the lane initialization which the block synchronization. Xilinx's Aurora's specification (P34) is an example of link initialization.
Briefly receiver tries to match the sync character in different bit-position, and when it match multiple times it reports link-up.
Note, that the 64b/66b encoding uses self-synchronous scrambler. This is why the scrambler (itself) doesn't need to know anything about where we are in the data stream. If you run a self-synchronous (de-)scrambler long enough, it produces the decoded bit stream.
Maliciousness
Note, that 64b/66b encoding is not an encryption. This scrambling won't protect you from eavesdropping/tamper. (Encryption should placed at higher level of the OSI model)
Same packet multiple times
Because the scrambler is in different state/seed when you sending the same packet second time, the two encoded packet will differ. (Theoretically we can creates packets, which sets back the shift register of the scramble, but we need to consider the control symbols, so practically this is impossible.)

Resources