Detecting suspicious/bot IP addresses in big access log (~30Gb) - r

I have big access log (~30Gb) and I'm looking for ways for find suspicious/bot IP addresses. Of course, we can replace IP with (IP + User_Agent). So my questions are:
find average requests number that are done from any IP
find IP addresses that are making more requests than average (see previous point)
find IP addresses that are doing requests regulary (every hour for example) during day
you recommendations about how to detect bot
This log is rather big and I don't think that R lang could process it. Should I use some kind of storage behind R (hadoop or something similar)? I absolutely have no experience in processing/analyzing big data so any ideas, recommedations & tuts/articles are appreciated.

The access log probably contains a lot of data which you may not need based on your question, if you only care about the time of the request and the orginating IP you could easily reduce the data size by extracting "columns" from the input before reading it into R, some standard command line tools such as cut or awk should do the trick.
If you want to keep more details another option could be to load the access log into a database and use this for further processing, 30GB is not a lot for a database, but of course this means some additional work: design a datbase schema and a way to load the data in the database.

You can also do the following type of analysis
Getting the geo location of IP addresses and comparing access frequency based on geo_location + time at geo_location (the access frequency could be normal during day time at the geo location but not after midnight)
If you have username information, check whether multiple IP addresses are using the same username during same time period
WSO2 has done some Anomaly Detection work using their Analytics Platform which is pretty scalable for most anomaly detection scenarios. Check it out - http://wso2.com/analytics/solutions/fraud-and-anomaly-detection-solution/
This might be a better option than doing through R, since it allows you to do complex event processing (through SQL like queries) as well as machine learning.

You can also do the following type of analysis
a) If the IP address is from data center range, it is likely from a bot than normal user.
b) If the IP address is from search engine range, it is high likely from a search engine bot.
You can get the geolocation database from IP2Location which has the usage type information to detect data center or search engine.

Check goaccess.io - works for me. With logs for different websites, distributed on several servers. Allows usage of GEOiP and identifies bots out of the box.

Check out https://ipdetective.io it tracks IP addresses that originate from datacenters, vpns, proxies, tor node and bot nets. It offers a free API as well so you can test it out.

Related

City data in Application Insights

I have multiple applications making use of Application Insights for Production Data. I'm trying to use the City telemetry field to map our current users. This data appears to be tracked very inconsistently and in most cases (> 75%) is just unavailable.
I understand some customers will be using VPNs which could affect the results, but not to the extent I'm seeing.
Here is the info from the Azure FAQ:
How are City, Country and other geo location data calculated? We look
up the IP address (IPv4 or IPv6) of the web client using GeoLite2.
Browser telemetry: We collect the sender's IP address.
Server telemetry: The Application Insights module collects the client IP
address. It is not collected if X-Forwarded-For is set.
You can configure the ClientIpHeaderTelemetryInitializer to take the IP
address from a different header. In some systems, for example, it is
moved by a proxy, load balancer, or CDN to X-Originating-IP.
Does anyone know how to improve geolocating user cities for App Insights?
IP Geolocation is not 100% accurate and you need to live with it. City accuracy is quite low because the information is guessed from multiple data that change frequently. One way to improve accuracy is to use a service that aggregates data from multiple sources and does it continuously, multiple times a day.
A second manner to enhance the results is to filter based on whether the IP is associated with a proxy by using threat data.
For both purposes, I recommend looking at Ipregistry, a service I work for:
https://api.ipregistry.co/?key=tryout
It would be great if MSFT could provide an example of manually setting the location in Browser telemetry. I understand privacy concerns, but our use-case is for internal enterprise apps used by our field service teams. Since Browsers can access the Geolocation APIs, it's probably straightforward to add that info. It's just a matter of knowing the right way to do it so it's picked up consistently.

Network design: one-to-many high bandwidth transmissions that excludes users

If StackOverflow is the wrong Exchange for this question, please help direct me to the correct one.
Short Version
What is the best design for a networking application in which one user transmits a constant, high-bandwidth stream of data to many other addresses? The solution must not require the uploader to duplicate the packets for each recipient and preferably will not transmit to users that have not been accepted by the transmitter.
Long Version
A friend and I have written an application that enables someone to transmit data in real time to one or more recipients that he wants to receive the data. I have designed the high-level application protocol to use UDP and to encode the data so that each packet can be lost without hurting the use of the rest. This solution requires managing sockets with each user and sending each packet to every user.
The problem here is that the stream can be very high bandwidth. The user can modify the settings for how high quality the data he is sending should be, and can end up sending 6 Mbps to each user. It is unfeasible to expect a user to pay his ISP enough to be allowed to upload such a stream to the preferred minimum of four other users at a time.
We need a way for the transmitter to send a packet exactly once and have each user receive a copy.
We have looked at multicasting. It may be what we need to use in the end, but we are concerned about the fact that anyone can join any group. It would be preferable to not allow users we do not want to see the data to not be allowed to join in. There is also the problem that if multiple transmitters happen to use the same group, viewers may find that they are receiving multiple streams' worth of data when they only want one.
My searching has revealed something IBM published over a decade ago called Explicit Multicast (Xcast) that looks perfect, but I have yet to find any information to determine whether this technology is commonly supported. Also, I have not yet seen whether it supports datagrams.
Does anyone know the best way to design an application that meets our needs?
Please keep in mind that we have no funds to support our project. Solutions need to be free.
Edit
In the summary above, I hinted at but failed to explicitly state that this is for a real-time application. The motivating drive behind the application is to keep the clients/recipients as close together in time as is possible. If packets are lost or arrive too late to be used in keeping the server and clients in phase, they need to be disregarded. That is why I designed the application protocol on top of UDP with independent data in each packet. Even if a client receives only one packet out of 300 for a given time step, it will use what it did get.
I think that I_am_Helpful's recommendation may be a good step in the right direction (or possibly the destination). I need to do some experimentation to determine whether using a system like Spread will work. However, I do not think I can budget more than additional 17 ms in transmission time.
If you can think of a system that enables sending unreliable datagrams to a specific group of users (like Spread) for a real-time application (unlike Spread, see p. 3), please let me know about it.
We need a way for the transmitter to send a packet exactly once and
have each user receive a copy.
In my limited knowledge, I would say that Reliable Multicasting appears to be one of the viable option for broadcasting in the group. I would like to mention that there are some of the possible Java API's* which could help you achieving the same :
JGroups Java API
The Spread Toolkit -> Spread consists of a library that user applications are linked with, a binary daemon which runs on each computer that is part of the processor group, and various utility and demonstration programs.
Appia
*NOTE : I have never worked with these API's.
It would be preferable to not allow users we do not want to see the
data to not be allowed to join in.
They do provide this feature, e.g., Spread supports thousands of groups with different sets of members. It also provides a range of reliability, ordering and stability guarantees for messages. JGroups can be used to create groups of processes whose members can send messages to each other. It also has facilities like group creation and deletion(Group members can be spread across LANs or WANs).
There is also the problem that if multiple transmitters happen to use
the same group, viewers may find that they are receiving multiple
streams' worth of data when they only want one.
When you could easily create multiple groups in the same network(using Spread,etc.), then, I believe that would no longer be an issue. It is your responsibility to declassify users into different groups.
I hope the given information helps. Good LUCK.
Via multicast you achieve exactly you want: sending each packet once, but authentication seems to be a concern for you.
One possible solution could be simetric cryptography, where the same key is used to encrypt and decrypt. Via TCP your clients connect to a server and fetch the multicast IP Address of the transmission and its associated key, then they join the multicast group and decrypt the incomming transmission.
If you accept a more flexible solution, you could have a server which sends a transmission in real time to a set of distributed servers. Your clients connect to one of these distributed servers via unicast, and after authentication is done, they are inluded in a list of receivers. Each distributed server sends each new transmission packet to each registered client via UDP. in ordinary situations your clients would have the same experience as if it was delivered in a multicast group, but the servers will spend far more bandwidth. Multiple transmission at a time will be allowed, so it could be good for you, and you can have more control, as clients can send signals to the servers, like PAUSE, and etc.

Tracking a dynamic ip address?

Is it possible for someone to track a dynamic IP address, if so what would it take and how would it manifest?
Would the person doing so be able to log every change in your ip range and eventually end up with the whole set of ip's you are able to have?
Is it possible to make my dynamic ip change in a different pattern, say in a more extreme way, making it harder for someone to trace it as described above? Is it possible to encrypt it somehow, and also all other information such as hardware MAC's / Inet MAC etc. everything.
The answer is yes and no.
In most cases only your service provider (and law enforcement) will have a log of all IPs you had and start/end times of each lease. You basically can't do anything to prevent this because they need to be able to identify you as their customer with a valid contract. This is usually done via MAC address of CPE equipment you get from service provider or by some login credentials (for PPPoE for example). There is no such thing as encrypting the IP and changing your MAC address would not prevent service provider from identifying you. For someone else there is no reliable way to track you. The closest thing they can find is the scope (or scopes) from which dynamic IP addresses are issued.
At the other hand, when you mix the technology and psychology, every one of us leaves the unique fingerprint when browsing the web. If you examine the combination of software someone uses, their traffic patterns (amount of traffic, sites they visit, activity during the day), their behavior and style of writing, etc, you can not just link them to some IP address but make a distinction between different users behind the same IP address. Anyway collecting this data is really hard which makes it improbable, especially if we are talking about ordinary internet users.

How do I configure OpenSplice DDS for 100,000 nodes?

What is the right approach to use to configure OpenSplice DDS to support 100,000 or more nodes?
Can I use a hierarchical naming scheme for partition names, so "headquarters.city.location_guid_xxx" would prevent packets from leaving a location, and "company.city*" would allow samples to align across a city, and so on? Or would all the nodes know about all these partitions just in case they wanted to publish to them?
The durability services will choose a master when it comes up. If one durability service is running on a Raspberry Pi in a remote location running over a 3G link what is to prevent it from trying becoming the master for "headquarters" and crashing?
I am experimenting with durability settings in a remote node such that I use location_guid_xxx but for the "headquarters" cloud server I use a Headquarters
On the remote client I might to do this:
<Merge scope="Headquarters" type="Ignore"/>
<Merge scope="location_guid_xxx" type="Merge"/>
so a location won't be master for the universe, but can a durability service within a location still be master for that location?
If I have 100,000 locations does this mean I have to have all of them listed in the "Merge scope" in the ospl.xml file located at headquarters? I would think this alone might limit the size of the network I can handle.
I am assuming that this product will handle this sort of Internet of Things scenario. Has anyone else tried it?
Considering the scale of your system I think you should seriously consider the use of Vortex-Cloud (see these slides http://slidesha.re/1qMVPrq). Vortex Cloud will allow you to better scale your system as well as deal with NAT/Firewall. Beside that, you'll be able to use TCP/IP to communicate from your Raspberry Pi to the cloud instance thus avoiding any problem related to NATs/Firewalls.
Before getting to your durability question, there is something else I'd like to point out. If you try to build a flat system with 100K nodes you'll generate quite a bit of discovery information. Beside generating some traffic, this will be taking memory on your end applications. If you use Vortex-Cloud, instead, we play tricks to limit the discovery information. To give you an example, if you have a data-write matching 100K data reader, when using Vortex-Cloud the data-writer would only match on end-point and thus reducing the discovery information by 100K times!!!
Finally, concerning your durability question, you could configure some durability service as alignee only. In that case they will never become master.
HTH.
A+

Getting current transferred MPI network communication volume

I have a question related to MPI.
In order to keep track of the communication volume used by my implementation, I would like to get the currently-transferred data amount since the mpi-process' start until the current measure-point.
I checked the specification as well as the mpi.h header file of mpich and did not find a matching function to call or variable that keeps track of the network transfer costs. It would, of course, be possible to implement a small traffic registry or define a macro for tracking communication sizes, but maybe it can be read out from somewhere.
Do you know a method to gain the current transfer size, maybe it is also possible to get this number using a system call to get the network traffic size of the process?
Is it maybe possible to access the proc information of the current process, maybe the /proc/net is maintained per process as well, such as /proc/self/net?
Thank you in advance,
Martin

Resources