I have created a UDP file using Scapy and I'm using tcpreplay to send the packet. I've faced two weird issues:
The number of packets sent is not identical to the (--pps) parameter. Not sure if the answer is (http://tcpreplay.appneta.com/wiki/faq.html#why-doesnt-tcpreplay-send-traffic-as-fast-as-i-told-it-to)
When I send less packets, i.e --pps=10, the CPU load is higher when I send more packets, i.e. --pps=200. I was expecting the other way around.
BTW, I'm using tcpreplay version 3.4.4
Issue 1. many --pps issues fixed in latest Tcpreplay version
Issue 2. CPU utilization is improved with latest version, but you can still expect to see 100% CPU being reported. In reality CPU utilization is over reported when using -t or --mbps=0 options. In these cases Tcpreplay will yield the sending thread whenever the TX buffers are full. This causes Tcpreplay to become the scheduler for the CPU. The result is a reported 100% CPU, however other processes on the CPU remain responsive.
Related
Background:
I am using the RxAndroidBle library and have a requirement to quickly (as possible) connect to multiple devices at a time and start communicating. I used the RxBluetoothKit for iOS, and have started to use RxAndroidBle on my Pixel 2. This had worked as expected and I could establish connections to 6-8 devices, as required, in a few hundred milliseconds. However, broadening my testing to phones such as the Samsung S8 and Nexus 6P, it seems that establishing a single connection can now take upwards of 5-6 seconds instead of 50-60 millis. I will assume for the moment that that disparity is within the vendor-specific BT implementations. Ultimately, this means that connecting to, e.g., 5 devices now takes 30 seconds instead of < 1 second.
Question:
What I understand from the documentation and other questions asked, RxAndroidBle queues all scanning, connecting, and communication requests and executes them serially to be safe and maintain stability based on the variety of Bluetooth implementations in the Android ecosystem. However, is there currently a way to execute the requests (namely, connecting) in parallel to accept this risk and potentially cut my total time to establish multiple connections down to whichever device takes the longest to connect?
And side question: are there any ideas to diagnose what could possibly be taking 5 seconds to establish a connection with a device? Or do we simply need to accept that some phones will take that long in some instances?
However, is there currently a way to execute the requests (namely, connecting) in parallel to accept this risk and potentially cut my total time to establish multiple connections down to whichever device takes the longest to connect?
Yes. You may try to establish connections using autoConnect=true which would prevent locking the queue for longer than few milliseconds. The last connection should be started with autoConnect=false to kick off a scan. Some stack implementations are handling this quite OK but your mileage may vary.
And side question: are there any ideas to diagnose what could possibly be taking 5 seconds to establish a connection with a device?
You can check the Bluetooth HCI snoop log. Also you may try using a BLE sniffer to check what is actually happening "on-air" (e.g. an nRF51 Development Kit).
Or do we simply need to accept that some phones will take that long in some instances?
This is also an option since usually there is little one can do about connecting time. From my experience BLE stack/firmware implementations are wildly different from each other.
I accidentally wrote a while loop that would never break in a kernel and I sent this to the GPU. After 30 seconds my screens started flickering, I realised what I have done and terminated the application by force. The problem is that I had to shut down the computer afterwards to make sure the kernels are gone. Therefore my questions are:
If I forcefully terminate the program (the program that's launching the kernels) without it freeing the GPU resources (freeing buffers, queues, kernels, CL.destroying) will the kernels still run?
If they are still running can I do anything to stop them? Say, like, release resources I don't have a handle to any more.
If you are using an NVIDIA card, then by terminating the application you will eventually free the resources on the card to allow it to run again. This is because NVIDIA has a watchdog monitor on the device (which you can turn off).
If you are using an AMD card, you are out of luck AFAIK and will have to restart the machine after every crash.
I am new to OpenCL, please tell me that the host cpu can be used only for allocating memory to the device, or we can use it can as an openCL device. (Because after the allocation is done, the host cpu will be idle).
You can use a cpu as a compute device. Opencl even allows multicore/processor systems to segment cores into separate compute units. I like to use this feature to divide the cpus on my system into groups based on NUMA nodes. It is possible to divide a cpu into compute devices which all share the same level of cache memory (L1, L2, L3 or L4).
You need a platform that supports it, such as AMD's SDK. I know there are ways to have Nvidia and AMD platforms on the same machine, but I have never had to do so myself.
Also, the opencl event/callback system allows you to use your cpu as you normally would while the gpu kernels are executing. In this way, you can use openmp or any other code on the host while you wait for the gpu kernel to finish.
There's no reason the CPU has to be idle, but it needs a separate job to do. Once you've submitted work to OpenCL you can:
Get on with something else, like preparing the next set of work, or performing calculation on something completely different.
Have the CPU set up as another compute device, and so submit a piece of work to it.
Personally I tend to find myself needing the first case more often as it's rare I find myself with two tasks that are independent and lend themselves to OpenCL style. The trick is keeping things balanced so you're not waiting a long time for the GPU task to finish, or having the GPU idle while the CPU is getting on with other work.
It's the same problem OpenGL coders had to conquer. Avoiding being CPU or GPU bound, and balancing between the two for best performance.
I was running a benchmark on CouchDB when I noticed that even with large bulk inserts, running a few of them in parallel is almost twice as fast. I also know that web browsers use a number of parallel connections to speed up page loading.
What is the reason multiple connections are faster than one? They go over the same wire, or even to localhost.
How do I determine the ideal number of parallel requests? Is there a rule of thumb, like "threadpool size = # cores + 1"?
The gating factor is not the wire itself which, after all, runs pretty quick (ignoring router delays) but the software overhead at each end. Each physical transfer has to be set up, the data sent and stored, and then completely handled before anything can go the other way. So each connection is effectively synchronous, no matter what it claims to be at the socket level: one socket operating asynchronously is still moving data back and forth in a synchronous way because the software demands synchronicity.
A second connection can take advantage of the latency -- the dead time on the wire -- that arises from the software doing its thing for first connection. So, even though each connection is synchronous, multiple connections let things happen much faster. Things seem (but of course only seem) to happen in parallel.
You might want to take a look at RFC 2616, the HTTP spec. It will tell you about the interchanges that happen to get an HTTP connection going.
I can't say anything about optimal number of parallel requests, which is a matter between the browser and the server.
Each connection consume one own thread. Each thread, have a quantum for consume CPU, network and other resources. Mainly, CPU.
When you start a parallel call, thread will dispute CPU time and run things "at the same time".
It's a high level overview of the things. I suggest you to read about asynchronous calls and thread programming to understand it better.
[]'s,
And Past
HELP PLEASE! I have an application that needs as close to real-time processing as possible and I keep running into this unusual delay issue with both TCP and UDP. The delay occurs like clockwork and it is always the same length of time (mostly 15 to 16 ms). It occurs when transmitting to any machine (eve local) and on any network (we have two).
A quick run down of the problem:
I am always using winsock in C++, compiled in VS 2008 Pro, but I have written several programs to send and receive in various ways using both TCP and UDP. I always use an intermediate program (running locally or remotely) written in various languages (MATLAB, C#, C++) to forward the information from one program to the other. Both winsock programs run on the same machine so they display timestamps for Tx and Rx from the same clock. I keep seeing a pattern emerge where a burst of packets will get transmitted and then there is a delay of around 15 to 16 milliseconds before the next burst despite no delay being programmed in. Sometimes it may be 15 to 16 ms between each packet instead of a burst of packets. Other times (rarely) I will have a different length delay, such as ~ 47 ms. I always seem to receive the packets back within a millisecond of them being transmitted though with the same pattern of delay being exhibited between the transmitted bursts.
I have a suspicion that winsock or the NIC is buffering packets before each transmit but I haven't found any proof. I have a Gigabit connection to one network that gets various levels of traffic, but I also experience the same thing when running the intermediate program on a cluster that has a private network with no traffic (from users at least) and a 2 Gigabit connection. I will even experience this delay when running the intermediate program locally with the sending and receiving programs.
I figured out the problem this morning while rewriting the server in Java. The resolution of my Windows system clock is between 15 and 16 milliseconds. That means that every packet that shows the same millisecond as its transmit time is actually being sent at different milliseconds in a 16 millisecond interval, but my timestamps only increment every 15 to 16 milliseconds so they appear the same.
I came here to answer my question and I saw the response about raising the priority of my program. So I started all three programs, went into task manager, raised all three to "real time" priority (which no other process was at) and ran them. I got the same 15 to 16 millisecond intervals.
Thanks for the responses though.
There is always buffering involved and it varies between hardware/drivers/os etc. The packet schedulers also play a big role.
If you want "hard real-time" guarantees, you probably should stay away from Windows...
What you're probably seeing is a scheduler delay - your application is waiting for other process(s) to finish their timeslice and give up the CPU. Standard timeslices on multiprocessor Windows are from 15ms to 180ms.
You could try raising the priority of your application/thread.
Oh yeah, I know what you mean. Windows and its buffers... try adjusting the values of SO_SNDBUF on sender and SO_RCVBUF on reciever side. Also, check involved networking hardware (routers, switches, media gateways) - eliminate as many as possible between the machines to avoid latency.
I meet the same problem.
But in my case,I use GetTickCount() to get current system time,unfortunately it always has a resolution of 15-16ms.
When I use QueryPerformanceCounter instead of GetTickCount(), everything's all right.
In fact,TCP socket recv data evenly,not 15ms deal once.