m/m/1 Queue Examples - networking

I am having hard time working on M/M/1 queue (Common queue architecture). I understand that
(lambda)^2/(mu*(mu-lambda)) = the average number of customers waiting in line
the part I am struggling with is that my queue is limited to only 3 clients waiting then anything after that they get dropped. So how do I find my average customers waiting in line now?

Logically, limiting the queue makes certain queue states (i.e > n) impossible. Thus your probability of being in all states < n sum to 1.0.
Doing a simple Google search for "mm1 with limited queue size" the first result is a PDF that answers your question. The paper actually gives usable formulas.

Related

Pipelining affects the clock time or cycle-per-instruction(CPI)?

My book mentions " Depending on what you consider as the baseline, the reduction can be viewed as decreasing the number of clock cycles per instruction (CPI), as decreasing the clock cycle time, or as a combination.If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining is usually viewed as reducing the CPI."
What I fail to understand is pipelining affects CPI or the clock period because in case of pipelining clock period is taken as max stage-delay + Latch-delay so pipelining does affect the clock time . Also it affects CPI because it becomes 1 in case of pipelining. Am I missing on some concept?
Executing an instruction requires a set of operations. For the sake of simplicity assume there are 5:
fetch-instruction decode-execute-memory access-write back.
This can be implemented with several schemes.
A/ Mono cycle processor
The scheme is the following:
The processor fetches an instruction, directs it to a decoder that controls a bank of multiplexers that will configure a large combinatorial datapath that will implement the instruction.
In this model, every instruction requires one cycle, and, assuming all the 5 "stages" require an equal time t, the period will be 5t.
Hence CPI=1, T=5
Actually, this was more or less the underlying model of the earlier computers in the late 40's. Besides that, no real processor has be done like that, but it is theorically quite doable.
B/ Multi cycle processor
Compared to the previous model, you introduce registers on the datapath. First one fetches the instruction and sends it to the inputs of an automaton that will sequentially apply the computation "stages".
In that case, instructions require 5 cycles (maybe slightly less as some instructions may be simpler and, for instance, skip the memory access). Period is 1t (or maybe slighly more to take into account the registers traversal time).
CPI=5, T=1
The first "true" computers were implemented like that and this was the main architectural model up to the early 80's. Nowadays several microcontrollers or, for instance, the simpler version of NIOS, are still relying on this scheme.
C/ pipeline processor
You add extra registers between the stages in order to keep track of the instruction and of all the partial results. In that case, the execution of every stage can be independent and you can execute several instructions simutaneously in different stages.
CPI becomes 1, as you can start a new instruction at every clock cycle (probably a bit more because of the hazards, but that is another story).
And T=1.
So CPI=1, T=1
(the CPI reflects the throughput increase but the execution time of a single instruction is not reduced)
So pipeline can be seen as either reducing the cycle time wrt scheme A, or reducing the CPI, wrt to scheme B. And you can also imagine an intermediate scheme (say 3 stages, with a period of 2) where pipeline will reduce both.

Is it possible to use the TWS/IBpy interface to collect and analyze tick data?

While searching for a template to test a paper trading strategy, I stumbled on IBPy. I have gone through the initial set-up and can connect and receive updates from the server. What I would like to do is:
a) Gather ticks from 1..n symbols when new prices (bid/asks) are published
b) Store these temporarily in a vector (I guess with vector.append((bid,ask))
c) Once the vector reaches it's computational max (I need 30 seconds or a certain number of ticks) I will compute some valued on vector[] and decide on whether an entry is appropriate
d) If not pop(0) and keep collecting
e) exit on a stoploss or trailing profit
My questions are:
i) I have read that updates are 250 ms, that is fine for my analytics but can the program/system keep up because different symbols update at different times so just because symbolA updates every 250 ms, with 10 symbols the updates maybe very frequent
ii) When I stop to make a calculation, haven't I lost updates?
If there is skeleton code for this, it would be great to mess around with it
Thanks for listening!
If you need to handle 100s of stock symbols you shall have multiple (at least 2) threads. One thread pulls the incoming data from the socket, sorts the messages by message type and pushes the data to queues. Other threads are waiting for their respective queues to get some data and process the incoming data.
The idea is that the dispatcher thread ensures that all incoming data gets pulled from the socket as fast as possible.
Generally your PC will be able to handle anything IB will be willing to send you. If your processing does not take too much time - no locks, calls to sleep(), file operations - you can do everything in a single thread.

CPU memory access time

Does the average data and instruction access time of the CPU depends on the execution time of an instruction?
For example if miss ratio is 0.1, 50% instructions need memory access,L1 access time 3 clock cycles, mis penalty is 20 and instructions execute in 1 cycles what is the average memory access time?
I'm assume you're talking about a CISC architecture where compute instructions can have memory references. If you have a sequence of ADDs that access memory, then memory requests will come more often than a sequence of the same number of DIVs, because the DIVs take longer. This won't affect the time of the memory access -- only locality of reference will affect the average memory access time.
If you're talking about a RISC arch, then we have separate memory access instructions. If memory instructions have a miss rate of 10%, then the average access latency will be the L1 access time (3 cycles for hit or miss) plus the L1 miss penalty times the miss rate (0.1 * 20), totaling an average access time of 5 cycles.
If half of your instructions are memory instructions, then that would factor into clocks per instruction (CPI), which would depend on miss rate and also dependency stalls. CPI will also be affected by the extent to which memory access time can overlap computation, which would be the case in an out-of-order processor.
I can't answer your question a lot better because you're not being very specific. To do well in a computer architecture class, you will have to learn how to figure out how to compute average access times and CPI.
Well, I'll go ahead and answer your question, but then, please read my comments below to put things into a modern perspective:
Time = Cycles * (1/Clock_Speed) [ unit check: seconds = clocks * seconds/clocks ]
So, to get the exact time you'll need to know the clock speed of your machine, for now, my answer will be in terms of Cycles
Avg_mem_access_time_in_cycles = cache_hit_time + miss_rate*miss_penalty
= 3 + 0.1*20
= 5 cycles
Remember, here I'm assuming your miss rate of 0.1 means 10% of cache accesses miss the cache. If you're meaning 10% of instructions, then you need to halve that (because only 50% of instrs are memory ops).
Now, if you want the average CPI (cycles per instr)
CPI = instr% * Avg_mem_access_time + instr% * Avg_instr_access_time
= 0.5*5 + 0.5*1 = 3 cycles per instruction
Finally, if you want the average instr execution time, you need to multiply 3 by the reciprocal of the frequency (clock speed) of your machine.
Comments:
Comp. Arch classes basically teach you a very simplified way of what the hardware is doing. Current architectures are much much more complex and such a model (ie the equations above) is very unrealistic. For one thing, access time to various levels of cache can be variable (depending on where physically the responding cache is on the multi- or many-core CPU); also access time to memory (which typically 100s of cycles) is also variable depending on contention of resources (eg bandwidth)...etc. Finally, in modern CPUs, instructions typically execute in parallel (ILP) depending on the width of the processor pipeline. This means adding up instr execution latencies is basically wrong (unless your processor is a single-issue processor that only executes one instr at a time and blocks other instructions on miss events such as cache miss and br mispredicts...). However, for educational purpose and for "average" results, the equations are okay.
One more thing, if you have a multi-level cache hierarchy, then the miss_penalty of level 1 cache will be as follows:
L1$ miss penalty = L2 access time + L1_miss_rate*L2_miss_penalty
If you have an L3 cache, you do a similar thing to L2_miss_penalty and so on

defining the time it takes to do something (latency, throughput, bandwidth)

I understand latency - the time it takes for a message to go from sender to recipient - and bandwidth - the maximum amount of data that can be transferred over a given time - but I am struggling to find the right term to describe a related thing:
If a protocol is conversation-based - the payload is split up over many to-and-fros between the ends - then latency affects 'throughput'1.
1 What is this called, and is there a nice concise explanation of this?
Surfing the web, trying to optimize the performance of my nas (nas4free) I came across a page that described the answer to this question (imho). Specifically this section caught my eye:
"In data transmission, TCP sends a certain amount of data then pauses. To ensure proper delivery of data, it doesn’t send more until it receives an acknowledgement from the remote host that all data was received. This is called the “TCP Window.” Data travels at the speed of light, and typically, most hosts are fairly close together. This “windowing” happens so fast we don’t even notice it. But as the distance between two hosts increases, the speed of light remains constant. Thus, the further away the two hosts, the longer it takes for the sender to receive the acknowledgement from the remote host, reducing overall throughput. This effect is called “Bandwidth Delay Product,” or BDP."
This sounds like the answer to your question.
BDP as wikipedia describes it
To conclude, it's called Bandwidth Delay Product (BDP) and the shortest explanation I've found is the one above. (Flexo has noted this in his comment too.)
Could goodput be the term you are looking for?
According to wikipedia:
In computer networks, goodput is the application level throughput, i.e. the number of useful bits per unit of time forwarded by the network from a certain source address to a certain destination, excluding protocol overhead, and excluding retransmitted data packets.
Wikipedia Goodput link
The problem you describe arises in communications which are synchronous in nature. If there was no need to acknowledge receipt of information and it was certain to arrive then the sender could send as fast as possible and the throughput would be good regardless of the latency.
When there is a requirement for things to be acknowledged then it is this synchronisation that cause this drop in throughput and the degree to which the communication (i.e. sending of acknowledgments) is allowed to be asynchronous or not controls how much it hurts the throughput.
'Round-trip time' links latency and number of turns.
Or: Network latency is a function of two things:
(i) round-trip time (the time it takes to complete a trip across the network); and
(ii) the number of times the application has to traverse it (aka turns).

Measuring time difference between networked devices

I'm adding networked multiplayer to a game I've made. When the server sends an update packet to the client, I include a timestamp so that the client knows exactly when that information is valid. However, the server computer and the client computer might have their clocks set to different times (maybe even just a few seconds difference), so the timestamp from the server needs to be translated to the client's local time.
So, I'd like to know the best way to calculate the time difference between the server and the client. Currently, the client pings the server for a time stamp during initialization, takes note of when the request was sent and when it was answered, and guesses that the time stamp was generated roughly halfway along the journey. The client also runs 10 of these trials and takes the average.
But, the problem is that I'm getting different results over repeated runs of the program. Within each set of 10, each measurement rarely diverges by more than 400 milliseconds, which might be acceptable. But if I wait a few minutes between each run of the program, the resulting averages might disagree by as much as 2 seconds, which is not acceptable.
Is there a better way to figure out the difference between the clocks of two networked devices? Or is there at least a way to tweak my algorithm to yield more accurate results?
Details that may or may not be relevant: The devices are iPod Touches communicating over Bluetooth. I'm measuring pings to be anywhere from 50-200 milliseconds. I can't ask the users to sync up their clocks. :)
Update: With the help of the below answers, I wrote an objective-c class to handle this. I posted it on my blog: http://scooops.blogspot.com/2010/09/timesync-was-time-sink.html
I recently took a one-hour class on this and it wasn't long enough, but I'll try to boil it down to get you pointed in the right direction. Get ready for a little algebra.
Let s equal the time according to the server. Let c equal the time according to the client. Let d = s - c. d is what is added to the client's time to correct it to the server's time, and is what we need to solve for.
First we send a packet from the server to the client with a timestamp. When that packet is received at the client, it stores the difference between the given timestamp and its own clock as t1.
The client then sends a packet to the server with its own timestamp. The server sends the difference between the timestamp and its own clock back to the client as t2.
Note that t1 and t2 both include the "travel time" t of the packet plus the time difference between the two clocks d. Assuming for the moment that the travel time is the same in both directions, we now have two equations in two unknowns, which can be solved:
t1 = t - d
t2 = t + d
t1 + d = t2 - d
d = (t2 - t1)/2
The trick comes because the travel time is not always constant, as evidenced by your pings between 50 and 200 ms. It turns out to be most accurate to use the timestamps with the minimum ping time. That's because your ping time is the sum of the "bare metal" delay plus any delays spent waiting in router queues. Every once in a while, a lucky packet gets through without any queuing delays, so you use that minimum time as the most repeatable time.
Also keep in mind that clocks run at different rates. For example, I can reset my computer at home to the millisecond and a day later it will be 8 seconds slow. That means you have to continually readjust d. You can use the slope of various values of d computed over time to calculate your drift and compensate for it in between measurements, but that's beyond the scope of an answer here.
Hope that helps point you in the right direction.
Your algorithm will not be much more accurate unless you can use some statistical methods. First of all, 10 is probably not sufficient. The first and simplest change would be to gather 100 transit time samples and toss out the x longest and shortest.
Another thing to add would be that both clients send their own timestamp in each packet. Then you can also calculate how different their clocks are and check the average difference between the clocks.
You can also check up on STNP and NTP implementations specifically, as these protocols do this specifically.

Resources