find clock cycle time for pipelined and unpipelined caches? - pipeline

Consider a 2-way set associative cache with the timing of its different units that is
given in Table 2. Consider two possible designs for this cache: unpipelined and pipelined. (30
points)
a) What would be the cycle time for both caches? What is the maximum frequency at
which the processor can run in both cases?
b) Given that all memory references for some program hit in the cache, then compare the
performance of the two caches in terms of cache access latency and throughput. Assume
that tag comparisons are performed using two separate comparators in both cases and
each of the units in the pipelined cache is used in one cycle only.
c) Assuming that the miss rate for both caches to be 5% and the miss penalty is 10 ns for
unpipelined cache and 7 ns for the pipelined cache. Compare the performance of the two
caches in terms of AMAT.
Table 2. Cache Attributes
Unit Delay (ns)
Cache Indexing 0.4
Reading Tags and Data 0.4
Writing Data into Cache 0.3
Comparator 0.2
Block Multiplexor 0.2

Related

SQLite optimization: simultaneous search for lower and upper bounds

For a long running algorithm based on SQLite3, I have a simple but huge table defined like that:
CREATE TABLE T(ID INTEGER PRIMARY KEY, V INTEGER);
The inner loop of the algorithm will need to find, given some integer N, the biggest ID that is less or equal to N, the value V associated to it, as well as the smallest ID that is strictly bigger than N.
The following pair of requests does work:
SELECT ID, V FROM T WHERE ID <= ? ORDER BY ID DESC LIMIT 1;
SELECT ID FROM T WHERE ID > ? LIMIT 1;
But I feel that it should be possible to merge those two requests into a single one. When SQLite has consulted the primary index to find the ID just smaller than N (first request), the next entry in the B-tree index is already the answer to the second request.
To give an order of magnitude, the table T has more than one billion of rows, and the inner requests will need to be executed more than 100 billions of times. Hence each microsecond counts. Of course I will use a fast SSD on a server with plenty of RAM. PostgreSQL could also be an option if it is quicker for that usage without taking more disk space.
This is an answer to my own question. While I didn't found yet a better SQL request as posted in the question, I made some preliminary speed measurements that moved slightly the perspective. Here are my observations:
There is a big difference in search or insert performance depending whether ID values are sequential or random order. In my application, ID will be mostly sequential, but with plenty of exceptions.
Executing the pair of SQL requests takes less time than the sum of each request taken separately. This is most visible with random order. This means that when the second SQL request runs, the B-Tree to access the next ID is always in cache memory and walking through the index is faster the second time.
The search and insertion times per request increase with the number of rows. In sequential order, the difference is small, but in random order the increase is substantial. Indexing a B-Tree is inherently O(log N), and in addition OS cache becomes less performant as the file size increases.
Here are my measurements on a fast server with SSD:
| Insertion (µs)| Search (µs)
# rows | sequ. random | sequ. random
10^5 | 0.60 0.9 | 1.1 1.3
10^6 | 0.64 3.1 | 1.2 2.5
10^7 | 0.66 4.3 | 1.2 3.0
10^8 | 0.70 5.6 | 1.3 4.2
10^9 | 0.73 | 1.3 4.6
My conclusion is that SQLite internal logic doesn't seem to be the bottleneck for the foreseen algorithm. That bottleneck for huge tables is disk access, even on a fast SSD. I don't expect to have a better performance with another database engine, nor with a custom made B-Tree.

Pipelining affects the clock time or cycle-per-instruction(CPI)?

My book mentions " Depending on what you consider as the baseline, the reduction can be viewed as decreasing the number of clock cycles per instruction (CPI), as decreasing the clock cycle time, or as a combination.If the starting point is a processor that takes multiple clock cycles per instruction, then pipelining is usually viewed as reducing the CPI."
What I fail to understand is pipelining affects CPI or the clock period because in case of pipelining clock period is taken as max stage-delay + Latch-delay so pipelining does affect the clock time . Also it affects CPI because it becomes 1 in case of pipelining. Am I missing on some concept?
Executing an instruction requires a set of operations. For the sake of simplicity assume there are 5:
fetch-instruction decode-execute-memory access-write back.
This can be implemented with several schemes.
A/ Mono cycle processor
The scheme is the following:
The processor fetches an instruction, directs it to a decoder that controls a bank of multiplexers that will configure a large combinatorial datapath that will implement the instruction.
In this model, every instruction requires one cycle, and, assuming all the 5 "stages" require an equal time t, the period will be 5t.
Hence CPI=1, T=5
Actually, this was more or less the underlying model of the earlier computers in the late 40's. Besides that, no real processor has be done like that, but it is theorically quite doable.
B/ Multi cycle processor
Compared to the previous model, you introduce registers on the datapath. First one fetches the instruction and sends it to the inputs of an automaton that will sequentially apply the computation "stages".
In that case, instructions require 5 cycles (maybe slightly less as some instructions may be simpler and, for instance, skip the memory access). Period is 1t (or maybe slighly more to take into account the registers traversal time).
CPI=5, T=1
The first "true" computers were implemented like that and this was the main architectural model up to the early 80's. Nowadays several microcontrollers or, for instance, the simpler version of NIOS, are still relying on this scheme.
C/ pipeline processor
You add extra registers between the stages in order to keep track of the instruction and of all the partial results. In that case, the execution of every stage can be independent and you can execute several instructions simutaneously in different stages.
CPI becomes 1, as you can start a new instruction at every clock cycle (probably a bit more because of the hazards, but that is another story).
And T=1.
So CPI=1, T=1
(the CPI reflects the throughput increase but the execution time of a single instruction is not reduced)
So pipeline can be seen as either reducing the cycle time wrt scheme A, or reducing the CPI, wrt to scheme B. And you can also imagine an intermediate scheme (say 3 stages, with a period of 2) where pipeline will reduce both.

Apache Ignite 2.4 uneven partitioning of data causing nodes to run out of memory and crash

Environment:
Apache Ignite 2.4 running on Amazon Linux. VM is 16CPUs/122GB ram. There is plenty of room there.
5 nodes, 12GB each
cacheMode = PARTITIONED
backups = 0
OnheapCacheEnabled = true
atomicityMode = ATOMIC
rebalacneMode = SYNC
rebalanceBatchSize = 1MB
copyOnread = false
rebalanceThrottle = 0
rebalanceThreadPoolSize = 4
Basically we have a process that populates the cache on startup and then receives periodic updates from Kafka, propagating them to the cache.
The number of elements in the cache is more or less stable over time (there is just a little fluctuation since we have a mixture of create, update and delete events), but what we have noticed is that the distribution of data across the different nodes is very uneven, with one of the nodes having at least double the number of keys (and memory utilization) as the others. Over time, that node either runs out of memory, or starts doing very long GCs and loses contact with the rest of the cluster.
My expectation was that Ignite would balance the data across the different nodes, but reality shows something completely different. Am I missing something here? Why do we see this imbalance and how do we fix it?
Thanks in advance.
Bottom line, although our hash function had good distribution, the default affinity function was not yielding a good distribution of keys (and, consequently, memory) across the nodes in the cluster. We replaced it with a very naive one (partition # % # of nodes), and that improved the distribution quite a bit (less than 2% variance).
This not a generic solution; it works for us because our entire cluster is in one VM and we don't use replication. For massive clusters crossing VM boundaries and replication, keeping the replicated data in separate servers is mandatory, and the naive approach won't cut it.

CPU memory access time

Does the average data and instruction access time of the CPU depends on the execution time of an instruction?
For example if miss ratio is 0.1, 50% instructions need memory access,L1 access time 3 clock cycles, mis penalty is 20 and instructions execute in 1 cycles what is the average memory access time?
I'm assume you're talking about a CISC architecture where compute instructions can have memory references. If you have a sequence of ADDs that access memory, then memory requests will come more often than a sequence of the same number of DIVs, because the DIVs take longer. This won't affect the time of the memory access -- only locality of reference will affect the average memory access time.
If you're talking about a RISC arch, then we have separate memory access instructions. If memory instructions have a miss rate of 10%, then the average access latency will be the L1 access time (3 cycles for hit or miss) plus the L1 miss penalty times the miss rate (0.1 * 20), totaling an average access time of 5 cycles.
If half of your instructions are memory instructions, then that would factor into clocks per instruction (CPI), which would depend on miss rate and also dependency stalls. CPI will also be affected by the extent to which memory access time can overlap computation, which would be the case in an out-of-order processor.
I can't answer your question a lot better because you're not being very specific. To do well in a computer architecture class, you will have to learn how to figure out how to compute average access times and CPI.
Well, I'll go ahead and answer your question, but then, please read my comments below to put things into a modern perspective:
Time = Cycles * (1/Clock_Speed) [ unit check: seconds = clocks * seconds/clocks ]
So, to get the exact time you'll need to know the clock speed of your machine, for now, my answer will be in terms of Cycles
Avg_mem_access_time_in_cycles = cache_hit_time + miss_rate*miss_penalty
= 3 + 0.1*20
= 5 cycles
Remember, here I'm assuming your miss rate of 0.1 means 10% of cache accesses miss the cache. If you're meaning 10% of instructions, then you need to halve that (because only 50% of instrs are memory ops).
Now, if you want the average CPI (cycles per instr)
CPI = instr% * Avg_mem_access_time + instr% * Avg_instr_access_time
= 0.5*5 + 0.5*1 = 3 cycles per instruction
Finally, if you want the average instr execution time, you need to multiply 3 by the reciprocal of the frequency (clock speed) of your machine.
Comments:
Comp. Arch classes basically teach you a very simplified way of what the hardware is doing. Current architectures are much much more complex and such a model (ie the equations above) is very unrealistic. For one thing, access time to various levels of cache can be variable (depending on where physically the responding cache is on the multi- or many-core CPU); also access time to memory (which typically 100s of cycles) is also variable depending on contention of resources (eg bandwidth)...etc. Finally, in modern CPUs, instructions typically execute in parallel (ILP) depending on the width of the processor pipeline. This means adding up instr execution latencies is basically wrong (unless your processor is a single-issue processor that only executes one instr at a time and blocks other instructions on miss events such as cache miss and br mispredicts...). However, for educational purpose and for "average" results, the equations are okay.
One more thing, if you have a multi-level cache hierarchy, then the miss_penalty of level 1 cache will be as follows:
L1$ miss penalty = L2 access time + L1_miss_rate*L2_miss_penalty
If you have an L3 cache, you do a similar thing to L2_miss_penalty and so on

Understanding CYCLE_ACTIVITY.* Haswell Performance-Monitoring Events

I'm trying to analyse an execution on an Intel Haswell CPU (Intel® Core™ i7-4900MQ) with the Top-down Microarchitecture Analysis Method (TMAM), described in Chapters B.1 and B.4 of the Intel® 64 and IA-32 Architectures
Optimization Reference Manual. (I adjust the Sandy Bridge formulas described in B.4 to the Haswell Microarchitecture if needed.)
Therefore I perform performance counter events measurements with Perf. There are some results I don’t understand:
CPU_CLK_UNHALTED.THREAD_P < CYCLE_ACTIVITY.CYCLES_LDM_PENDING
This holds only for a few measurements, but still is weird. Does the PMU count halted cycles for CYCLE_ACTIVITY.CYCLES_LDM_PENDING?
CYCLE_ACTIVITY.CYCLES_L2_PENDING > CYCLE_ACTIVITY.CYCLES_L1D_PENDING
and CYCLE_ACTIVITY.STALLS_L2_PENDING > CYCLE_ACTIVITY.STALLS_L1D_PENDING
This applies for all measurements. When there is a L1D cache miss, the load gets transferred to the L2 cache, right? So a load missed L2 earlier also missed L1. There is the L1 instruction cache not counted here, but with *_L2_PENDING being 100x or even 1000x greater than *_L1D_PENDING it is probably not that.. Are the stalls/cycles being measured somehow separately? But than there is this formula:
%L2_Bound =
(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING) / CLOCKS
Hence CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING is assumed (the result of the formula must be positive). (The other thing with this formula is that it should probably be CYCLES instead of STALLS. However this wouldn't solve the problem described above.) So how can this be explained?
edit: My OS: Ubuntu 14.04.3 LTS, kernel: 3.13.0-65-generic x86_64, perf version: 3.13.11-ckt26
I'll start with the second part of the question, i.e., how CYCLE_ACTIVITY.CYCLES_L2_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L1D_PENDING, respectively.
First, note that the formula for %L2_Bound is from Section B.5 of the Intel Optimization Manual. The first paragraph of that section says:
This section covers various performance tuning techniques using
performance monitoring events. Some techniques can be adapted in
general to other microarchitectures, most of the performance events
are specific to Intel microarchitecture code name Sandy Bridge.
My first hunch was that prefetching has something to do with it (see my comment). This paragraph pushed me further in the right direction; these events may represent different things in Sandy Bridge and in Haswell. Here is what they mean on Haswell:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles with pending L1 data cache
miss loads. CYCLE_ACTIVITY.CYCLES_L2_PENDING: Cycles with pending L2
miss loads. CYCLE_ACTIVITY.STALLS_L1D_PENDING: Execution stalls due to
L1 data cache miss loads. CYCLE_ACTIVITY.STALLS_L2_PENDING: Number of
loads missed L2.
The manual also says the counters for L2 should only be used when hyperthreading is disabled. Now here is what they mean on Sandy Bridge:
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Each cycle there was a miss-pending
demand load this thread, increment by 1.
CYCLE_ACTIVITY.CYCLES_L2_PENDING: Each cycle there was a MLC-miss
pending demand load this thread, increment by 1.
CYCLE_ACTIVITY.STALLS_L1D_PENDING: Each cycle there was a miss-pending
demand load this thread and no uops dispatched, increment by 1.
CYCLE_ACTIVITY.STALLS_L2_PENDING: Each cycle there was a MLC-miss
pending demand load and no uops dispatched on this thread, increment
by 1.
There are three important differences:
Some of the Haswell events can only valid when HT is disabled. All SNB events are valid even when HT is enabled.
CYCLE_ACTIVITY.STALLS_L2_PENDING on HSW counts the number of load misses at L2, but on SNB, it counts the number of cycles during which there was at least one demand load miss at L2.
The HSW events include all accesses, not just demand loads. In contrast, the SNB events only occur for demand loads.
On HSW, CYCLE_ACTIVITY.CYCLES_L2_PENDING can be larger than CYCLE_ACTIVITY.CYCLES_L1D_PENDING because of the miss-pending loads issued by the L1D prefetcher (and/or the L2 prefetcher(s) depending on whether the prefetcher increments the counter for the same level of cache). Similarly, while they count different things, CYCLE_ACTIVITY.STALLS_L2_PENDING can be larger than CYCLE_ACTIVITY.STALLS_L1D_PENDING due to prefetching. TLB prefetching and prefetching at other MMU caches may also impact these performance events on HSW. On the other hand, on SNB, it is guaranteed that CYCLE_ACTIVITY.STALLS_L2_PENDING < CYCLE_ACTIVITY.STALLS_L1D_PENDING, and that's why the %L2_Bound formula is valid on SNB.
Like I said in the comment, disabling HT and/or prefetching may "fix" your problem.
Actually, the Intel spec update document for the Mobile Haswell processors mentions two bugs that affect CYCLES_L2_PENDING:
HSM63: The intended behavior of CYCLES_L2_PENDING on Haswell is to count only for demand loads, but it may count inaccurately in SMT mode.
HSM80: CYCLES_L2_PENDING may overcount due to requests from the next page prefetcher.
I think you can minimize the error in CYCLES_L2_PENDING by disabling SMT (either in BIOS or putting the other logical core into sleep). In addition, try to not trigger the NPP. This can be achieved by avoiding locations towards the end of a virtual page where the translation of the next page is not already in the TLB hierarchy.
Related: When L1 misses are a lot different than L2 accesses… TLB related?
Regarding the first part of the question, i.e., how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLE_ACTIVITY.CYCLES_LDM_PENDING. One explanation that I could think of is that the CYCLE_ACTIVITY.CYCLES_LDM_PENDING occurs for loads issued from (some) other threads (in particular, on the same physical core), not just the halted thread. Erratum HSM146 mentions that CYCLES_LDM_PENDING may count inaccurately when the logical core is not in C0, which explains how CPU_CLK_UNHALTED.THREAD_P can be smaller than CYCLES_LDM_PENDING. Disabling HT may eliminate this inaccuracy, although the spec update document doesn't provide any workaround.

Resources