Analysis of enhanced interleaved multithreading (W.M. Zuberek)

In interleaved multithreading, the thread changes in every processor cycle. This approach is advantageous for eliminating data dependencies that slow-down the processor's pipeline; since consecutive instructions are issued from different threads, they have no data dependencies. Typically, the number of threads is equal to the number of pipeline stages, so no inter-instruction dependencies can stall the pipeline. In pure interleaved multithreading, a thread issuing a long-latency memory operation becomes `waiting' for the result of the requested operation. If a waiting thread is selected for execution, its slot simply remains empty (i.e., no instruction is issued), which is equivalent to a single-cycle pipeline stall. Since the threads issue their instructions one after another, fewer processor cycles are lost during a long-latency operation of a single thread than in the case of uni-threaded processors.

In enhanced interleaved multithreading, additional threads are available to replace any active thread when it initiates a long-latency operations and becomes inactive until the end of the initiated operation. Consequently, the processor cycles are not lost, the utilization of processors increases and this improves the performance of the system. The enhanced interleaved multithreading combines elements of interleaved and block multithreading within one architecture.

The utilization of 4-thread processors, as a function of probability of long-latency accesses to local memory, and the number of additional threads, is shown in Fig.1.

Fig.1. Processor utilization for a 4-thread system.

The effect of enhancements is more pronounced for the values of probability close to 1 (i.e., when most of long-latency accesses are to local memory); for small values of this probability (in this particular case) the availability of additional threads does not have any significant effect on the utilization of processors.

Utilization of processors for an 8-thread system is shown in Fig.2. The results are better than for the 4-thread system, but the effects of enhanced multithreading are less significant than in Fig.1.

Fig.2. Processor utilization for an 8-thread system.

Both Fig.1 and Fig.2 show that the performance of processors decreases quite significantly when most of long-latency operations are accesses to remote memory. This is an indication that the interconnecting network may be the limiting component of this system. This is also the reason that the enhancement of multithreading has practically no effect when the long-latency accesses are mostly to remote memory; the interconnecting network, and more precisely, the delay of its switches, determine the performance of the system. Indeed, the utilization of the input switch, for the 4-thread system, as a function of the probability of long-latency accesses to local memory and the number of additional threads, is shown in Fig.3 (for the 8-thread system, the utilization of the input switch is very similar to Fig.3). The region of low utilization of processors in Fig.1 and Fig.2 corresponds to almost 100\% utilization of the input switches, which indicates that the switches are the bottleneck of this system, limiting its performance; the switches are simply too slow for this system.

Fig.3. Switch utilization for a 4-thread system; case a.

Fig.4 shows the utilization of switches when the switch delay is one half of that used in Fig.3 (and all other parameters as the same as in Fig.3). The utilization of processors

Fig.4. Switch utilization for a 4-thread system; case b.

Fig.4 indicates that the input switch remains the bottleneck only for very small values of probabilities of long-latency accesses to local memory. Further improvement of the processor's performance (mostly for long-latency accesses to remote memory) can be obtained by using even faster switches or by using several parallel switches and sharing the load among them.

Fig.5 shows the relative improvement of the processor utilization when 4 additional threads are used; the maximum improvement of more than 40 % can be achieved when approximately one half of long-latency memory accesses are local.

Fig.5. Utilization gain for an enhanced interleaved 4-thread system.


Prev Page Up to Project Page

Copyright by W.M. Zuberek. All rights reserved.
Revised: 2003.03.25 :

Notice: Use of undefined constant COUNTNAME - assumed 'COUNTNAME' in /users/cs/faculty/wlodek/.www/research/proj-multi-mult-e.php on line 133

Notice: Use of undefined constant COUNTER - assumed 'COUNTER' in /users/cs/faculty/wlodek/.www/research/counter.php on line 21
1452