# Concept and Design of Exhaustive-Parallel search algorithm for Network-on-Chip

Meganathan Deivasigamani<sup>1</sup>, Shaghayeghsadat Tabatabaei<sup>2</sup>, Naveed Mustafa<sup>2</sup>, Hamza Ijaz<sup>2</sup>, Haris Bin Aslam<sup>2</sup> Shaoteng Liu<sup>2</sup> and Axel Jantsch<sup>2</sup>

Haris Bin Aslam Shaoteng Liu and Axel Jantsch

<sup>(1)</sup>Department of Electronics Engineering, MIT, Anna University, Chennai, India <sup>(2)</sup> School of ICT, Royal Institute of Technology (KTH), Stockholm, Sweden

Abstract: This paper presents the concept and design of exhaustive-parallel search algorithm for Networkon-Chip. The proposed parallel algorithm searches minimal path between source and destination in a forward-wave-propagation manner. The algorithm guarantees setup latency if the setup path exists. A high performance switch is designed to support exhaustive-parallel search algorithm. The NoC fabric is designed for 8X8 mesh architecture and its performance is evaluated.

**Keywords:** Network-on-Chip (NoC), Circuit-switch (CS), Exhaustive Parallel Search (EPS), Guaranteed Throughput (GT).

### I. INTRODUCTION

Scaling the ultra deep submicron technology drives processor development from single processoron-chip into multiprocessor systems-on-a-chip (MPSoCs)[1]. MPSoC system is highly suitable for multimedia, avionic, automated manufacturing applications. MPSoCs are typically composed of processors, DSPs, memories, hardware accelerators (video and audio decoders). Input/Output devices and application specific coprocessors. They may include homogeneous on-chip multiprocessors (CMPs) [2] or, alternatively, specialized processors. These systems need scalable, high-bandwidth network-onchip (NoC) fabric for global communications. NoC fabric eliminates the bandwidth limitations and crosstalk effects which are present in global interconnects [3]. Regular NoC structure reduce the VLSI layout complexity compared to custom routed wires [4]. The most difficult task in NoC design is providing quality-of-transmission with low-cost (area and energy) [5],[6],[7]. The QoS offers high throughput, no data loss, lower data-latency and high data-bandwidth. The hard-real time systems based on NoC require QoS [5],[6],[7]. In sequential search algorithm, the sender sends out a probe to the destination node. At every step the probe reserves a link and sets up the circuit connection incrementally. If a link is already reserved by another connection, the probe backtracks and tears down one or several links of the partially set-up connection. In this way,

the probe searches through all possible minimal paths between source and destination until either a connection is setup successfully, or no connection is found. The drawback of sequential search algorithm is that the setup latency is not guaranteed. If the network has heavy traffic congestion then it takes huge amount of setup cycles to find minimal path or to inform the source that the path does not exist. But the proposed parallel search algorithm guarantees setup latency. It sends probes along all minimum paths in parallel. The probe always traversed from source to destination in a minimal path such as forward-wave-propagation. At every switch, when a probe is received, it forwarded to the possible minimal paths in parallel. When a probe finally reaches the destination, the connection is successfully setup and an ACK is sent back to the source. The parallel search takes exactly M steps to setup the connection, where M is the distance in hops between source and destination.

## II. EXHAUSTIVE PARALLEL SEARCH METHOD

The exhaustive parallel search method has three phases of operation, namely set-up phase, datatransmission phase and release phase. In setup phase, the set-up probe sets minimal path between source and destination. In data-transmission phase, data flits are transmitted between sender and receiver through the minimal path. The third phase is release phase, after transferring complete data flits between source and destination nodes, the link resources booked by the source is released for allowing other resources to use them.

## A. Set-up phase

The general-architecture of router used for exhaustive-parallel search is shown in Figure 1. The packet, control switches are shown only in north direction. But they are present other directions as well. Each interconnect has five numbers of input and output ports. They are connected on north, east, south, west and input-core directions. Each input and output ports comprise of data, setup and control lines. The data-lines are 256-bit wide and dedicated to transmit the data between source and destination.

978-1-4577-1617-1/11/\$26.00 ©2011 IEEE 150 Authorized licensed use limited to: University of Central Florida. Downloaded on January 14,2024 at 18:24:29 UTC from IEEE Xplore. Restrictions apply. During data-transmission phase, appropriate circuit switches are closed to make continuous connection of data-lines between sender network interface and receiver network interface. After transferring complete data blocks from sender to receiver these switches are kept opened. Hence the circuit switches are only active in data-transmission phase. A 20-bit wide setup lines are used to setup the minimal path between source and destination nodes. During setup phase, packet switches passes setup flit from one interconnect to another interconnect. A 2-bit wide control lines are used to control the network interfaces in all three phases. The control-switch passes control information from one router to another router. The cross-bar switches which are used to realize circuit, packet and control switches.



Figure 1. General architecture of interconnect

The proposed parallel search algorithm sends probes along all minimum paths in parallel. At the source node, one or two probes are sent to neighboring switches. One probe is sent, if there is only one minimum path; two probes are sent if there are more than one. The probe contains the information of source address, destination address. The probes always traverse from source to destination in a minimal path such as forward-wavepropagation. At every switch, when a probe is received, one or two probes are forwarded. If two probes from the same setup procedure meet, one is canceled and the other continues. Each probe reserves a link as it moves from switch to switch and thus builds up incrementally a circuit connection. If a probe encounters a link, which is already allocated to another connection, the probe is canceled. Cancellation of a probe means, that a NACK (10) is sent back along the partially built up connection to release the reserved links. When a probe finally reaches the destination, the connection is successfully setup and an ACK is sent back to the source. The parallel search takes exactly M steps to setup the connection, if the distance between source and destination is M hops.

#### B. Control-signal

The router operates based on the two-bit control signal. Table 1 shows the status of the control signal.

| Control signal<br>value | Control signal Status     |
|-------------------------|---------------------------|
| 01                      | Setup flit                |
| 11                      | Acknowledgement           |
| 10                      | No-Acknowledgement (NACK) |
| 00                      | Not used                  |

Table 1 Status of Control signal

In setup phase, setup flit traverse along with control signal of '01'. On reception of this signal the router calculates the profitable output links to send setup flit. If the link resource is idle then it changes idle channel status into reserved status. Now the link is reserved by current setup flit, it does not allow other resources to use that link. If two set probes from the same setup procedure meet, one is canceled and the other continues. If a setup probe encounters a link, which is already allocated to another connection, the probe is canceled. Cancellation of a probe means, that a NACK signal (10) is sent back along the partially built up connection to release the reserved links. In the end-of setup phase, the destination node generates a control signal '11' to acknowledge the source that the set-up flit sent by source is reached the destination in a profitable minimal path. The acknowledgement control signal '11' also closes the circuit switches to transmit the data. Whenever the source receives the signal '11' from destination node it starts to send the data. In release phase, the destination node generates the control-signal '00' to release the link-resources booked by the current communication path.

#### C. Setup-Control Signal

Whenever the setup control signal arrives the router will perform the operation based on algorithm shown in Figure 2.

**Step-1:** The router compares the current tile address and destination address of a setup flit. If they match and there is no competition then it generates



Figure 2 Setup-phase algorithm

acknowledgement signal. If there is a competition among same producer setup probes then static priority is applied. The winner probe will send acknowledgement back to the sender and failure probe will send NACK signal back to the sourcenode. If there is a competition between two different producers then also static priority is applied to resolve conflict.

Step-2: Incase the addresses are not matched and there is no competition between other probes in X and Y direction then the setup probes are allowed to progress in preferred X and Y directions. If there is a competition on one or multiple directions by different producers setup probes then the winning probe is selected based on nearest destination address or static priority logic. If there is a competition between same sender setup probes then static priority is applied to resolve conflict.

An example which compares between exhaustive-sequential search and exhaustive-parallel search during setup phase is given below

Let us assume that in 8X8 Mesh NoC fabric. Source I wants to communicate with Destination-Resource J as shown in Figure 3. By nature of the sequential algorithm, it searches minimal X-direction to reach destination, if it fails then it searches minimal Y-direction to reach destination. Figure-3 shows that the setup probe has not found the minimal path between source and destination after 20-clock cycles. Therefore the probe continues to search the minimal path between resources I and J. Figure 4 shows even after 39 -clock cycles there is no minimal path is found by setup probe.



Figure-3 Setup probe search after 20-clock cycles

Figure-4 Setup probe search after 39-clock cycles





Figure-10 Setup probe finds minimal path between source and destination after 10-clock cycles using exhaustive parallel search method

Again the setup-probe continues to search minimal path between source-and destination. The setupprobe condition after 56-clock cycle is shown in Figure 5. The probe continues to search setup path between source and destination. The setup probe status after 71-clock cycles, 78-clock cycles and 83clock cycle are shown in Figure 6, Figure 7 and Figure 8 respectively. At the end of 88-clock cycle the setup probe finds the destination as shown in Figure 9. The exhaustive-parallel search method uses 10-clock cycles to setup path between source and destination as shown in Figure 10. The advantage of exhaustive-parallel-search is that it guarantees setup latency between source and destination.

#### **III. PERFORMANCE MEASURES**

The simulations are performed for a 2D network with a size of 8X8. The synthetic traffic generators are used to evaluate the performance of NoCs[8]. The synthetic traffic patterns are classified as uniform random traffic patterns and localized traffic patterns. Measuring setup latency and throughput are uniformly accepted metrics to evaluate the performance of the NoC[9][10]. There are three phases of operation of networks, warm-up phase, steady-state phase and drain phase. In warm-up phase the network is allowed to reach equilibrium. During steady-state phase the performance of the network is measured. Finally in drain phase the network is run long enough to allow all sent packets to reach their destinations.

The average setup-latency versus average lifetime of a data flits is shown in Figure 11. The average life time of a data flit is selected from 10ns to 1ms. The setup probes are injected based on uniform random-traffic. At lower average life time the performance of sequential search algorithm is better than parallel search algorithm because of huge amount of network activities present in parallel search due to very frequent termination of the connection. These activities degrade the performance of NOC. The performance of the exhaustive parallel search is better on longer average life time. The numbers of setup requests are reduced due to larger data-transmission time. The parallel search finds the minimal path at lowest latency. The data-latency and setup latencies are same. The maximum setup latency Vs average life time of a data flits are shown in Figure 12. The setup latency performance is shown in Figure 13. Average setup latency is measured for various injection rates (number of flits/cycle/input port) or offered load. Setup latency is measured in terms of probe clock-cycles. The characteristics are obtained for uniform-random traffic pattern and localized. The performance of localized traffic pattern is better than uniform-random traffic pattern because of 50% traffic is concentrated on internal nodes [10]. The network saturates at higher values of injection rate compared to uniform-random traffic pattern. Figure 14 shows the normalized accepted traffic Vs injection rates. The number of accepted traffic is normalized. The number of accepted traffic is better in localized traffic than uniform-random traffic. Figure 15 shows the performance of throughput (number of flits/cycle/input-port) Vs offered load. The throughput is normalized.







Figure 12 Maximum Setup latency Vs life time











Figure-15 Normalized throughput versus injection rate

#### **IV. CONCLUSION**

The concept and design of exhaustive-parallel search algorithm for Network-on-Chip has been presented. The proposed router frees the routing deadlock and livelock and also supports dynamic path setup scheme. The router achieves higher aggregate bandwidth. The NoC fabric is designed for 8X8 mesh architecture and its performance is evaluated.

#### REFERENCES

- Kangmin Lee, Se-Joong Lee and Hoi-Jun Yoo, "Low-power network-on-chip for high performance SoC Design", IEEE Transactions on Very Large Scale Integration Systems, Vol 14 No 2, pp. 148-160, Feb 2006.
- Pablo Abad, Valentin Puente, Jose' A'ngel Gregorio, "Reducing the Interconnection Network cost of chip multiprocessor", IEEE International symposium on network-on-chip, pp.183-192, 2008.
- Daniël Schinkel, Eisse Mensink, Eric A. M. Klumperink, Ed van Tuijl, nd Bram Nauta," Low-Power, High-Speed Transceivers for Network-on-Chip Communication", IEEE Transcations on Very Large Scale Integration Systems, Vol 17, No. 1, pp.12-21 Jan 2009.
- Federico Angiolini, Paolo Meloni, Salvatore M. Carta, Luigi Raffo, and Luca Benini, Layout-Aware Analysis of Networkson-Chip and Traditional Interconnects for MPSoCs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 26 No.3, pp. 421-434, March 2007.
- D.Wiklund and L. Dake, SoCBUS: Switched network on chip for hard real time embedded systems, in Proc. Int. Parallel Distrib. Process. Symp., pp.8, 2003.
- P. T. Wolkotte, G. J. M. Smit, G. K. Rauwerda, and L. T. Smit, An energy-efficient reconfigurable circuit-switched network-on-chip, in Proc. IEEE Int. Parallel Distrib. ProcessSymp., pp. 155a, 2005.
- N. Truong, W. H. Cheng, T. Mohsenin, Y. Zhiyi, A. T. Jacobson G. Landge, M. J. Meeuwsen, C. Watnik, A. T. Tran, X. Zhibin, E. W. Work, J. W. Webb, P. V. Mejia, and B. M. Baas, A 167-processor computational platform in 65 nm CMOS, IEEE Journal of Solid-State Circuits, Vol.44, No. 4, pp. 1130–1144 April 2009.
- Arnab banerjee, Pascal T.Wolkotte, Robert D.Mullins, Simon.W and Gerard J.M.Smit, "An Energy and performance exploration of network-on-chip architectures", IEEE transcations on V.Large scale integration systems Vol.17, No.3, pp.319-329 March 2009.
- Partha Pratim Pande, cristian Grecu, Michael Jones, Andre' Ivanov and Resve Saleh, "Performance evaluation and design trade-offs for network-on-chip Interconnect Architectures", "IEEE transactions on computers, Vol.54, No.8, ,pp.1025-1040, Aug 2005.
- Salminen E. Lu, Z. Jantsch and C. Grecu. Network-on-Chip Benchmarking Specification, Part II: Micro-Benchmark Specification. In OCP-IP, Springer London, Limited, pp. 7–8, Feb2008.

155 Authorized licensed use limited to: University of Central Florida. Downloaded on January 14,2024 at 18:24:29 UTC from IEEE Xplore. Restrictions apply.