Graceful Deadlock-Free Fault-Tolerant Routing Algorithm for 3D

Transcription

Graceful Deadlock-Free Fault-Tolerant Routing Algorithm for 3D
Graceful Deadlock-Free Fault-Tolerant Routing Algorithm
for 3D Network-on-Chip Architectures
Akram Ben Ahmed, Abderazek Ben Abdallah
The University of Aizu,
Graduate School of Computer Science and Engineering,
Adaptive Systems Laboratory,
Fukushima-ken, Aizu-Wakamatsu-shi 965-8580, Japan
E-mail: {d8141104, benab}@u-aizu.ac.jp
Abstract
Three-Dimensional Networks-on-Chip (3D-NoC) has been presented as an auspicious
solution merging the high parallelism of Network-on-Chip (NoC) interconnect paradigm with
the high-performance and lower interconnect-power of 3-dimensional integration circuits.
However, 3D-NoC systems are exposed to a variety of manufacturing and design factors
making them vulnerable to different faults that cause corrupted message transfer or even
catastrophic system failures. Therefore, a 3D-NoC system should be fault-tolerant to transient
malfunctions or permanent physical damages.
In this paper, we present an efficient fault-tolerant routing algorithm, called Hybrid-LookAhead-Fault-Tolerant (HLAFT), which takes advantage of both local and look-ahead routing
to boost the performance of 3D-NoC systems while ensuring fault-tolerance. A deadlockrecovery technique associated with HLAFT, named Random-Access-Buffer (RAB), is also
presented. RAB takes advantage of look-ahead routing to detect and remove deadlock with
no considerably additional hardware complexity. We implemented the proposed algorithm
1
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
and deadlock-recovery technique on a real 3D-NoC architecture (3D-OASIS-NoC1 ) and
prototyped it on FPGA. Evaluation results show that the proposed algorithm performs better
than XYZ, even when considering high fault-rates (i.e., ≥ 20%), and outperforms our
previously designed Look-Ahead-Fault-Tolerant routing (LAFT) demonstrated in latency/flit
reduction that can reach 12.5% and a throughput enhancement reaching 11.8% in addition
to 7.2% dynamic-power saving thanks to the Power-management module integrated with
HLAFT.
Keywords: 3D-NoC ; Architecture ; Parallel ; Adaptive ; Look-ahead routing ; Deadlock-free;
1
Introduction
During the past few decades, technology enabled the aggressive scaling and continuous shrinkage
of transistors dimension on modern microchips. This made the integration of billions of transistors
on a single chip achievable [1]. As an example, the Intel Xeon processor [2] includes 2.3 billion
transistors. With such high integration level available, the development of many cores on a single
die has become possible. For instance, the Tilera Tile64 [3] and Intel Polaris [4] contain 64 and
80 cores, respectively. As the number of cores keeps increasing, and in order to take advantage
of the large number of cores, the employment of efficient and scalable interconnect fabrics has
become imperative. Traditional on-chip interconnect schemes, such as Point-to-Point, sharedbus, and the crossbar, are no longer reliable to provide the necessary communication among the
processor cores. Recently, Network-on-Chip (NoC) [5, 6] has been proposed as a potential solution
to respond to the high interconnection demands arising in the nanometer era.
Based on a simple and scalable architecture platform, NoC connects processors, memories and
other custom designs together using switches to distribute packets on a hop-by-hop basis in order
to increase the bandwidth and performance and solve the interconnect bottleneck in traditional
1
This project is partially supported by Competitive research funding, Ref. P1-5, Fukushima, Japan.
2
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
bus-based systems. On the other hand, as the number of cores keeps increasing in hyper-core
systems, spreading hundreds of cores in a 2D plane may not satisfy the high requirements of
future data-intensive applications. This is mainly caused by the increasing diameter of 2D-NoC
that does not scale with such large number of cores.
At the same time, three dimensional integrated circuits (3D-ICs) [7] have attracted a lot of
attention as a potential solution to resolve also the interconnect bottleneck. Thanks to the reduced
average interconnect length, 3D-ICs can achieve higher-performance and a lower interconnectpower consumption can be obtained [8]. Moreover, the realization of mixed technology has
become possible [9]. Combining the NoC structure with the benefits of the 3D integration offers
a promising 3D-NoC architecture. The combination of these two major trends provides a new
horizon for NoC designs to satisfy the high requirements of future large-scale applications.
As 3D-NoC architectures started to show their outperformance and energy efficiency against
2D-NoC systems, questions about their reliability to sustain their performance growth begun to
arise [10]. This is mainly due to challenges inherited from both 3D-ICs and NoCs: On one side,
the complex nature of 3D-IC fabrics and the continuing shrinkage of semiconductor components.
Furthermore, the significant heterogeneity in 3D chips which are likely to mix logic layers with
memory layers and even more complex technologies increases the fault’s probability in a system
[11]. On the other side, the single-point-failure nature of NoC introduces a big concern to their
reliability as they are the sole communication medium. As a result, 3D-NoC systems are becoming
susceptible to a variety of faults caused by crosstalk, electromagnetic interferences, impact of
radiations, oxide breakdown, and so on [12]. A simple failure in a single transistor caused by one
of these factors may compromise the entire system reliability where the failure can be illustrated
in corrupted message delivery, time requirements unsatisfactory, or even sometimes the entire
system collapse. Faults can occur at any component of a 3D-NoC system (i.e., link, router, buffers,
crossbar etc.) and their rate of occurrence depends on the design, technology, environment and
3
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
operation conditions.
From a time perspective, the duration of faults is very important especially for real-time 3DNoC systems and it can be categorized into three main types [13]: 1) Transient faults: they occur
and remain in the system for a particular period of time before disappearing; 2) Intermittent
faults: they are transient faults which occur from time to time; 3) Permanent faults: they start
at a particular time and remain in the system until they are repaired. Consequently, reliable
3D-NoC systems should be able to detect first all the fault types occurrence, then working on
reconfiguring the system resources to recover from these faults and guarantee the continuous
correct functionality of the system.
Fault detection is very crucial phase in fault-tolerant 3D-NoC systems. It can be obtained
by relying on custom testing mechanisms or other detection schemes based on codes. Codes
are mostly used in NoC systems and they were proposed to detect and correct errors in specific
components of the system at the presence of a specific type of fault. For instance, Crosstalk
Avoidance Codes (CAC) [14] are used for transmission wires and they are considered more
efficient than the already existing methods to avoid crosstalk. For errors whose presence could
not be detected, Error Detection Codes (EDC) and Error Correcting Codes (ECC) [15] are used
to detect and correct these errors.
Detection and checking mechanisms in 3D-NoC systems are out of the scoop of this paper and
we are mainly interested in avoiding the detected faults by using a light-weight adaptive routing
to avoid information loss or corruption. On the other hand, deadlock is one of the problems
that may come with adaptive routing. Deadlock is caused when packets in different buffers are
unable to progress because they are dependent on each other forming a dependency cycle. Most of
the existing 3D-NoC systems used Virtual-Channel (VC) [16] as a deadlock-avoidance technique
while others employ Virtual-Output-Queue (VOQ) [17] or their routing algorithms are based on a
turn-model that adds restrictions to the routing choices to avoid deadlock. These procedures insure
4
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
the deadlock freedom at the expense of the hardware and implementation complexity or latency
penalty caused by the routing restrictions.
Based on these facts, in this paper we propose a novel low-latency, high-throughput and
fault-tolerant routing algorithm, that we called Hybrid-Look-Ahead Fault-Tolerant (HLAFT).
HLAFT takes advantages of the high-throughput and low-overhead of our previously implemented
routing algorithm, Look-Ahead-Fault-Tolerant (LAFT) [18, 19], and combines it with localrouting for better routing decision while simultaneously guaranteeing fault tolerance with graceful
performance degradation. The proposed algorithm can detect if the route calculated for the next
node leads to a blocking path and recalculate the route to save the unnecessary clock cycles
needed for nonminimal routing. To ensure deadlock-freedom, we employed a low-cost technique
called Random-Access-buffer (RAB). RAB detects first the deadlock occurrence then manages
to drop the blocking request and looks for other ones to free some slots in the buffer and break
the dependency cycle causing the deadlock. Both HLAFT and RAB were implemented on 3DOASIS-NoC [20, 21] architecture which is a natural extension of our previous robust and flexible
2D-OASIS-NoC system [6, 23]. We prototyped the proposed system on FPGA and evaluated its
hardware complexity and performance under different fault-rates and traffic loads using Transpose
and Uniform synthetic workloads and also Matrix-multiplication and JPEG-encoder applications.
The rest of the paper is organized as follows: In Section 2, we present some of the related
work to fault-tolerant routing algorithms used in most popular 2D- and 3D-NoC systems. The
proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) and its main features
are explained in Section 3. In Section 4, we present Random-Access-Buffer technique for
deadlock-recovery. Section 5 is dedicated for the evaluation methodology and results, and finally
we end the paper with the conclusion and future work in Section 6.
5
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
2
Related Work
Many works have been conducted so far to tackle fault-tolerance in NoC systems where they can
be classified depending the target system, the fault’s type, or the faults’ handling mechanism (e.g.,
using routing algorithms or architectural solutions). The majority of the fault-tolerant solutions
were proposed for 2D-NoC systems. Some of them added restrictions to the number of faults as a
security requirement for their systems. For instance, a single link or single node failure tolerance
was presented in [24]. For n-dimensional mesh, Duato et al. [25] presented a routing algorithm
that can tolerate up to n-1 faults.
Another part of the proposed solutions eliminated the number of faults restriction and, instead,
they limited the location of faults to a specific part of the network. They called these restricted
subnetworks fault-regions which can be disabled, if necessary, to ensure fault-tolerance of the
system. These restrictions can be represented by the shape of the fault-regions [26], their locations
(excluding faults located at the edges of the network [27]), or assuming the absence of faults in
some specific parts of the router [28]).
Some works were presented without adding any restriction to either the number of faults or
location. For example, uLBDR [29] is a routing scheme for 2D mesh topology based on Virtualcut-through which incurs a high complexity despite its great performance. The work in [30] is
based on stochastic approaches to tolerate transient and permanent faults.
Other works [31, 32] focused on the importance of adopting minimal routing algorithms to
reduce the congestion caused by the presence of faults. They proposed minimal routing schemes
that are able to adaptively route packets through the shortest paths, as long as a path exists.
Authors in [33], presented a comprehensive investigation about 2D/3D fault-tolerant routing
algorithms’ implementation. They discussed the different turn models, fault conditions, and the
fault occurrence in both links and routers.
6
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
Some of the proposed fault-tolerant solutions for 3D-NoC systems primarily focused on
permanent faults occurring in Through-Silicon-Vias (TSVs). For example, Loi et al. [34]
addressed the TSV yield improvement by presenting a solution based on spare via insertion.
Pasricha et al. [35] introduced serialization to achieve also the same goal. On-chip sequential
elements protection techniques were also proposed in [36, 37] to deal with single event upsets.
The other existing works for 3D-NoC systems focused on link failure in general by adopting
fault-tolerant routing algorithms. For example, Rahmani et al. [38] presented a fault-tolerant
routing algorithm for named AdaptiveZ targeted for Hybrid-3D-NoC architecture. Feng et al.
[41] proposed a low overhead deflection routing algorithm based on routing tables. However, the
deployment of routing tables is still costly in terms of hardware and suffers from poor scalability
due to the area required for the tables.
Nordbotten et al. [39] presented an adaptive and fault-tolerant routing for 3D meshes and tori
based on storing at each node a table for routing containing information about destination and the
list of intermediate nodes. A fault-tolerant routing algorithm for 3D torus was presented by Yamin
et al. [40]. With up to 30% faulty-node rate, the proposed algorithm can find a valid link between
two faulty nodes with a probability higher than 90%. The Planar Adaptive Routing (PAR) for large
scale 3D topologies was presented by Chien et al. [42] where they used three Virtual-Channels
(VCs) [16] for deadlock-recovery and used also some routing rules. In [43] a fault model based
on PAR, named 3D minimum-connected component (MCC), was presented. Later, Xiang et al.
[44] proposed a fault model, named planar network (PN), that needs to store much less safety
information at each faulty-free node.
The problem with the solutions mentioned above is that they use VCs which are costly in terms
of hardware and implementation complexity. This is caused by the arbitration needed to handle
the different requests coming from the multiple VCs at each input-port. Thus, they introduce an
extremely expensive hardware complexity and implementation cost besides the energy overhead
7
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
making them undesirable for the 3D-NoC implementation.
To avoid using VCs while insuring deadlock-freedom, Pasricha et al. [45] extended a 2D
turn-model for partially adaptive routing to the third dimension. The proposed scheme combines
both 4N-FIRST and 4P-FIRST schemes to propose a lightweight 4NP-FIRST. On the other hand,
this turn-model introduces some routing restriction to prevent from deadlock. These restrictions
cause a nonminimal routing selection where in some cases it may take too many additional hops
for the packet to reach its destination.
A recent work by Akbari et al. [46] focused mainly on faults links and proposed an algorithm
named AFRA. This algorithm routes packets through ZXY in the absence of link faults. When
faults are detected, flits are sent first to an escape node through X, and then they continue to
their destination through ZXY as it is in the no-fault case. The authors presented simplicity,
good performance and robustness as the main features of this algorithm. However, this solution
has two major drawbacks: First, AFRA tolerates only vertical links and horizontal links are not
taken into consideration; second, the network congestion status is not taken into consideration,
especially when more than one possible minimal path is available. In that case, a static minimal
path selection does not eventually lead to low latency (i.e. a minimal path may contain a high
congestion while other alternative ones do not).
Ebrahimi et al. [47] stated that AFRA requires a fault distribution mechanism to know
about the fault information on all vertical links along each row. They proposed a more reliable
routing algorithm named HamFA which does not require any fault distribution, additional fault
information, or any virtual channels. They modified the basic form of the Hamiltonian path in
order to be able to switch between high and low channel subnetworks. In fact, HamFA provides
much more reliability than AFRA; however, it requires that the faulty links should belong to the
same subnetwork in order to tolerate multiple vertical or horizontal one-faulty links. As a result,
this limits the reliability of this approach to 95% when considering the presence of a single faulty
8
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
link. In our approach, no restrictions are added to the fault locations and they can be anywhere in
the network as long as there exists a valid path between a given (source, destination) pair.
3
3.1
Hybrid-Look-Ahead-Fault-Tolerant Routing Algorithm
Look-Ahead-Fault-Tolerant routing (LAFT) drawbacks
To keep the benefits of look-ahead routing [22], Look-Ahead-Fault-Tolerant routing algorithm
(LAFT) [18, 19] should be able to perform the routing decision for the next node taking into
consideration its link status and selects the best minimal path. The fault information is read by
each input-port where LAFT handles the routing computation. The first phase of this algorithm
calculates the next node address depending on the next-port identifier read from the flit. For a
given node wishing to send a flit to a given destination, there exist at most three possible directions
through X, Y, and Z dimensions respectively. In the second phase, LAFT performs the calculation
of these three directions by comparing x, y and z coordinates of both current and destination
nodes concurrently. At the same time, as these directions are being computed, the fault-control
module reads the next-port identifier from the flit and sends the appropriate fault information to the
corresponding input-port. By the end of this second phase, LAFT has information about the next
node fault status and also the three possible directions for a minimal routing. In the next phase,
the routing selection is performed. For this decision, we adopted a set of prioritized conditions
to ensure fault-tolerance and high performance either in the presence or absence of faults: 1)
The selected direction should ensure a minimal path and it is given the highest priority in the
routing selection, 2) we should select the direction with the largest next hope path diversity, and
the congestion status is given the lowest priority.
To understand better how LAFT works, we observe Fig. 1 (a). Assuming that the current node
(labeled C) received an incoming flit where the next port identifier, calculated in the previous node,
9
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
indicates that the out-port for this flit is East (Red arrow). The next node address is calculated
(labeled N). Three minimal directions are possible for routing: East, North or Up. The East
direction will not be selected since the link in this direction is faulty. Therefore, either North or
Up can be selected, which both are minimal and nonfaulty. In this case, the diversity priority [18]
is taken into consideration. If Up is selected, where the node in this direction is on one of the
network edges, the diversity value is equal to 2 (two minimal possible directions: East or North).
But, if North is selected, its diversity value is equal to 3 (East, North or Up). Having the highest
priority, the North out-port (Green arrow) is selected for the next node and it is embedded in the
flit to be used in the downstream node to allow the routing calculation and switch allocation to be
performed both in parallel. More details about the different steps of the algorithm can be found in
[18].
Employing look-ahead routing reduces the router latency and improves the system overall
performance. However, due to the limited information restricted to only one switch ahead, in
some cases LAFT may not select the best route. Observing the example in Fig. 1 (a), we can see
that the North route is leading to a blocked path where all the minimal possible directions to the
destination are faulty. Therefore, nonminimal routing is required causing several additional clock
cycles. When arriving to the next node (labeled N), the out-port is already decided and sent to the
Switch-allocator due to the look-ahead routing restraints. This is despite the fact that this path is
leading to a blocking path and also a better blocking-free route exists (Up). Therefore, optimizing
LAFT and making it able to detect whether the route precalculated in the previous upstream node
would lead to a blocked path or not is necessary. To enable this, the benefits of look-ahead routing
should be combined with local routing for better routing selection.
10
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
3.2
Proposed Hybrid-Look-Ahead-Fault-Tolerant routing (HLAFT) algorithm
The proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) solves the earlier
mentioned problems of LAFT. At every incoming flit, HLAFT makes a simple computation to
judge whether the precalculated Next-port identifier will lead to a blocking path or not. In case
where a possible nonminimal route might occur, HLAFT recomputes the route depending on the
local and neighboring nodes fault status. Algorithm 1 represents the proposed HLAFT routing.
At first, the fault-control module at each input-port reads the Next-port identifier and destination
addresses from the flit. Depending on the Next-port value, the next-node can be calculated. After
acquiring information about the next-node, the three possible minimal directions are computed
and checked whether a valid minimal route leading to the destination exists or not. In case where
at least a valid minimal link leading to the destination exits, the Next-port identifier is sent to the
Switch-allocator and the look-ahead-routing-computation (LA-RC) modules at the same time. In
the opposite case (i.e., all the possible three directions are faulty), a flag is issued triggering the
local-routing-calculation module (Loc-RC) to recalculate the output-port by verifying the status of
the current and neighboring nodes links. The same three priorities, earlier explained in the previous
subsection (i.e., minimal valid, higher diversity, and less congested), are taken into consideration
while selecting the new out-port. The recalculated out-port issued from Loc-RC is sent to the
Switch-allocator to perform the arbitration, and also sent to the LA-RC module to compute the
Next-port for the next-node to continue the look-ahead routing cycle.
As illustrated in Fig. 1 (b), when the flit arrives to the node labeled C and having a current outport North precalculated in the previous node, the fault-control checks the fault status of the links
of the next-node in the north direction. When performing the checking, all the possible minimal
paths are found faulty; thus, Loc-RC is triggered and computes Up as the new best route (since
East is faulty). Then, the newly calculated Up direction is sent to the LA-RC module which issues
North as the Next-out-port.
11
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
Figure 2 (a) depicts the LAFT pipeline stages which takes advantage of look-ahead routing
to parallelize both Next-Port-Calculation (NPC) and Switch-Arbitration (SA) stages. For the
proposed HLAFT algorithm, illustrated in Fig. 2 (b), we can observe first that a pipeline stage
is dedicated for the fault-control module (FTC) to issue the flag and whether trigger local-routing
or not. This simple computation is done in parallel with the Buffer Writing stage (BW) to further
reduce the computation latency in the router without creating any critical paths. In case where no
flag is issued, the Local-Routing-Calculation stage (Local-RC) is bypassed and then Next-PortCalculation and Switch-Arbitration stages are performed in parallel (NPC/SA). In the presence
of a flag, Local-RC is performed and then the results are sent to NPC/SA stage before the final
Crossbar Traversal stage (CT) handles the flit transfer to the next neighboring node.
4
Random-Access-Buffer Scheme for Deadlock-Recovery
As is the case for every adaptive routing algorithm, the deadlock issue may rise. As we previously
mentioned in Section 1, most of the existing routing algorithms use either Virtual-channels (VCs)
or add restrictions to the routing selection to avoid deadlock. These solutions either suffer from
high implementation complexity or incur an additional delay due to the nonminimal approach. In
our case, we implemented a similar technique to VCs, but it is much simpler and less complex.
This technique, named Random-Access-Buffer (RAB), detects first the flit being the reason of
deadlock in the buffer, drops its request and then looks for another flit whose request can be
granted to free some slots in the buffer and break the dependency.
Figure 3 shows an example how RAB works. In each input-port, the RAB-controller (RABcntrl) manages the detection of deadlock and handles the assignment of wr-adr and rd-adr
addresses. The detection mechanism is based on a timer which after a period of time, if the
request being processed is not served, a flag is issued informing the presence of a deadlock (Fig.
3 (a)). This is done by reading the sw-grnt signal received from the Switch-allocator. In this case,
12
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
the BC reads the head of the next packet in the buffer and checks whether the requested out-port is
different from the one previously flagged as blocked or not. When it finds a request whose channel
is free (Fig. 3 (b)), it sends a request to the Switch-allocator to be served. When the request is
granted, the flits of the granted packet are read from the buffer and the freed slots can be used to
host another incoming packet (Fig. 3 (c)). After new flits are written in the buffer, the blocked
packet is checked again (Fig. 3 (d)). The RAB-cntrl receives a grant for the direction requested
(North) and the packet is read from the buffer.
The timer’s value is one of the important choices that should be carefully taken. This is
because the deadlock occurrence is strongly dependent on the application for any adaptive routing
algorithm. Moreover, the presence of faults in the system adds a higher probability for the
deadlock occurrence; therefore, before selecting the value of the timer, we profiled each one of the
used applications and analyzed the task graph of each one of them. During this analysis we took
into consideration the fault-rate and its impact on the congestion and deadlock occurrence.
In order to avoid any flit overwriting caused by mismanaging the wr-adr and rd-adr addresses,
we added a small status register (labeled SR in Figs. 3 and 4) which keeps information about the
flits being flagged (i.e., not served yet). When a flit is read from the buffer, the BC checks the
status register and makes sure not to write the incoming flit in a flagged slot. Observing, Figs. 3
(1) and (2), we can see that status[0] and status[1], which are keeping information about the two
flits of the flagged packet are updated to 1. As shown in Fig. 3 (3), when an incoming flit arrives to
the buffer, RAB-cntrl makes sure that it will not be stored in one of these two flits’ locations, but
instead it is stored in an empty one. When a flit which was previously flagged, and after of period
of time its request is granted, the RAB-cntrl should update the status register to free the flagged
slot so it can be used by other incoming flits, as it is show in Fig. 3 (4) (status[0] is updated to 0).
With this simple technique, performance and deadlock-freedom are ensured while guaranteeing a small overhead. Moreover, instead of managing many requests at the same time, as it is the
13
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
case of VCs requiring then additional complexity and delay for the arbitration, RAB mechanism
handles each request at a time. Despite the delay penalty required by the timer to detect the
deadlock, this technique is still faster and simpler to implement than Virtual-channels.
Livelock freedom
As long as the chosen route is minimal, the livelock problem does not exist
either. However, it can be observed when a nonminimal direction is selected. For this reason,
some restrictions are added when selecting the nonminimal route [18]. The first restriction forbids
the flit to turn back to the same direction where it came from. The second one forbids selecting a
path which is in the opposite direction of the faulty link (i.e., if ”East” is faulty then ”West” should
not be selected). Adopting these restrictions guarantees the livelock freedom of HLAFT, and the
flits will continue to advance and search for a route until it finds a valid link.
Power management
For the proposed system, we used a power management scheme to reduce
the dynamic power in the system. This scheme is based on clock-gating to turn-off the clock in
some specific components which, in some periods of time, they are in a waiting state and not
performing any computation. Therefore, it is important to switch off these components when they
are not necessary for the correct working of the system and they are awakened when the system
requires their services.
We employed the power management scheme in three main components. First, when a router
detects that one of its links is faulty, it immediately switches off the clock from the entire inputport associated to this faulty link. In this case, an important amount of dynamic power can be
saved since input-ports occupy the largest portion of the system power budget due to the presence
of buffers which are considered as the hungriest components in the entire router in terms of power.
Second, we turn off the clock in the local-routing module that we added to recompute the incoming
route if necessary. Thus, the local-routing module is initially not fed by the clock. Whenever
the fault-control module signals the necessity to recompute the route, the local-routing module
14
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
is awakened and performs the necessary computation. Otherwise, it is disconnected from the
clock and the unnecessary dynamic power is spared. Finally, the last component is the RABmanger circuit (See Fig. 4). As we previously mentioned, RAB deadlock-recovery technique is
triggered only at the presence of deadlock. When no blocking happens, the buffer is managed by
the conventional First-In-First-Out (FIFO) buffer manager, as shown in Fig. 4. For dynamic-power
saving, RAB-manager is put asleep at the absence of deadlock (deadlock-flag is equal to zero) and
it is working only when the timer (deadlock-flag) informs it that deadlock is occurring. In this
fashion, deadlock-recovery is insured at a low-cost with no unnecessary power waste.
5
5.1
Evaluation
Evaluation methodology
Our proposed algorithm is implemented on 3D-OASIS-NoC system [20, 21, 22] which was
designed in Verilog HDL. For the hardware synthesis we used Altera Quartus II CAD tool, and
Power Play Analyzer to extract both static and dynamic power of the proposed system [48]. In our
evaluation, we conducted two experiments: in the first one, we evaluated the hardware complexity
and performance of 3D-OASIS-NoC and compared it to that of the baseline 2D-OASIS-NoC
system. In the second experiment, we evaluated the hardware complexity of HLAFT router in
terms of area utilization, power consumption (static and dynamic) and speed. To evaluate the
performance of the proposed algorithm, we selected Matrix-multiplication [49] and JPEG-encoder
[50] as real benchmarks and also two traffic patterns: Transpose [51] and Uniform [52]. We chose
Matrix-multiplication because it is one of the most fundamental problems in computer sciences
and mathematics, which forms the core of many important algorithms such as engineering and
image processing applications [49]. To evaluate 3D-OASIS-NoC system’s performance with
Matrix-multiplication, we set the matrix size to a 6×6. We also decided to calculate from 1 to
15
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
100 different matrices at the same time. This aims to increase the number of flits traveling the
network at the same time and see the impact of congestion on the performance of the proposed
system with different traffic loads.
The second application used for the evaluation is the JPEG-encoder application [50] which is a
well-known application frequently used in a lot of research and includes some parallel tasks. This
can be suitable to evaluate the performance of NoC systems. To increase the network size and
introduce more parallelism, we implemented four JPEG systems that run concurrently and whose
tasks share three shared memories where the input and output images are stored.
The Transpose traffic pattern is a communication method based matrix transposition. Each
node sends messages to another node with the address of the reversed dimension index [51]. The
Transpose workload is often used to evaluate the NoC throughput and power consumption since
it creates a bottleneck due to the long communication distance exhibited between (transmitter and
receiver) pairs.
The Uniform traffic pattern is a standard benchmark used in on-chip and off-chip network
routing studies which can be considered as the traffic model for well-balanced shared memory
computations [52]. Each node sends messages to other nodes with an equal probability (i.e.,
destination nodes are chosen randomly using a uniform probability distribution function).
In our evaluation with the two traffic patterns, we set 4x4x4 as a network size where all the
nodes were assigned for both transmitter and receiver nodes. Each transmitter node injects from
102 to 105 flits into the network. While on the other side, receiver nodes verify the correctness of
the received flits.
Using these four benchmarks, we conducted two experiments: in the first one we evaluated
the performance of 3D-OASIS-NoC when compared to the base 2D-architecture under different
routing algorithms (XY [53], XYZ [55], LA-XY [54], and LA-XYZ [22]). In the second
experience, we evaluated the latency per flit, the throughput, and the dynamic-power of the
16
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm under each of the aforementioned
application. We observed the performance variation of the proposed system under different fault
link rates: 0%, 1%, 5%, 10%, 15% and 20%. The number of links in each system can be calculated
using the formula shown below [56]:
#links = N1 N2 (N3 − 1) + N1 N3 (N2 − 1) + N2 N3 (N1 − 1)
(1)
Where N1, N2 and N3 are the respective network’s X, Y and Z dimensions. During the evaluation,
we divided the faults into two categories: Half of the faults are permanent (considered during
the whole simulation time) and the second half are transient (randomly start and end along the
simulation time). In addition, as much as the fault-rate increases, we employed more faults in flits’
paths to cause nonminimal routing and observe the system behavior in a worst case environment.
All the results obtained with HLAFT were compared with LAFT [18], LA-XYZ [22], and also
Dimension-Order-Routing XYZ [53]. Table 1 represents the configuration parameters used for
our evaluation.
5.2
5.2.1
3D Vs. 2D NoC architectures comparison results
Hardware complexity evaluation
Table 2 summarizes the hardware complexity of both 2D- and 3D-OASIS-NoC architectures based
on XY, LA-XY (2D), XYZ, and LA-XYZ (3D). When we extend the 2D-architecture to the third
dimension (XY to XYZ or LA-XY to LA-XYZ) we observe, an area penalty which was about
37% in average. This can be explained by the additional two input-ports used by inter-layer links
to connect the two adjacent layers. This results in additional buffer resources usage in addition
to a larger crossbar. We also evaluated the effect of look-ahead routing on both architectures.
Adopting look-ahead incurs an additional 4.6% extra area in average. This additional area can be
explained by the additional bits needed for the Next-port identifier. In terms of speed, LA-XYZ-
17
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
based system has the slowest speed which is caused by the additional hardware that we previously
explained. In terms of total power consumption, 3D-architectures show the best performance
illustrated in an average reduction of 5.2%. This slight reduction is a result of a tradeoff between
the additional static power caused by the additional hardware when moving to third dimension,
and the dynamic power reduction thanks to reduced number of hops with 3D-architectures which
incurs less buffering, less arbitration, and less computation resulting in dynamic-consumption
reduction.
5.2.2
Communication latency evaluation
We evaluated the communication latency by calculating the latency per flit for each architecture
under each of the four applications. The latency per flit results are shown in Fig. 5. Extending the
system to the third dimension in Matrix application (Fig. 5 (a)) reduces the latency/flit to 33% in
average thanks to the reduced number of hops. Furthermore, adopting look-ahead routing (LA-XY
and LA-XYZ) offers 25% latency/flit reduction. This reduction is obtained thanks to the ability of
look-ahead routing to reduce the router delay and forward faster the flits to reach its destination.
As a result, when compared to XY-based system, LA-XYZ-based one reduces the latency/ flit up
to 55%. As a conclusion, 3D-architectures reduce the latency/flit with 22.5%, 24.5%, and 26%
with Transpose, Uniform, and JPEG applications, respectively, when compared to 2D-routingbased systems. On the other hand, employing look-ahead routing offers also 28%, 29%, and 17%
latency reduction with Transpose, Uniform, and JPEG applications, respectively.
5.2.3
Throughput Evaluation
For the second 2D- and 3D-architecture evaluation, we calculated each system’s throughput.
Figures 6 (a), (b), (c), and (d) show the throughput results for each of the used application.
Transpose application (Fig. 6 (b)) shows the best throughput enhancement, when comparing 3D-
18
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
to 2D-architectures, illustrated in 45% improvement. This is can be explained by the long-distance
communication omnipresent in this kind of traffic pattern. With this kind of communication, 3Darchitectures take advantage of the reduced number of hops to forward flits to their destination
faster than 2D-systems. When observing the look-ahead routing effect, we noticed that Matrixmultiplication (Fig. 6 (a)) showed the best performance represented in 32% enhancement when
comparing LA-XYZ to Dimension-Order-Routing XYZ. This improvement can be explained by
the large size of the network (3×6×6) presenting long distance between source and destination
nodes, where the router delay plays an important role in the throughput.
5.3
3D routing algorithms comparison results
The second part of the evaluation takes into consideration fault-tolerance and compares the
proposed Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT) with other 3D-routings
(XYZ, LA-XYZ, and LAFT). This comparison is made in terms of hardware complexity (area,
speed, and static-power), communication latency, throughput, dynamic-power, and reliability.
5.3.1
Hardware complexity evaluation
Table 3 represents the hardware complexity results of 3D-OASIS-NoC’s router when employing
HLAFT, XYZ, LA-XYZ, and LAFT. To see the additional hardware required for the RandomAccess-Buffer (RAB) for deadlock freedom, we evaluated the proposed HLAFT’s router without
(HLAFT-based) and with RAB (HLAFT+RAB-based). When compared to XYZ and LAFT,
HALFT presents an increasing area evaluated to 23% and 34%, respectively. This area penalty
is a natural cause to the additional Loc-routing module added to recompute Next-port if needed.
Moreover, the additional control signals added to perform the flag decision and insure the correct
functionality of the proposed routing algorithm. Observing the additional area required for
the adopted deadlock-recovery technique, RAB (HLAFT+RAB-based) requires only less than
19
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
7.1% of extra hardware when compared to HLAFT router with no deadlock-recovery mechanism
(HLAFT+RAB-based). In terms of static-power consumption, the addition of RAB does not
considerably affect the proposed algorithm’s power whose static-power consumption increased
by 2.8% and 5.6% when compared to LAFT and XYZ, respectively. In terms of speed, XYZbased router seems to be the fastest among the three other routers thanks to its low area while
HLAFT is 2.2% and 10.3% slower than LAFT- and XYZ-based routers, respectively.
5.3.2
Communication latency evaluation
For the performance evaluation, we first calculated the latency/flit for XYZ and LA-XYZ with
0% fault-rate (since both algorithms do not support fault-tolerance), and LAFT and HLAFT under
fault-rates varying between 0% and 20% and we observed the performance variation. Figures 7
(a), (b), (c), and (d) show the average latency/flit results for each of the used applications. At the
absence of faults, both LAFT and HLAFT have the same latency/flit. This means that the Localrouting was not triggered and HLAFT has the same behavior as LAFT. In Uniform application
(Fig. 7 (c)) the latency/flit reduction with HLAFT and LAFT can reach the 48% and 33% when
compared to XYZ and LA-XYZ, respectively. This reduction is a result of the ability of both
algorithms to detect the congestion caused by such traffic and makes the best routing decision
for less congested channels. When we increased the fault-rate and placed the fault links in some
critical locations, the previous LAFT algorithm fails and deadlock happens at 20% (Matrix and
Transpose) and starting from 10% (Uniform and JPEG) fault-rates as illustrated in Figs. 7 (a), (b),
(c), and (d). In these figures, we can see that the latency tends to ∞, which means the failure of the
system. On the other hand, HLAFT manages to avoid deadlock and, moreover, its latency is still
smaller than XYZ even under 20% fault-rate with JPEG application (Fig. 7 (d)) while observing
a small overhead when compared to LA-XYZ illustrated in 1.5% in Transpose traffic. In average
20
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
and under 20% fault-rate, the latency/flit is increased with HLAFT with 13.75% and 3.2% when
compared to LA-XYZ and XYZ, respectively. It’s important to remind that despite their small
outperformance against the proposed HLAFT algorithm, both LA-XYZ and XYZ algorithms are
static schemes that do not support fault-tolerance and they can crash even at the presence of a
single faulty link.
When we reobserve the latency/flit results, we can see that even when LAFT succeeds to
avoid deadlock, its latency is always equal or higher than that of HLAFT. The outperformance
of HLAFT against LAFT can reach the 6.2 and 12.6% latency/flit average reduction under 5%
and 10% fault-rates, respectively, with Matrix application. This performance improvement is
obtained thanks to the employment of both local- and look-ahead routing which gives our system
the opportunity to recompute the route if it finds out that the already calculated one will lead to
a blocking path costing several clock cycles for nonminimal routing. In this fashion, HLAFT
optimizes the routing decision and minimizes the probability of finding a blocking path (that leads
to a nonminimal route) while keeping the routing efficiently minimal as long as one minimal path
exists. Thus, with the help of RAB, both fault-tolerance and deadlock-recovery are insured with
no considerable performance degradation.
5.3.3
Throughput Evaluation
In the second performance evaluation, we computed the throughput of each routing-based system
as shown in Figs. 8 (a), (b), (c), and (d). Under 0% fault-rate, the throughput enhancement
with LAFT and HLAFT can reach the 53% and 23% when compared to XYZ and LA-XYZ,
respectively, with Matrix application (Fig. 8 (a)). Under 5% and 10% fault-rates, HLAFT
keeps its advantage against XYZ for all applications, and even under 20% fault-rates, HLAFT
still outperforms XYZ with Matrix, Transpose and JPEG application observing 12% throughput
21
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
degradation with Uniform traffic-pattern. When compared to LA-XYZ, HALFT’s throughput is
the highest under 5% and 10% fault-rates for all applications except for Uniform (10%) where we
observed a slight 2.3% degradation. The final throughput comparison considered the performance
of HLAFT and LAFT. As we previously mentioned, LAFT fails to avoid deadlock when the faultrate exceeds the 10%. For this reason, the throughput results of LAFT are not shown on these
figures and the throughput is considered equal to zero. However, HLAFT with the help of the
RAB deadlock-recovery technique, manages to detect and eliminate the deadlock. Moreover,
even when LAFT succeeds to avoid deadlock, its throughput is always equal or lower than that of
HLAFT. The enhancement of HLAFT against LAFT can reach the 5.4% and 11.8% under 5% and
10% fault-rates, respectively, with Transpose application.
5.3.4
Dynamic-power evaluation
In the third part of our evaluation, we evaluated the dynamic-power of each routing-based system.
In addition, we want to see the effect of the employed Power-manager (PM) for dynamic-power
reduction; therefore, we evaluated the proposed HLAFT algorithm without (HLAFT) and with
(HALFT+PM) PM. The average dynamic-power results are illustrated in Table 4. XYZ-based
system represents the one with the lowest dynamic-power consumption followed by LA-XYZ with
a slight 2.3% difference. The low dynamic-power of these two deterministic routing-based systems
is natural due to their simplicity and the absence of fault-tolerance. When evaluating LAFT, it
exhibits 7.2% and 5.1% higher average dynamic-power consumption than XYZ and LA-XYZ,
respectively, which increases as much as we increase the fault-rates where the computation is more
important than those at low fault-rates. When comparing LAFT with HLAFT, we can observe that
the proposed algorithm, without employing any power management, presents the highest dynamicpower consumption illustrated by an average of 5% which also increases also linearly as we
22
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
increase the fault-rate. This can be explained by the power overhead caused by both Local-routing
(to recompute the route) and RAB (for deadlock-recovery) which are triggered more frequently
at high fault-rates (10% and 20%). When adopting the Power-management (PM) for HLAFT, the
dynamic-power reduction with the optimized system can reach the 21.2% when compared to the
system with no PM. Furthermore, when no faults are detected in the system, HLAFT+PM has
almost the same dynamic-power consumption as in LAFT since both Local-routing module and
RAB are shutdown. As we employ faults, HLAFT+PM starts to take advantage against LAFT
that can reach the 7.2% at 20% fault-rate. At this high fault-rate, the computation performed
in HLAFT+PM is more important than that in LAFT, causing then additional dynamic-power
that can be seen in HLAFT without PM. However, when integrating the PM module 20% of the
input-port connected to the faulty-links are clock-gated providing an important dynamic-power
saving. Despite the small area penalty, the proposed system endorses fault-tolerance, provides
higher performance, and ensures deadlock-freedom. Moreover, all these benefits are obtained
with a minimal dynamic-power overhead to maintain the system fault-tolerant at low cost.
5.3.5
Reliability evaluation
In this subsection, we discuss the reliability of HLAFT. We define reliability as the capability of
the system to deliver all the packets to their destinations. If all packets are delivered except for
one, the system is considered as unreliable. We compared the results with both AFRA [46] and
HamFA [47] algorithms. The reliability results of these schemes are obtained from [47] where
Uniform traffic pattern was used and one, two, and three faulty links are considered.
HamFA could tolerate one, two, and three faulty links by 95%, 44%, 20% reliability,
respectively, while AFRA could tolerate 33%, 7%, and 3% for the same number of faulty
links, respectively [47]. For the proposed HLAFT, the system could deliver all the flits to their
destinations with no single dropped flit (100% reliability) when considering three faulty links.
23
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
Moreover, all the packets could reach their destinations when running all the used benchmarks,
even when considering a large number of faulty links that can reach sometimes 252 faulty links
(case of 20% fault rate in Matrix-multiplication application).
We can explain this reliability by the fact that HLAFT always finds a path form source to
destination, no matter where the fault is located and how heavy the traffic is. The only possible
reason that can prevent the arrival of a given packet to its destination is the presence of deadlock.
But thanks to the adopted RAB mechanism, deadlock was avoided and, therefore, all packets could
reach their destinations providing a full reliability under different fault-rates and with different
injection-rates.
6
Conclusion
In this paper, we presented a novel low-latency, high-throughput, and fault-tolerant 3D-Networkon-Chip routing algorithm named Hybrid-Look-Ahead-Fault-Tolerant (HLAFT). HLAFT is a
minimal routing algorithm that combines the benefits of look-ahead routing and also local-routing
for better routing choices. Furthermore, HLAFT adopts a low-cost technique, named RandomAccess-Buffer (RAB), for deadlock-recovery. We implemented the proposed algorithm on a
real 3D-NoC architecture (3D-OASIS-NoC), which has shown good performance, scalability and
robustness. We prototyped HLAFT on an FPGA, evaluated its performance over real and large
benchmarks, and compared it with well-known 3D-NoC routing schemes.
We conducted two experiments where in the first one we showed the benefits of adopting
3D-architecture over the 2D-one illustrated in 33% latency/flit reduction and 45% throughput
enhancement. We also saw the outperformance of look-ahead routing against local-routing
depicted in an average of 24.7% latency/flit reduction and a throughput enhancement that can
reach the 33%.
In the second experiment we demonstrated that the proposed HLAFT algorithm introduces
24
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
lower latency/flit reduction against Dimensional-Order-Routing (XYZ) even at 20% fault-rate.
Thanks to the adopted RAB deadlock-recovery technique, deadlock-freedom was ensured in
HLAFT while the previous algorithm (LAFT) fails at 10% fault-rate. HLAFT outperforms
LAFT even when deadlock does not occur in this latter. This outperformance is represented in
a latency/flit reduction that can reach the 12.5% and a throughput enhancement reaching 11.8%.
Thanks to the adopted Power-management scheme (PM), the dynamic-power increasing could be
controlled and an important saving could be obtained that reached 7.2% when compared to the
earlier LAFT routing.
Adopting the proposed HLAFT routing introduced an area penalty represented in 23% and
34% extra ALUTs when compared to XYZ and LAFT, respectively, while observing that the
adopted RAB deadlock-recovery technique cost only 7.1% extra hardware. In terms of staticpower consumption a small 2.8% and 5.6% overhead is observed when compared to LAFT and
XYZ, respectively, and also 2.2% and 10.3% decreasing speed was reported when compared to
the aforementioned algorithms, respectively.
As a future work, an in-depth thermal power study should be conducted to observe how the
performance gain obtained with the proposed algorithm would affect this design requirement,
as it is very crucial for 3D-Network-on-Chip architectures. Moreover, we want to consider the
occurrence of faults in other components of the systems, such as input buffers and crossbar. These
circuits consume a big portion of the system area, making them more vulnerable to transient and
permanent faults.
References
[1] Y. Xie, J. Cong, and S. Sapatnekar. Three-Dimensional Integrated Circuits Design.
Springer, 2010.
25
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
[2] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, R. Varada, M. Ratta, and S. Vora. A 45
nm 8-core enterprise Xeon processor. In Proc. IEEE Asian Solid-State Circuits Conf., Nov.
2009, pages 9-12.
[3] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L.
Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger,
N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. TILE64 processor: A 64core SoC with mesh interconnect. In Proc. of the IEEE International Solid-State Circuits
Conference. Digest of Technical Papers, Feb. 2008, pages 88-598.
[4] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T.
Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile
sub-100-w teraFLOPS processor in 65-nm CMOS. IEEE Journal on Solid-State Circuits,
43(1):29-41, Jan. 2008.
[5] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks, Morgan
Kaufmann, 2004.
[6] A. Ben Abdallah and M. Sowa. Basic Network-on-Chip Interconnection for Future
Gigascale MCSoCs Applications: Communication and Computation Orthogonalization.
Proceedings of The TJASSST2006 Symposium on Science, December 2006.
[7] G. Philip, B. Christopher, and P. Ramm. Handbook of 3D Integration: Technology and
Applications of 3D Integrated Circuits. Wiley-VCH, 2008.
[8] A. W. Topol, J. D. C. La Tulipe, L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A. Kumar,
G. U. Singco, A. M. Young, K. W. Guarini and M. Ieong. Three-dimensional Integrated
Circuits. IBM Journal of Research and Development, 50(4/5):491-506, July 2006.
26
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
[9] G. Sun, X. Dong, Y. Xie, J. Li and Y. Chen. A Novel 3D Stacked MRAM Cache
Architecture for CMPs. IEEE 15th International Symposium High Performance Computer
Architecture, pages 239-249, February 2009.
[10] L. Benini and G. De Micheli. Networks on Chips: Technology and Tools. Morgan
Kauffmann, 2006.
[11] I. Loi, F. Angiolini, S. Fujita, S. Mitra, L. Benini. Characterization and Implementation
of Fault-Tolerant Vertical Links for 3-D Networks-on- Chip. IEEE Trans. On CAD of
Integrated Circuits and Systems, 30(1):124-134, Jan. 2011.
[12] A. DeOrio, D. Fick, V. Bertacco, D. Sylvester, D. Blaauw, J. Hu, and G. Chen. A Reliable
Routing Architecture and Algorithm for NoCs. IEEE Trans. on CAD of Integrated Circuits
and Systems, 31(5):726-739, May 2012.
[13] A. Burns and A. Wellings. Real-Time Systems and Programming Languages Ada Real-Time
Java and C/Real-Time Posix. Addison Wesley, 2009.
[14] P. P. Pande, H. Zhu, A. Ganguly, and C. Grecu. Energy reduction through crosstalk
avoidance coding in NoC paradigm. In Proc. of the 9th EUROMICRO Conference on
Digital System Design, pages 689-695, 2006. IEEE Computer Society.
[15] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Applications.
Englewood Cliffs, NJ: Prentice-Hall, 1983.
[16] W. J. Dally. Virtual-channel flow control, IEEE Trans. on Parallel and Distributed Systems,
3(2):194-205, March 1992.
[17] Y. Tar and G. L. Frazier. High-performance multiqueue buffers for VLSI communication
switches. 15th Annual International Symposium on Computer Architecture, pages 343-354,
May-June 1988.
27
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
[18] A. Ben Ahmed and A. Ben Abdallah. Architecture and Design of High-throughput, Lowlatency, and Fault-Tolerant Routing Algorithm for 3D-Network-on-Chip (3D-NoC). The
Journal of Supercomputing, 66(3):1507-1532, December 2013.
[19] A. Ben Ahmed and A. Ben Abdallah. Fault-tolerant Routing Algorithm with Deadlock
Recovery Support for 3D-NoC Architectures. The 7th IEEE International Symposium on
Embedded Multicore SoCs, pages 67-72, September 2013.
[20] A. Ben Ahmed, A. Ben Abdallah, and K. Kuroda. Architecture and Design of Efficient
3D Network-on-Chip (3D NoC) for Custom Multicore SoC. IEEE Proceedings of the
5th International Conference on Broadband, Wireless Computing, Communication and
Applications, pages 67-73, November 2010.
[21] A. Ben Ahmed and A. Ben Abdallah. LA-XYZ: Low Latency, High Throughput LookAhead Routing Algorithm for 3D Network-on-Chip (3D-NoC) Architecture. The 6th IEEE
International Symposium on Embedded Multicore SoCs, pages 167-174, September 2012.
[22] A. Ben Ahmed and A. Ben Abdallah. Low-overhead Routing Algorithm for 3D Networkon-Chip. IEEE Proceedings of The Third International Conference on Networking and
Computing, pages 23-32, December 2012.
[23] K. Mori, A. Esch, A. Ben Abdallah and K. Kuroda. Advanced Design Issues for OASIS
Network-on-Chip Architecture. IEEE Proceedings of the 5th International Conference
on Broadband, Wireless Computing, Communication and Applications, pages 74-79,
November 2010.
[24] W. J. Dally, L. R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos. The reliable router: A
reliable and high-performance communication substrate for parallel computers. In Proc. of
28
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
the First International Workshop on Parallel Computer Routing and Communication, pages
241-255, 1994.
[25] J. Duato, A theory of fault-tolerant routing in wormhole networks. IEEE Trans. Parallel
Distributed Systems, 8(8):790-802, August 1997.
[26] Z. Zhang, A. Greiner, and S. Taktak. A reconfigurable routing algorithm for faulttolerant 2-D-mesh network-on-chip. In Proc. of the 45th ACM/IEEE Conference on Design
Automation Conference, pages 441-446, June 2008.
[27] P.-H. Sui and S.-D. Wang. Fault-tolerant wormhole routing algorithms for mesh networks.
IEEE Trans. on Computers, 44(7):848-864, January 2000.
[28] C. Liu, L. Zhang, Y. Han, and X. Li. A resilient on-chip router design through data path
salvaging. In Proc. of the 16th Asia and South Pacific Design Automation Conference, pages
437-442, January 2011.
[29] S. Rodrigo, J. Flich, A. Roca, S. Medardoni, D. Bertozzi, J. Camacho, F. Silla, and J.
Duato. Addressing manufacturing challenges with cost-efficient fault tolerant routing. 4th
ACM/IEEE International Symposium on Networks-on-Chip, pages 25-32, May 2010.
[30] P. Bogdan, T. Dumitras, and R. Marculescu. Stochastic communication: A new paradigm
for fault-tolerant networks-on-chip. VLSI Design, 2007(95348):17, 2007.
[31] M. Ebrahimi, M. Daneshtalab, J. Plosila, and H. Tenhunen. MAFA: Adaptive Fault-Tolerant
Routing Algorithm for Networks-on-Chip. In Proc. of 15th Euromicro Conference on
Digital System Design, pages 201-207, September 2012.
[32] M. Ebrahimi, M. Daneshtalab, J. Plosila, and F. Mehdipour, MD: Minimal path-based FaultTolerant Routing in On-Chip Networks. In Proc. of 18th Asia and South Pacific Design
Automation Conference, pages 35-40, January 2013.
29
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
[33] M. Ebrahimi. Reliable and Adaptive Routing Algorithms for 2D and 3D Networks-on-Chip.
Routing Algorithms in Networks-on-Chip, Springer, 2014.
[34] I. Loi, S. Mitra, T. H. Lee, Sh. Fujita, and L. Benini. A low-overhead fault tolerance scheme
for TSV-based 3D network on chip links. In Proc. of the 2008 IEEE/ACM International
Conference on Computer-Aided Design, pages 598-602, 2008.
[35] S. Pasricha. Exploring serial vertical interconnects for 3D ICs. In Proc. of the 46th
ACM/IEEE Design Automation Conference, pages 581-586, July 2009.
[36] M. Nicolaidis. Design for soft error mitigation. IEEE Trans. on Device and Materials
Reliability, 5(3):405-418, September 2005.
[37] F. L. Kastensmidt, L. Carro, and R. Reis. Fault-Tolerance Techniques for SRAM-Based
FPGAs. Frontiers in Electronic Testing, Springer-Verlag, 2006.
[38] A. -M. Rahmani, K. R. Vaddina, K. Latif, P. Liljeberg, J. Plosila and H. Tenhunen. Design
and Management of High-performance, Reliable and Thermal-aware 3D Networks-onChip. IET Circuits, Devices & Systems, 6(5):308-321, September 2012.
[39] N.A. Nordbotten, M.E. Gmez, J. Flich, P. Lopez, A. Robles, T. Skeie, O. Lysne, and
J. Duato. A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate
Nodes. In Proc. of IFIP international conference on Network and Parallel Computing,
pages 341-356, October 2004.
[40] Y. Li, Sh. Peng, and W. Chu. Adaptive box-based efficient fault-tolerant routing in 3D torus.
In Proc. of the 11th International Conference Parallel and Distributed Systems, pages 7177, July 2005.
30
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
[41] Ch. Feng, M. Zhang, J. Li, J. Jiang, Z. Lu and A. Jantsch. A Low-Overhead Fault-Aware
Deflection Routing Algorithm for 3D Network-on-Chip. IEEE Computer Society Annual
Symposium on VLSI, pages 19-24, July 2011.
[42] A. A. Chien and J. H. Kim. Planar-adaptive Routing: Low-cost Adaptive Networks for
Multiprocessors. The 19th Annual International Symposium on Computer Architecture,
pages 268-277, 1992.
[43] Z. Jiang, J. Wu and D. Wang. A New Fault Information Model for Fault-Tolerant Adaptive
and Minimal Routing in 3-D Meshes. IEEE Transactions on Reliability, 57(1):149-162,
March 2008.
[44] D. Xiang, Y. Zhang and Y. Pan. Practical Deadlock-Free Fault-Tolerant Routing Based on
the Planar Network Fault Model. IEEE Transactions on Computers, 58(5):620-633, May
2009.
[45] S. Pasricha and Y. Zou. A Low Overhead Fault Tolerant Routing Scheme for 3D Networkson-Chip. The 12th International Symposium on Quality Electronic Design, pages 1-8,
March 2011.
[46] S. Akbari, A. Shafieey, M. Fathy and R. Berangi. AFRA: A Low Cost High Performance
Reliable Routing for 3D Mesh NoCs. Design, Automation & Test in Europe Conference &
Exhibition, pages 332-337, March 2012.
[47] M. Ebrahimi, M. Daneshtalab, and J. Plosila. Fault-Tolerant Routing Algorithm for 3D
NoC Using Hamiltonian Path Strategy. In Proc. of Design, Automation & Test in Europe
Conference & Exhibition (DATE), pages 1601-1604, March 2013.
[48] http://www.altera.com
31
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
[49] P. Chan, K. Dai, D. Wu, J. Rao and X Zou. The Parallel Algorithm Implementation of
Matrix Multiplication Based on ESCA. IEEE ASIA Pacific Conference on Circuits and
Systems, pages 1091-1094, December 2010.
[50] Y. L. Lee, J. W. Yang, and J. M. Jou. Design of a distributed JPEG encoder on a scalable
NoC platform. IEEE International Symposium VLSI-DAT, pages 132-135, April 2008
[51] A. A. Chien and J. H. Kim. Planar-Adaptive Routing: Low-Cost Adaptive Networks for
Multiprocessors. Journal of the ACM, 42(1):91-123, January 1995.
[52] R. Sivaram. Queuing delays for uniform and nonuniform traffic patterns in a MIN. ACM
SIGSIM Simulation Digest, 22(1):17-27, 1990.
[53] H. Sullivan and T. R. Bashkow. Large Scale, Homogeneous, Fully Distributed Parallel
Machine. Annual Symposium on Computer Architecture, ACM Press, pages 105-117,
March 1977.
[54] A. Ben Ahmed, K. Mori, and A. Ben Abdallah. ONoC-SPL: Customized Network-on-Chip
(NoC) Architecture and Prototyping for Data-intensive Computation Applications. IEEE
Proc. of the 4th International Conference on Awareness Science and Technology, pages
257-262, August 2012.
[55] C. H. Chao, K. Y. Jheng, H. Y. Wang, J. C. Wu and A. -Y. Wu. Traffic and Thermal-aware
Run-time Thermal Management Scheme for 3D NoC Systems. In Proc. of the ACM/IEEE
International Symposium on Networks-on-Chip (NoCS), pages 223-230, May 2010.
[56] B. Feero and P. P. Pande. Performance Evaluation for Three-Dimensional Networks-onChip. Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages
305-310, May 2007.
32
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
Figure 1: Example of fault-tolerant routing: (a) Look-ahead routing (LAFT) (b) Hybrid routing
(HLAFT).
Figure 2: Router pipeline stages: (a) Look-Ahead-Fault-Tolerant [18] (b) Hybrid-Look-AheadFault-Tolerant.
33
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
(a)
(b)
(c)
(d)
Figure 3: Example of deadlock-recovery with Random-Access-Buffer.
Figure 4: Input buffer block diagram.
34
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
(a)
(b)
(c)
(d)
Figure 5: Latency per flit comparison results between 2-D and 3-D NoC architectures with: (a)
Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG.
35
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
Algorithm 1: Hybrid-Look-Ahead-Fault-Tolerant routing algorithm (HLAFT)
// Destination address
Input: Xdest , Ydest , Zdest
// Current node address
Input: Xcur , Ycur , Zcur
// Next-port identifier
Input: Next-port
// Link status information
Input: Fault-in
// New-next-port for next node
Output: New-next-port
// Calculate the next-node address
Next← Next-node (Xcur , Ycur , Zcur , Next-port);
// Read fault information for the next-node
Next-fault← Next-status (Fault-in, Next-port);
// Calculate the three possible directions for the next-node
Next-dir← poss-dir (Xdest , Ydest , Zdest , Next x , Nexty , Nextz );
// Evaluate the flag
if (|Next-dir| == 0) then
flag← 1;
else flag← 0;;
// Evaluate the New-next-port
if (flag == 1) then
// Trigger local-routing
out-port← local-routing (Fault-in, Xcur , Ycur , Zcur , Xdest , Ydest , Zdest );
// Execute LAFT with the new recomputed out-port
New-next-port← LAFT-routing (Xcur , Ycur , Zcur , Xdest , Ydest , Zdest , out-port, Fault-in);
else
// Execute LAFT with the inputed Next-port identifier
New-next-port ← LAFT-routing (Xcur , Ycur , Zcur , Xdest , Ydest , Zdest , Next-port, Fault-in);
end
36
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
Table 1: Simulation configuration.
Parameters / System
XYZ-based
LA-XYZ-based
LAFT-based
HLAFT-based
JPEG
3x3x3
3x3x3
3x3x3
3x3x3
Matrix
3x6x6
3x6x6
3x6x6
3x6x6
Transpose & Uniform
4x4x4
4x4x4
4x4x4
4x4x4
JPEG
27 bits
30 bits
30 bits
30 bits
Matrix
31 bits
34 bits
34 bits
34 bits
Trans. & Unif.
34 flit
34 flit
34 flit
34 flit
JPEG
10 bits
13 bits
13 bits
13 bits
Matrix
10 bits
13 bits
13 bits
13 bits
Transpose
16 bits
13 bits
13 bits
13 bits
JPEG
16 bits
16 bits
16 bits
16 bits
Matrix
21 bits
21 bits
21 bits
21 bits
Transpose
21 bits
21 bits
21 bits
21 bits
Buffer Depth
4
4
4
4
Switching
Wormhole-like
Wormhole-like
Wormhole-like
Wormhole-like
Flow control
Stall-Go
Stall-Go
Stall-Go
Stall-Go
Scheduling
Matrix-Arbiter
Matrix-Arbiter
Matrix-Arbiter
Matrix-Arbiter
Routing
XYZ
LA-XYZ
LAFT
HLAFT
Target FPGA device
Stratix III
Stratix III
Stratix III
Network Size
(Mesh)
Flit size
Header size
Payload size
37
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
Table 2: Router hardware complexity evaluation comparison results between 2D and 3D NoC
architectures.
System / Parameter
Figure 6:
Area
Total Power
Speed
(ALUTs)
Mw
Mhz
2D (XY-based)
78201
1399.08
125.32
2D (LA-XY-based)
90576
1211.06
109.22
3D (XYZ-based)
138578
1210.77
107.34
3D (LA-XYZ-based)
145287
1121.89
100.89
(a)
(b)
(c)
(d)
Throughput comparison results between 2-D and 3-D NoC architectures with: (a)
Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG.
38
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
Table 3: Router hardware complexity comparison results between 3D-NoC systems.
System/ Parameter
Area
Static Power
Speed
(mW)
(MHz)
XYZ-based
2809
1258.01
194.61
LA-XYZ-based
3134
1269.76
194.09
LAFT-based
3272
1296.92
178.51
HLAFT-based
4277
1334.5
174.68
HLAFT+RAB-based
4580
1336.4
174.42
(a)
(b)
(c)
(d)
Figure 7: Latency per flit comparison results between XYZ, LA-XYZ, LAFT, and HLAFT based
3D-NoC systems with: (a) Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG.
39
Preprint version to be published in the Journal of Parallel and Distributed Computing (2014)
(a)
(b)
(c)
(d)
Figure 8: Throughput comparison results between XYZ, LA-XYZ, LAFT, and HLAFT based
3D-NoC systems with: (a) Transpose; (b) Uniform; (c) 6 × 6 Matrix; (d) JPEG.
Table 4: Average dynamic-power evaluation under different fault-rates.
System/ Fault-rate
0%
5%
10%
20%
XYZ-based
165.2
165.2
165.2
165.2
LA-XYZ-based
169.9
169.9
169.9
169.9
LAFT-based
170.09
172.92
178.51
194.61
HLAFT-based
173.7
174.61
183.68
231.43
HLAFT+PM-based
170.11
170.2
174.42
180.02
40