A Design Workflow for Dynamically Reconfigurable Multi-FPGA Systems

Alessandro Panella∗, Marco D. Santambrogio‡†, Francesco Redaelli†, Fabio Cancare‡, and Donatella Sciuto†

∗ Computer Science Department - University of Illinois at Chicago, Email: apanel2@uic.edu
† Dipartimento di Elettronica e Informazione (DEI) - Politecnico di Milano,
Email: {santambr, cancare, fredaelli, sciuto}@elet.polimi.it
‡ Computer Science and Artificial Intelligence Laboratory - Massachusetts Institute of Technology, Email: santambr@mit.edu

Abstract—Multi-FPGA systems (MFS’s) represent a promising technology for various applications, such as the implementation of supercomputers and parallel and computational intensive emulation systems. On the other hand, dynamic reconfigurability expands the possibilities of traditional FPGAs by providing them the capability of adapting their functionality while still running to cope with runtime environment changes. These two research directions are merged together in this work, that describes a methodology for designing dynamic reconfigurable MFS’s. In this paper a novel MFS design flow has been described, which makes use of blocks reuse through dynamic reconfigurability to make the implementation of large systems feasible even on multi-FPGA architectures with strict physical constraints. Functional to this goal is the development of an algorithm for the extraction of the isomorphic structures of a circuit that extensively exploits the hierarchy of the design.

I. INTRODUCTION

The use of Field Reprogrammable Gate Arrays (FPGAs) is nowadays widespread in both industry and academic research. Their computational power can be increased through the creation of clusters of chips. Besides obviously augmenting the available physical area, this also provides the possibility of massively exploiting parallel computation. Such multi-FPGA systems (MFS’s) are currently used in supercomputing applications and logic emulation of custom circuits [1]. A computational paradigm attracting growing interest is reconfigurable computing (RC). An early definition given by Gerald Estrin refers to RC as the process of altering the location or the functionality of a system element, as a response to faults, changes in the environment or explicit application needs [2]. Due to their reprogrammability, FPGAs currently represent the leading technology for implementing reconfigurable systems. In recent years, the evolution of FPGA architectures has made it possible to further increase the degree of flexibility in the use of such chips. This innovation is represented by the possibility of having parts of the FPGA reconfigured at run-time, while others are still running, so that the execution of the system never ceases. This technique is called partial dynamic reconfigurability, as opposed to the standard static reconfigurability. A number of works about MFS design can be found in literature (e.g. [1], [3]–[5]), but only few approaches have been proposed that explore the field of dynamically reconfigurable MFS’s ([6], [7]). Merging together the potential of MFS’s and reconfigurability is nevertheless a promising research direction. Although the area available on MFS’s is usually large, some complex applications may require even more space, thus imposing the replacement of the physical architecture with a larger one, a process that is very expensive and time-consuming. By providing a larger virtual area, dynamic reconfigurability allows to go beyond the physical space constraints of the architecture [8]. The presented work proposes a novel design methodology that exploits the dynamic reconfigurability of interconnections in MFS’s. This allows design blocks of the application to be used more than once during execution, with the result of significant area savings. At the best of our knowledge, no other work on multi-FPGA design has explored this scenario.

The remainder of this paper is organized as follows. Sections II, III, and IV present the proposed MFS design workflow and describe in details the three phases it is composed of. Section V provides the obtained experimental results, while Section VI briefly reports previous works on MFS design, comparing them with the proposed methodology. Section VII concludes the paper providing some hints for future work.

II. PROPOSED WORKFLOW AND DESIGN EXTRACTION

The approach proposed in this paper consists of a workflow for the design of MFS’s, whose abstract view is represented in Fig. 1 and is briefly described in the following.

Figure 1. Outline of the proposed multi-FPGA systems design workflow.

The input of the design process consists of a VHDL description of the application and a specification of the target
multi-FPGA architecture. The VHDL code undergoes a design extraction phase, which aims at collecting the information relevant to the design structure. A global physical layout phase performs the partitioning, placement and routing of the application on the specified architecture. At this point, two situations are possible. If the application fits into the architecture, the flows ends. Otherwise, another step is undertaken, aimed at exploiting the dynamic reconfiguration of the communication infrastructure for modules reuse. The output of the workflow is a new VHDL specification, describing the modules to be instantiated on each FPGA, together with information about the reconfiguration of interconnections. This process relies on existing commercial tools (e.g. Xilinx ISE) for subsequent intra-FPGA synthesis.

A. Intermediate Representation

The VHDL specification received as input is parsed and interpreted, and the result is saved in a specifically designed intermediate representation, that maintains information both on the structure and the hierarchy of the design tree-like data structure. Hierarchy information is useful to our task for two important reasons. The first one depends on what makes the designer choose a particular design hierarchy when implementing the VHDL application. The designer follows some simple implicit rule in recursively aggregating – or splitting down – components. Design blocks are built based on their functionality: if two sub-components (children) carry out operations that are theoretically portions of a larger function, they are likely to be aggregated in a bigger component (parent). It is evident that such sub-components are probably strongly interconnected. Therefore, it seems natural and favorable to exploit this information in the partitioning, placement and routing of the input circuit. The second reason is rooted in the concept of regularity: if two components belong to the same type, they are roots of two identical sub-trees in the hierarchy. Therefore, when a given operation is carried out during the execution of some algorithm in one of these subtrees, it can be immediately replicated in the other one.

B. VHDL Preprocessing and Extraction

The extraction phase is composed of two steps. First, the VHDL specification is processed to reduce it to a pure VHDL structural description. The resulting code contains only structural statements for the intermediate nodes of the hierarchical tree, while behavioral and data-flow instructions are allowed exclusively in leaf blocks. To obtain a pure structural description, two operations are carried out for each component in the design:

1) For every process, a leaf component is created which contains the process. The process in the original file is replaced by the instantiation of this component.
2) All data-flow instructions are turned into a leaf component, and are replaced by the instantiation of such component.

Then, the specification is parsed into the intermediate representation. The estimated FPGA area occupation in number of slices is retrieved by this step, using existing FPGA synthesis tools such as Xilinx XST [9]. Each leaf of the extracted hierarchical tree is then constituted by a single VHDL process or a group of data-flow instructions. The granularity of this structure is quite coarse, especially if compared to usual gate-level netlists. The choice of handling the circuit at a process-level granularity arises from the fact that the presented workflow performs a global mapping of the application on a multi-FPGA architecture, with subsequent phases taking care of fine-grained local syntheses. In this context, dealing with a low number of relatively large design modules leads to faster results.

III. GLOBAL PHYSICAL LAYOUT

The global layout phase deals with the search of a feasible mapping of the parsed application on a multi-FPGA architecture received as input, while optimizing some objectives, i.e. the interconnections length. Such mapping assigns one and only one host FPGA of the architecture to each leaf block of the input application and routes interconnections between any two communicating modules assigned to different chips. The cost function to be minimized is the estimated length of the interconnections between blocks assigned to different FPGAs, measured in number of hops and weighted over connections width. Let us define \( w(i, j) \) as the amount of communication in number of bits between nodes \( i \) and \( j \) and let \( e_i \) identify the FPGA node \( i \) is assigned to. The cost function to be minimized is the Weighted Estimated Wire Length (WEWL), computed as follows:

\[
WEWL = \sum_{1 \leq i < j \leq n} w(i, j) d(e_i, e_j)
\]

where \( n \) is the number of nodes in the architecture, \( w(i, j) \) is the size in bits of the interconnection between nodes \( i \) and \( j \), and \( d(e_i, e_j) \) is the estimated distance between FPGAs \( e_i \) and \( e_j \) in the architecture. Off-chip wires are undesirable since (10): they degrade performances, constitute a source of faults, and increase the need of I/O pins. In this paper, only global partitioning and placement are addressed. Global partitioning aims at creating partitions of leaf design nodes such that their size is not bigger than the area available on the FPGAs composing the architecture and the cut-size is minimized. Placement generates a one-to-one mapping between the created partitions and the FPGAs that minimizes interconnection length. The partitioning algorithm is a bottom-up clustering that exploits the regularities extracted from the design hierarchy. At the beginning, each leaf of the design hierarchy is considered as a cluster and is assigned a type, given by the VHDL component the node is instance of. Then, the two clusters maximizing a given closeness metric are collapsed together, provided this does not violate maximum area and I/O pin count constraints. Let us define as \( B = \{1, 2, ..., N\} \) the set of all leaves of the design hierarchy, \( P = \{P_1, P_2, ..., P_M\} \) a partitioning over \( B \). Being \( P \) and \( Q \in P \), the metrics considered by the algorithm are:

- Connection (CONN): volume of communication between
two clusters (in bits).

\[ CONN(P,Q) = \sum_{i,j \in P \cup Q, i < j} w(i,j) \]

- **Communication Ratio** (CR): ratio between the communication volume internal to the resulting cluster (Internal Communication – IC) and the communication volume with other clusters (External Communication – EC).

\[ IC(P,Q) = \sum_{i,j \in P \cup Q, i < j} w(i,j) \]

\[ EC(P,Q) = \sum_{i \in Q, j \in P \cup Q, i < j} w(i,j) \]

- **Communication Density** (CD): ratio between the Internal Communication and the number of edges of an hypothetical complete graph built on the resulting cluster.

\[ CD(P,Q) = \frac{IC(P,Q)}{\text{CliqueSize}(P \cup Q)} \]

Two possible cases can arise when two clusters are collapsed:

1) If the two clusters belong to the same parent in the hierarchy, other instances of the parent’s type are searched in the hierarchy to apply the same transformation. The same type is assigned to these newly created clusters. In other words, a collapse operation *induces* other ones.

2) If the two clusters do not belong to the same parent, the newly created cluster is added as a child of the root node, being such cluster unique and surely not involved in any regularity patterns.

This schema, exemplified in Fig. 2, is iterated until no more clusters can be formed or only one cluster remains. Throughout the algorithm, intermediate hierarchy nodes with a single child are dropped. Notice that the use of the hierarchy information overcomes a traditional problem of clustering algorithms, represented by the locality of closeness metrics. The placement step is essentially a one-to-one mapping of the generated partitions on the FPGAs of the target architecture. When dealing with a small number of chips, an optimal solution can be easily found. In general, a topology-independent iterative method can be considered. A simulated annealing (SA) approach has been developed and tested, whose objective function is the expected wire length needed for inter-FPGA communication. If partition \( K \) is assigned to chip \( c_k \), then this function is \( \sum_{I,J \in \mathcal{P}} d(c_I, c_J) CONN(I, J) \), where \( d(\cdot, \cdot) \) is the distance in number of hops between two chips.

IV. REUSE AND DYNAMIC RECONFIGURABILITY

The attempt to find a static global layout may fail, due to the bounded area available on the architecture. In such a case, a design blocks reuse technique is adopted. Consider a dynamically interconnected circuit structure where the nets connecting the blocks can be added and dropped at run-time. In this scenario, a block can be connected to more than one net in non-overlapping time intervals. In this way two or more identical parts of the application can be implemented by a single block with dynamic interconnections. A crossbar topology can be used to implement this kind of circuit: the reprogrammable switch-boxes in the crossbar chip can be dynamically reconfigured to implement temporary connections, as shown in Fig. 3. Other multi-FPGA architectures (e.g. bus-based) can be considered as well. The problem of design parts

\[ \text{Figure 3. Input structure (a) and possible crossbar implementation (b).} \]
the reconfigurations. The problem to be solved is therefore to find a blocks reuse strategy which allows the input application to fit in the architecture while minimizing the required reconfiguration time. Complicating the problem is the fact that isomorphic structures are in general overlapping: this introduces mutual constraints to be fulfilled when choosing which blocks to instantiate. In spite of that, the isomorphic structures extracted by the clustering algorithm have a peculiar nature: given two clusters, either one contains the other either they do not overlap. This is because the algorithm generates a clusters hierarchy, that can be viewed as a dendrogram, as the one shown in Fig. 4. This context implies that any cut of the dendrogram represents a full, flat, typed specification of the input application. The number of possible cuts is exponential in the number of initial clusters, therefore only the subset given by horizontal cuts is considered by the proposed methodology: the reuse problem is individually solved for each of these cuts, and the best solution is then returned. Being \( n(c_i) \) the number of occurrences of cluster type \( c_i \in C \), a solution to the problem is represented by a function \( m(c_i) : C \rightarrow \{1, 2, ..., n(c_i)\} \), which represents the number of instances of cluster type \( c_i \) in the resulting dynamically interconnected structure. In a partial crossbar topology as the one described above, the reconfiguration time related to a cluster can be estimated by considering that it is with good approximation proportional to the area that has to be reconfigured. In turn, such area is proportional to the width of the external interconnections of the cluster, denoted as \( w(c_i) \). This quantity has to be multiplied by the number of reconfigurations implied by the solution, equal to \( n(c_i) - m(c_i) \). We can conclude that the actual time needed for the reconfigurations is proportional to the following quantity, that therefore has to be minimized:

\[
T_{rec} \propto \sum_{c_i \in C} \{n(c_i) - m(c_i)\} \cdot w(c_i).
\]  

The area occupied by the resulting system has to be smaller than the overall capacity of the architecture \( A \). Mathematically, this constraint is expressed as \( \sum_{c_i \in C} [a(c_i) \cdot m(c_i)] < A \). Despite being the problem NP-complete, an Integer Linear Programming (ILP) model is simply obtainable from these formulae and it has shown to run in acceptable time (see Section V). Considering again the nature of the isomorphic structures that are considered for reuse, obtained by a clustering process that tries to minimize the amount of external communication. Combined with (1), this fact implies that the structures are good with respect to the goal of minimizing the reconfiguration time.

V. EXPERIMENTS AND CASE STUDY

Four VHDL test circuits have been used for validating the global layout algorithms proposed in this paper: an encryption/decryption core (3DES), a Finite Impulse Response filter (FIR), a cypher (NOEK), and a combination of the first two (3DES+FIR). Quantitative information of these circuits are reported in Table I.

![Figure 4. Example of extraction of horizontal cuts.](image)

Table I

<table>
<thead>
<tr>
<th>Circuit</th>
<th>3DES</th>
<th>FIR</th>
<th>NOEK</th>
<th>3DES-FIR</th>
</tr>
</thead>
<tbody>
<tr>
<td>Size (slices)</td>
<td>1613</td>
<td>561</td>
<td>938</td>
<td>2141</td>
</tr>
<tr>
<td># Nodes in hierarchy</td>
<td>67</td>
<td>231</td>
<td>29</td>
<td>301</td>
</tr>
<tr>
<td># Leaves</td>
<td>52</td>
<td>211</td>
<td>25</td>
<td>264</td>
</tr>
<tr>
<td>Leaves size (slicer)</td>
<td>19.21</td>
<td>2.66</td>
<td>38.32</td>
<td>8.11</td>
</tr>
<tr>
<td>Mean</td>
<td>28.5</td>
<td>4.94</td>
<td>72.45</td>
<td>27.36</td>
</tr>
</tbody>
</table>

Table II reports some numerical results\(^1\) of the execution of the MFS design workflow proposed in this paper. In particular, results are shown for partitioning using the three different clustering metrics that have been introduced in Section III: Connection, Communication Ratio, and Communication Density. For carrying out experiments on the test circuits explained above, three hypothetical FPGA dimensions have been considered: 300, 400, and 600 slices. The partitioning quality is tested against the results obtained by using the Metis partitioning algorithm. For carrying out this comparison, the cutsize of the resulting partitioning is considered, which is the amount of communication – in number of bits – among different partitions. Moreover, the results for the one-to-one placement and for solving the instances of the ILP model for blocks reuse are listed in the table.

The table shows that the Connection metric (CONN) for clustering leads to smaller cutsizes in the majority of the cases, although it sometimes implies the use of one additional partition. The time required for partitioning is reasonably low, even if it grows more than linearly with the number of nodes in the design hierarchy. These results are compared with the ones obtained using the Metis partitioning algorithm ([11]), which currently represents the state-of-art for partitioning large flat netlists. In some cases the proposed clustering algorithm behaves better than Metis. This can be noticed to happen for circuits whose leaves size has a high variance-mean ratio (i.e. 3DES and 3DES-FIR), while it is not true circuits whose structure is more similar to a typical flat netlists. This shows that the proposed approach is promising, as it provides good results in partitioning hierarchical structures extracted from VHDL with large and irregular blocks dimensions. The one-to-one placement algorithm has been tested on 4-mesh multi-FPGA topologies. The running times are acceptable, and grow roughly linearly as the number of partitions increases.

The ILP model for computing the best solution in clusters reuse has been solved by actually considering a maximum

---

\(^1\) All tests have been carried out using an Intel Core 2 Duo 2.2 GHz machine.
area equal to the 80% of the one actually available on the architecture. This is a common technique used to counter-balance possible estimation errors. It can be seen that, even when the number of executions is high, the running time is reasonably low. A more detailed analysis shows that such time depends on both the size of the circuit and the number of required executions of the solver, which is equal to the number of iterations performed by the clustering algorithm (i.e. the depth of the dendrogram), that in turn is bounded by the number of leaves in the design hierarchy. In order to show how the proposed workflow works in practice, a sample case study is provided. The user needs to deploy a parallel JPEG decoder composed by two identical decoding modules on a single FPGA device. The result is obtained by partitioning the circuit using the Connection metric, that produces five partitions with a cutsizes of 198. The Metis algorithm gives a cutsizes equal to 388.

Consider now a scenario where the multi-FPGA system under exam deploys hardware applications on-demand: it is likely that a certain sequence of received requests limits the current available area. Within this environment, assume that the total available area is bounded to 3000 slices. The architecture used within these experiments is composed by Xilinx XC3S100E2 devices ([12]), due to their low power and low costs characteristics, each with an available area of 960 slices and 108 I/O pins. The best result is obtained by partitioning the application using the Connection metric, that produces five partitions with a cutsizes of 198. The Metis algorithm gives a cutsizes equal to 388.

Table II

<table>
<thead>
<tr>
<th>Circuit</th>
<th>3DES</th>
<th>FIR</th>
<th>NOEK</th>
<th>3DES+FIR</th>
</tr>
</thead>
<tbody>
<tr>
<td>Partition Size (slices)</td>
<td>300</td>
<td>400</td>
<td>600</td>
<td>300</td>
</tr>
<tr>
<td>Partitioning (Clustering)</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>M</td>
<td>E</td>
<td>F</td>
<td>I</td>
<td>M</td>
</tr>
<tr>
<td>Time (ms)</td>
<td>7</td>
<td>50</td>
<td>19</td>
<td>970</td>
</tr>
<tr>
<td># Partitions</td>
<td>7</td>
<td>5</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>Cutsize</td>
<td>547</td>
<td>530</td>
<td>349</td>
<td>36</td>
</tr>
<tr>
<td>Time (ms)</td>
<td>18</td>
<td>17</td>
<td>19</td>
<td>1371</td>
</tr>
<tr>
<td># Partitions</td>
<td>7</td>
<td>5</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>Cutsize</td>
<td>547</td>
<td>530</td>
<td>349</td>
<td>36</td>
</tr>
<tr>
<td>Time (ms)</td>
<td>18</td>
<td>17</td>
<td>19</td>
<td>1371</td>
</tr>
<tr>
<td># Partitions</td>
<td>7</td>
<td>5</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>Cutsize</td>
<td>547</td>
<td>530</td>
<td>349</td>
<td>36</td>
</tr>
<tr>
<td>Time (ms)</td>
<td>18</td>
<td>17</td>
<td>19</td>
<td>1371</td>
</tr>
<tr>
<td># Partitions</td>
<td>7</td>
<td>5</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>Cutsize</td>
<td>547</td>
<td>530</td>
<td>349</td>
<td>36</td>
</tr>
<tr>
<td>Time (ms)</td>
<td>18</td>
<td>17</td>
<td>19</td>
<td>1371</td>
</tr>
</tbody>
</table>

1-to-1 Placement

<table>
<thead>
<tr>
<th>1-to-1 Placement</th>
<th>WEWL</th>
<th>Time (ms)</th>
<th>1335</th>
<th>789</th>
<th>540</th>
<th>36</th>
<th>50</th>
<th>0</th>
<th>2287</th>
<th>2061</th>
<th>1314</th>
<th>2099</th>
<th>775</th>
<th>753</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time (ms)</td>
<td>1224</td>
<td>621</td>
<td>494</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>655</td>
<td>528</td>
<td>2</td>
<td>1658</td>
<td>892</td>
<td>753</td>
<td></td>
<td></td>
</tr>
<tr>
<td># Runs</td>
<td>42</td>
<td>34</td>
<td>46</td>
<td>186</td>
<td>186</td>
<td>187</td>
<td>22</td>
<td>23</td>
<td>22</td>
<td>228</td>
<td>229</td>
<td>230</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Time (ms)</td>
<td>420</td>
<td>492</td>
<td>474</td>
<td>2103</td>
<td>2142</td>
<td>2078</td>
<td>2173</td>
<td>2325</td>
<td>2596</td>
<td>2753</td>
<td>2685</td>
<td>2753</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Figure 5. Case study: estimated reconfiguration time for different dendrogram cuts. The value -100 represents the impossibility to find a solution to the ILP model.

VI. RELATED WORKS

The design of MFS’s is addressed in several works. In [13], [14], the authors proposed a the Virtual wires approach to overcome pin limitations by intelligently multiplexing each physical wire among multiple logical wires. Hauck ([1], [15]) provides a complete workflow for the design of MFS’s, by proposing an integrated methodology for global partitioning, placement, and routing. The application to be mapped is recursively bi-partitioned using, in each iteration, a multilevel algorithm based on the Fiduccia-Mattheyses (FM) heuristic ([16]). The recursive bi-partitioning is driven by partitioning orderings: the first two obtained partitions are placed on the two least connected portions of the architecture, and so on recursively. This technique also implicitly provides the global routing of the application on the architecture.

Khalid’s work ([3], [17]) focuses on the evaluation of different MFS topologies rather than on the performances of the adopted algorithms. Nevertheless, he proposes a complete MFS design workflow, in which the three global layout steps are carried out sequentially. For partitioning the input application, Khalid uses a simple recursive FM bi-partitioning algorithm. Global placement is executed differently depending on the architectural topology; for instance, a force directed approach is used for mesh topologies. To cope with the routing problem, a general topology-independent approach based on graph structures is proposed, as well as several algorithms tailored to specific topologies, that provide better results.
A direct comparison between these approaches and the proposed methodology can be carried out in terms of partitioning results. Both Hauck and Khalid use derivations of the FM heuristic, in a multilevel process and in recursive bi-partitioning, respectively. In Section V comparisons with METIS are shown. Being a multilevel partitioning algorithm that also uses FM as a baseline, METIS represents a progress with respect to recursive FM and has been reported to outperform Hauck’s approach ([18]).

Other approaches focus on parts of the MFS design flow. In [19] the authors propose a genetic algorithm with fuzzy fitness function for MFS partitioning and placement targeted to 4-mesh topologies. [20] use simulated annealing to cope with global partitioning and placement. Iterative techniques seem indeed suitable to cope with high dimensional and complex problems such as global layout. Works like [21] and [4] show the advantages of exploiting the design hierarchy of the application, instead of using a flat netlist representation, by providing heuristic algorithms that traverse the hierarchy structure. In [22] the authors include an HDL synthesis step in the MFS design flow: a Verilog description is analyzed and turned into a hierarchical tree, and a top-down set covering algorithm is applied to generate partitions.

The approaches presented so far in this section deal with the design of MFS without taking into account dynamic reconfiguration. Instead, in [6] the authors propose a partitioning and synthesis process claimed to generate dynamically reconfigurable MFS’s. The input specification is transformed into a directed task-graph, which is divided in time segments (temporal partitioning). Then, a binary non-linear programming model performs a spatial partitioning over the FPGAs of the architecture for each time segment. At run-time, after a time segment is completed, execution is stopped and all the FPGAs are reconfigured, with temporary results stored in memory. Clearly, this is an approach that is not “dynamic” in the sense described in section I. In order to be truly dynamic, a MFS must be partially reconfigured, so that reconfiguration times are masked and the execution never ceases.

This is exactly what the methodology described in the present work proposes, by introducing partial dynamic reconfigurability in a standard MFS design workflow. In particular, inter-FPGA interconnections are partially reconfigured with the goal of saving area through blocks reuse, without requiring the execution of the system to cease.

VII. CONCLUSIONS AND FUTURE WORKS

In this paper a novel MFS design flow has been described, which makes use of blocks reuse through dynamic reconfigurability to make the implementation of large systems feasible even on multi-FPGA architectures with strict physical constraints. Experimental results have been provided in Section V to validate the proposed methodology. Among the others, a remarkable novelty is the exploitation of the design hierarchy both producing good partitioning results and extracting isomorphic structures. Future work will deal with the improvement of the clustering algorithm by adopting more powerful clustering metrics and developing solutions to go beyond its intrinsic greediness. An algorithm for scheduling the reuse of components has to be developed, along with an effective routing methodology. Finally the proposed approach will be used in conjunction with the virtual wires approaches that has been proven [13] not to need for expensive crossbar technology while increasing FPGA utilization.

REFERENCES