High Level Synthesis: from Algorithm to Digital Circuit- P28 pps

260 M.C. Molina et al. Table 14.1 Schedule and binding based on operation fragmentations Operation Fragmentation Cycle 1 Cycle 2 Cycle 3 E = A×B ⊗4×4 F = C×D ⊗4×4 G = E+ F ⊕8 I = H×GI1= H ×G 3 0 ⊗8×4 I2 = H 3 0 ×G 7 4 ⊗4×4 I3 = H 7 4 ×G 7 4 ⊗4×4 I4 = I1 11 4 + I2 ⊕8 I5 = (‘000’ & I4 8 4 )+I3 ⊕8 L = J+ K ⊕8 N = L×MN1= L×M 11 8 ⊗8×4 N2 = L ×M 3 0 ⊗8×4 N3 = L 3 0 ×M 7 4 ⊗4×4 N4 = L 7 4 ×M 7 4 ⊗4×4 N5 = N3 + N2 11 4 ⊕8 N6 = N4+ (‘000’ & N5 8 4 ) ⊕8 N7 = N1+ (‘000’ & N6) ⊕12 R = P+ QR1= P 11 0 + Q 11 0 ⊕12 R2 = P 23 12 + Q 23 12 ⊕12 Fig. 14.2 Execution of operation N = L ×M in cycle 3 14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 261 Table 14.2 Main features of the synthesized implementations Conventional Proposed Saved (%) Datapath FUs ⊗12 ×8, ⊗4 ×4, ⊕24 ⊗8×4, 2 ⊗4 ×4, ⊕12, 2 ⊕8– FU’s area 1,401 inverters 911 inverters 35 Circuit area 3,324 inverters 2,298 inverters 30.86 Cycle length 23.1 ns 21.43 ns 7.22 In the second implementation of this example every datapath FU is used once per cycle to execute one operation of its same width, and therefore the HW waste present in the conventional implementation has been totally removed. The circuit area (including FUs, registers, multiplexers, and controller) and the cycle length have been reduced around 31 and 7%, respectively, as summarized in Table 14.2. 14.2 Design Techniques to Reduce the HW Waste The first step to reduce the HW waste present in conventional implementations becomes the extraction of the common operative kernel of operations in the behavioural specification. This implies the fragmentation of compatible operations (that can be executed over FUs of a same type) into their common operative kernel plus some glue logic. These transformations can be applied not only to trivial cases like the compatibility between additions and subtractions, but to more complex ones like the compatibility between additions and multiplications. The common operative kernel extraction increments the number of operations that can be executed over the same HW resources, and in consequence, helps augment the FUs reuse. In addition to the extraction of the common operative kernels, several other advanced design techniques can be applied during the scheduling or allocation phases to further reduce the HW waste. These new methods handle specification operations at the bit level, and produce datapaths where the number, type, and width of the HW resources do not depend on the number, type, and width of the specification operations and variables. This independence from the description style used in the specification brings the applicability of HLS closer to inexperienced designers, and offers expert ones more freedom to specify behaviours. 14.2.1 Kernel Extraction of Compatible Operations Compatible operations can be executed over the same FUs in order to improve HW reuse, provided that their common operative kernel has been extracted in advance. This kernel extraction can be applied to the behavioural specifications as a pre- processing phase prior to the synthesis process. This way, additive operations such as subtractions, comparisons, maximums, minimums, absolute values, data format converters, or data limiters can be substituted for additions and logic operations. 262 M.C. Molina et al. And more complex operations like multiplication, division, MAC, factorial or power operations can be decomposed into several multiplications, additions, and some glue logic. In order to increase the number of operations that may share one FU, it is also desirable the unification of the different representation formats used in the specification. Signed operations can be transformed into several unsigned ones, e.g. a two’s complement signed multiplication of m × n bits (being m ≥ n) is transformed using a variant of the Baugh and Wooley algorithm [1] into one multiplication of (m −1) ×(n −1) bits, and two additions of m and n + 1 bits. Some of the transformations that can be applied to homogenize operation types and data representation formats have been described by Molina et al. [2]. 14.2.2 Scheduling Techniques The HW waste present in conventional implementations can be reduced in the scheduling phase if we balance the computational cost of the operations executed per cycle (number of bits of every different operation type) instead of the number of operations. In order to obtain homogeneous distributions, some specification operations must be transformed into a set of new ones, whose types and widths may be different from the original. Operation fragments inherit the mobility of the original operation and are scheduled separately, in order to balance the computational cost of the operations executed per cycle. Therefore, one original operation may be executed across a set of cycles, not necessarily consecutive (saving the partial results and carry information calculated in every cycle). This implies that every result bit is available to be used as input operand the cycle it is calculated in, even if the operation finishes at a later cycle. Also each fragment can begin its execution once its input operands are available, even if its predecessors finish at a later cycle. It can also occur that the first operation fragment executed is not the one that uses the LSB of the input operands. This feature does not imply that the MSB of the result can be calculated before its LSB in the case of operations with rippling effect. The set of different fragmentations to be applied in every case mainly depends on the type and width of the operation. In the following sections a detailed anal- ysis of the different ways to fragment additions and multiplications is presented. Other types of operations could also be fragmented in a similar one. However, its description has been omitted as most of them can be reduced to multiplications, additions, and logic operations and the remaining ones are less common in behavioural specifications. 14.2.2.1 Fragmentation of Additions In order to improve the quality of circuits, some specification additions can be fragmented into several smaller additions and executed across several not necessarily consecutive cycles. In this case the new data dependences among operation 14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 263 fragments (propagation of carry signals) obliges to calculate the LSB of the operation in the earliest cycle and the MSB in the latest one. The application of this design technique provides many advantages in comparison to previous ones, such as chaining, bit-level chaining, multicycle, or non-integer multicycle. Some of them are summarized below: (a) Area reduction of the required FUs. The execution of the new operation fragments can be performed using a unique adder. The adder width must equal the width of the biggest addition fragment, which is always narrower than the original operation length. In general, the routing resources needed to share the adder among all the addition fragments do not waste the benefits achieved by the FU reuse. Aside from the remaining additions present in the specification, the best area reduction is obtained when all the fragments have similar widths due to the HW waste minimization, which is also applicable to the routing resources. In the particular case that the addition fragments are scheduled in consecutive cycles, this design technique may seem similar to the multicycle technique. However, the main difference resides on the required HW, as the width of the multicycle FU must match the operation width, meanwhile in our case must equal the widest fragment. (b) Area reduction of the required storage units. As well as all the input operand bits of a fragmented addition are not required in the first of its execution cycles, the already summed up operand bits are not longer necessary in the later ones. That is, every input operand bit is needed just in one cycle, and it has to be stored only until that cycle. The storage requirements are quite smaller than when the entire operation is executed at a later cycle, because this implies to save the complete operands during several cycles. In the case of multicycle operators, the input operands are necessary all the cycles that last its execution. (c) Cycle length reduction. The reduction of the addition widths, consequence of the operation fragmentations, directly involves a reduction of their delays, and in many cases, it also results in considerable cycle length reductions. However, this reduction is not achieved in a forthright manner, because it also depends on both the schedule of the remaining specification operations and the set of operations chained in every cycle. (d) Chaining of uncompleted operations. Every result fragment calculated can be used as input operand in the cycle it has been computed in. This allows the beginning of a successor operation in the same cycle. This way, the calculus of addition fragments is less restrictive than the calculus of complete operations, and therefore extends the design space explored by synthesis algorithms. Figure 14.3 shows graphically the schedule of one n bits addition in two inconsecutive cycles. The m LSB are calculated in cycle i, and the remaining ones in cycle i + j. The input operand bits computed in cycle i are not required in the following ones, and thus the only saved operands are A n−1 m and B n−1 m . A successor operation requiring A + B m−1 0 can begin as soon as cycle i. Although this example corresponds to the most basic fragmentation of one addition in two smaller ones, many others are also possible, including the ones that produce a bigger number of new operations. 264 M.C. Molina et al. Fig. 14.3 Execution of one addition in two inconsecutive cycles 14.2.2.2 Fragmentation of Multiplications The multiplications can also be fragmented in smaller operations to be executed in several not necessarily consecutive cycles, with the same aim of reducing the circuit area. In this case, two important differences with the addition fragmentation arise: – The fragmentation of one multiplication not only produces smaller multiplications but also additions, needed to compose the final result from the multiplication fragments. – The new data dependences among the operation fragments also oblige to calculate the LSB before the MSB. However, this need does not imply that the LSB must be calculated in the first cycle in this case. In fact, the fragments can be computed in any order as there are not data dependences among them. Only the addition fragments that produce the result bits of the original multiplication must keep their order to propagate adequately the carry signals. The advantages of this design technique are quite similar to the ones mentioned in the above section dealing with additions. For this reason just the differences between them are cited in the following paragraphs. (a) Area reduction in the required FUs. The execution of the new operation fragments can be performed using a multiplier and an adder, whose widths must equal the widths of the biggest multiplication and addition fragments, respectively. The total area of these two FUs and the required routing resources to share them across its execution cycles is usually quite smaller than the area of one multiplier of the original multiplication width. (b) Area reduction in the required storage units. Although several multiplication fragments may require the same input operands, they are not needed all along every cycle of its execution. Every input operand bit may be available in several cycles, but it only needs to be stored until the last cycle it is used as input operand. The storage requirements are reduced compared to the execution of the entire operation at a later cycle, or the use of a multicycle multiplier, where the operand bits must be kept stable during the cycles needed to complete the operation. 14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 265 Fig. 14.4 Executions of one multiplication in three inconsecutive cycles Figure 14.4 illustrates two different schedules of one m×n bits multiplication in three inconsecutive cycles. The LSB are calculated in cycle i+ j, and the remaining ones in cycle i+ j +k. Note that in both cases the multiplication fragment scheduled in cycle i does not produce any result bit. However, if the execution cycles of the fragments assigned to cycles i and i+ j were changed, the schedules would produce result bits in cycles i and i+ j + k. In both schedules some of the input operand bits computed in cycle i are not required in cycle i + j, and none of them are required in cycle i+ j + k. Although these examples correspond to basic fragmentations of one multiplication into two smaller ones and one addition, other more complex ones (that produce more fragments) are also possible. 14.2.3 Allocation and Binding Techniques The allocation techniques to reduce the HW waste are focused on increasing the bit level reuse of FUs. The maximum reuse of functional resources occurs when all the datapath FUs execute an operation of its same type and width in every cycle (no datapath resource is wasted). If the allocation phase takes place after the scheduling, then the maximum degree of FUs reuse achievable depends on the given schedule. Note that the maximum bit level reuse of FUs can only be reached once a totally homogeneous distribution of operation bits among cycles is provided. In order to get the maximum FUs reuse for a given schedule, the specification operations must be fragmented to obtain the same number of operations of equal type and width in every cycle. These fragmentations include the transformations of operations into several simpler ones whose types, representations, and widths may be different from those of the original. These transformationsimply that one original operation will be executed distributed over several linked FUs. Its operands will also be transmitted and stored distributed over several routing and storage resources, respectively. The type of fragmentation performed is different for every operation type. 266 M.C. Molina et al. Fig. 14.5 Execution of one addition in two adders, and two additions in one adder 14.2.3.1 Fragmentation of Additions The minimum HW waste in datapath adders occurs when all of them execute additions of their same width at least in one cycle. For every different schedule there may exist several implementations with the minimum HW waste, and thus similar adders area. The main difference resides on the fragmentation degree of the implementations. These circuits may present some of the following features: (a) Execution of one addition over several adders. The set of adders used to execute every addition must be linked to propagate the carry signals. (b) Execution of one addition over a wider adder. This requires the extension of the input operands and the discard of some result bits. This technique is used to avoid the excessive fragmentation in cycles with smaller computational cost. (c) Execution of several additions in the same adder. Input operands of different operations must be separated with zeros to avoid the carry propagation, and the corresponding bits of the adder result discarded. At least one zero must be used to separate every two operations, such that the minimum adder width equals the sum of the additions widths to the number of operations minus one. Figure 14.5 shows the execution of one addition over two adders as well as the execution of two additions over the same adder. These trivial examples provide the insight of how these two design techniques can be applied to datapaths with more adders and operations. 14.2.3.2 Fragmentation of Multiplications Like additions, the minimum HW waste in datapath multipliers occurs when all of them calculate multiplications of their same width at least in one clock cycle. Some of the desirable features of one implementation with minimum HW waste are: (a) Execution of one multiplication over several multipliers and adders.Thesetof multipliers used to execute every multiplication must be connected to adders 14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 267 Fig. 14.6 Execution of one multiplication over two multipliers and one adder, and two multiplications in one multiplier in some way (either chained or via intermediate registers) in order to sum the partial results and get the final one. (b) Execution of one multiplication over a wider multiplier. This technique is used to avoid the excessive fragmentation of multiplications in cycles with smaller computational cost. (c) Execution of several multiplications in the same multiplier. The input operands of different operations must be separated with several zeros to avoid interfer- ences in the calculus matrix of the different multiplications. As the amount of zeros needed equals the operands widths, this technique seems only appropriate when the computational costs of the multiplications scheduled in every cycle are quite different. Figure 14.6 shows the execution of one multiplication over two multipliers and one adder as well as the execution of two multiplications over the same multiplier. Note here that the unification of operations to be executed in the same FU produces an excessive internal waste of the functional resource, what makes this technique unviable in most designs. By contrast, the fragmentation does not produce any addi- tional HW waste, and can be used to reduce the area of circuits with some HW waste in all the cycles. 14.3 Applications to Scheduling Algorithms The scheduling techniques proposed to reduce the HW waste can be easily imple- mented in most HLS algorithms. This requires the common kernel extraction of specification operations and its successive transformations into several simpler ones to obtain balanced distributions in the number of bits computed per cycle. 268 M.C. Molina et al. This section presents a variant of the classical force-directed scheduling algorithm proposed by Paulin and Knight [4] that includes some bit-level design techniques to reduce the HW waste [3]. The intent of the original method is to minimize the HW cost subject to a given time constraint by balancing the number of operations executed per cycle. For every different type of operations the algorithm successively selects, among all operations and all execution cycles, an (operation, cycle) pair according to an estimate of the circuit cost called force. By contrast, the intent of the proposed variant is to minimize HW cost by balancing the number of bits of every different operation type calculated per cycle. This method successively selects, among all operations (multiplications first and additions afterwards) and all execution cycles, a pair formed by an operation (or an operation fragment) and an execution cycle, according to a new force definition that takes into account the widths of operations. It also uses a new parameter, called bound, to decide whether an operation should be either scheduled complete, or fragmented and just one fragment scheduled in the selected cycle. 14.3.1 Force Calculation Our redefinition of the force measure used in the classical force-directed algorithm does not consider all operations equally. Instead, it gives a different weight to every operation in function of its computational cost, leading to a more uniform distribution of the computational costs of operations among cycles. The proposed algorithm calculates the forces using a set of new distribution graphs (DGs) that represent the distribution of the computational cost of operations. For each DG, the distribution in clock cycle c is given by: DG(τ,c)= ∑ op∈EOP τ c (COST(op)·P(op,c)), where • COST(op): computational cost of operation op. For additions it is defined as the width of the widest operand, and for multiplications as the product of the widths of its operands. COST(op)=  Max(width(x),width(y)) op≡ x+ y width(x) ·width(y) op≡ x×y • P(op,c): probability of scheduling operation op in cycle c. It is similar to the classical method. • EOP τ c (estimated operations of type τ in cycle c): set of operations of type τ whose mobility makes their scheduling possible in cycle c. Except for the DG redefinition, the force associated with an operation to cycle binding is calculated like the classical method. The smaller or more negative the 14 Exploiting Bit-Level Design Techniques in Behavioural Synthesis 269 force expression is, the more desirable an operation to cycle binding becomes. In each iteration, the algorithm selects the operation and cycle with the lowest force, thus scheduling first operations with less mobility in cycles with smaller sum of computational costs, i.e. with smaller DG( τ ,c). 14.3.2 Bound Definition Our algorithm binds operations (or fragments) to cycles subject to an upper bound that represents the most uniform distribution of operation bits of a certain type among cycles reachable in every moment. It is used to decide if an operation should be fragmented and how. Its initial value corresponds to the most uniform distribution of operation bits of type τ among cycles, without data dependencies: bound = ∑ op∈OP τ COST(op) λ , where OP τ : set of specification operations of type τ. λ: circuit latency. The perfect distribution defined by the bound is not always reachable due to data dependencies among operations and given time constraints. Therefore it is updated during the scheduling in the way explained below. If the number of bits already scheduled in the selected cycle c plus the computational cost of the selected operation op is equal or lower than the bound, then the operation is scheduled there. CCS( τ ,c)+COST(op) ≤ bound, where • CCS( τ ,c): sum of the computational costs of type τ operations scheduled in cycle c CCS( τ ,c)= ∑ op∈SOP τ c COST(op). • SOP τ c : set of operations of type τ scheduled in cycle c. Otherwise the operation is fragmented and one fragment scheduled in the selected cycle (the remaining fragments continue unscheduled). The computational cost of the fragment to be scheduled CCostFrag equals the value needed to reach the bound in that cycle. CCostFrag = bound−CCS( τ ,c). In order to avoid a reduction in the mobility of the predecessors and successors of the fragmented operation, they are also fragmented. . cycle, and it has to be stored only until that cycle. The storage requirements are quite smaller than when the entire operation is executed at a later cycle, because this implies to save the complete operands. available in several cycles, but it only needs to be stored until the last cycle it is used as input operand. The storage requirements are reduced compared to the execution of the entire operation. multiplication over several multipliers and adders.Thesetof multipliers used to execute every multiplication must be connected to adders 14 Exploiting Bit -Level Design Techniques in Behavioural Synthesis