High Level Synthesis: from Algorithm to Digital Circuit- P6 potx

36 T. Bollaert Fig. 3.3 The Gantt chart to complete. In the Catapult flow, the generation of RTL is accomplished in a matter of minutes. Catapult generates VHDL, Verilog or SystemC netlists, based on user settings. Various reports are also produced providing both hardware-centric and algorithm- centric information about the design’s characteristics. Finally, Catapult provides an integrated verification flow that automates the process of validating the HDL netlist(s) output from Catapult against the original C/C++ input. This is accomplished by wrapping the netlist output with a SystemC “foreign module” class and instantiating it along with the original C/C++ code and testbench in a SystemC design. The same input stimuli are applied to both the original and the synthesized code and a comparator at each output validates that the output from both are identical (Fig. 3.4). The flow automatically generates all of the SystemC code to provide interconnect and synchronization signals, Makefiles to perform compilation, as well as scripts to drive the simulation. 3.3 Coding and Optimizing a Design with Catapult Synthesis This section provides an overview of the various controls the user can leverage to efficiently synthesize his design. 3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 37 Fig. 3.4 Catapult synthesis’ automatic verification flow 3.3.1 Coding C/C++ for Synthesis The coding style used for functional specification is plain C++ that provides a sequential implementation of the behavior without any notion of timing or concur- rency. Both the syntax and the semantics of the C++ language are fully preserved. 3.3.1.1 General Constructs and Statements Catapult supports a very broad subset of the ANSI C++ language. The C/C++ synthesized top-level function may call other sub-functions, which may be inlined or may be kept as a level of hierarchy. The design may also contain static vari- ables that keep some state between invocations of the function. “if” and “switch” condition statements are supported, as well as “for,” “do” and “while” looping statements. “break,” “continue” and “return” branching statements are synthesizable as well. The only noticeable restriction is that the code should be statically determinable, meaning that all its properties must defined at compilation time. As such, dynamic memory allocation/deallocation (malloc, free, new, delete) is not supported. 3.3.1.2 Pointers Pointers are synthesizable if they point to statically allocated objects and therefore can be converted into array indexes. Pointer arithmetic is also supported and a pointer can point to several objects inside of an array. 38 T. Bollaert Fig. 3.5 Coding style example 3.3.1.3 Classes and Templates Compound data types such as classes, structs and arrays are fully supported for synthesis. Furthermore, parameterization through C++ templates is also supported. The combination of classes and templates provides a powerful mechanism facilitat- ing design re-use. The example in Fig. 3.5 gives an overview of some of the coding possibilities allowed by the Catapult synthesizable subset. A struct is defined to model a RGB pixel. The struct is templatized so users can define the actual bitwidth of the R, G and B fields. Additionally, a method is defined which returns a grayscale value from the RGB pixel. The synthesized design is the “convert to gray” function. It is implemented as a loop which reads RGB pixels one by one from an input array, calls the “to gray” method to compute the result and assigns it to the output array using pointer arithmetic. 3.3.1.4 Bit-Accurate Data Types Hardware designers are accustomed to bit-accurate datatypes in hardware design languages such as VHDL and Verilog. Similarly, bit-accurate data types are needed to synthesize area efficient hardware from C models. The arbitrary-length bit- accurate integer and fixed-point “Algorithmic C” datatypes provide an easy way to model static bit-precision with minimal runtime overhead. Operators and meth- ods on both the integer and fixed-point types are clearly and consistently defined so that they have well defined simulation and synthesis semantics. The precision of the integer type ac int<W,S> is determined by template parameters W(integer that gives bit-width) and S (a boolean that determines whether the integer is signed or unsigned). The fixed-point type ac fixed<W,I,S,Q, O> has five template parameters which determine its bit-width, the location of the fixed-point, whether it is signed or unsigned and the quantization and overflow modes that are applied when construct- ing or assigning to object of its type. 3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 39 The advantages of the Algorithmic C datatypes over the existing integer and fixed-point datatypes are the following: • Arbitrary-Length: this allows a clean definition of the semantics for all operators that are not tied to an implementation limit. It is also important for writing general IP algorithms that don’t have artificial (and often hard to quantify and document) limits for precision. • Precise Definition of Semantics: special attention has been paid to define and verify the simulation semantics and to make sure that the semantics are appro- priate for synthesis. No simulation behavior has been left to compiler depen- dent behavior. Also, asserts have been introduced to catch invalid code during simulation. • Simulation Speed: the implementation of ac int uses sophisticated template specialization techniques so that a regular C++ compiler can generate optimized assembly language that will run much faster than the equivalent SystemC datatypes. For example, ac int of bit widths in the range 1–32 can run 100× faster than the corresponding sc bigint/sc biguint datatype and 3× faster than the corresponding sc int/sc uint datatype. • Correctness: the simulation and synthesis semantics have been verified for many size combinations using a combination of simulation and equivalence checking. • Compilation Speed and Smaller Executable: code written using ac int datatypes compiles 5× faster even with the compiler optimizations turned on (required for fast simulation). It also produces smaller binary executables. • Consistency: consistent semantics of ac int and ac fixed. In addition to the Algorithmic C datatypes, Catapult Synthesis also supports the C++ native types (bool, char, short, int and long) as well as the SystemC sc int, sc bigint and sc fixed types and their unsigned version. 3.3.2 Synthesizing the Design Interface 3.3.2.1 Hardware Interface View of the Algorithm The design interface is how a hardware design communicates with the rest of the world. In the C/C++ source code, the arguments passed to the top-level function infer the interface ports. Catapult can infer three types of interface ports: • Input Ports transfer data from the rest of the world to the design. All inputs are either non-pointer arguments passed to the function or pointer arguments that are read only. • Output Ports transfer data from the design to the rest of the world. Structure or pointer arguments infer output ports if the design reads from them but does not write to them. • Input ports transfer data both to and from the design. These are pointer arguments that are both written and read. 40 T. Bollaert 3.3.2.2 Interface Synthesis Catapult builds a correspondence between the arguments of the C/C++ function and the I/Os of the hardware design. Once this is established, the designer uses interface synthesis constraints to specify properties of each hardware ports. With this approach, designers can target and build any kind of hardware interface. Interface synthesis directives give users control other parameters such as bandwidth, timing, handshake and other protocols aspects. This way the synthesized C/C++ algorithm remains purely functional and doesn’t have to embed any kind of interface specific information. The same code can be retargeted based on any interface requirement (bandwidth, protocol, etc ) Amongst other transformations and constraints, the user can for instance: • Define full, partial or no handshake on interface signals • Map arrays to wires, memories, busses or streams • Control the bandwidth (bitwidth) of the hardware ports • Add optional start/done flags to the design • Define custom interface protocols Hardware specific I/O signals such as clock, reset, enable or and handshaking signals do not need to be modeled either and are added automatically based on user constraints. 3.3.3 Loop Controls 3.3.3.1 Loop Unrolling Loop unrolling exposes parallelism that exists across different subsequent iterations of a loop by partially or fully unrolling the loop. The example in Fig. 3.6 features a simple loop summing two vectors of four values. If the loop is kept rolled, then Catapult will generate a serial architecture. As shown on the left, a single adder will be allocated to implement the four additions. The adder is therefore time-shared, and dedicated control logic is built accordingly. Assuming the mux, add and demux logic can fit in the desired clock period, four cycles are needed to compute the results. On the right-hand side, the same design is synthesized with its loop fully unrolled. Unrolling is applied by setting a synthesis constraint and has the same effect as copying four times the loop body. Catapult can now exploit the operation- level parallelism to build a fully parallel implementation of the same algorithm. The resulting architecture necessitates four adders to implement the four additions and has a latency of one clock cycle. Partial unrolling may also be used to trade the area, power and performance of the resulting design. In the above example, an unrolling factor of 2 would cause the loop body to be copied twice, and the number of loop iterations halved. The 3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 41 Fig. 3.6 Unrolling defines how many times to copy the body of a loop synthesized solution would therefore be built with two adders, and have a latency of two cycles. 3.3.3.2 Loop Merging Loop merging exhibits loops-level parallelism. This technique applies to sequential loops and creates a single loop with the same functionality as the original loops. This transformation is used to reduce latency and area consumption in a design by allowing parallel execution, where possible, of loops that would normally execute in series. With loop merging, algorithm designers can develop their application in a very natural way, without having to worry about potential parallelism in the hardware implementation. In Fig. 3.7, the code contains sequential loops. Sequential loops are very con- venient to model the various processing stages of an algorithm. By enabling or disabling loop merging, the designer decides if in the generated hardware, the loops should run in parallel (merging enabled) or sequentially (merging disabled). With this technique, the designer maintains the readability and hardware independence of his source code. The transformation and optimization techniques in Catapult can produce a parallelized design which would have required a much more convoluted source description, as shown on the right-hand side. It should also be noted in this example that Catapult is able to appropriately optimize the intermediate data storage. When sequentially processing the two loops, intermediate storage is needed to store the values of “a.” When parallelizing the two loops, values of “a” produced in the first loop can directly be consumed by the second loop, removing the need for storage. 42 T. Bollaert Fig. 3.7 Merging parallelizes sequential loops Fig. 3.8 Pipelining defines when to initiate the next iteration of a loop 3.3.3.3 Loop Pipelining Loop pipelining provides a way to increase the throughput of a loop (or decreas- ing its overall latency) by initiating the next iteration of the loop before the current iteration has completed. Overlapping the execution of subsequent iterations of a loop exploits parallelism across loop iterations. The number of cycles between iterations of the loop is called the initiation interval. In many cases loop pipelining may improve the resource utilization thus increasing the performance/area metric of the design. In example Fig. 3.8, a loop iteration consists of four operations: an I/O read to in[i], a multiplication with coef1, an addition with coef2, and finally an I/O write to out[i]. Assuming that each of these operations executes in a clock cycle, and if no loop constraints are applied, the design schedule will look as shown on the left hand side. Each operation happens sequentially, and the start of a loop iteration (shown here with the red triangle) happens after the previous iteration 3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 43 completes. Conceptually, the pipeline initiation interval is equal to the latency of a loop iteration, in this case, four cycles. By constraining the initiation interval with loop pipelining, designers determine when to start each loop iteration, relative to the previous one. The schedule on the right hand side illustrates the same loop, pipelined with an initiation interval of one cycle: the second loop iteration starts one cycle after the first one. Pipelining a design directly impacts the data rate of the resulting hardware implementation. The first solution makes 1 I/O access every four cycles, while the second one will make I/O accesses every cycles. Some applications may require a given throughput, therefore commanding the initiation interval constraint. Other designs may tolerate some flexibility, allowing the designers to explore different pipelining scenarios, trading area, bandwidth utilization as well as power consumption. 3.3.4 Hierarchical Synthesis The proper integration of individual blocks into a sub-system is one of the major challenges in chip design. With its hierarchical synthesis capability Catapult Syn- thesis can greatly simplify the design and integration tasks, building complex multi-block systems correct-by-construction. While loop unrolling exploits instruction level parallelism and loop merging exploits loop level parallelism, hierarchy exploits function level (task-level) parallelism. In Catapult, the user can specify which function calls should be synthesized as hierarchical units. The arguments of the hierarchical function define the data flow of the system, and Catapult will build all the inter-block communication and synchronization logic. Hierarchy generalizes the notion of pipelining, allowing different functions to run in a parallel and pipelined manner. In complex systems consisting of various processing stages, hierarchy is very useful to meet design throughput constraints. When pipelining hierarchical systems, Catapult builds a design were the execution of the various functions overlap in time. As shown in Fig. 3.9, in the sequential source code, the three functions (stage1, stage2 and stage3) execute one after the other. In Fig. 3.9 Task-overlapping with hierarchical synthesis 44 T. Bollaert the resulting hierarchical system, the second occurrence of stage1 can start together with the first occurrence of stage2, as soon as first occurrence of stage1 ends. 3.3.5 Technology-Driven Scheduling and Allocation Scheduling and allocation is the process of building and optimizing the design given all the user constraints, including the specific clock period and target technology. With the clock period defining the maximum register-to-register path, the technology defines the logic delay for each design operation. The design schedule is therefore intimately tied to these clock and technology constraints (Fig. 3.10). This is fundamental to build optimized RTL implementations, allowing efficient retar- geting of algorithmic specifications from one ASIC process to another, or even to FPGAs, with always optimal results. This capability opens new possibilities in the field of IP and reuse. While RTL reuse can provide a quick path to the desired functionality, it often comes at the expense of suboptimal results. RTL IPs maybe reused over many years. Devel- oped on older processes, IPs will certainly work on newer ones, but without taking advantage of higher speeds and better densities, therefore resulting in bigger and slower than needed implementations. In contrast, Catapult can built optimized RTL designs from functional IPs for each process generation, taking reuse to a new level of efficiency. Fig. 3.10 Technology-driven scheduling and allocation 3 Catapult Synthesis: A Practical Introduction to Interactive C Synthesis 45 3.4 Case Study: JPEG Encoder In this section we will show how a sub-system such as a JPEG encoder can be synthesized with Catapult Synthesis. We chose a JPEG encoder design for this case study, as we felt that the application would be sufficiently familiar to most readers in order to be easily understood without extensive explanations. Moreover, such an encoder features a pedagogical mix of datapath and control blocks, giving a good overview of Catapult Synthesis’ capabilities. 3.4.1 JPEG Encoder Overview The pixel pipe (Fig. 3.11) of the encoder can be broken down in four main stages: first RGB to YCbCr color space conversion block, second DCT (discrete cosine transform), third zigzag reordering combined with quantization and last, the Huffman encoder. 3.4.2 The Top Level Function The top level function synthesized by Catapult (Fig. 3.12) closely resembles the system block diagram. Four sub-functions implement the four processing stages of the algorithm. The sub-functions simply pass on arrays to each other, mimicking the system data flow. 3.4.3 The Color Space Conversion Block The color space conversion unit is implemented as a relatively straightforward vector multiplication. Different sets of coefficients are used for Y, Cb and Cr components. Fig. 3.11 JPEG encoder block diagram . the “convert to gray” function. It is implemented as a loop which reads RGB pixels one by one from an input array, calls the to gray” method to compute the result and assigns it to the output. the design to the rest of the world. Structure or pointer arguments infer output ports if the design reads from them but does not write to them. • Input ports transfer data both to and from the. flags to the design • Define custom interface protocols Hardware specific I/O signals such as clock, reset, enable or and handshaking signals do not need to be modeled either and are added automatically

High Level Synthesis: from Algorithm to Digital Circuit- P6 potx

Thông tin tài liệu

Từ khóa liên quan

Mục lục

cover.jpg

front-matter.pdf

fulltext.pdf

fulltext_001.pdf

fulltext_002.pdf

fulltext_003.pdf

fulltext_004.pdf

fulltext_005.pdf

fulltext_006.pdf

fulltext_007.pdf

fulltext_008.pdf

fulltext_009.pdf

fulltext_010.pdf

fulltext_011.pdf

fulltext_012.pdf

fulltext_013.pdf

fulltext_014.pdf

Tài liệu cùng người dùng

Tài liệu liên quan