59 VLSI Architectures for Image Communications

P Pirsch, et Al “VLSI Architectures for Image Communications.” 2000 CRC Press LLC VLSI Architectures for Image Communications P Pirsch Laboratorium fur Informationstechnologie, University of Hannover W Gehrke Philips Semiconductors 59.1 59.1 59.2 59.3 59.4 59.5 59.6 Introduction Recent Coding Schemes Architectural Alternatives Efficiency Estimation of Alternative VLSI Implementations Dedicated Architectures Programmable Architectures Intensive Pipelined Architectures • Parallel Data Paths • Coprocessor Concept 59.7 Conclusion Acknowledgment References Introduction Video processing has been a rapidly evolving field for telecommunications, computer, and media industries In particular, for real time video compression applications a growing economical significance is expected for the next years Besides digital TV broadcasting and videophone, services such as multimedia education, teleshopping, or video mail will become audiovisual mass applications To facilitate worldwide interchange of digitally encoded audiovisual data, there is a demand for international standards, defining coding methods, and transmission formats International standardization committees have been working on the specification of several compression schemes The Joint Photographic Experts Group (JPEG) of the International Standards Organization (ISO) has specified an algorithm for compression of still images [4] The ITU proposed the H.261 standard for video telephony and video conference [1] The Motion Pictures Experts Group (MPEG) of ISO has completed its first standard MPEG-1, which will be used for interactive video and provides a picture quality comparable to VCR quality [2] MPEG made substantial progress for the second phase of standards MPEG-2, which will provide audiovisual quality of both broadcast TV and HDTV [3] Besides the availability of international standards, the successful introduction of the named services depends on the availability of VLSI components, supporting a cost efficient implementation of video compression applications In the following, we give a short overview of recent coding schemes and discuss implementation alternatives Furthermore, the efficiency estimation of architectural alternatives is discussed and implementation examples of dedicated and programmable architectures are presented 1999 by CRC Press LLC c 59.2 Recent Coding Schemes Recent video coding standards are based on a hybrid coding scheme that combines transform coding and predictive coding techniques An overview of these hybrid encoding schemes is depicted in Fig 59.1 FIGURE 59.1: Hybrid encoding and decoding scheme The encoding scheme consists of the tasks motion estimation, typically based on blockmatching algorithms, computation of the prediction error, discrete cosine transform (DCT), quantization (Q), variable length coding (VLC), inverse quantization (Q−1 ), and inverse discrete cosine transform (IDCT or DCT-1) The reconstructed image data are stored in an image memory for further predictions The decoder performs the tasks variable length decoding (VLC−1 ), inverse quantization, and motion compensated reconstruction Generally, video processing algorithms can be classified in terms of regularity of computation and data access This classification leads to three classes of algorithms: • Low-Level Algorithms — These algorithms are based on a predefined sequence of operations and a predefined amount of data at the input and output The processing sequence of low-level algorithms is predefined and does not depend on the values of data processed Typical examples of low-level algorithms are block matching or transforms such as the DCT • Medium-Level Algorithms — The sequence and number of operations of medium-level algorithms depend on the data Typically, the amount of input data is predefined, whereas the amount of output data varies according to the input data values With respect to hybrid coding schemes, examples for these algorithms are quantization, inverse quantization, or variable length coding • High-Level Algorithms — High-level algorithms are associated with a variable amount of input and output data and a data-dependent sequence of operations As for medium1999 by CRC Press LLC c level algorithms, the sequence of operations is highly data dependent Control tasks of the hybrid coding scheme can be assigned to this class Since hybrid coding schemes are applied for different video source rates, the required absolute processing power varies in the range from a few hundred MOPS (Mega Operations Per Second) for video signals in QCIF format to several GOPS (Giga Operations Per Second) for processing of TV or HDTV signals Nevertheless, the relative computational power of each algorithmic class is nearly independent of the processed video format In case of hybrid coding applications, approximately 90% of the overall processing power is required for low-level algorithms The amount of medium-level tasks is about 7% and nearly 3% is required for high-level algorithms 59.3 Architectural Alternatives In terms of a VLSI implementation of hybrid coding applications, two major requirements can be identified First, the high computational power requirements have to be provided by the hardware Second, low manufacturing cost of video processing components is essential for the economic success of an architecture Additionally, implementation size and architectural flexibility have to be taken into account Implementations of video processing applications can either be based on standard processors from workstations or PCs or on specialized video signal processors The major advantage of standard processors is their availability Application of these architectures for implementation of video processing hardware does not require the time consuming design of new VLSI components The disadvantage of this implementation strategy is the insufficient processing power of recent standard processors Video processing applications would still require the implementation of cost intensive multiprocessor systems to meet the computational requirements To achieve compact implementations, video processing hardware has to be based on video signal processors, adapted to the requirements of the envisaged application field Basically, two architectural approaches for the implementations of specialized video processing components can be distinguished Dedicated architectures aim at an efficient implementation of one specific algorithm or application Due to the restriction of the application field, the architecture of dedicated components can be optimized by an intensive adaptation of the architecture to the requirements of the envisaged application, e.g., arithmetic operations that have to be supported, processing power, or communication bandwidth Thus, this strategy will generally lead to compact implementations The major disadvantage of dedicated architecture is the associated low flexibility Dedicated components can only be applied for one or a few applications In contrast to dedicated approaches with limited functionality, programmable architectures enable the processing of different algorithms under software control The particular advantage of programmable architectures is the increased flexibility Changes of architectural requirements, e.g., due to changes of algorithms or an extension of the aimed application field, can be handled by software changes Thus, a generally cost-intensive redesign of the hardware can be avoided Moreover, since programmable architectures cover a wider range of applications, they can be used for low-volume applications, where the design of function specific VLSI chips is not an economical solution For both architectural approaches, the computational requirements of video processing applications demand for the exploitation of the algorithm-inherent independence of basic arithmetic operations to be performed Independent operations can be processed concurrently, which enables the decrease of processing time and thus an increased through-put rate For the architectural implementation of concurrency, two basic strategies can be distinguished: pipelining and parallel processing In case of pipelining several tasks, operations or parts of operations are processed in subsequent steps in different hardware modules Depending on the selected granularity level for the implemen1999 by CRC Press LLC c tation of pipelining, intermediate data of each step are stored in registers, register chains, FIFOs, or dual-port memories Assuming a processing time of TP for a non-pipelined processor module and TD,I M for the delay of intermediate memories, we get in the ideal case the following estimation for the throughput-rate RT ,Pipe of a pipelined architecture applying NPipe pipeline stages: RT ,Pipe = TP NPipe = + TD,I M NPipe TP + NPipe · TD,I M (59.1) From this follows that the major limiting factor for the maximum applicable degree of pipelining is the access delay of these intermediate memories The alternative to pipelining is the implementation of parallel units, processing independent data concurrently Parallel processing can be applied on operation level as well as on task level Assuming the ideal case, this strategy leads to a linear increase of processing power and we get: RT ,Par = NPar TP (59.2) where NPar = number of parallel units Generally, both alternatives are applied for the implementation of high-performance video processing components In the following sections, the exploitation of algorithmic properties and the application of architectural concurrency is discussed considering the hybrid coding schemes 59.4 Efficiency Estimation of Alternative VLSI Implementations Basically, architectural efficiency can be defined by the ratio of performance over cost To achieve a figure of merit for architectural efficiency we assume in the following that performance of a VLSI architecture can be expressed by the achieved throughput rate RT and the cost is equivalent to the required silicon area ASi for the implementation of the architecture: E= RT ASi (59.3) Besides the architecture, efficiency mainly depends on the applied semiconductor technology and the design-style (semi-custom, full-custom) Therefore, a realistic efficiency estimation has to consider the gains provided by the progress in semiconductor technology A sensible way is the normalization of the architectural parameters according to a reference technology In the following we assume a reference process with a grid length λ0 = 1.0 micron For normalization of silicon area, the following equation can be applied: 2 λ0 (59.4) ASi,0 = ASi λ where the index is used for the system with reference gate length λ0 According to [7] the normalization of throughput can be performed by: RT ,0 = RT λ λ0 1.6 (59.5) From Eqs (59.3), (59.4), and (59.5), the normalization for the architectural efficiency can be derived: RT ,0 RT = E0 = ASi,0 ASi 1999 by CRC Press LLC c λ λ0 3.6 (59.6) E can be used for the selection of the best architectural approach out of several alternatives Moreover, assuming a constant efficiency for a specific architectural approach leads to a linear relationship of throughput rate and silicon area and this relationship can be applied for the estimation of the required silicon area for a specific application Due to the power of 3.6 in Equ (59.6), the chosen semiconductor technology for implementation of a specific application has a significant impact on the architectural efficiency In the following, examples of dedicated and programmable architectures for video processing applications are presented Additionally, the discussed efficiency measure is applied to achieve a figure of merit for silicon area estimation 59.5 Dedicated Architectures Due to their algorithmic regularity and the high processing power required for the discrete cosine transform and motion estimation, these algorithms are the first candidates for a dedicated implementation As typical examples, alternatives for a dedicated implementation of these algorithms are discussed in the following The discrete cosine transform (DCT) is a real-valued frequency transform similar to the Discrete Fourier transform (DFT) When applied to an image block of size L × L, the two dimensional DCT (2D-DCT) can be expressed as follows: Yk,l = L−1 X L−1 X xi,j · Ci,k · Cj,l (59.7) i=0 j =0 where Cn,m =    √1   cos for m = h (2n+1)mπ 2L i otherwise with (i, j ) = coordinates of the pixels in the initial block (k, l) = coordinates of the coefficients in the transformed block xi,j = value of the pixel in the initial block Yk,l = value of the coefficient in the transformed block Computing a 2D DCT of size L × L directly according to Eq (59.7) requires L4 multiplications and L4 additions The required processing power for the implementation of the DCT can be reduced by the exploitation of the arithmetic properties of the algorithm The two-dimensional DCT can be separated into two one-dimensional DCTs according to Eq (59.8)   L−1 L−1 X X Ci,k ·  xi,j · Cj,l  (59.8) Yk,l = i=0 j =0 The implementation of the separated DCT requires 2L3 multiplications and 2L3 additions As an example, the DCT implementation according to [9] is depicted in Fig 59.2 This architecture is based on two one-dimensional processing arrays Since this architecture is based on a pipelined multiplier/accumulator implementation in carry-save technique, vector merging adders are located at the output of each array The results of the 1D-DCT have to be reordered for the second 1D-DCT stage For this purpose, a transposition memory is used Since both one-dimensional processor arrays require identical DCT coefficients, these coefficients are stored in a common ROM 1999 by CRC Press LLC c FIGURE 59.2: Separated DCT implementation according to [9] Moving from a mathematical definition to an algorithm that can minimize the number of calculations required is a problem of particular interest in the case of transforms such as the DCT The 1D-DCT can also be expressed by the matrix-vector product : [Y] = [C][X] (59.9) where [C] is an L × L matrix and [X] and [Y] 8-point input and output vectors As an example, with θ = p/16, the 8-points DCT matrix can be computed as denoted in Eq (59.10)            Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7                        Y0  Y2     Y4  Y6   Y1  Y3     Y5  Y7 =  cos 4θ cos θ cos 2θ cos 3θ cos 4θ cos 5θ cos 6θ cos 7θ cos 4θ cos 3θ cos 6θ − cos 7θ − cos 4θ − cos θ − cos 2θ − cos 5θ cos 4θ cos 5θ − cos 6θ − cos θ − cos 4θ cos 7θ cos 2θ cos 3θ cos 4θ  cos 2θ   cos 4θ cos 6θ  cos θ  cos 3θ   cos 5θ cos 7θ cos 4θ cos 6θ − cos 4θ − cos 2θ cos 4θ − cos 6θ − cos 4θ cos 2θ cos 3θ − cos 7θ − cos θ − cos 5θ cos 5θ − cos θ cos 7θ cos 3θ  = = cos 4θ cos 7θ − cos 2θ − cos 5θ cos 4θ cos 3θ − cos 6θ − cos θ cos 4θ − cos 7θ − cos 2θ cos 5θ cos 4θ − cos 3θ − cos 6θ cos θ cos 4θ − cos 5θ − cos 6θ cos θ − cos 4θ − cos 7θ cos 2θ − cos 3θ cos 4θ − cos 3θ cos 6θ cos 7θ − cos 4θ cos θ − cos 2θ cos 5θ cos 4θ − cos θ cos 2θ − cos 3θ cos 4θ − cos 5θ cos 6θ − cos 7θ            x0 x1 x2 x3 x4 x5 x6 x7            (59.10)   cos 4θ x0 + x7  x1 + x6  − cos 2θ    cos 4θ   x2 + x5  x3 + x4 − cos 6θ   cos 7θ x0 + x7   − cos 5θ   x1 + x6   cos 3θ   x2 + x5  x3 + x4 − cos θ (59.11) (59.12) More generally, the matrices in Eqs (59.11) and (59.12) can be decomposed in a number of simpler matrices, the composition of which can be expressed as a flowgraph Many fast algorithms have been proposed Figure 59.3 illustrates the flowgraph of the B.G Lee’s algorithms, which is commonly used [10] Several implementations using fast flow-graphs have been reported [11, 12] Another approach that has been extensively used is based on the technique of distributed arithmetic Distributed arithmetic is an efficient way to compute the DCT totally or partially as scalar products To illustrate the approach, let us compute a scalar product between two length-M vectors C and X : Y = M−1 X ci · xi with xi = −xi,0 + i=0 B−1 X xi,j · 2−j (59.13) j =1 where {ci } are N-bit constants and {xi } are coded in B bits in 2s complement Then Eq (59.13) can be rewritten as : Y = B−1 X j =0 1999 by CRC Press LLC c Cj · 2−j with Cj 6=0 = M−1 X i=0 ci xi,j and C0 = − M−1 X i=0 ci xi,0 (59.14) FIGURE 59.3: Lee FDCT flowgraph for the one-dimensional 8-points DCT [10] The change of summing order in i and j characterizes the distributed arithmetic scheme in which the initial multiplications are distributed to another computation pattern Since the term Cj has only 2M possible values (which depend on the xi,j values), it is possible to store these 2M possible values in a ROM An input set of M bits {x0,j , x1,j , x2,j , , xM−1,j } is used as an address, allowing retrieval of the Cj value These intermediate results are accumulated in B clock cycles, for producing one Y value Figure 59.4 shows a typical architecture for the computation of a M input inner product The inverter and the MUX are used for inverting the final output of the ROM in order to compute C0 FIGURE 59.4: Architecture of a M input inner product using distributed arithmetic Figure 59.5 illustrates two typical uses of distributed arithmetic for computing a DCT Figure 59.5(a) implements the scalar products described by the matrix of Eq (59.10) Figure 59.5(b) takes advantage of a first stage of additions and substractions and the scalar products described by the matrices of Eq (59.11) and Eq (59.12) Properties of several dedicated DCT implementations have been reported in [6] Figure 59.6 shows the silicon area as a function of the throughput rate for selected design examples The design parameters are normalized to a fictive 1.0 µm CMOS process according to the discussed normalization strategy As a figure of merit, a linear relationship of throughput rate and required silicon area can be derived: (59.15) αT ,0 ≈ 0.5 mm2 / Mpel/s Equation (59.15) can be applied for the silicon area estimation of DCT circuits For example, assuming TV signals according to the CCIR-601 format and a frame rate of 25Hz, the source rate 1999 by CRC Press LLC c FIGURE 59.5: Architecture of an 8-point one-dimensional DCT using distributed arithmetic (a) Pure distributed arithmetic (b) Mixed D.A.: first stage of flowgraph decomposition products of points followed by times scalar products of points equals 20.7 Mpel/s As a figure of merit from Eq (59.15) a normalized silicon area of about 10.4 mm2 can be derived For HDTV signals the video source rate equals 110.6 Mpel/s and approximately 55.3 mm2 silicon area is required for the implementation of the DCT Assuming an economically sensible maximum chip size of about 100 mm2 to 150 mm2 , we can conclude that the implementation of the DCT does not necessarily require the realization of a dedicated DCT chip and the DCT core can be combined with several other on-chip modules that perform additional tasks of the video coding scheme For motion estimation several techniques have been proposed in the past Today, the most important technique for motion estimation is block matching, introduced by [21] Block matching is based on the matching of blocks between the current and a reference image This can be done by a full (or exhaustive) search within a search window, but several other approaches have been FIGURE 59.6: Normalized silicon area and throughput for dedicated DCT circuits 1999 by CRC Press LLC c reported in order to reduce the computation requirements by using an “intelligent” or “directed” search [17, 18, 19, 23, 25, 26, 27] In case of an exhaustive search block matching algorithm, a block of size N × N pels of the current image (reference block, denoted X) is matched with all the blocks located within a search window (candidate blocks, denoted Y ) The maximum displacement will be denoted by w The matching criterium generally consists in computing the mean absolute difference (MAD) between the blocks Let x(i, j ) be the pixels of the reference block and y(i, j ) the pixels of the candidate block The matching distance (or distortion) D is computed according to Eq (59.16) The indexes m and n indicate the position of the candidate block within the search window The distortion D is computed for all the (2w + 1)2 possible positions of the candidate block within the search window [Eq (59.16)] and the block corresponding to the minimum distortion is used for prediction The position of this block within the search window is represented by the motion vector v (59.17) D(m, n) = N −1 N−1 X X |x(i, j ) − y(i + m, j + n)| (59.16) i=0 j =0 v = m n |Dmin (59.17) The operations involved for computing D(m, n) and DMIN are associative Thus, the order for exploring the index spaces (i, j ) and (m, n) are arbitrary and the block matching algorithm can be described by several different dependence graphs As an example, Fig 59.7 shows a possible dependence graph (DG) for w = and N = In this figure, AD denotes an absolute difference and an addition, M denotes a minimum value computation FIGURE 59.7: Dependence graphs of the block matching algorithm The computation of v (X, Y ) and D(m, n) are performed by 2D linear DGs The dependence graph for computing D(m, n) is directly mapped into a 2-D array of processing elements (PE), while the dependence graph for computing v(X, Y ) is mapped into time (59.8) In other words, block matching is performed by a sequential exploration of the search area, while the computation of each distortion is performed in parallel Each of the AD nodes of the DG is implemented by an AD processing element (AD-PE) The AD-PE stores the value of x(i, j ) and receives the value of y(m + i, n + j ) corresponding to the current position of the reference block in the search window It performs the subtraction and the absolute value computation, and adds the 1999 by CRC Press LLC c result to the partial result coming from the upper PE The partial results are added on columns and a linear array of adders performs the horizontal summation of the row sums, and computes D(m, n) For each position (n, m) of the reference block, the M-PE checks if the distortion D(m, n) is smaller than the previous smaller distortion value, and, in this case, updates the register which keeps the previous smaller distortion value To transform this naive architecture into a realistic implementation, two problems must be solved: (1) a reduction of the cycle time and (2) the I/O management The architecture of Fig 59.8 implicitly supposes that the computation of D(m,n) can be done combinatorially in one cycle time While this is theoretically possible, the resulting cycle time would be very large and would increase as 2N Thus, a pipeline scheme is generally added This architecture also supposes that each of the AD-PE receives a new value of y(m + i, n + j ) at each clock cycle FIGURE 59.8: Principle of the 2-D block-based architecture Since transmitting the N values from an external memory is clearly impossible, advantage must be taken from the fact that these values belong to the search window A portion of the search window of size N ∗ (2w + N) is stored in the circuit, in a 2-D bank of shift registers able to shift in the up, down, and right direction Each of the AD-PEs has one of these registers and can, at each cycle, obtain the value of y(m + i, n + j ) that it needs To update this register bank, a new column of 2w + N pixels of the search area is serially entered in the circuit and is inserted in the bank of registers A mechanism must also be provided for loading a new reference with a low I/O overhead: a double buffering of x(i, j ) is required, with the pixels x (i, j ) of a new reference block serially loaded during the computation of the current reference block (Fig 59.9) Figure 59.10 shows the normalized computational rate vs normalized chip area for block matching circuits Since one MAD operation consists of three basic ALU operations (SUB, ABS, ADD), for a 1.0 micron CMOS process, we can derive from this figure that: αT ,0 ≈ 30 mm2 + 1.9 mm2 / GOPS (59.18) The first term of this expression indicates that the block matching algorithm requires a large storage area (storage of parts of the actual and previous frame), which cannot be reduced even when the 1999 by CRC Press LLC c FIGURE 59.9: Practical implementation of the 2-D block-based architecture throughput is reduced The second term corresponds to the linear dependency on computation throughput The second term has the same amount as that determined for the DCT for GADDS because the three types of operations for the matching require approximately the same expense of additions From equation Eq (59.18), the silicon area required for the dedicated implementation of the exhaustive search block matching strategy for a displacement of ±w pels can be derived by: αT ,0 ≈ 0.0057 · (2w + 1)2 · RS + 30 mm2 (59.19) According to Eq (59.19), a dedicated implementation of exhaustive search block matching for telecommunication applications based on a source rate of RS = 1.01 Mpel/s (CIF format, 10 Hz frame rate) and a maximum displacement of w = 15, the required silicon area can be estimated to 35.5 mm2 For TV (RS = 10.4 Mpel/s) the silicon area for w = 31 can be estimated to 265 mm2 Estimating the required silicon area for HDTV signals and w = 31 leads to 1280 mm2 for FIGURE 59.10: Normalized silicon area and computational rate for dedicated motion estimation architectures 1999 by CRC Press LLC c the fictive 1.0 µm CMOS process From this follows that the implementation for TV and HDTV applications will require the realization of a dedicated block matching chip Assuming a recent 0.5 µm semiconductor processes the core size estimation leads to about 22 mm2 for TV signals and 106 mm2 for HDTV signals To reduce the high computational complexity required for exhaustive search block matching, two strategies can be applied: Decrease of the number of candidate blocks Decrease of the pels per block by subsampling of the image data Typically, (1) is implemented by search strategies in successive steps As an example, a modified scheme according to the original proposal of [25] will be discussed In this scheme, the best match vs−1 in the previous step s − is improved in the present step s by comparison with displacements ±1s The displacement vector vs for each step s is calculated according to Ds (ms , ns ) = N−1 X N−1 X |x(i, j ) − y (i + ms + q · 1s , j + ns + q · 1s ) | i=0 j =0 with q ∈ {−1, 0, 1} ms ns ms ns = vs−1 0 = and vs = ms ns for s > for s = |Ds,min (59.20) 1s depends on the maximum displacement w and the number of search steps Ns Typically, when w = 2k − 1, Ns is set to k = log2 (w + 1) and 1s = 2k−s+1 For example, for w = 15, four steps with 1s = 8, 4, 2, are performed This strategy reduces the number of candidate blocks from (2w + 1)2 in case of exhaustive search to + ∗ log2 (w + 1), e.g., for w = 15 the number of candidate blocks is reduced from 961 to 33 which leads to a reduction of processing power by a factor of 29 For large block sizes N, the number of operations for the match can be further reduced by combining the search strategy with subsampling in the first steps Architectures for block matching based on hierarchical search strategies are presented in [20, 22, 24, 30] 59.6 Programmable Architectures According to the three ways for architectural optimization, adaptation, pipelining, and parallel processing, three architectural classes for the implementation of video signal processors can be distinguished: • Intensive Pipelined Architectures — These architectures are typically scalar architectures that achieve high clock frequencies of several hundreds of MHz due to the exploitation of pipelining • Parallel Data Paths — These architectures exploit data distribution for the increase of computational power Several parallel data paths are implemented on one processor die, which leads in the ideal case to a linear increase of supported computational power The number of parallel data paths is limited by the semiconductor process, since an increase of silicon area leads to a decrease of hardware yield 1999 by CRC Press LLC c • Coprocessor Architectures — Coprocessors are known from general processor designs and are often used for specific tasks, e.g., floating point operations The idea of the adaptation to specific tasks and increase of computational power without an increase of the required semiconductor area has been applied by several designs Due to their high regularity and the high processing power requirements, low-level tasks are the most promising candidates for an adapted implementation The main disadvantage of this architectural approach is the decrease of flexibility by an increase of adaptation 59.6.1 Intensive Pipelined Architectures Applying pipelining for the increase of clock frequency leads to an increased latency of the circuit For algorithms that require a data dependent control flow, this fact might limit the performance gain Additionally, increasing arithmetic processing power leads to an increase of data access rate Generally, the required data access rate cannot be provided by external memories The gap between provided external and required internal data access rate increases for processor architectures with high clock frequency To provide the high data access rate, the amount of internal memory which provides a low access time has to be increased for high-performance signal processors Moreover, it is unfeasible to apply pipelining to speed-up on-chip memory Thus, the minimum memory access time is another limiting factor for the maximum degree of pipelining At least speed optimization is a time consuming task of the design process, which has to be performed for every new technology generation Examples for video processors with high clock frequency are the S-VSP [39] and the VSP3 [40] Due to intensive pipelining, an internal clock frequency of up to 300 MHz can be achieved The VSP3 consists of two parallel data paths, the Pipelined Arithmetic Logic Unit (PAU) and Pipelined Convolution Unit (PCU) (Fig 59.11) The relatively large on-chip data memory of size 114 kbit is split into seven blocks, six data memories and one FIFO memory for external data exchange Each of the six data memories is provided with an address generation unit (AGU), which provides the addressing modes “block”, “DCT”, and “zig-zag” Controlling is performed by a Sequence Control Unit (SCU) which involves a 1024x32bit instruction memory A Host Interface Unit (HIU) and a Timing Control Unit (TCU) for the derivation of the internal clock frequency are integrated onto the VSP3 core The entire VSP3 core consists of 1.27 million transistors, implemented based on a 0.5 micron BiCMOS technology on a 16.5 x 17.0-mm2 die The VSP3 performs the processing of the CCITTH.261 tasks (neglecting Huffman coding) for one macroblock in 45 µs Since realtime processing of 30Hz-CIF signals requires a processing time of less than 85 µs for one macroblock, a H.261 coder can be implemented based on one VSP3 59.6.2 Parallel Data Paths In the previous section, pipelining was presented as a strategy for processing power enhancement Applying pipelining leads to a subdivision of a logic operation into sub-operations, which are processed in parallel with increased processing speed An alternative to pipelining is the distribution of data among functional units Applying this strategy leads to an implementation of parallel data paths Typically, each data path is connected to an on-chip memory which provides the access distributed image segments Generally, two types of controlling strategies for parallel data paths can be distinguished An MIMD concept provides a private control unit for each data path, whereas SIMD-based controlling provides a single common controller for parallel data paths Compared to SIMD, the advantage of MIMD is a greater flexibility and a higher performance for complex algorithms with highly data dependent control flow On the other hand, MIMD requires a significantly increased silicon area 1999 by CRC Press LLC c FIGURE 59.11: VSP3 architecture [40] Additionally, the access rate to the program memory is increased, since several controllers have to be provided with program data Moreover, a software-based synchronization of the data paths is more complex In case of an SIMD concept synchronization is performed implicitly by the hardware Since actual hybrid coding schemes require a large amount of processing power for tasks that require a data independent control flow, a single control unit for the parallel data path provides sufficient processor performance The controlling strategy has to provide the execution of algorithms that require a data dependent control flow, e.g., quantization A simple concept for the implementation of a data dependent control flow is to disable the execution of instruction in dependence of the local data path status In this case, the data path utilization might be significantly decreased, since several of the parallel data path idle while others perform the processing of image data An alternative is a hierarchical controlling concept In this case, each data path is provided with a small local control unit with limited functionality and the global controller initiates the execution of control sequences of the local data path controllers To reduce the required chip area for this controlling concept, the local controller can be reduced to a small instruction memory Addressing of this memory is performed by the global control unit An example of a video processor based on parallel identical data path with a hierarchical controlling concept is the IDSP [42] (Fig 59.12) The IDSP processor includes four pipelined data processing units (DPU0-DPU3), three parallel I/O ports (PIO0-PIO2), one × 16-bit register file, five dualported memory blocks of size 512 × 16-bit each, an address generation unit for the data memories, and a program sequencer with 512 × 32-bit instruction memory and 32 × 32-bit boot ROM The data processing units consist of a three-stage pipeline structure based on a ALU, multiplier, and an accumulator This data path structure is well suited for L1 and L2 norm calculations and convolution-like algorithms The four parallel data paths support a peak computational power of 300 MOPS at a typical clock frequency of 25 MHz The data required for parallel processing are supplied by four cache memories (CM0-CM3) and a work memory (WM) Address generation for 1999 by CRC Press LLC c FIGURE 59.12: IDSP architecture [42] these memories is performed by an address generation unit (AU) which supports address sequences such as block scan, bit reverse, and butterfly The three parallel I/O units contain a data I/O port, an address generation unit, and a DMA control processor (DMAC) The IDSP integrates 910,000 transistors in 15.2 × 15.2 mm2 using an 0.8 micron BiCMOS technology For a full-CIF H.261 video codec four IDSP are required Another example of an SIMD-based video signal processor architecture based on identical parallel data paths is the HiPAR-DSP [44] (Fig 59.13) The processor core consists of 16 RISC data paths, controlled by a common VLIW instruction word The data paths contain a multiplier/accumulator unit, a shift/round unit, an ALU, and a 16 × 16bit register file Each data path is connected to a private data cache To support the characteristic data access pattern of several image processing tasks efficiently, a shared memory with parallel data access is integrated on-chip and provides parallel and conflict-free access to the data stored in this memory The supported access patterns are “matrix”, “vector” and “scalar” Data exchange with external devices is supported by an on chip DMA unit and a hypercube interface At present, a prototype of the HiPAR-DSP, based on four parallel data paths, is implemented This chip will be manufactured in a 0.6 micron CMOS technology and will require a silicon area of about 180 mm2 One processor chip is sufficient for realtime decoding of video signals, according to MPEG-2 Main Profile at Main Level For encoding an external motion estimator is required In contrast to SIMD-based HiPAR-DSP architecture, the TMS320C80 (MVP) is based on an MIMD approach [43] The MVP consists of four parallel processors (PP) and one master processor (Fig 59.14) The processors are connected to 50-kbyte on-chip data memory via a global crossbar interconnection network A DMA controller provides the data transfer to an external data memory and video I/O is supported by an on-chip video interface The master processor is a general-purpose RISC processor with an integral IEEE-compatible floating-point unit (FPU) The processor has a 32-bit instruction word and can load or store 8-, 16-, 32-, and 64-bit data sizes The master processor includes a 32 × 32-bit general purpose register file The master processor is intended to operate as the main supervisor and distributor of tasks within the chip and is also responsible for the communication with external processors Due to the integrated FPU, the master processor will perform tasks such as audio signal processing and 3-D graphics transformation 1999 by CRC Press LLC c FIGURE 59.13: Architecture of the HiPAR-DSP [44] The parallel processors architecture has been designed to perform typical DSP algorithms, e.g., filtering, DCT, and to support bit and pixel manipulations for graphics applications The parallel processors contain two address units, a program flow control unit, and a data unit with 32-bit ALU, 16 × 16-bit multiplier, and a barrel rotator The MVP has been designed using a 0.5 micron CMOS technology Due to the supported flexibility, about four million transistors on a chip area of 324 mm2 are required A computational power of GOPS is supported A single MVP is able to encode CIF-30Hz video signals according to the MPEG-1 standard 59.6.3 Coprocessor Concept Most programmable architectures for video processing applications achieve an increase of processing power by an adaptation of the architecture to the algorithmic requirements A feasible approach is the combination of a flexible programmable processor module with one or more adapted modules This approach leads to an increase of processing power for specific algorithms and leads a significant decrease of required silicon area The decrease of silicon area is caused by two effects At first, the implementation of the required arithmetic operations can be optimized, which leads to an area reduction Second, dedicated modules require significantly less hardware expense for module controlling, e.g., for program memory Typically, computation intensive tasks, such as DCT, block matching, or variable length coding, are candidates for an adapted or even dedicated implementation Besides the adaptation to one specific task, mapping of several different tasks onto one adapted processor module might be advantageous For example, mapping successive tasks, such as DCT, quantization, inverse quantization, IDCT, onto the same module reduces the internal communication overhead Coprocessor architectures that are based on highly adapted coprocessors achieve high computational power on a small chip area The main disadvantage of these architectures is the limited flexibility Changes of the envisaged applications might lead to an unbalanced utilization of the processor modules and therefore to a limitation of the effective processing power of the chip Applying the coprocessor concept opens up a variety of feasible architecture approaches, which 1999 by CRC Press LLC c FIGURE 59.14: TMS320C80 (MVP) [43] differ in achievable processing power and flexibility of the architecture In the following several architectures are presented, which clarify the wide variety of sensible approaches for video compression based on a coprocessor concept Most of these architectures aim at an efficient implementation of hybrid coding schemes As a consequence, these architectures are based on highly adapted coprocessors A chip set for video coding has been proposed in [8] This chip set consists of four devices: two encoder options (the AVP1300E and AVP1400E), the AVP1400D decoder, and the AVP1400C system controller The AVP1300E has been designed for H.261 and MPEG-1 frame-based encoding Full MPEG-1 encoding (I-frame, P-frame, and B-frame) is supported by the AVP1400E In the following, the architecture of the encoder chips is presented in more detail The AVP1300E combines function oriented modules, mask programmable modules, and user programmable modules (Fig 59.15) It consists of a dedicated motion estimator for exhaustive search block matching with a search area of +/− 15 pels The variable length encoder unit contains an ALU, a register array, a coefficient RAM, and a table ROM Instructions for the VLE unit are stored in a program ROM Special instructions for conditional switching, run-length coding, and variableto-fixed-length conversion are supported The remaining tasks of the encoder loop, i.e., DCT/IDCT, quantization, and inverse quantization, are performed in two modules called SIMD processor and quantization processor (QP) The SIMD processor consists of six parallel processors each with ALU, multiplier-accumulator units Program information for this module is again stored in a ROM memory The QP’s instructions are stored in a 1024 × 28-bit RAM This module contains 16-bit ALU, a multiplier, and a register file of size 144 × 16-bit Data communication with external DRAMs is supported by a memory management unit (MMAFC) Additionally, the processor scheduling is performed by a global controller (GC) Due to the adaptation of the architecture to specific tasks of the hybrid coding scheme, a single chip of size 132 mm2 (at 0.9 micron CMOS technology) supports the encoding of CIF-30Hz video signals according to the H.261 standard, including the computation intensive exhaustive search motion estimation strategy An overview of the complete chipset is given in [33] The AxPe640V [37] is another typical example of the coprocessor approach (Fig 59.16) To provide high flexibility for a broad range of video processing algorithms, the two processor modules are fully user programmable A scalar RISC core supports the processing of tasks with data dependent control 1999 by CRC Press LLC c FIGURE 59.15: AVP encoder architecture [8] flow, whereas the typically more computation intensive low level tasks with data independent control flow can be executed by a parallel SIMD module The RISC core functions as a master processor for global control and for processing of tasks such as variable length encoding and quantization To improve the performance for typical video coding schemes, the data path of the RISC core has been adapted to the requirements of quantization and variable length coding, by an extension of the basic instruction set A program RAM of size is placed on-chip and can be loaded from an external PROM during start-up The SIMD oriented arithmetic processing unit (APU) contains four parallel datapaths with a subtracter-complementermultiplier pipeline The intermediate results of the arithmetic pipelines are fed into a multi-operand accumulator with shift/limit circuitry The results of the APU can be stored in the internal local memory or read out to the external data output bus Since both RISC core and APU include a private program RAM and address generation units, these processor modules are able to work in parallel on different tasks This MIMD-like concept enables an execution of two tasks in parallel, e.g., DCT and quantization The AxPe640V is currently available in a 66-MHz version, designed in a 0.8 micron CMOS technology A QCIF-10Hz H.261 codec can be realized with a single chip To achieve higher computation power several AxPe640V can be combined to a multiprocessor system For example, three AxPe640V are required for an implementation of a CIF-10Hz codec FIGURE 59.16: AxPe640V architecture [37] The examples presented above clarify the wide range of architectural approaches for the VLSI implementation of video coding schemes The applied strategies are influenced by several demands, especially the desired flexibility of the architecture and maximum cost for realization and manufacturing Due to the high computational requirements of real time video coding, most of the presented architectures apply a coprocessor concept with flexible programmable modules in combination with 1999 by CRC Press LLC c modules that are more or less adapted to specific tasks of the hybrid coding scheme An overview of programmable architectures for video coding applications is given in [6] Equations (59.4) and (59.5) can be applied for the comparison of programmable architectures The result of this comparison is shown in Fig 59.17, using the coding scheme according to ITU recommendation H.261 as a benchmark Assuming a linear dependency between throughput rate FIGURE 59.17: Normalized silicon area and throughput (frame rate) for adapted and flexible programmable architectures for a H.261 codec and silicon area, a linear relationship corresponds to constant architectural efficiency, indicated by the two grey lines in Fig 59.17 According to these lines, two groups of architectural classes can be identified The first group consists of adapted architectures, optimized for hybrid coding applications The architectures contain one or more adapted modules for computation intensive tasks, such as DCT or block matching It is obvious that the application field of these architectures is limited to a small range of applications This limitation is avoided by the members of the second group of architectures Most of these architectures not contain function specific circuitry for specific tasks of the hybrid coding scheme Thus, they can be applied for wider variety of applications without a significant loss of sustained computational power On the other hand, these architectures are associated with a decreased architectural efficiency compared to the first group of proposed architectures: Adapted architectures achieve an efficiency gain of about to For a typical video phone application a frame rate of 10 Hz can be assumed For this application the required normalized silicon area of about 130 mm2 is required for adapted programmable approaches and approximately 950 mm2 are required for flexible programmable architectures For a rough 1999 by CRC Press LLC c .. .VLSI Architectures for Image Communications P Pirsch Laboratorium fur Informationstechnologie, University of Hannover W Gehrke Philips Semiconductors 59. 1 59. 1 59. 2 59. 3 59. 4 59. 5 59. 6 Introduction... to [7] the normalization of throughput can be performed by: RT ,0 = RT λ λ0 1.6 (59. 5) From Eqs (59. 3), (59. 4), and (59. 5), the normalization for the architectural efficiency can be derived:... An overview of programmable architectures for video coding applications is given in [6] Equations (59. 4) and (59. 5) can be applied for the comparison of programmable architectures The result of

59 VLSI Architectures for Image Communications

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan