Digital Signal Processing Handbook P59

Thông tin tài liệu

P. Pirsch, et. Al. “VLSI Architectures for Image Communications.” 2000 CRC Press LLC. <http://www.engnetbase.com>. VLSIArchitecturesforImage Communications P.Pirsch Laboratoriumfur Informationstechnologie, UniversityofHannover W.Gehrke PhilipsSemiconductors 59.1Introduction 59.2RecentCodingSchemes 59.3ArchitecturalAlternatives 59.4EfficiencyEstimationofAlternativeVLSIImplementations 59.5DedicatedArchitectures 59.6ProgrammableArchitectures IntensivePipelinedArchitectures • ParallelDataPaths • Co- processorConcept 59.7Conclusion Acknowledgment References 59.1 Introduction Videoprocessinghasbeenarapidlyevolvingfieldfortelecommunications,computer,andmedia industries.Inparticular,forrealtimevideocompressionapplicationsagrowingeconomicalsignifi- canceisexpectedforthenextyears.BesidesdigitalTVbroadcastingandvideophone,servicessuch asmultimediaeducation,teleshopping,orvideomailwillbecomeaudiovisualmassapplications. Tofacilitateworldwideinterchangeofdigitallyencodedaudiovisualdata,thereisademandfor internationalstandards,definingcodingmethods,andtransmissionformats.Internationalstan- dardizationcommitteeshavebeenworkingonthespecificationofseveralcompressionschemes.The JointPhotographicExpertsGroup(JPEG)oftheInternationalStandardsOrganization(ISO)has specifiedanalgorithmforcompressionofstillimages[4].TheITUproposedtheH.261standardfor videotelephonyandvideoconference[1].TheMotionPicturesExpertsGroup(MPEG)ofISOhas completeditsfirststandardMPEG-1,whichwillbeusedforinteractivevideoandprovidesapicture qualitycomparabletoVCRquality[2].MPEGmadesubstantialprogressforthesecondphaseof standardsMPEG-2,whichwillprovideaudiovisualqualityofbothbroadcastTVandHDTV[3]. Besidestheavailabilityofinternationalstandards,thesuccessfulintroductionofthenamedservices dependsontheavailabilityofVLSIcomponents,supportingacostefficientimplementationofvideo compressionapplications.Inthefollowing,wegiveashortoverviewofrecentcodingschemesand discussimplementationalternatives.Furthermore,theefficiencyestimationofarchitecturalalter- nativesisdiscussedandimplementationexamplesofdedicatedandprogrammablearchitecturesare presented. c  1999byCRCPressLLC 59.2 Recent Coding Schemes Recent video coding standards are based on a hybrid coding scheme that combines transform coding and predictive coding techniques. An overview of these hybrid encoding schemes is depicted in Fig. 59.1. FIGURE 59.1: Hybrid encoding and decoding scheme. The encoding scheme consists of the tasks motion estimation, typically based on blockmatching algorithms, computation of the prediction error, discrete cosinetransform (DCT), quantization (Q), variable length coding (VLC), inverse quantization (Q −1 ), and inverse discrete cosine transform (IDCT or DCT-1). The reconstructed image data are stored in an image memory for further predic- tions. The decoder performs the tasks variable length decoding (VLC −1 ), inverse quantization, and motion compensated reconstruction. Generally, video processing algorithms can be classified in terms of regularity of computation and data access. This classification leads to three classes of algorithms: • Low-Level Algorithms — These algorithms are based on a predefined sequence of operations and a predefined amount of data at the input and output. The processing sequence of low-level algorithms is predefined and does not depend on the values of data processed. Typical examples of low-level algorithms are block matching or transforms such as the DCT. • Medium-Level Algorithms — The sequence and number of operations of medium-level algorithms depend on the data. Typically, the amountof input datais predefined, whereas theamountofoutputdata varies accordingtotheinput datavalues. Withrespecttohybrid coding schemes, examples for these algorithms are quantization, inverse quantization, or variable length coding. • High-Level Algorithms — High-level algorithms are associated with a variable amount of input and output data and a data-dependent sequence of operations. As for medium- c  1999 by CRC Press LLC level algorithms, the sequence of operations is highly data dependent. Control tasks of the hybrid coding scheme can be assigned to this class. Since hybrid coding schemes are applied for different video source rates, the required absolute processing power varies in the range from a few hundred MOPS (Mega Operations Per Second) for video signals in QCIF format to several GOPS (Giga Operations Per Second) for processing of TV or HDTV signals. Nevertheless, the relative computational power of each algorithmic class is nearly independent of the processedvideo format. In case of hybrid coding applications, approximately90% of the overall processing power is required for low-level algorithms. The amount of medium-level tasks is about 7% and nearly 3% is required for high-level algorithms. 59.3 Architectural Alternatives In terms of a VLSI implementation of hybrid coding applications, two major requirements can be identified. First, the high computational power requirements have to be provided by the hardware. Second, low manufacturing cost of video processingcomponentsis essential for the economic success of an architecture. Additionally, implementation size and architectural flexibility have to be taken into account. Implementations of video processing applications can either be based on standard processors from workstations or PCs or on specialized video signal processors. The major advantage of standard processors is their availability. Application of these architectures for implementation of video processing hardware does not require the time consuming design of new VLSI components. The disadvantage of this implementation strategy is the insufficient processing power of recent standard processors. Video processing applications would still require the implementation of cost intensive multiprocessor systems to meet the computational requirements. To achieve compact implementations, video processing hardware has to be based on video signal processors, adapted to the requirements of the envisaged application field. Basically, two architectural approaches for the implementations of specialized video processing components can be distinguished. Dedicated architectures aim at an efficient implementation of one specific algorithm or application. Due to the restriction of the application field, the architecture of dedicated components can be optimized by an intensive adaptation of the architecture to the requirements of the envisaged application, e.g., arithmetic operations that have to be supported, processing power, or communication bandwidth. Thus, this strategy will generally lead to compact implementations. The major disadvantage of dedicated architecture is the associated low flexibility. Dedicated components can only be applied for one or a few applications. In contrast to dedicated approaches with limited functionality, programmable architectures enable the processing of different algorithms under software control. The particular advantage of programmable architectures is the increased flexibility. Changes of architectural requirements, e.g., due to changes of algorithms or an extension of the aimed application field, can be handled by software changes. Thus, a generally cost-intensive redesign of the hardware can be avoided. Moreover, since programmable architectures cover a wider range of applications, they can be used for low-volume applications, where the design of function specific VLSI chips is not an economical solution. For both architectural approaches, the computational requirements of video processing applications demand for the exploitation of the algorithm-inherent independence of basic arithmetic operations to be performed. Independent operations can be processed concurrently, which en- ables the decrease of processing time and thus an increased through-put rate. For the architectural implementation of concurrency, two basic strategies can be distinguished: pipelining and parallel processing. In case of pipelining several tasks, operations or parts of operations are processed in subsequent steps in different hardware modules. Depending on the selected granularity level for the implemen- c  1999 by CRC Press LLC tation of pipelining, intermediate data of each step are stored in registers, register chains, FIFOs, or dual-port memories. Assuming a processing time of T P for a non-pipelined processor module and T D,IM for the delay of intermediate memories, we get in the ideal case the following estimation for the throughput-rate R T, Pipe of a pipelined architecture applying N Pipe pipeline stages: R T, Pipe = 1 T P N Pipe + T D,IM = N Pipe T P + N Pipe · T D,IM (59.1) From this follows that the major limiting factor for the maximum applicable degree of pipelining is the access delay of these intermediate memories. The alternative to pipelining is the implementation of parallel units, processing independent data concurrently. Parallel processing can be applied on operation level as well as on task level. Assuming the ideal case, this strategy leads to a linear increase of processing power and we get: R T, Par = N Par T P (59.2) where N Par = number of parallel units. Generally, both alternatives are applied for the implementation of high-performance video processing components. In the following sections, the exploitation of algorithmic properties and the application of architectural concurrency is discussed considering the hybrid coding schemes. 59.4 Efficiency Estimation of Alternative VLSI Implementations Basically, architectural efficiency can be defined by the ratio of performance over cost. To achieve a figure of merit for architectural efficiency we assume in the following that performance of a VLSI architecture can be expressed by the achieved throughput rate R T and the cost is equivalent to the required silicon area A Si for the implementation of the architecture: E = R T A Si (59.3) Besides the architecture, efficiency mainly depends on the applied semiconductor technology and the design-style (semi-custom, full-custom). Therefore, a realistic efficiency estimation has to consider the gains provided by the progress in semiconductor technology. A sensible way is the normalization of the architectural parameters according to a reference technology. In the following we assume a reference process with a grid length λ 0 = 1.0 micron. For normalization of silicon area, the following equation can be applied: A Si,0 = A Si  λ 0 λ  2 (59.4) where the index 0 is used for the system with reference gate length λ 0 . According to [7] the normalization of throughput can be performed by: R T,0 = R T  λ λ 0  1.6 (59.5) From Eqs. (59.3), (59.4), and (59.5), the normalization for the architectural efficiency can be derived: E 0 = R T,0 A Si,0 = R T A Si  λ λ 0  3.6 (59.6) c  1999 by CRC Press LLC E can be used fortheselectionofthebest architectural approachoutof several alternatives. Moreover, assuming a constant efficiency for a specific architectural approach leads to a linear relationship of throughput rate and siliconareaand this relationship canbe applied forthe estimation of the required siliconareaforaspecificapplication. Due tothepowerof3.6inEqu. (59.6), the chosensemiconductor technology for implementation of a specific application has a significant impact on the architectural efficiency. In the following, examples of dedicated and programmable architectures for video processing applications are presented. Additionally, the discussed efficiency measure is applied to achieve a figure of merit for silicon area estimation. 59.5 Dedicated Architectures Due to their algorithmic regularity and the high processing power required for the discrete cosine transform and motion estimation, these algorithms are the first candidates for a dedicated implementation. As typical examples, alternatives for a dedicated implementation of these algorithms are discussed in the following. The discrete cosine transform (DCT) is a real-valued frequency transform similar to the Discrete Fourier transform (DFT). When applied to an image block of size L × L, the two dimensional DCT (2D-DCT) can be expressed as follows: Y k,l = L−1  i=0 L−1  j=0 x i,j · C i,k · C j,l (59.7) where C n,m =      1 √ 2 for m = 0 cos  (2n+1)mπ 2L  otherwise with (i, j )= coordinates of the pixels in the initial block (k, l)= coordinates of the coefficients in the transformed block x i,j = value of the pixel in the initial block Y k,l = value of the coefficient in the transformed block Computing a 2D DCT of size L × L directly according to Eq. (59.7) requires L 4 multiplications and L 4 additions. The required processing power for the implementation of the DCT can be reduced by the exploitation of the arithmetic properties of the algorithm. The two-dimensional DCT can be separated into two one-dimensional DCTs according to Eq. (59.8) Y k,l = L−1  i=0 C i,k ·   L−1  j=0 x i,j · C j,l   (59.8) The implementation of the separated DCT requires 2L 3 multiplications and 2L 3 additions. As an example, the DCT implementation according to [9]isdepictedinFig.59.2. This architecture is based on two one-dimensional processing arrays. Since this architecture is based on a pipelined multiplier/accumulator implementation in carry-save technique, vector merging adders are located at the output of each array. The results of the 1D-DCT have to be reordered for the second 1D-DCT stage. For this purpose, a transposition memory is used. Since both one-dimensional processor arrays require identical DCT coefficients, these coefficients are stored in a common ROM. c  1999 by CRC Press LLC FIGURE 59.2: Separated DCT implementation according to [9]. Moving from a mathematical definition to an algorithm that can minimize the number of calcu- lations required is a problem of particular interest in the case of transforms such as the DCT. The 1D-DCT can also be expressed by the matrix-vector product : [Y]=[C][X] (59.9) where [C]isanL× L matrix and [X] and [Y] 8-point input and output vectors. As an example, with θ = p/16, the 8-points DCT matrix can be computed as denoted in Eq. (59.10)            Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7            =            cos 4θ cos 4θ cos 4θ cos 4θ cos 4θ cos4θ cos4θ cos4θ cos θ cos 3θ cos 5θ cos 7θ − cos 7θ − cos 5θ − cos 3θ − cos θ cos 2θ cos 6θ − cos 6θ − cos 2θ − cos2θ − cos6θ cos 6θ cos 2θ cos 3θ − cos 7θ − cos θ −cos 5θ cos 5θ cos θ cos 7θ − cos 3θ cos 4θ − cos 4θ − cos 4θ cos 4θ cos 4θ − cos 4θ −cos 4θ cos4θ cos 5θ − cos θ cos7θ cos 3θ − cos 3θ −cos 7θ cos θ − cos 5θ cos 6θ − cos 2θ cos 2θ − cos 6θ − cos6θ cos 2θ − cos 2θ cos6θ cos 7θ − cos 5θ cos 3θ − cos θ cos θ −cos 3θ cos 5θ − cos 7θ                       x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7            (59.10)     Y 0 Y 2 Y 4 Y 6     =     cos 4θ cos 4θ cos 4θ cos 4θ cos 2θ cos 6θ − cos 6θ − cos 2θ cos 4θ − cos 4θ − cos 4θ cos 4θ cos 6θ − cos 2θ cos 2θ − cos 6θ         x 0 + x 7 x 1 + x 6 x 2 + x 5 x 3 + x 4     (59.11)     Y 1 Y 3 Y 5 Y 7     =     cos θ cos 3θ cos 5θ cos 7θ cos 3θ − cos 7θ − cos θ − cos 5θ cos 5θ − cos θ cos7θ cos 3θ cos 7θ − cos 5θ cos 3θ − cos θ         x 0 + x 7 x 1 + x 6 x 2 + x 5 x 3 + x 4     (59.12) More generally, the matrices in Eqs. (59.11) and (59.12) can be decomposed in a number of simpler matrices, the composition of which can be expressed as a flowgraph. Many fast algorithms have been proposed. Figure 59.3 illustrates the flowgraph of the B.G. Lee’s algorithms, which is commonly used [10]. Several implementations using fast flow-graphs have been reported [11, 12]. Anotherapproachthathasbeen extensivelyusedisbased onthetechniqueof distributedarithmetic. Distributed arithmetic is an efficient way to compute the DCT totally or partially as scalar products. To illustrate the approach, let us compute a scalar product between two length-M vectors C and X : Y = M−1  i=0 c i · x i with x i =−x i,0 + B−1  j=1 x i,j · 2 −j (59.13) where {c i } are N-bit constants and {x i } are coded in B bits in 2s complement. Then Eq. (59.13) can be rewritten as : Y = B−1  j=0 C j · 2 −j with C j=0 = M−1  i=0 c i x i,j and C 0 =− M−1  i=0 c i x i,0 (59.14) c  1999 by CRC Press LLC FIGURE 59.3: Lee FDCT flowgraph for the one-dimensional 8-points DCT [10]. The change of summing order in i and j characterizes the distributed arithmetic scheme in which the initial multiplications are distributed to another computation pattern. Since the term C j has only 2 M possible values (which depend on the x i,j values), it is possible to store these 2 M possible values in a ROM. An input set of M bits {x 0,j ,x 1,j ,x 2,j , .,x M−1,j } is used as an address, allowing retrieval of the C j value. These intermediate results are accumulated in B clock cycles, for producing one Y value. Figure 59.4 shows a typical architecture for the computation of a M input inner product. The inverter and the MUX are used for inverting the final output of the ROM in order to compute C 0 . FIGURE 59.4: Architecture of a M input inner product using distributed arithmetic. Figure59.5illustrates twotypicalusesofdistributed arithmeticforcomputingaDCT.Figure59.5(a) implements the scalar productsdescribed by the matrix of Eq. (59.10). Figure 59.5(b) takes advantage of a first stage of additions and substractions and the scalar products described by the matrices of Eq. (59.11) and Eq. (59.12). Properties of several dedicated DCT implementations have been reported in [6]. Figure 59.6 shows the silicon area as a function of the throughput rate for selected design examples. The design parametersarenormalizedtoafictive1.0 µm CMOSprocessaccordingtothe discussednormalization strategy. As a figure of merit, a linear relationship of throughput rate and required silicon area can be derived: α T,0 ≈ 0.5 mm 2 / Mpel/s (59.15) Equation (59.15) can be applied for the silicon area estimation of DCT circuits. For example, assuming TV signals according to the CCIR-601 format and a frame rate of 25Hz, the source rate c  1999 by CRC Press LLC FIGURE 59.5: Architecture of an 8-point one-dimensional DCT using distributed arithmetic. (a) Pure distributed arithmetic. (b) Mixed D.A.: first stage of flowgraph decomposition products of 8 points followed by 2 times 4 scalar products of 4 points. equals 20.7 Mpel/s. As a figure of merit from Eq. (59.15) a normalized silicon area of about 10.4 mm 2 can be derived. For HDTV signals the video source rate equals 110.6 Mpel/s and approximately 55.3 mm 2 silicon area is required for the implementation of the DCT. Assuming an economically sensible maximum chip size of about 100 mm 2 to 150 mm 2 , we can conclude that the implementation of the DCT does not necessarily require the realization of a dedicated DCT chip and the DCT core can be combined with several other on-chip modules that perform additional tasks of the video coding scheme. For motion estimation several techniques have been proposed in the past. Today, the most im- portant technique for motion estimation is block matching, introduced by [21]. Block matching is based on the matching of blocks between the current and a reference image. This can be done by a full (or exhaustive) search within a search window, but several other approaches have been FIGURE 59.6: Normalized silicon area and throughput for dedicated DCT circuits. c  1999 by CRC Press LLC reported in order to reduce the computation requirements by using an “intelligent” or “directed” search [17, 18, 19, 23, 25, 26, 27]. In case of an exhaustive search block matching algorithm, a block of size N × N pels of the current image (reference block, denoted X) is matched with all the blocks located within a search window (candidate blocks, denoted Y ) The maximum displacement will be denoted by w. The matching criterium generally consists in computing the mean absolute difference (MAD) between the blocks. Let x(i,j) be the pixels of the reference block and y(i,j) the pixels of the candidate block. The matching distance (or distortion) D is computed according to Eq. (59.16). The indexes m and n indicate the position of the candidate block within the search window. The distortion D is computed for all the (2w+1) 2 possible positions of the candidate block within the search window [Eq. (59.16)] and the block corresponding to the minimum distortion is used for prediction. The position of this block within the search window is represented by the motion vector v (59.17). D(m, n) = N−1  i=0 N−1  j=0 |x(i,j)− y(i+ m, j + n)| (59.16) v =  m n  | D min (59.17) The operations involved for computing D(m, n) and DMIN are associative. Thus, the order for exploring the index spaces (i, j) and (m, n) are arbitrary and the block matching algorithm can be described by several different dependence graphs. As an example, Fig. 59.7 shows a possible dependence graph (DG) for w = 1 and N = 4. In this figure, AD denotes an absolute difference and an addition, M denotes a minimum value computation. FIGURE 59.7: Dependence graphs of the block matching algorithm. The computation of v (X, Y ) and D(m, n) are performed by 2D linear DGs. The dependence graph for computing D(m, n) is directly mapped into a 2-D array of processing elements (PE), while the dependence graph for computing v(X, Y ) is mapped into time (59.8). In other words, block matching is performed by a sequential exploration of the search area, while the computation of each distortion is performed in parallel. Each of the AD nodes of the DG is implemented by an AD processing element (AD-PE). The AD-PE stores the value of x(i,j) and receives the value of y(m+ i, n+ j)corresponding to the current position of the reference block in the search window. It performs the subtraction and the absolute value computation, and adds the c  1999 by CRC Press LLC [...]... die The VSP3 performs the processing of the CCITTH.261 tasks (neglecting Huffman coding) for one macroblock in 45 µs Since realtime processing of 30Hz-CIF signals requires a processing time of less than 85 µs for one macroblock, a H.261 coder can be implemented based on one VSP3 59.6.2 Parallel Data Paths In the previous section, pipelining was presented as a strategy for processing power enhancement... al., A video digital signal processor with a vector-pipeline architecture, IEEE J Solid-State Circuits, 27(12), 1886–1893, Dec 1992 [36] Bailey, D et al., Programmable vision processor/controller, IEEE MICRO, 12(5), 33–39, Oct 1992 [37] Gaedke, K., Jeschke, H and Pirsch, P., A VLSI based MIMD architecture of a multiprocessor system of real-time video processing applications, J VLSI Signal Processing, ... the algorithmic complexity of an MPEG-2 decoder for CCIR-601 signals is about half the complexity of an H.261 codec Additionally, it has to be taken into account that the number of pixels per frame is about 5.3 times larger for CCIR signals than for CIF signals From this the normalized implementation size of an MPEG-2 decoder for CCIR-601 signals and a frame rate of 25 Hz can be estimated to 870 mm2... Wohlleben, S and Noll, T.G., CMOS VLSI implementation of the 2DDCT with linear processor arrays, Proc Intl Conf on Acoustics Speech and Signal Processing, V3.3, 1990 [10] Lee, B.G., A new algorithm to compute the discrete cosine transform, IEEE Trans Acoustics, Speech and Signal Processing, 32(6), 1243–1245, Dec 1984 c 1999 by CRC Press LLC [11] Artieri, A., Macoviak, E., Jutand, F and Demassieux, N., A VLSI... video signals according to the MPEG-1 standard 59.6.3 Coprocessor Concept Most programmable architectures for video processing applications achieve an increase of processing power by an adaptation of the architecture to the algorithmic requirements A feasible approach is the combination of a flexible programmable processor module with one or more adapted modules This approach leads to an increase of processing. .. strategies are presented in [20, 22, 24, 30] 59.6 Programmable Architectures According to the three ways for architectural optimization, adaptation, pipelining, and parallel processing, three architectural classes for the implementation of video signal processors can be distinguished: • Intensive Pipelined Architectures — These architectures are typically scalar architectures that achieve high clock frequencies... an unbalanced utilization of the processor modules and therefore to a limitation of the effective processing power of the chip Applying the coprocessor concept opens up a variety of feasible architecture approaches, which c 1999 by CRC Press LLC FIGURE 59.14: TMS320C80 (MVP) [43] differ in achievable processing power and flexibility of the architecture In the following several architectures are presented,... technology) supports the encoding of CIF-30Hz video signals according to the H.261 standard, including the computation intensive exhaustive search motion estimation strategy An overview of the complete chipset is given in [33] The AxPe640V [37] is another typical example of the coprocessor approach (Fig 59.16) To provide high flexibility for a broad range of video processing algorithms, the two processor modules... programmable A scalar RISC core supports the processing of tasks with data dependent control c 1999 by CRC Press LLC FIGURE 59.15: AVP encoder architecture [8] flow, whereas the typically more computation intensive low level tasks with data independent control flow can be executed by a parallel SIMD module The RISC core functions as a master processor for global control and for processing of tasks such as variable... and HDTV applications will require the realization of a dedicated block matching chip Assuming a recent 0.5 µm semiconductor processes the core size estimation leads to about 22 mm2 for TV signals and 106 mm2 for HDTV signals To reduce the high computational complexity required for exhaustive search block matching, two strategies can be applied: 1 Decrease of the number of candidate blocks 2 Decrease . performs the processing of the CCITT- H.261 tasks (neglecting Huffman coding) for one macroblock in 45 µs. Since realtime processing of 30Hz-CIF signals requires. in QCIF format to several GOPS (Giga Operations Per Second) for processing of TV or HDTV signals. Nevertheless, the relative computational power of each

Ngày đăng: 06/11/2013, 09:15

Xem thêm: Digital Signal Processing Handbook P59