A high performance VLSI architecture for integer motion estimation in HEVC

A High Performance VLSI Architecture for Integer Motion Estimation in HEVC XuYuan1, Liu Jinsong1, Gong Liwei1, Zhang Zhi1, Robert K F Teng1,2 Shenzhen Key Lab of Advanced Communication and Information Processing College of Information Engineering, Shenzhen University, Shenzhen, 518000, China1 Department of Electrical Engineering, California State University, Long Beach, 90840, USA2 Abstract VLSI architectures have been studied targeting at A high performance VLSI architecture for integer motion estimation (IME) in High Efficiency Video Coding (HEVC) is presented in this paper It supports coding tree block (CTB) structure with the asymmetric motion partition (AMP) mode The architecture contains two parallel sub-architectures to meet 1080p@30fps real-time video coding The size L×L of CTB in the architecture is set to L=32 pixels by default, and it can be extended to L=64 and L=16 pixels A serial mode decision module to find optimal partition mode for the architecture has also been implemented various standards (e.g H.264): Cao Wei et al has proposed a reconfigurable architecture for VBSME in H.264 with memory partition scheme [3]; G.A Ruiz et al has proposed an efficient VLSI processor including Lagrangian cost module and mode decision module [4]; Tuan et al has defined four levels of data-reuse scheme according to memory situations [5], etc However, few architectures targeting IME in HEVC have been reported so far This paper studies a parallel VLSI architecture for IME in HEVC This structure can support AMP mode aiming at high resolution application A serial mode decision module to find optimal partition mode has Introduction also been implemented High Efficiency Video Coding (HEVC) is the recent video coding standard of the ITU-T Video HEVC Motion Estimation Theory Coding Experts Group (VCEG) and the ISO/IEC (MPEG) HEVC is the theory of motion estimation for the bit-rate VLSI architecture studied in this paper The coding reduction and equal perceptual video quality have been object in HEVC is CTB Its size can be represented as demonstrated in the HM10.0 Compared to previous LhL (L=16, 32, 64), while the traditional macroblock video coding standards, HEVC has many new concepts, size is 16h16 CTB is further partitioned into coding such blocks (CBs) or one CB according to a quadtree as Moving Picture standardization as Experts organizations quadtree structure, Group [1] The asymmetry motion prediction (AMP) [2] in integer motion estimation shown in figure The root of quadtree is CTB (IME), etc., resulting in higher coding efficiency and The size of CBs can be represented as MhM (M=8, more design complexity The IME is the critical part of 16, 32, 64) Figure 1(a) shows a corresponding trellis video coding design because of the high memory diagram of quadtree decomposition, smaller CBs are bandwidth, high hardware cost, complex control logic, typically distributed around the Object boundary CB is etc Therefore, the high performance architecture of further partitioned into prediction blocks (PBs) through IME is important for the HEVC encoder Many IME three modes for inter prediction, as shown in figure They are two square modes, MhM, M/2hM/2; two 978-1-4673-6417-1/13/$31.00 ©2013 IEEE symmetric modes, M/2 h M, M h M/2; and four block size of the architecture is 32h32 When the asymmetric modes, M/4hM (L), M/4hM (R), Mh size of CTB is 64h64, a quarter down sampling M/4 (U), MhM/4 (D) M denotes its corresponding module is designed to reduce the search points This parent CB size strategy strikes a good balance between hardware 2EMHFWERXUGDU\ resources and compression quality Other schemes, such as full search algorithm and level D data-reuses [5], have also been used in the architecture design The full search algorithm scheme has the advantages of computational regularity and excellent output video quality The Level D scheme reuse pixels data in the entire search window strips of a consecutive current (a) (b) block Figure Example of quadtree structure (a) Quadtree structure (b) Corresponding trellis diagram $;,,QWHUFRQQHFW 6HDUFK $UHD 2QFKLS5$0 6KLIWB5HJV DUUD\ &XUUHQW &7% 5HJLVWHUV 3( DUUD\ 6$'V 3( DUUD\ 6$'V &RPSDUDWRUV %HVW6$'V ''5 5HIHUHQFH )UDPH &XUUHQW )UDPH 0RGH 'HFLVLRQ 5$0 Figure Top-level of design Figure Modes for splitting a CB into PBs A 32h33 pixels 2D three direction shift register Meanwhile, three constraints must be complied with: array is proposed to improve data fetching efficiency (1) Asymmetric motion partition mode is turning off Since two processing element (PE) arrays are used in when M=8, parallel, two consecutive reference candidate blocks (2) For reducing the memory bandwidth, 4h4 PBs are stored in the shift register array When the next two are not allowed for inter prediction, reference candidates have been pushed into the shift (3) 4h8 PBs and 8h4 PBs are only adopted in register array, 32h2 pixels are updated, whereas the uni-predictive coding other 32h31 pixels are reused In order to eliminate bubble clock cycles, a column of 33 pixels is added to VLSI Architecture With estimation the the array, so the array size is changed to 33h33 pixels newly theory, a introduced high HEVC performance motion VLSI architecture for integer motion estimation has been studied The Top-level of the architecture is shown in figure On-chip RAM includes current CTB and search area Once the size of CTB is decided, the complexity of IME is also determined The maximum size of CTB is 64h64, while the default Search window can be scanned in three directions: upward, downward, left to right When the direction is left to right, the shift register array uses one cycle to update the requested 33 pixels For typical search range [-24, 23], it takes (24+24)/2=24 clock cycles to finish one column data matching and one clock cycle to shift to another column 24 h 48=1152 clock cycles are needed for one CTB processing of SADs from 593 to 145 in order to make tradeoff Processing element (PE) is for calculating the between the performance and precision differences of a pair of pixels which are used for Sum of Absolute Differences (SADs) of PBs The size of PB Table Content of 145 SADs Block Size Block Number 8×4 16×16 32 4×8 32×16 32 8×8 16×32 16 16×8 32×8(up)* 8×16 32×8(down)* 16×4(up)* 8×32(left)* 16×4(down)* 8×32(right)* 4×16(left)* 32×32 4×16(right)* Note: * represents AMP mode decides the number of PEs One PE array includes Block Size 1024 PEs Two PE arrays can concurrently calculate h 145 SADs in one clock cycle The contents of 145 SADs are shown in table In order to implement the CTB quadtree partition, each PE array can be divided into 16h16 execution units (16h16 EUs) as shown in figure 4(a) Each 16h 16 EU is divided into sixteen 4h4 EUs as shown in figure (b) When CTB size is 64h64, quarter down sampling will be performed first to reduce the number (a) PE array architecture [DQG[6$'V [DQG[6$'V 5HIHUHFH GDWD (8 (8 (8 (8 (8 (8 (8 (8 (8 (8 (8 (8 (8 (8 (8 [DQG[6$'V [ [ [ [ [ 6$' 6$' 6$' 6$' 6$' &XUUHQW GDWD [ (8 [6$'V (b) 16h16 EU architecture Figure The design architecture Block Number 2 2 2 Hierarchical adder trees are built to help PE array processing The architecture can process over 70K to generate 145 SADs for each shift operation, as CTBs per second at system clock 110M considering the shown in figure 4(a) and 4(b) The best 145 SADs initiate clock cycles It can meet the requirement of selected by comparator modules will be sent to the 1080p@30fps video which need 61K CTBs per mode decision module second The mode decision module proposed in this paper is based on the structure presented by G.A Ruiz [4] The improvement of the circuit is shown in figure Three extra adders are used to sum up the four 16×16 blocks’ cost for eliminating the bubble clock cycles of shifting data, and sixteen registers stored the blocks’ cost to meet the requirement of HEVC Conclusion A high performance parallel VLSI architecture for integer motion estimation has been studied It can meet real-time HEVC encoding requirements for 1080p@30fps video The architecture has been implemented on Virtex-6 XC6VLX-550T with 110M system clock When implemented as ASIC, the number $FFXPXODWRU of parallel PE arrays would be reduced as system clock 08; UHJ &267 increased &203$5$725 UHJ UHJ 08; UHJ UHJ UHJ UHJ UHJ UHJ UHJ UHJ UHJ UHJ UHJ Reference [1] B Bross, et al “High Efficiency Video Coding (HEVC) Text Specification Draft 9”, document JCTVC-K1003, ITU-T/ISO/IEC Joint collaborative Team on Video coding (JCT-VC), (Oct.2012) [2] Gary J Sullivan, et al “Overview Of The High Efficiency Video Coding (HEVC) Standard”, IEEE Figure Mode Decision architecture Trans Circuits Syst Video Technol., vol 22, no 12, pp 1648–1667, (Dec 2012) Results and Performance Analysis The proposed architecture has been implemented in Verilog, simulated and verified by ModelSim The Verilog code has been synthesized, placed and routed into Xilinx Virtex-6 XC6VLX-550T using Xilinx ISE tool The synthesized results are given in table Table2 Resources Utilization of the FPGA Resource utilization percentage slice logic 55346 16% slice register 19744 2.9% Bram 148kB 5.2% The performance of the IME is related to the size of search window The default range of displacement is [-24, 23] It takes 1152 clock cycles for one CTB [3] Cao Wei, et al “A High-Performance Reconfigurable VLSI Architecture for VBSME in H.264”, IEEE Trans Consumer Electron., vol 54, no 3, pp 1338–1345, (Aug 2008) [4] G.A Ruiz and J.A Michell, “An efficient VLSI processor chip for variable block size integer motion estimation in H.264/AVC”, Signal Processing: Image Communication, vol 26, pp 289-303, (July 2011) [5] Tuan J-C, et al “on the data reuse and memory bandwidth analysis of full-search block-matching VLSI architecture”, IEEE Trans Circuits Syst Video Technol., 12(1)pp.61-72, (2002)

A high performance VLSI architecture for integer motion estimation in HEVC

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan