Digital Signal Processing Handbook P54

Thông tin tài liệu

de Haan, G. “Video Scanning Format Conversion and Motion Estimation” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton: CRC Press LLC, 1999 c  1999byCRCPressLLC 54 Video Scanning Format Conversion and Motion Estimation Gerard de Haan Philips Research Laboratories 54.1 Introduction 54.2 Conversion vs. Standardization 54.3 Problems with Linear Sampling Rate Conversion Applied to Video Signals TemporalInterpolation • Vertical Interpolation and Interlaced Scanning 54.4 Alternatives for Sampling Rate Conversion Theory Simple Algorithms • Advanced Algorithms 54.5 Motion Estimation Pel-RecursiveEstimators • Block-MatchingAlgorithm • Search Strategies 54.6 Motion Estimation and Scanning Format Conversion Hierarchical Motion Estimation • Recursive Search Block- Matching References 54.1 Introduction The scanning format of a video signal is a major determinant of general picture quality. Specifi- cally, it determines such aspects as stationary and dynamic resolution, motion portrayal, aliasing, scanning structure visibility, and flicker. Various formats have been designed and standardized to strike a particular balance between quality, cost, transmission capacity, and compatibility with other standards. The field of video scanning format conversion is concerned with the translation of video signals from one format into another. It consists of two basic parts: temporal interpolation and spatial interpolation. A particular case is de-interlacing, which poses an inseparable spatio-temporal interpolation problem. Vertical and temporal interpolation cause practical and fundamental difficulties in achieving high- quality scanning format conversion. This is because the conditions of the sampling theorem are generally not met in video signals. If they were satisfied, standard conversions of arbitrary accuracy would be possible using suitable linear filters. The earlierconversion methods neglected the fundamental problems and, consequently, negatively influenced the resolution and the motion portrayal. More recent algorithms apply motion vectors to predict the position of moving objects at unregistered temporal instances to improve the quality of the picture at the output format. A so-called motion estimator extracts these vectors from the input c  1999 by CRC Press LLC signal. The motion vectors partly solve the fundamental problems, but the demands on the motion estimator for scanning format conversion are severe. In this section we shall first briefly indicate why we can expect that the importance of scanning format conversion will grow. Then we discuss in more detail the fundamental problems of temporal interpolation of video signals. Next we provide a concise overview of the basic methods in scanning format conversion, focused on temporal sampling rate conversion and de-interlacing. Finally, we give an overview of motion estimation algorithms, which are crucial in the more advanced scanning format convertors. 54.2 Conversion vs. Standardization Scanning formats have been designed in the past to strike a particular compromise between quality, cost, transmission capacity, and compatibility with other standards. There were three main formats in use a decade ago: 50 Hz interlaced, 60 Hz interlaced, and 24 (or 25) Hz progressive (film). With the arrival of video-conferencing, HDTV, workstations, and PCs, many new video formats have appeared. These include low end formats such as CIF and QCIF with smaller picture size and lower frame rates, progressive and interlaced HDTV formats at 50 Hz and 60 Hz, and other video formats used on computer workstations and enhanced television displays with field rates up to 100 Hz. It will be clear that the problem of scanning format conversion is of a growing importance, despite many attempts to globally standardize video formats. 54.3 Problems with Linear Sampling Rate Conversion Applied to Video Signals High-quality scanning format conversion is difficult to achieve, as the conditions of the sampling theorem are generally not met in video signals. The solution of Sample Rate Conversion (SRC) for systems satisfying the conditions of the sampling theory is well known for arbitrary sampling ratios [1]. Figure 54.1 illustrates the procedure for a ratio of 2. To arrive at the double output sampling rate, in a first step, zero-valued samples are inserted between every input pair of samples. In a second step, a low-pass filter (LPF) at the output rate is applied to remove the first repeat spectrum from the input data. In case of a temporal SRC, the interpolating LPF has to be a temporal LPF, i.e., a filter including picture delays. Though feasible, this makes it a fairly expensive filter. A more complicated, though still not fundamental, problem occurs at the signal acquisition stage. Since scenes do occur with almost unlimited spatial and/or temporal bandwidth, the sampling theorem requiresthat this signal be low-pass filtered prior to the scanning process. Interlaced scanning, as commonly applied, even demands two-dimensional prefiltering in the vertical-temporal frequency plane. In a video system, it is the camera that samples the scene in a vertical and temporal sense; therefore, the prefilter has to be realized in the optical path. Although there are considerable practical problems achieving this filtering, it would apparently bring down the problem of temporal interpolation of video images to the common sampling rate conversion problem. The next section will show, however, that in addition to the practical problems there is a fundamental problem as well. 54.3.1 Temporal Interpolation Considering the eye’s sine-wave temporal frequency response for full brightness potential and full field display [2], as shown in Fig. 54.2, temporal prefiltering with a bandwidth of 75 Hz at first sight seems sufficient. The fundamental problem now is that the relation shown in Fig. 54.2 holds for c  1999 by CRC Press LLC FIGURE 54.1: Consecutive steps in upsampling with a factor of two. temporal frequencies as they occur at the retina of the observer. These frequencies, however, equal the frequencies at the display only if the eye is stationary with respect to this display. Particularly with the eye tracking objects moving on the screen, this assumption is no longer valid. For a tracking observer very high temporal frequencies on the screen can be transformed to much lower frequencies or even DC at the retina. Consequently, suppression of these frequencies, with an interpolating lowpass filter, results in excessive blurring of moving objects as will be discussed next. Figure 54.3 shows, in a time-discrete representation, a simple object, a square, moving with a constant velocity. Again, in this example, we consider up-sampling with a factor of two. Therefore, the true position of the object is available at every second temporal position only (e.g., the odd numbered samples). The “tracking observer” views along the motion trajectory, represented with a line in the illustration, which results in a stationary image of the object on the retina. If the output field sampling frequency exceeds the cutoff temporal frequency of the human visual system, 1 the viewer will have the illusion that the object is continuously present. Therefore, the object is actually seen at a position corresponding with the motion trajectory. If now, e.g., in the 6th output field, the object is interpolated according to SRC theory, weighted copies of the object from surrounding fields resulting from the interpolating LPF are displayed. Figure 54.3 illustrates the case of a symmetrical transversal lowpass filter. In this situation, the viewer sees the object at the correct position but also various attenuated and displaced copies (the impulse response of the interpolating temporal filter) of the object in a neighborhood. The attenuation depends on the coefficientsoftheinterpolatingfilter, andthedistancebetweenthecopiesisrelatedtothedisplacement 1 Actually the picture update frequency may be even as low as 16 Hz, to guarantee smooth perceived motion (see, e.g., [3]). The higher display rates are merely necessary to prevent the annoying large area flicker. c  1999 by CRC Press LLC FIGURE 54.2: The contrast sensitivity of the human observer (y-axis) for large areas of uniform brightness, as a function of the temporal frequency (x-axis). FIGURE 54.3: The effectof temporalinterpolation for anobject tracking observer. Thefield numbers are counted at the output field rate. of the moving object in a field period. For the object-tracking observer, therefore, the temporal LPF is transformed into a spatial LPF. For an object velocity of one pixel per field period (one pel/field), its frequency characteristic equals the temporal frequency characteristic of the interpolating LPF. 2 1 pel/field is a slow motion, as in broadcast picture material; velocities in a range exceeding 16 pel/field do occur. Thus, the spatial blur caused by the SRC process becomes unacceptable even for moderate object velocities. 54.3.2 Vertical Interpolation and Interlaced Scanning Much similar to the situation of field rate conversion, it may seem that sequential scan conversion is an up-sampling problem for which SRC-theory provides an adequate solution. However, straightforward, one-dimensional, up-sampling in the vertical frequency domain is incorrect as the data is clearly sub-Nyquist sampled due to interlace. If, more correctly, the sequential scan conversion is considered as a two-dimensional up-sampling problem in the vertical-temporal frequency domain, we arrive at a discussion similar to the one 2 It is assumed here that both filters are normalized to their respective sampling frequency. c  1999 by CRC Press LLC in Section 54.3.1: the problem cannot be solved as we do not know the temporal frequency at the retina of a movement-tracking observer. It is possible to disregard this problem and to perform a two-dimensional SRC, implicitly assuming a stationary viewer and prefiltered information. Such systems were described and have been implemented for studio applications. With the older image pick-up tubes the results can be satisfactory, as these devices have a poor dynamic resolution. When modern (CCD-)cameras are used, however, the limitations of the assumptions become obvious. 54.4 Alternatives for Sampling Rate Conversion Theory With the problem of linear interpolation of video signals clarified, we will discuss alternative algorithms developed over time. These algorithms fall into two categories. A first category simplifies the interpolation filter prescribed by SRC-theory, considering that a completely correct solution is impossible anyway. The resulting “simple algorithms” are more attractive for hardware realization than the method from which they are derived and under certain conditions can perform quite simi- larly. Thesecond category includes the most “advancedalgorithms” forscanning format conversion. These methods can be characterized by their common attempt to interpolate the 3-D image data in the direction in which the correlation is highest. The difference between the various options lies mainly in the number of possible directions, and dimensions, which are considered. The imple- mentation can show various linear interpolation filters controlled by one or more detectors, or a multi-dimensional nonlinear filter that has an inherent edge adaptivity. As this description allows a large number of algorithms, we will illustrate it with some important examples. 54.4.1 Simple Algorithms SRC-theory in the temporal and vertical frequency domain is not applicable due to the missing prefilter in common video systems. A sophisticated linear interpolation filter therefore makes little sense. Any interpolating (spatio-)temporal low-pass filter will suppress original temporal frequency components as well as aliased signal components, as they occupy, by definition, the same spectrum. As the first effect is desired and the second not, the transfer function of the filter strikes a compromise between alias and blurring. Repetition of the most recent sample in this sense is optimal for the dynamic resolution and worst for alias. A strong temporal low-pass filter suppresses much (not necessarily all) alias and yields a poor dynamic resolution. The annoyance of the temporal alias depends on the input and output picture frequency, and particularly their difference. In the easiest case, both frequencies are high and their difference 50 Hz or more. In the worst case, input and output picture rate are low and their difference in the order of 10 Hz. In case of an annoying beat frequency, an interpolating LPF usually improves picture quality, otherwise the best compromise is closer to repetition of the most recent sample. 54.4.2 Advanced Algorithms As indicated before, these methods are characterized by their common attempt to interpolate the 3-D image data in the direction in which the correlation is highest. To this end they either have an explicit or implicit detectorto find thisdirection. In case of (1-D) temporal interpolation the explicitdetector is usually called a motion detector, for 2-D spatial interpolation it is called an edge detector, while the most advanced device estimating the optimal spatio-temporal (3-D) interpolation direction is usually called a motion estimator. The interpolation filter can be recursive or transversal, and can have any number of taps, but a transversal filter with one or two taps is the most common choice. For a two taps FIR approach we can write the interpolated video signal F int , in picture n, at spatial c  1999 by CRC Press LLC position x = (x, y) T as a function of the input video signal F(x,n): F int (x,n)= 0.5  F  x +  δ 1 δ 2  ,n+ δ 3  + F  x −  δ 1 δ 2  ,n− δ 3  (54.1) In this terminology a motion detector controls δ 3 ,anedgedetectorδ 1 , and δ 2 , while a motion estimator can be applied to determine δ 1 ,δ 2 , and δ 3 . Algorithms with a Motion Detector To detect motion, the difference between two successive pictures is calculated. It is too simple, however, to expect this signal to become zero in a picture part without moving objects. The common problems with the detection are noise and alias. Additional problems occurring in some systems are color subcarriers causing non-stationarities in colored regions, interlace causing nonstationarities in vertically detailed picture parts, and timing jitter of the sampling clock which is particularly harmful in detailed areas. All these problems imply that the output of the motion detector usually is not a binary, but rather a multi-level signal, indicating the probability of motion. Usual (but not always valid) assumptions made to improve the detector are: 1. Noise is small and signal is large. 2. The spectrum part around the color carrier carries no motion information. 3. Low-frequency energy in the signal is larger than in the noise and alias. 4. Moving objects are large compared to a pixel. The general structure of the motion detector resulting from these assumptions is depicted in Figure 54.4. As can be seen, the difference signal is first low-pass (and carrier reject) filtered to profit FIGURE 54.4: General structure of a motion detector. from (54.2) and (54.3). It also makes the detector less “nervous” for timing jitter in detailed areas. After the rectification another low-pass filter improves the consistency of the motion signal, based on assumption (54.4). Finally, the nonlinear (but monotonous) transfer function in the last block translates the signal in a probability figure for the motion P m , using (54.1). This last function may have to be adapted to the expected noise level. Low-pass filters are not necessarily linear. More than one detector can be used, working on more than just two pictures in the neighborhood of the current image, and a logical or linear combination of their outputs may lead to a more reliable indication of motion. The motion detector (MD) is applied to switch or fade between two processing modes, one of which is optimal for stationary and the other for moving image parts. Examples are: • De-interlacing. The MD fades between intra-field interpolation (line-averaging, or edge c  1999 by CRC Press LLC dependent spatial interpolation) and inter-field interpolation (repetition of the previous field, averaging of neighboring fields, etc.). • Field rate doubling on interlaced video: The MD fades between repetition of fields (best dynamic resolution without motion compensation for moving picture parts) and repetition of frames (best spatial resolution in stationary image parts). To slightly elaborate on the first example of de-interlacing, we define the interpolated pixel X m (x,n)in a moving picture part as: X m  x ,n  = 0.5  F  x −  0 1  ,n  + F  x +  0 1  ,n  (54.2) while for stationary picture parts the interpolated pixel X s (x,n)is taken as: X s  x ,n  = F  x,n− 1  (54.3) and taking the probability of motion P m , from the motion detector into account, the output is given by: F int  x ,n  = P m X m  x ,n  + (1 − P (m))X s  x ,n  (54.4) In most practical cases the output P m has a nonlinear relation with the actual probability. Algorithms with an Edge Detector To detect the orientation of a spatial edge, usually the differences between pairs of spatially neighboring pixels are calculated. Again it is a bit unrealistic to expect that a zero difference is a reliable indication of a spatial direction in which the signal is stationary. The same problems (noise, alias, carriers, timing-jitter) occur as with motion detection. The edge detector (ED) is applied to switch or fade between at least two but usually more processing modes, each of them optimal for interpolation of a certain orientation of the spatial edge. Examples are: • De-interlacing. The ED fades between vertical line-averaging and diagonal averaging (+/ − 45 ◦ , or even more angles). • Up-conversion to a higher resolution format. A simple bi-linear interpolation filter is applied with its coefficients adapted to the output of the edge detector. FIGURE 54.5: Identification of pixels as applied for direction dependent spatial interpolation. c  1999 by CRC Press LLC In Fig. 54.5, X is the pixel to be interpolated for the sequential scan conversion and the result applying pixels in a neighborhood (A, B, C, D, E and F) is either X a ,X b ,orX c , where: X a = 0.5[A + F ]=0.5  F  x −  1 1  ,n  + F  x +  1 1  ,n  (54.5) and: X b = 0.5[B + E]=0.5  F  x −  0 1  ,n  + F  x +  0 1  ,n  (54.6) and: X c = 0.5[C + D]=0.5  F  x +  +1 −1  ,n  + F  x +  −1 +1  ,n  (54.7) The selection of X a ,X b ,orX c to the interpolated output F int is controlled by a luminance gradient indication calculated from the same neighborhood: F int  x ,n  =     X a , ( |A − F | < |C − D|∧|A − F | < |B − E| ) X b , ( |B − E|≤|A − F |∧|B − E|≤|C − D| ) X c , ( |C − D| < |A − F |∧|C − D| < |B − E| ) (54.8) In this example, the gradient is calculated on the same pixels that are used in the interpolation step. This is not necessarily the case. Similar to the earlier described motion detector, it is advantageous to filter the video signal prior to and/or after the rectification in Eq. (54.8). Also the decision, i.e., the optimal interpolation angle, can be low-pass filtered to improve the consistency of the interpolation angle. Finally, the edge dependent interpolation can be combined with (motion adaptive or motion compensated) temporal interpolation to improve the interpolation quality of near horizontal edges. Implicit Detection in Nonlinear Interpolation Filters Many nonlinear interpolation methods have been described. Most popular is the class of order statistical filters. Combinations with linear (bandsplitting) filters are known, optimizing the interpolation for individual spectrum parts. We will limit ourselves to some basic examples here. An illustration of a basic inherently adapting filter is shown in Figure 54.6. The line to be inter- FIGURE 54.6: Sequential scan conversion with three-tap vertical-temporal median filtering. The thin lines show which pixels are input for the median filter. c  1999 by CRC Press LLC polated is found as the median of the spatially neighboring lines (a and b) and the corresponding line (c) from the previous field: F int (x,n)= median [a, b, c]= median  F  x +  0 1  ,n  ,F  x −  0 1  ,n  ,F  x,n− 1   (54.9) with: median ( X, Y, Z ) =     X, ( Y ≤ X ≤ Z ∨ Z ≤ X ≤ Y ) Y, ( X<Y ≤ Z ∨ Z ≤ Y<X ) Z, (otherwise) (54.10) The inherent adaptation to edges is understood as follows: In case of a temporal edge (i.e., motion) larger than the spatial edge (i.e., vertical detail), the difference between a and b is relatively small compared to their difference with c. Therefore, an intra-field interpolation results (a or b is copied). In case of a non-moving vertical edge, the differencebetween a and b will be relatively large compared to the difference between c and a or b. In this case, the inter-field interpolation (c is copied) is most likely. It is possible to combine edge detectors with non-linear filters, e.g., a so-called weighted median filter. In a weighted median filter, the (integer) weight given to a sample indicates the number of times its value is included in the input of the filter to the ranking stage. An increase of this weight increases the chance this sample value is selected as the median. It therefore provides a method, using the output of an edge detector with uncertainties, to statistically improve the performance of the interpolation. We will again use Fig. 54.5 to identify the location of the pixels used in the interpolation. The output value for the pixel position indicated with X results as: F int  x ,n  = median  A, B, C, D, E, F, α · X −1 ,β· B + E 2  , ( α, β ∈ N ) (54.11) with: X −1 = F  x,n− 1  ,A= F  x −  1 1  ,n  ,B= F  x −  0 1  ,n  , . (54.12) as illustrated in Fig. 54.5. The weighting (α and β) implies that an assumed “important” pixel is fed more than once to the median calculating circuit: α · A = A, A, A .A, A α times (54.13) The combination arises if a motion detector is used to control the weighting factors of the pixel from the previous field and that of the value found by line averaging. A large value of α increases the probability of field insertion, while a large β causes an increased probability of line averaging. Although the examplesinthissection arelimitedtode-interlacing, itshouldbenoted that proposals exist for field rate conversion as well. Algorithms with a Motion Estimator The idea to interpolate picture content in the direction in which it is most correlated can be extended to a three-dimensional case. This results in an interpolation along the motion trajectory. Figure 54.7 defines the motion trajectory as the line that connectsidentical picture parts in a sequence c  1999 by CRC Press LLC [...]... motion vector into earlier pictures until it points (almost) to an existing pixel 3 Recursive de-interlacing of the signal The implication of GST is that it is possible to perfectly reconstruct a signal sampled at 1/n times the Nyquist rate if n independent sets of samples describe the signal The de-interlacing problem is a specific case for which n = 2 The required two sets are the current field and... earlier text We will make an exception, however, for temporal interpolation on interlaced signals, as this poses non-trivial problems even with knowledge of local motion Motion Compensated De-Interlacing In general, the pixels required for the motion compensated interpolation do not exist in the time discrete input signal, e.g., due to non-integer velocities In the horizontal domain this problem can be... of methods have been proposed Two classes can be distinguished, combinations of which are possible: • Methods, that perform a post -processing on the output vector field to improve the consistency • Methods in which a smoothness constraint is integrated in the estimator Postprocessing can be straightforward, applying basically low-pass filtering to improve the spatial and/or temporal consistency or smoothness... Part Two Determination of frame frequency for television in terms of flicker characteristics, Proc of the I.R.E., 23 (4), 295-310, 1935 [2] van den Enden, A.W.M and Verhoeckx, N.A.M., Discrete-Time Signal Processing, PrenticeHall, Englewood Cliffs, NJ [3] Zworykin, V.K and Morton, G.A., Television, 2nd ed., John Wiley & Sons, New York, 1954 c 1999 by CRC Press LLC ... including smoothness constraints Integrated solutions can be expected to realize a better performance than the straightforward representatives of the first class at a lower expense than the sophisticated processing methods The c 1999 by CRC Press LLC constraint can either be explicit, e.g., by adding a “discontinuity penalty” to the error criterion of a block-matcher: C, X, n = F x, n , −F x − C, n − 1... vectors are estimated on the low frequency band The result is used as a prediction for a more accurate estimate at the next sub-band, which contains higher frequencies, etc At the top of the pyramid, the signal is strongly prefiltered and sub-sampled The bandwidth of the filter increases and the sub-sampling factors decrease, going down in the hierarchy, until the full resolution is reached on the lowest... correspond to the best matching candidate vectors Sub-pixel accuracy better than a tenth of a pixel can be achieved by fitting a quadratic curve through the elements in this matrix For interlaced video signals, p = 2 is the common choice The peaks in the phase plane can be applied to identify the most likely candidate vectors C(x, n) for a consecutive block-matching algorithm, evaluating all candidates... block from a previous field, summed over the block B(X): ∈ C, X, n = COST x∈B(X) c 1999 by CRC Press LLC F x, n , F x − C, n − p (54.25) A common choice for p is either 1 or 2, depending on whether the signal is interlaced or not Although the COST function itself can be rather straightforward and simple to implement, the high repetition factor for this calculation creates a huge burden To save calculational . Haan, G. “Video Scanning Format Conversion and Motion Estimation” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton:. Video Signals High-quality scanning format conversion is difficult to achieve, as the conditions of the sampling theorem are generally not met in video signals.

Ngày đăng: 19/10/2013, 18:15

Xem thêm: Digital Signal Processing Handbook P54, Digital Signal Processing Handbook P54

Digital Signal Processing Handbook P54

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan