Chapter 2 Theory
2.3 High Dynamic Range Imaging
2.3.1 High Dynamic Range Imaging Rendering Algorithm
HDR image rendering algorithms can be categorized into two types: global operators and local operators. Global operators apply same processing over one image based on the image content while local operators use different mapping methods according to spatially localized content. Notwithstanding global operators benefit faster computation and easier to implement, local operators allow for larger dynamic range compression. The following are brief introduction to these operators.
Sigmoidal Transformation: Sigmoid contrast enhancement function S(t) derived from a discrete cumulative normal function is utilized to rescale the lightness for gamut mapping.
This method was presented by Braun in1999. Afterwards, this method is modified to compress the HDR images by the logarithm of luminance.
(32)
Histogram Adjustment: By incorporating the human visual models of glare, spatial acuity and color sensitivity effects, the histogram of luminance is modified to reproduce the imperfections in human vision, which was proposed by Ward in 1997.
Figure 2-12 Dodging and burning effect
Photographic Reproduction: Different luminance mapping of highlight and shadow region is applied to simulate the dodging-and-burning effect in traditional photography. As Figure 2-12 illustrated, Dodging decreases the exposure to make the film negative brighter
41
while burning increases the exposure to make it darker. [44] This tone mapping techniques was presented by Reinhard et al. in 2002.
Bilateral Filtering Technique: An image is decomposed by an edge-preserving spatial filter into base layer and detail layer. Base layer contains large scale of variations. The overall brightness and base contrast is compress and subsequently two layers are combined into final image. This technique proposed by Durand and Dorsey in 2002 reduce the overall contrast but maintain the local details in the image.
Local Eye Adaption: The Naka-Rushton equation is modified to predict the response of cones and rods. According to the S-shaped response function, the luminance channel is compressed. This local-eye adaption method is presented by Ledda et al. in 2004.
The aforementioned algorithms [44] [45] are just few of them, but the core idea is trying adjust the range of luminance in the image with our conventional displays due to the limited contrast ratio. In brief, HDR images provide more realistic visualization of the real word as what people perceive. If one day, researchers surmount the limited capability of displays and our knowledge to human visual system, more robust models and operators can be utilized to improve the perceptual accuracy. Finally, Figure 2-13 shows an example of HDR image rendering by two images with different captured intensity. [47]
Figure 2-13 Example of high dynamic range image by tone mapping (a)+2 stop (b)-2 stop (c) HDR image
42
Chapter 3
Structure and Algorithms
Binocular vision is the foundation of 3D capturing with lens array. We utilize the disparity of elemental images to render the depth information. However, the corresponding problem is always an issue for stereo camera. Lens array in our High Dynamic Depth Range (HDDR) system is thus modified with different focal arrangement and the configuration of temporal and spatial systems will be illustrated in the first part of this chapter. Besides, depth information is extracted by the Depth Estimation Reference Software (DERS) and the post image processing, Depth map Fusion from Edge Exploring Thresholding (DFEET), is carried out in order to fuse the depth maps together. In the other part of this chapter, we will carefully elaborate on each step in DERS and DFEET. Finally, the limitation of our algorithm will be discussed as well.
3.1 3D Image Capturing with Lens Array
In conventional microscopes, depth of field decrease as magnification increase, so it needs to adjust the focal plane to clearly observe the specimen. Furthermore, some techniques of 3D image capturing are restricted in near field due to the shallow depth of field as well. Blurred images would bring about the matching error as computing the disparity. Even for the light field camera, the reversibility of light rays is the first hypothesis. As a result, when a source point diverges to be a spot, the light field function could not be a one-to-one and onto function anymore, so the integral fails to render back. To eliminate the out of focus issue, variable focal regions can be utilized by the lens array. Concerning the resolution of elemental images, the fewer lenslets are used, the more depth information can be acquired.
To minimize the number of lenslets, each depth of field should be well arranged.
According to concept and equation described in Chapter 2, depth of field evolves from depth
43
of focus with the consideration of longitudinal magnification. For the sake of convenient formulization, we only cogitate upon symmetrical depth of field as
(33)
Objects within twice of r will be clear imaged and other parameters are identical in Chapter 2. By means of the aforementioned equation, different depth ranges of interest can be designed as shown in Figure 3-1. The concept of extended depth of field is that one depth of field overlaps with the closest adjacent depth of field. Ideally, the objects falling out of the depth of field are out of focus, so we can render part of depth in the overlapping depth of field;
then stack them altogether. Hence on the basis of reducing the wasting depth of field, the total range of clear imaging is elongated from depth of field of s1 to wider than that of s3, for example. And this concept is similar to High Dynamic Range (HDR) in photography, but in depth map, we use depth as the estimated quantity. So we come up with the term: High Dynamic Depth Range (HDDR) to stand for the wide range of depth rendering.
Figure 3-1 Scheme of extended depth of field
However, in point of image processing, the fuzziness is requested more strictly. As a
44
consequence, we had better transfer from the idea of depth of field to that of Modulation Transfer Function (MTF) [48]. MTF is the frequency response of an optical system, so it governs how sharp the edge is or the contrast of the image since edge is regarded as a high frequency component. The idea of MTF also reveals a fact that the number of different focal lengths can be reduced. So we begin with the simplest case: two depth of field. Note that we use the term depth of field for the idea that images are “in focus” within the layer. Due to the occlusion issue, three elemental images are required to render one depth map. As a result, lens array should comprise at least six lenslets with two focal positions. Both spatial-multiplexed and temporal-multiplexed type can fulfill this idea of HDDR as illustrated in Figure 3-2 and Figure 3-3. Subsequently, we will render two depth maps that are partially well-defined and finally we fuse them together into HDDR depth map.
Figure 3-2 Spatial HDDR system with 2 DOF
Figure 3-3 Temporal HDDR system with 2 DOF
Of course these two methods are not totally equivalent because of the deviation in height
45
between two depth maps. This deviation will lead to distortion of objects because they are captured from different perspectives. When we carry out fusion of two depth maps into HDDR depth map, it is more difficult to reconstruct the correct shape of the objects. However, this issue will be mitigated as the pitch of lens array is getting smaller.
In addition, the result of integrating three depth of field will be demonstrated to prove that this HDDR concept can be applied to a wider range as long as the number of lenslets is increased. As the Figure 3-4 shows, every row of lenslets contributes one depth of field;
then depth of field can be greatly enhanced by stacking each layers.
Figure 3-4 HDDR system with 3 DOF
However, there is an ambiguity to distinguish the demarcation of foreground and background. Likewise, this ambiguity stems from the degree of fuzziness the software can tolerate. The more the objects blur, the less accurate the matching will be, because of lacking sufficient feature points. Besides patterned ground and terminal wall are applied to make sure the feature point is enough for the software to find the corresponding points and then compute the disparity. It should be reminded that we don’t use large f-number to increase the range of depth of field owing to the low light efficiency. If we want to maintain the captured intensity, dimming environment or objects with quick motion will lead to a dilemma of exposure time.
46
Figure 3-5 Experiment setup
2 depth of field 3 depth of field
Camera Nikon D60 Nikon D60
Lens array 1x3 or 2x3 1x3
System Temporal/Spatial HDDR Temporal HDDR
f-number F/2, F/2.8, F/22 F/2.8, F/22
# of elemental images 6 9
resolution of elemental images 1200x800 1200x800
Pitch 1cm or 5 cm 1 cm
Object distance 87, 150cm
or 110, 208, 235cm
35, 76, 152 cm
Table 3-1 Experimental Parameters
Last but not least, the experiment setup and parameters are shown in Figure 3-5 and Table 3-1 respectively. In the experiment, we utilize moving camera to simulate the HDDR system. Although we did not use “real” lens array in our experiment, the moving pitch of 5cm
47
between the elemental images was determined by the size of our camera lens. As a result, it would be feasible to use lens array in the future. Figure 3-6 illustrates the analogy of lens array system and moving camera. We can regard the lens and the main lens as a single optical system, so the effective focal length can be calculated. And the possibility of using moving camera is proved if we carefully adjust the focal length and the moving pitch. However, our camera lens is a prime lens, so we cannot change the focal length but change the image distance to sweep the focal plane. Consequently, we should modify the magnification as we use real lens array in the future. As long as the resolution of elemental images is sufficient, we can determine the factor for resizing according to the depth because the reserved range of conventional depth maps can be first acquired via edge information. In addition, when we use coplanar lens array with different focal lengths, the image planes of the lenslets might be different. Therefore, in Figure 3-7, our target is that the depth of field of our camera should cover the variation of the image planes in order to clearly capture every elemental image. And for the fear that the elemental images would overlap with each other, we have to cautiously arrange the focal lengths and the field of view of each lenslet. To conclude, there are some differences between using lens array and moving camera, but they do not contradict our original concept of HDDR system.
Figure 3-6 Effective lens design
48
Figure 3-7 Overall imaging system with lens array
49
3.2 Algorithm
“In the logician’s voice:
‘an algorithm is a finite procedure, written in a fixed symbolic vocabulary, governed by precise instructions, moving in discrete steps, 1, 2, 3, …, whose execution requires no insight cleverness, intuition, intelligence, or perspicuity, and that sooner or later comes to an end.’”
— “The Advent of Algorithm” by David Berlinski, 2000
After capturing the elemental images, we have to visualize the depth information. In terms of the transmission of 3D TV, the most straightforward and common way is to use a depth map.
Based on a monoscopic video, the corresponding depth map can be utilized to synthesize the stereo image pairs for more virtual view of 3D scene. This technique is denominated as Depth-Image-Based Rendering (DIBR). [49] To generate a HDDR depth map, the overall flow is shown in Figure 3-8.
Figure 3-8 Flow chart of algorithm
50
Our algorithm can be roughly divided into two parts. One part is using stereo matching software to generate conventional depth maps. The other part is the fusion of the conventional depth maps. The following two sections will disclose the details of the software and how to fuse all depth map together.
3.2.1 Depth Estimation Reference Software (DERS)
Moving Picture Experts Group (MPEG) is an authoritative group in charge of developing global standards of compression, decompression, and coded representation of pictures and audio [50]. On September 24th 2009, a latest version of the matching tool, called Depth Estimation Reference System (DERS) is released on London meeting. The latest version, DERS 5.0, introduced a technique of soft segmentation matching because one of the most crucial elements in depth estimation is matching the corresponding points between the stereo images. In first version of DERS, direct pixel matching approach is vulnerable to noise. As a result, a weighted comparison mask is utilized in DERS 5.0 to describe the significance of pixels neighboring with processed pixels. [51] Moreover, DERS is different from other depth map generators because it uses three input views instead of two and it is able to perform depth estimation for image sequences. Three input views stand for left, central, right views which are equally spaced along the horizontal baseline. The benefit of introducing third view is that the occlusion issue is somewhat reduced.
Generally speaking, the disparity estimation in DERS includes the following steps [52]:
a. Image segmentation (optional)
Three methods including mean shift, pyramid segmentation, and K means clustering, can be chosen but the bandwidth parameters are fixed in the source code.
b. Pixel/block matching
Two options for matching: 1.) the simplest approach is pixel matching, which
51
compares the intensity differences pixel-wise. 2) The other approach is block matching, which uses a 3-by-3 window with adaptive weights to compute the cost function. The weights are given as following formula:
(34)
where
P = center point in processed frame, P’ = processed point in processed frame,
W(P, P’) = soft-segmentation mask around center point P, I(P) = intensity of image at point P,
= Euclidian distance between P and P’, = color similarity parameter,
= distance similarity parameter.
c. Cost adjustment for temporal enhancement (optional)
This function is especially useful as dealing with image sequences, which updates the cost function by the block motion detection. The data cost of static blocks is set as zero to encourage same disparity will be selected again during graph cut optimization.
d. Disparity computation using graph cuts optimization
Graph cut is one of most common optimization methods in stereo correspondence work and it bases on two strategies: α, β-swap moves and α-expansion moves. The key idea of graph cut is trying to minimize the energy function E(d).
(35) where λ is to adjust the smoothness and d=d(x,y) is the disparity map.
52
While judging the similarity of corresponding points, dissimilarity is equivalent. So, we define a cost function C(x, y, d) in disparity space to return the value of dissimilarity. And Edata(d) is the summation of the cost.
(36)
As for the smoothness term, it is a function or called penalty ρ(d,I) often depending on differences in disparity and intensity of neighboring pixels. ρ(d,I) increases with disparity difference but reduces as the discontinuities locate on the color edges.
Then Esmooth(d) is the summation of the penalty.
(37)
In DERS, α-expansion moves approach is used while doing graph cut optimization.
As Figure 3-9 depicted, pixels with any labels in original labeling might be assigned with the new label α. By means of appropriate grouping the pixels, the energy function will be minimized.
Figure 3-9 Example of results from moves in graph cut algorithm.(a) Original pixel labeling (b) after an α-expansion move
e. Refinement of created disparity maps by using plane fitting (optional)
This step is triggered as image segmentation is activated. Least squares method is
53
utilized to refine the previous results of segmentation.
f. Post processing: 3-by-3 median filter
Final step is to further reduce the noise via median filter. However, the size is predetermined.
3.2.2 Depth Map Fusion from Edge Exploring Thresholding (DFEET)
In DFEET, four steps are executed: 1.) deviation correction 2.) edge searching 3.) thresholding 4.) Combination.
First of all, when we use spatial HDDR system, two depth maps are not rendered from the identical height. They are similar to elemental images that include disparity. Therefore, we have to adjust one depth map to the same height with the other depth map. For example, if we use the depth map from the nether position as the model, we have shift the upper depth map downward to keep the positions of objects are the same. Figure 3-10 illustrates an example of deviation correction. Red image and green image are the original image captured from upper position and shifted image respectively. It should be noted that deviation correction is carried out after the conventional depth map is generated. Hence, both of the color image and depth map should undergo shifting.
Figure 3-10 Deviation correction
54
It goes without saying that if temporal HDDR system is applied, there’s no need to do the deviation correction, i.e. disparity is zero.
Secondly, edge is often regarded as the indication to judge the things whether they are in focus. Consequently, we apply edge filter to analogize the region of depth of field. When it comes to edge filter, high pass filter is the basis of edge detection. However, high pass filter such as Laplacian operator is sensitive to noise, so it requires noise suppression beforehand.
Fortunately, Marr and Hildreth proposed Laplacian of Gaussian (LoG) method to take care of these two considerations. We can create a mask by sampling the following equation. [53]
(38)
where G(x,y) is a Gaussian function. As the name implies, we perform Gaussian smoothing as well as Laplacian sharpening. LoG is one of the common approaches used in edge detection and it is easy to implement with acceptable accuracy, so we use it in our algorithm. After extracting the edges of the elemental images, we apply another identity matrix to convolve with edge image in order to find a representative focal point.
Figure 3-11 Representative point of a deviation-corrected elemental image
In Figure 3-11, in addition to the line above due to the deviation correction, the green point implies that this position is full of edge information, so we regard it as the representative focal point. Conceptually, we consider this point as the location of focal plane when we capture the
55
image but actually it is not always correct. When the content is texture-less or with low contrast, the representative focal point would locate far away from what we expect. This phenomenon will influence the exactness of determining the threshold value in next step.
However, there’s an ambiguity between two depth of field as mentioned before, so the variation of the threshold value is acceptable. Unless the representative focal points deviate too much, segmentation error of depth map will occur since the boundary is distinct from the region of ambiguity. In other words, ill-defined objects will not be filtered out after thresholding.
With regard to thresholding, once the representative focal point is discovered, we can find the corresponding gray level from the depth map. By the same token, the representative focal point and its corresponding gray level can be detected as well. As a result, if N depth of field are arranged in capturing, N representative focal points will be extracted. Subsequently, N-1 threshold gray value can be decided by averaging the corresponding gray levels of two
With regard to thresholding, once the representative focal point is discovered, we can find the corresponding gray level from the depth map. By the same token, the representative focal point and its corresponding gray level can be detected as well. As a result, if N depth of field are arranged in capturing, N representative focal points will be extracted. Subsequently, N-1 threshold gray value can be decided by averaging the corresponding gray levels of two