Metalens-Transformer Logo

Aberration Correcting Vision Transformers for High-Fidelity Metalens Imaging

Metalens-Transformer Logo
Byeonghyeon Lee1, Youbin Kim2, Yongjae Jo2, Hyunsu Kim2, Hyemi Park2, Yangkyu Kim2, Debabrata Mandal3,
Praneeth Chakravarthula3, Inki Kim2, and Eunbyung Park1
1Yonsei University      2Sungkyunkwan University      3University of North Carolina at Chapel Hill

Compact and Tiny Lens, Metalens

Video Aberration Correction

3D Reconstruction with Aberrated Images

Image Aberration Correction

Abstract

Metalens is an emerging optical system with an irreplaceable merit in that it can be manufactured in ultra-thin and compact sizes, which shows great promise in various applications. Despite its advantage in miniaturization, its practicality is constrained by spatially varying aberrations and distortions, which significantly degrade the image quality. Several previous arts have attempted to address different types of aberrations, yet most of them are mainly designed for the traditional bulky lens and ineffective to remedy harsh aberrations of the metalens. While there have existed aberration correction methods specifically for metalens, they still fall short of restoration quality. In this work, we propose a novel aberration correction framework for metalens-captured images, harnessing Vision Transformers (ViT) that have the potential to restore metalens images with non-uniform aberrations. Specifically, we devise a Multiple Adaptive Filters Guidance (MAFG), where multiple Wiener filters enrich the degraded input images with various noise-detail balances and a cross-attention module reweights the features considering the different degrees of aberrations. In addition, we introduce a Spatial and Transposed self-Attention Fusion (STAF) module, which aggregates features from spatial self-attention and transposed self-attention modules to further ameliorate aberration correction. We conduct extensive experiments, including correcting aberrated images and videos, and clean 3D reconstruction. The proposed method outperforms the previous arts by a significant margin. We further fabricate a metalens and verify the practicality of our method by restoring the images captured with the manufactured metalens.

Metalens Fabrication

Metalens Figure Fabrication Figure

We fabricated a metalens with the PSF from Neural Nano optics. A metalens with a diameter of 500 µm and a focal length of 1 mm was designed based on the optimization of a polynomial phase equation. The SiN meta-atom library was generated using rigorous coupled-wave analysis simulations for circular pillars with a height of 750 nm. The widths of the selected meta-atoms ranged from 100 to 300 nm, with a lattice period of 350 nm.
A 750 nm thick SiN layer was deposited onto a SiO2 substrate using plasma-enhanced chemical vapor deposition to fabricate the designed metalens. A 200 nm thick positive photoresist layer was spin-coated at 4000 RPM. The pattern of circular nano-pillar meta-atoms was then transferred onto the positive photoresist using electron beam lithography (Figure right (a)), with a dose of 3.75 C/m2. To prevent charging, 100 µL of ESPACER was spin-coated at 2000 RPM for 30 seconds.
The exposed resist was developed in a 1:3 solution of methyl isobutyl ketone/isopropyl alcohol for 11 minutes. Subsequently, a 40 nm thick chromium layer was deposited as a hard mask using an electron beam evaporator (Figure right (b)). The unexposed photoresist was removed through a lift-off process in acetone at room temperature for 1 hour, leaving the Cr hard mask intact. Patterning was finalized using inductively coupled plasma etching with SF6 and C4H8 gases for 10 minutes. Finally, the Cr hard mask was removed using a chromium etchant for 5 minutes. The fabricated metalens is shown on the left-side of the figure.

The above figure illustrates the image capture setup. An optical microscope system was set up to obtain images through the metalens. The images displayed on a 5.5-inch FHD display were captured using a CMOS camera coupled with a magnification system consisting of a 20x objective lens with 0.5 NA and a tube lens. The metalens was positioned such that its focal plane coincided with the focal plane of the objective lens using a linear motorized stage. Camera exposure time was adjusted using a white image prior to recording to prevent saturation. The point spread functions (PSFs) were then acquired using the same setup with 450 nm laser, 532 nm laser, and 635 nm laser for calibration and training of the model. The following images show the restoration results of our model on the real images captured with the fabricated metalens under this setup.

Correcting Images Captured with the Fabricated Metalens

Methods

Our model comprises Multiple Adaptive Filters Guidance (MAFG) which produces different representations with various noise-detail balances, and a Spatial and Transposed self-Attention Fusion (STAF) module that aggregates features differently in encoder and decoder.


Multiple Adaptive Filters Guidance (MAFG)

We propose to use multiple Wiener filters to guide aberration correction with several distinct representations. It is not feasible to obtain an optimal Wiener filter with accurate SNR as noise distribution is unknown in the real world. Instead of estimating the noise distribution, we adopt multiple Wiener filters with different \( K \). Such design can approximate the ideal filter by employing \( M \) filters with different \( K \) in a wide range. Various representations are fed to the restoration model, and they can enrich the features complementarily, which in turn improves aberration correction.
We extend multiple Wiener filters to Multiple Adaptive Filters Guidance (MAFG), which determines \( K \) adaptively considering the image intensity. Image with higher intensity tends to have better signal quality, so brighter channels often experience less noise and are less sensitive to noise. Thus, we penalize noise less and capture more information for bright images by adjusting \( K \) with the image intensity. Also, we treat each channel differently to avoid suppressing high-SNR details unnecessarily, since metalens exhibit wavelength-dependent aberrations where each color channel suffers differently due to the chromatic responses.
To dynamically integrate these diverse representations, we employ a cross-attention module that reweights the filtered features according to their relevance to the median representation. This enables spatially adaptive correction, guiding the restoration network with features emphasized based on the severity of local aberration.


Spatial and Transposed self-Attention Fusion (STAF)

We propose a Spatial and Transposed self-Attention Fusion (STAF) module to further ameliorate image restoration. By leveraging both Spatial Attention (SA) and Transposed Attention (TA), the STAF module can capture diverse spatial dependencies. To fully realize the potential of SA and TA in image restoration, it is important to consider the distinct roles of the encoder and decoder in Transformers. The encoder focuses on capturing global context, emphasizing the overall structure and relationships within images—a critical aspect of identifying patterns and features corrupted in degraded images. Meanwhile, the decoder specializes in recovering fine local details and textures necessary for high-fidelity restoration. Therefore, STAF module applies SA and TA separately rather than alternately, as illustrated in Figure (a). Instead, as shown in Figure (b), STAF module fuses their features by assigning different weights in the encoder and decoder stages.

Quantitative Results

Image Aberration Correction

Method Non-uniform Aberration Uniform Aberration
θ = 0° θ = 20°
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
Wiener deconvolution 25.54 0.5743 0.5228 27.06 0.6561 0.4458 26.05 0.6332 0.5030
Eboli et al. 15.19 0.3784 0.8270 20.73 0.5359 0.6488 18.55 0.4480 0.7973
DeblurGANv2 24.08 0.6863 0.3233 25.27 0.7338 0.3032 21.79 0.6048 0.4810
DWDN 26.40 0.7656 0.2854 29.38 0.8318 0.2459 25.68 0.7282 0.3267
Tseng et al. 28.66 0.8045 0.2949 29.88 0.8373 0.2507 28.09 0.7841 0.3117
Restormer 27.84 0.8091 0.2753 31.09 0.8827 0.1804 28.53 0.8020 0.2871
Restormer + SWF 29.87 0.8289 0.2812 33.58 0.8804 0.2003 28.95 0.7908 0.3262
Ours 34.29 0.8760 0.2052 35.06 0.8961 0.1763 32.32 0.8409 0.2542
* SWF refers to "Single Wiener Filter".

Video Aberration Correction

DVD
Method PSNR↑ SSIM↑ LPIPS↓
VRT 23.23 0.6906 0.3921
VRT w/ Ours 28.89 0.8602 0.2102

3D Reconstruction with Aberrated Images

LLFF Tanks&Temples Mip-NeRF360
Method PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
3D-GS 17.43 0.4083 0.7428 16.81 0.3875 0.7263 19.58 0.3857 0.8166
3D-GS + Ours 25.29 0.7513 0.2138 23.17 0.6852 0.2639 25.94 0.6676 0.3693

To evaluate the performance of aberration correction tasks across image, video, and 3D reconstruction domains, we synthesized aberration on existing datasets by applying point spread functions to construct aberrated datasets. These tables present the quantitative results of various methods for each task, demonstrating the superior performance of our model. For video and 3D reconstruction tasks, only Non-uniform Aberration was applied.

BibTeX

@article{lee2024aberration,
  title={Aberration Correcting Vision Transformers for High-Fidelity Metalens Imaging},
  author={Lee, Byeonghyeon and Kim, Youbin and Jo, Yongjae and Kim, Hyunsu and Park, Hyemi and Kim, Yangkyu and Mandal, Debabrata and Chakravarthula, Praneeth and Kim, Inki and Park, Eunbyung},
  journal={arXiv preprint arXiv:2412.04591},
  year={2024}
}