U-Net: Convolutional Networks for Biomedical Image Segmentation

In this paper, a new neural network architecture called U-Net is introduced. Designed for Biomedical Image Segmentation, this model uses a Contracting Path to capture context and an Expansive Path to enable precise localization. The model achieves superior performance in pixel-level segmentation compared to traditional models by preserving high-resolution information through skip connections. This structure also allows for end-to-end learning with very few training images.

1. U-Net Architecture

The U-Net has a U-shaped structure, consisting of 1) a Contracting Path, 2) a Bottleneck, and 3) an Expansive Path. Each blue box represents a multi-channel feature map, where the number on top of the box shows the number of channels, and the number below indicates the feature map’s size.

(Since this is a medical image, the initial input image has a single channel.)

Let's examine each component in more detail:

1) Contracting path

3x3 Convolution layer

In the Contracting Path, a 3x3 convolution operation is performed twice in succession, generating a feature map from the input image. This operation is conducted with a stride of 1 and without padding, causing the feature map to progressively shrink. For example, with an input image of 572x572, the first convolution reduces the size to 570x570, and the second to 568x568. After each convolution operation, a ReLU activation function is applied.

The ReLU function not only introduces non-linearity, enabling the model to learn complex patterns, but also mitigates the vanishing gradient problem by maintaining a gradient of 1 for positive inputs, allowing for effective training even in deeper networks.

2x2 Max Pooling Layer

Following the two 3x3 convolutions, a 2x2 Max Pooling layer with a stride of 2 is applied. This halves the size of the feature map, retaining only essential features to reduce complexity while preserving global context. Additionally, the channel count doubles, so if the feature map has 64 channels initially, it will increase to 128 after Max Pooling, allowing the model to represent more features.

2) Bottleneck

In the bottleneck section, two 3x3 convolutions are performed. This is the most compressed part of the network, and ReLU activation is applied after each convolution. This compressed representation serves as a foundation for high-resolution reconstruction in the Expansive Path.

3) Expansive path

2x2 Up-sampling Layer, 3X3 conv layer

The feature map, which has been downsampled to a lower resolution, is upsampled back to twice its size to restore fine details. Similar to the Contracting Path and bottleneck, a 3x3 convolution with ReLU is applied. The Expansive Path also includes an additional structure.

Skip Architecture

The feature map created at each step of the Contracting Path is concatenated with the corresponding feature map at each stage of the Expansive Path. This allows for more accurate feature recovery in the Expansive Path, enabling precise segmentation even along boundaries. By utilizing the high-resolution information from the Contracting Path and the context patterns learned in the bottleneck, the model reflects both global structures and fine-grained boundary details.

Using skip connections allows the Contracting Path’s feature maps to be reused in the Expansive Path, promoting efficient gradient flow and faster learning. These connections improve training stability and mitigate vanishing gradient problems even in deep layers.

However, due to the lack of padding in the 3x3 convolutions, feature maps are of different sizes, making direct concatenation impossible. The “Copy and Crop” method adjusts the sizes by cropping the center of the Contracting Path’s feature map to match the Expansive Path's feature map size before concatenation.

In part of the paper, the channel count doubles, but the feature size remains 56x56 in the Expansive Path.

At the end of the Expansive Path, a 1x1 convolution determines the class each pixel belongs to, producing a binary map distinguishing tumor and non-tumor areas in tumor segmentation, for instance.

2. Training

Blue patches provide yellow segmentation, and red patches yield green segmentation.

In medical and scientific fields, analyzing high-resolution images can be challenging to handle in a single pass, so large images are split into patches. However, processing images as patches can cause edge artifacts, where information at the boundary is cut off, leading to inaccurate segmentation. To address this, an overlapping tile strategy is used, where patches overlap but the segmentation results do not.

Instead of using padding like zero or arbitrary values at the patch boundaries, edge pixels are mirrored, ensuring all areas are uniformly processed and reliable segmentation is achieved even at edges.

The batch size was set to 1 instead of segmenting large patches from input images. This single-image batch size could destabilize training, so high momentum (0.99) was used to maintain consistent learning by referencing previous training paths.

Separating touching objects, like adjacent cells, is critical during segmentation. Without this, two touching cells might be segmented as one mass. U-Net addresses this with a loss weighting map, emphasizing boundary areas. In the weighting map, boundary areas of cells are highlighted to make the model focus on boundary differentiation.

Using a weight formula from the paper, we can explore this further. Here, d1(x) and d2(x) represent the distance from pixel x to the two closest boundaries. w0 is a constant for boundary weighting, and σ adjusts the weight’s influence. wc(x) is a base weight map that corrects class imbalance by weighting pixels according to the frequency of each class (e.g., background or cell), promoting balanced learning.

After calculating class probabilities Pk(x) for each pixel using softmax, the model compares these with the actual class label Pl(x) and calculates cross-entropy loss. (Ω represents all pixels in the image.)

To address the limited data in medical imaging tasks, data augmentation techniques like Shift, Rotation, Gray Value (conversion to grayscale), and Elastic Deformation (random pixel distortion) are applied.

3. Experiments

Electron Microscopy (EM) Neural Structure Segmentation (2012)

U-Net showed lower error rates than the conventional sliding window convolutional network, achieving warp error and Rand error rates of 0.000353 and 0.0382, respectively, even without additional pre- or post-processing.

ISBI Cell Tracking Challenge (2014,2015)

On the first dataset (PhC-U373), U-Net achieved 92% IOU, compared to 83% of the second-best algorithm, and 77.5% IOU on the second dataset (DIC-HeLa), compared to 46% for the next best. This demonstrates U-Net’s high accuracy without additional post-processing and its outstanding performance on complex medical images in an end-to-end learning structure.

4. Limitations of U-Net and Improved Models

While U-Net has many strengths, it also has some limitations:

Difficulty distinguishing complex boundaries: While weighting boundary pixels can help, U-Net still struggles with complex or densely packed boundaries.
Generalization issues on large datasets: U-Net excels with small datasets but faces challenges in generalizing to large datasets.
Limited global pattern learning: As a convolution- and pooling-based model, U-Net is effective for learning local patterns but limited in capturing global patterns.

To address these limitations, several improved models have emerged, including the U-Former model.

Uformer: A General U-Shaped Transformer for Image Restoration (CVPR 2022)

The U-Former model follows U-Net’s U-shaped structure but uses Transformer modules instead of convolution to extract features. By leveraging the self-attention mechanism of Transformers, U-Former can learn global context, making it effective in handling complex structures and patterns and better suited to generalizing diverse patterns in large datasets. Transformers excel as data volume increases, performing strongly on complex and scale-variable data.

In general, U-Former models outperform U-Net models.

However, U-Former models also have some limitations:

High data requirements: The self-attention mechanism considers each pixel’s relationship with all others, requiring a large dataset to learn meaningful relationships and patterns. With limited data, the model might overfit specific data patterns, hindering generalization.
High computational complexity: The Uformer model has around 30M-50M parameters, while U-Net has around 7M-10M. This significantly slows U-Former’s training speed and increases costs.

5. Reference

unet : https://arxiv.org/abs/1505.04597

U-Net: Convolutional Networks for Biomedical Image Segmentation

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated

arxiv.org

uformer : https://arxiv.org/abs/2106.03106

Uformer: A General U-Shaped Transformer for Image Restoration

In this paper, we present Uformer, an effective and efficient Transformer-based architecture for image restoration, in which we build a hierarchical encoder-decoder network using the Transformer block. In Uformer, there are two core designs. First, we intr

arxiv.org

발전적인, 생산적인 블로그

U-Net: Convolutional Networks for Biomedical Image Segmentation

1. U-Net Architecture

1) Contracting path

2) Bottleneck

3) Expansive path

Skip Architecture

2. Training

3. Experiments

4. Limitations of U-Net and Improved Models

5. Reference

티스토리툴바

U-Net: Convolutional Networks for Biomedical Image Segmentation

1. U-Net Architecture

1) Contracting path

2) Bottleneck

3) Expansive path

*Skip Architecture*

2. Training

3. Experiments

4. Limitations of U-Net and Improved Models

5. Reference

티스토리툴바

Skip Architecture