# 图像预处理与增强

> 在送入模型之前，图像经历了哪些尺寸调整、灰度化、去噪和标准化操作。

- Repository: sml2h3/ddddocr
- GitHub: https://github.com/sml2h3/ddddocr
- Human wiki: https://grok-wiki.com/public/wiki/sml2h3-ddddocr-a34dd45d9f63
- Complete Markdown: https://grok-wiki.com/public/wiki/sml2h3-ddddocr-a34dd45d9f63/llms-full.txt

## Source Files

- `ddddocr/preprocessing/image_processor.py`
- `ddddocr/preprocessing/color_filter.py`
- `ddddocr/utils/image_io.py`
- `ddddocr/utils/validators.py`

---

<details>
<summary>相关源文件</summary>
以下文件用于生成此维基页面：
- [ddddocr/preprocessing/image_processor.py](ddddocr/preprocessing/image_processor.py)
- [ddddocr/preprocessing/color_filter.py](ddddocr/preprocessing/color_filter.py)
- [ddddocr/utils/image_io.py](ddddocr/utils/image_io.py)
- [ddddocr/utils/validators.py](ddddocr/utils/validators.py)
- [ddddocr/core/ocr_engine.py](ddddocr/core/ocr_engine.py)
- [ddddocr/core/detection_engine.py](ddddocr/core/detection_engine.py)
- [ddddocr/core/slide_engine.py](ddddocr/core/slide_engine.py)
</details>

# 图像预处理与增强

ddddocr 是一个基于 ONNX Runtime 的验证码识别库。在将图像送入模型推理之前，系统会经过一套完整的预处理流水线——包括图像加载与格式归一化、颜色过滤、尺寸调整、灰度化、去噪、对比度增强以及像素值标准化。这些步骤确保无论用户传入什么格式的图片（bytes、base64、文件路径、PIL Image 或 numpy 数组），最终送入模型的张量都具有一致的形状和数值范围。

本页梳理 OCR、目标检测、滑块匹配三条引擎各自走过的预处理路径，并解释每一步的设计理由。

## 图像加载与格式归一化

用户可以向 ddddocr 传入多种格式的图像输入。`load_image_from_input()` 函数负责将它们统一转换为 PIL Image 对象：

| 输入类型 | 处理方式 |
|---------|---------|
| `bytes` | 通过 `io.BytesIO` 加载为 PIL Image |
| `str`（文件路径） | 直接 `Image.open()` |
| `str`（base64） | 先 base64 解码，再加载 |
| `pathlib.PurePath` | `Image.open()` |
| `Image.Image` | 复制一份副本，避免修改原图 |
| `np.ndarray` | 根据形状和 dtype 转换（灰度/RGB/RGBA，float→uint8） |

Sources: [ddddocr/utils/image_io.py:82-119](ddddocr/utils/image_io.py)

## PNG 透明背景处理

验证码图片有时是 RGBA 模式的 PNG，带有透明背景。`png_rgba_black_preprocess()` 将透明区域填充为白色（RGB 255,255,255），然后输出 RGB 模式图像。这一步在 OCR 预处理流水线中通过 `png_fix` 参数控制，仅在用户显式启用时执行。

```python
# ddddocr/utils/image_io.py:59-79
def png_rgba_black_preprocess(img: Image.Image) -> Image.Image:
    width = img.width
    height = img.height
    image = Image.new('RGB', size=(width, height), color=(255, 255, 255))
    image.paste(img, (0, 0), mask=img)
    return image
```

Sources: [ddddocr/utils/image_io.py:59-79](ddddocr/utils/image_io.py)

## 颜色过滤（HSV）

`ColorFilter` 类提供基于 HSV 颜色空间的颜色过滤功能。用户可以指定预设颜色名称（如 `red`、`blue`）或自定义 HSV 范围，系统会生成掩码并将匹配区域保留、不匹配区域设为白色背景。

内置预设涵盖 10 种常见颜色。红色因为 HSV 色相环的特性，需要两个范围来覆盖（0–10 和 170–180）。

```python
# ddddocr/preprocessing/color_filter.py:23-34
COLOR_PRESETS = {
    'red': [((0, 50, 50), (10, 255, 255)), ((170, 50, 50), (180, 255, 255))],
    'blue': [((100, 50, 50), (130, 255, 255))],
    'green': [((40, 50, 50), (80, 255, 255))],
    # ... 共 10 种
}
```

颜色过滤在 OCR 引擎的 `predict()` 方法中作为可选步骤执行，位于图像加载之后、预处理之前。如果过滤过程中出现异常，系统会打印警告并跳过该步骤，保证识别流程不中断。

Sources: [ddddocr/preprocessing/color_filter.py:19-67](ddddocr/preprocessing/color_filter.py), [ddddocr/core/ocr_engine.py:133-139](ddddocr/core/ocr_engine.py)

## OCR 引擎预处理流水线

OCR 引擎的 `_preprocess_image()` 方法是图像进入模型前的核心通道。流水线按顺序执行以下步骤：

### 1. 尺寸调整

目标高度固定为 64 像素，宽度按原始宽高比计算：

```python
# ddddocr/core/ocr_engine.py:178-180
target_height = 64
target_width = int(image.size[0] * (target_height / image.size[1]))
image = ImageProcessor.resize_image(image, (target_width, target_height))
```

`resize_image()` 使用 PIL 的 `LANCZOS` 重采样算法（高质量下采样）。宽度不固定，按比例缩放，这保证了不同宽高比的验证码都能被正确识别。

对于自定义模型，尺寸由模型配置文件中的 `image` 字段决定。如果宽度配置为 `-1`，则按高度等比缩放；否则使用固定的 `(width, height)`。当模型是 `word`（单词识别）模式时，输出为正方形。

### 2. 灰度化

默认模型要求单通道输入（`channel=1`），因此图像被转换为灰度图：

```python
image = ImageProcessor.convert_to_grayscale(image)
```

转换使用 PIL 的 `convert('L')` 方法，将 RGB 三通道加权合并为单通道。自定义模型如果 `channel` 不为 1，则跳过此步骤，保留彩色通道。

### 3. 像素值标准化

灰度化完成后，像素值从 `[0, 255]` 整数范围转换为 `[0.0, 1.0]` 浮点范围：

```python
# ddddocr/core/ocr_engine.py:199-202
img_array = np.array(image).astype(np.float32)
img_array = img_array / 255.0
```

### 4. 维度重排

最后一步是调整张量维度以匹配 ONNX 模型的输入格式：

- **灰度图**（2D 数组）：先 `expand_dims(axis=0)` 添加通道维度，得到 `(1, H, W)`
- **彩色图**（3D 数组 HWC）：`transpose(2, 0, 1)` 转为 CHW 格式 `(3, H, W)`
- 最终统一 `expand_dims(axis=0)` 添加 batch 维度，得到 `(1, C, H, W)`

```
原始图像 (PIL)
    │
    ▼
┌──────────────────┐
│ PNG透明背景处理   │  ← png_fix=True 时
│ (RGBA → RGB)     │
└────────┬─────────┘
         ▼
┌──────────────────┐
│ 颜色过滤 (可选)   │  ← HSV掩码过滤
│ ColorFilter       │
└────────┬─────────┘
         ▼
┌──────────────────┐
│ 尺寸调整          │  ← 高度64，宽度等比缩放
│ LANCZOS重采样     │
└────────┬─────────┘
         ▼
┌──────────────────┐
│ 灰度化            │  ← channel=1 时
│ RGB → L           │
└────────┬─────────┘
         ▼
┌──────────────────┐
│ 像素值标准化      │  ← float32, /255.0
└────────┬─────────┘
         ▼
┌──────────────────┐
│ 维度重排          │  ← (H,W) → (1,1,H,W)
│ 添加batch+channel │
└────────┬─────────┘
         ▼
    ONNX 模型推理
```

Sources: [ddddocr/core/ocr_engine.py:159-215](ddddocr/core/ocr_engine.py)

## ImageProcessor 增强工具

除了流水线中自动调用的步骤外，`ImageProcessor` 类还提供以下可独立使用的增强方法：

| 方法 | 功能 | 默认参数 |
|------|------|---------|
| `enhance_contrast()` | 对比度增强（PIL ImageEnhance） | factor=1.5 |
| `enhance_sharpness()` | 锐度增强（PIL ImageEnhance） | factor=1.5 |
| `remove_noise()` | 中值滤波去噪（OpenCV） | kernel_size=3 |
| `binarize_image()` | 二值化（simple/otsu/adaptive） | threshold=128 |
| `normalize_image()` | Z-score 标准化 | mean=0.0, std=1.0 |

`preprocess_for_ocr()` 是一个预设流水线，将上述多个步骤组合执行：RGBA 处理 → 等比缩放到目标高度 → 对比度增强（factor=1.2） → 中值滤波去噪 → 灰度化。这个方法适合需要完整预处理但不经过 ONNX 推理的场景。

二值化支持三种方法：
- **simple**：固定阈值，超过阈值的像素设为 255，否则为 0
- **otsu**：自动计算最优阈值（Otsu 算法）
- **adaptive**：自适应阈值，对图像不同区域使用不同阈值（高斯加权）

Sources: [ddddocr/preprocessing/image_processor.py:124-238](ddddocr/preprocessing/image_processor.py)

## 目标检测引擎预处理

目标检测引擎使用不同的预处理策略。`DetectionEngine.preproc()` 将图像缩放到固定的 416×416 尺寸，用灰度值 114 填充保持宽高比后的空白区域，然后将 HWC 格式转为 CHW：

```python
# ddddocr/core/detection_engine.py:89-104
def preproc(self, img, input_size, swap=(2, 0, 1)):
    padded_img = np.ones((input_size[0], input_size[1], 3), dtype=np.uint8) * 114
    r = min(input_size[0] / img.shape[0], input_size[1] / img.shape[1])
    resized_img = cv2.resize(img, (int(img.shape[1] * r), int(img.shape[0] * r)),
                             interpolation=cv2.INTER_LINEAR).astype(np.uint8)
    padded_img[: int(img.shape[0] * r), : int(img.shape[1] * r)] = resized_img
    padded_img = padded_img.transpose(swap)
    padded_img = np.ascontiguousarray(padded_img, dtype=np.float32)
    return padded_img, r
```

与 OCR 引擎的关键区别：
- 使用 OpenCV 的 `INTER_LINEAR` 插值（速度优先），而非 LANCZOS
- 固定正方形输入尺寸 416×416，而非等比缩放
- 保留三通道彩色，不转灰度
- 填充值 114（YOLO 系列模型的惯例），而非 0 或 255
- 不做 `/255.0` 归一化，直接以 uint8→float32 送入

Sources: [ddddocr/core/detection_engine.py:89-104](ddddocr/core/detection_engine.py), [ddddocr/core/detection_engine.py:173-176](ddddocr/core/detection_engine.py)

## 滑块匹配引擎预处理

滑块匹配引擎不使用深度学习模型，而是基于传统图像处理。预处理路径取决于匹配模式：

**简单模板匹配**：将图像转为灰度后直接进行 `cv2.matchTemplate`（归一化相关系数法）。

**边缘检测匹配**：灰度化后使用 Canny 边缘检测（阈值 50/150），再在边缘图上做模板匹配。这种方法对光照变化更鲁棒。

**滑块比较**（带坑位）：计算两张图的像素差异 → 灰度化 → 固定阈值 30 二值化 → 形态学闭运算 + 开运算去噪 → 轮廓检测 → 取最大轮廓作为缺口位置。

```python
# ddddocr/core/slide_engine.py:162-173
diff = cv2.absdiff(target, background)
diff_gray = cv2.cvtColor(diff, cv2.COLOR_RGB2GRAY)
_, binary = cv2.threshold(diff_gray, 30, 255, cv2.THRESH_BINARY)
kernel = np.ones((3, 3), np.uint8)
binary = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
binary = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
```

Sources: [ddddocr/core/slide_engine.py:119-198](ddddocr/core/slide_engine.py)

## 输入验证

所有图像输入在进入预处理之前都会经过 `validate_image_input()` 验证，确保类型在支持范围内。颜色过滤参数有专门的 `validate_color_filter_params()` 验证，检查 HSV 值范围（H: 0–180, S/V: 0–255）和结构正确性。

Sources: [ddddocr/utils/validators.py:15-31](ddddocr/utils/validators.py), [ddddocr/utils/validators.py:83-137](ddddocr/utils/validators.py)

## 总结

ddddocr 的图像预处理体系围绕三条引擎各自的特点设计。OCR 引擎追求固定高度、灵活宽度、单通道归一化输入；目标检测引擎追求固定正方形、三通道、YOLO 惯例的预处理；滑块引擎则完全基于传统图像处理（灰度化、边缘检测、形态学操作）。所有路径共享同一个图像加载层（`load_image_from_input`）和验证层（`validate_image_input`），确保格式兼容性的同时保持各引擎预处理的独立性。