image cleaner

This commit is contained in:
Simon Martens
2025-09-15 18:32:13 +02:00
parent bcf11e4e11
commit 9960dc5e38
11 changed files with 1247 additions and 0 deletions

56
scripts/ex/.gitignore vendored Normal file
View File

@@ -0,0 +1,56 @@
# Python Virtual Environment
venv/
env/
.env
# Python cache files
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
# Generated/processed images
demo_*
cleaned_*
comparison_*
*_cleaned_*
*_comparison_*
# Processing outputs
cleaned/
output/
results/
# Configuration files (may contain sensitive settings)
config.json
*.config.json
custom_*.json
# Temporary files
*.tmp
*.temp
.DS_Store
Thumbs.db
# IDE files
.vscode/
.idea/
*.swp
*.swo
*~
# Logs
*.log
logs/
# Test outputs
test_*
sample_output/
# Large source images (uncomment if you don't want to track originals)
# *.jpg
# *.jpeg
# *.png
# *.tif
# *.tiff

BIN
scripts/ex/1771-09b-02.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.2 MiB

BIN
scripts/ex/1772-07b-02.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.4 MiB

BIN
scripts/ex/1772-34-136.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.1 MiB

211
scripts/ex/README.md Normal file
View File

@@ -0,0 +1,211 @@
# Historical Newspaper Image Cleaning Pipeline
This pipeline automatically cleans and enhances scanned historical newspaper images by reducing noise, improving contrast, and sharpening text for better readability.
## Features
- **Noise Reduction**: Bilateral filtering and non-local means denoising
- **Contrast Enhancement**: CLAHE and gamma correction
- **Background Cleaning**: Morphological operations to remove artifacts
- **Text Sharpening**: Unsharp masking for improved readability
- **Batch Processing**: Process entire directories efficiently
- **Interactive Tuning**: Find optimal parameters for your specific images
- **Before/After Comparisons**: Visual validation of improvements
## Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Process Single Image
```bash
python image_cleaner.py input_image.jpg -o cleaned_image.jpg --comparison
```
### 3. Batch Process Directory
```bash
python batch_process.py -i newspaper_scans -o cleaned_images
```
### 4. Interactive Parameter Tuning
```bash
python config_tuner.py sample_image.jpg
```
## Usage Examples
### Basic Image Cleaning
```bash
# Clean single image with default settings
python image_cleaner.py 1771-09b-02.jpg
# Clean with specific processing steps
python image_cleaner.py 1771-09b-02.jpg --steps denoise contrast sharpen
# Create before/after comparison
python image_cleaner.py 1771-09b-02.jpg -c
```
### Batch Processing
```bash
# Process all JPG files in current directory
python batch_process.py
# Process specific directory with custom output
python batch_process.py -i scans/ -o cleaned/
# Use custom configuration
python batch_process.py --config custom_config.json
# Skip comparison images for faster processing
python batch_process.py --no-comparisons
```
### Parameter Tuning
```bash
# Start interactive tuning session
python config_tuner.py sample_image.jpg
# Load existing config for fine-tuning
python config_tuner.py sample_image.jpg -c existing_config.json
```
## Configuration
### Default Parameters
The pipeline uses these default parameters optimized for newspaper scans:
```json
{
"bilateral_d": 9,
"bilateral_sigma_color": 75,
"bilateral_sigma_space": 75,
"clahe_clip_limit": 2.0,
"clahe_grid_size": [8, 8],
"gamma": 1.2,
"denoise_h": 10,
"morph_kernel_size": 2,
"unsharp_amount": 1.5,
"unsharp_radius": 1.0,
"unsharp_threshold": 0
}
```
### Parameter Descriptions
- **bilateral_d**: Neighborhood diameter for bilateral filtering (5-15)
- **bilateral_sigma_color**: Color space filter strength (50-150)
- **bilateral_sigma_space**: Coordinate space filter strength (50-150)
- **clahe_clip_limit**: Contrast limiting for CLAHE (1.0-4.0)
- **clahe_grid_size**: CLAHE tile grid size [width, height] (4-16)
- **gamma**: Gamma correction value (0.8-2.0)
- **denoise_h**: Denoising filter strength (5-20)
- **morph_kernel_size**: Morphological operation kernel size (1-5)
- **unsharp_amount**: Unsharp masking strength (0.5-3.0)
- **unsharp_radius**: Unsharp masking radius (0.5-2.0)
- **unsharp_threshold**: Unsharp masking threshold (0-10)
### Creating Custom Configurations
1. Generate default config template:
```bash
python batch_process.py --create-config
```
2. Edit `config.json` with your preferred values
3. Use custom config:
```bash
python batch_process.py --config config.json
```
## Processing Pipeline
The image cleaning pipeline applies these steps in sequence:
1. **Noise Reduction**
- Bilateral filtering preserves edges while reducing noise
- Non-local means denoising removes repetitive patterns
2. **Contrast Enhancement**
- CLAHE improves local contrast adaptively
- Gamma correction adjusts overall brightness
3. **Background Cleaning**
- Morphological operations remove small artifacts
- Background normalization reduces paper texture
4. **Sharpening**
- Unsharp masking enhances text edges
- Preserves fine details while reducing blur
## Interactive Tuning Commands
When using `config_tuner.py`, these commands are available:
- `set <param> <value>` - Adjust parameter value
- `show` - Display current parameters
- `test [steps]` - Process with current settings
- `compare [filename]` - Save before/after comparison
- `save <filename>` - Save configuration to file
- `load <filename>` - Load configuration from file
- `presets` - Show preset configurations
- `help` - Show detailed help
- `quit` - Exit tuning session
## Tips for Best Results
### For Light Damage/Noise:
- Reduce `bilateral_d` to 5-7
- Lower `denoise_h` to 5-8
- Use `clahe_clip_limit` around 1.5
### For Heavy Damage/Artifacts:
- Increase `bilateral_d` to 12-15
- Raise `denoise_h` to 15-20
- Use higher `clahe_clip_limit` (3.0-4.0)
### For Faded/Low Contrast Images:
- Increase `gamma` to 1.3-1.5
- Raise `clahe_clip_limit` to 3.0+
- Boost `unsharp_amount` to 2.0+
### For Sharp/High Quality Scans:
- Focus mainly on `denoise` and `sharpen` steps
- Skip `background` cleaning if unnecessary
- Use lighter settings to preserve quality
## File Structure
```
newspaper_image_cleaner/
├── image_cleaner.py # Core processing module
├── batch_process.py # Batch processing script
├── config_tuner.py # Interactive parameter tuning
├── requirements.txt # Python dependencies
└── README.md # This documentation
```
## Troubleshooting
### ImportError: No module named 'cv2'
Install OpenCV: `pip install opencv-python`
### Memory Issues with Large Images
The tuner automatically resizes large images. For batch processing of very large images, consider resizing first.
### Poor Results
Use the interactive tuner to find optimal parameters for your specific image characteristics.
## Performance
- Single 3000x2000 image: ~3-5 seconds
- Batch processing depends on image size and quantity
- Interactive tuning uses smaller images for faster feedback

162
scripts/ex/batch_process.py Executable file
View File

@@ -0,0 +1,162 @@
#!/usr/bin/env python3
"""
Batch Processing Script for Historical Newspaper Images
Simple script to process multiple images with the newspaper cleaning pipeline.
Includes progress tracking and error handling.
"""
import os
import sys
import time
import json
from pathlib import Path
from image_cleaner import NewspaperImageCleaner, create_comparison_image
def process_batch(input_dir=".", output_dir="cleaned", config_file=None,
create_comparisons=True, file_pattern="*.jpg"):
"""
Process all newspaper images in a directory.
Args:
input_dir: Directory containing input images
output_dir: Directory for cleaned images
config_file: JSON file with custom parameters
create_comparisons: Whether to create before/after comparisons
file_pattern: Glob pattern for files to process
"""
# Load custom config if provided
config = None
if config_file and os.path.exists(config_file):
with open(config_file, 'r') as f:
config = json.load(f)
print(f"Loaded custom config from {config_file}")
# Initialize cleaner
cleaner = NewspaperImageCleaner(config)
# Setup paths
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
if create_comparisons:
comparison_path = output_path / "comparisons"
comparison_path.mkdir(exist_ok=True)
# Find all image files
image_files = list(input_path.glob(file_pattern))
image_files.extend(input_path.glob("*.jpeg"))
image_files.extend(input_path.glob("*.JPG"))
image_files.extend(input_path.glob("*.JPEG"))
if not image_files:
print(f"No image files found in {input_dir}")
return
print(f"Found {len(image_files)} images to process")
print(f"Output directory: {output_path.absolute()}")
# Process each image
success_count = 0
error_count = 0
start_time = time.time()
for i, img_file in enumerate(image_files, 1):
print(f"\n[{i}/{len(image_files)}] Processing: {img_file.name}")
try:
# Process image
output_file = output_path / f"cleaned_{img_file.name}"
processed, original = cleaner.process_image(img_file, output_file)
# Create comparison if requested
if create_comparisons:
comp_file = comparison_path / f"comparison_{img_file.name}"
create_comparison_image(original, processed, comp_file)
success_count += 1
print(f"✓ Completed: {img_file.name}")
except Exception as e:
error_count += 1
print(f"✗ Error processing {img_file.name}: {str(e)}")
# Summary
elapsed_time = time.time() - start_time
print(f"\n" + "="*50)
print(f"Batch Processing Complete")
print(f"{"="*50}")
print(f"Successfully processed: {success_count}")
print(f"Errors: {error_count}")
print(f"Total time: {elapsed_time:.1f} seconds")
print(f"Average time per image: {elapsed_time/len(image_files):.1f} seconds")
print(f"Output directory: {output_path.absolute()}")
def create_sample_config():
"""Create a sample configuration file for customization."""
config = {
"bilateral_d": 9,
"bilateral_sigma_color": 75,
"bilateral_sigma_space": 75,
"clahe_clip_limit": 2.0,
"clahe_grid_size": [8, 8],
"gamma": 1.2,
"denoise_h": 10,
"morph_kernel_size": 2,
"unsharp_amount": 1.5,
"unsharp_radius": 1.0,
"unsharp_threshold": 0
}
with open("config.json", "w") as f:
json.dump(config, f, indent=4)
print("Created config.json with default parameters.")
print("Edit this file to customize processing settings.")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="Batch process historical newspaper images",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python batch_process.py # Process current directory
python batch_process.py -i scans -o clean # Process 'scans' folder
python batch_process.py --no-comparisons # Skip comparison images
python batch_process.py --config custom.json # Use custom settings
"""
)
parser.add_argument("-i", "--input", default=".",
help="Input directory (default: current directory)")
parser.add_argument("-o", "--output", default="cleaned",
help="Output directory (default: cleaned)")
parser.add_argument("-c", "--config",
help="JSON config file with custom parameters")
parser.add_argument("--no-comparisons", action="store_true",
help="Skip creating before/after comparison images")
parser.add_argument("--pattern", default="*.jpg",
help="File pattern to match (default: *.jpg)")
parser.add_argument("--create-config", action="store_true",
help="Create sample config file and exit")
args = parser.parse_args()
if args.create_config:
create_sample_config()
sys.exit(0)
process_batch(
input_dir=args.input,
output_dir=args.output,
config_file=args.config,
create_comparisons=not args.no_comparisons,
file_pattern=args.pattern
)

291
scripts/ex/config_tuner.py Executable file
View File

@@ -0,0 +1,291 @@
#!/usr/bin/env python3
"""
Interactive Parameter Tuning Tool for Newspaper Image Cleaning
This tool helps you find optimal parameters for your specific images
by providing an interactive tuning interface.
"""
import cv2
import json
import numpy as np
from pathlib import Path
from image_cleaner import NewspaperImageCleaner
class ParameterTuner:
"""Interactive parameter tuning for image cleaning pipeline."""
def __init__(self, sample_image_path):
"""Initialize with a sample image for tuning."""
self.original = cv2.imread(str(sample_image_path))
if self.original is None:
raise ValueError(f"Could not load image: {sample_image_path}")
# Resize large images for faster processing during tuning
height, width = self.original.shape[:2]
if height > 1500 or width > 1500:
scale = min(1500/height, 1500/width)
new_width = int(width * scale)
new_height = int(height * scale)
self.original = cv2.resize(self.original, (new_width, new_height))
print(f"Resized image to {new_width}x{new_height} for faster tuning")
self.current_params = self._get_default_params()
self.cleaner = NewspaperImageCleaner(self.current_params)
def _get_default_params(self):
"""Get default parameters as starting point."""
return {
'bilateral_d': 9,
'bilateral_sigma_color': 75,
'bilateral_sigma_space': 75,
'clahe_clip_limit': 2.0,
'clahe_grid_size': (8, 8),
'gamma': 1.2,
'denoise_h': 10,
'morph_kernel_size': 2,
'unsharp_amount': 1.5,
'unsharp_radius': 1.0,
'unsharp_threshold': 0,
}
def update_parameter(self, param_name, value):
"""Update a single parameter and refresh the cleaner."""
if param_name in self.current_params:
# Handle special cases
if param_name == 'clahe_grid_size':
self.current_params[param_name] = (int(value), int(value))
else:
self.current_params[param_name] = value
# Update cleaner with new parameters
self.cleaner = NewspaperImageCleaner(self.current_params)
print(f"Updated {param_name} = {value}")
def process_with_current_params(self, steps=None):
"""Process the sample image with current parameters."""
if steps is None:
steps = ['denoise', 'contrast', 'background', 'sharpen']
image = self.original.copy()
# Apply processing steps
if 'denoise' in steps:
image = self.cleaner.reduce_noise(image)
if 'contrast' in steps:
image = self.cleaner.enhance_contrast(image)
if 'background' in steps:
image = self.cleaner.clean_background(image)
if 'sharpen' in steps:
image = self.cleaner.sharpen_image(image)
return image
def create_comparison(self, steps=None):
"""Create side-by-side comparison with current parameters."""
processed = self.process_with_current_params(steps)
# Create side-by-side comparison
height = max(self.original.shape[0], processed.shape[0])
comparison = np.hstack([
cv2.resize(self.original, (self.original.shape[1], height)),
cv2.resize(processed, (processed.shape[1], height))
])
return comparison
def save_comparison(self, output_path, steps=None):
"""Save comparison image to file."""
comparison = self.create_comparison(steps)
cv2.imwrite(str(output_path), comparison)
print(f"Comparison saved to: {output_path}")
def save_config(self, config_path):
"""Save current parameters to JSON config file."""
# Convert tuple to list for JSON serialization
config_to_save = self.current_params.copy()
if 'clahe_grid_size' in config_to_save:
config_to_save['clahe_grid_size'] = list(config_to_save['clahe_grid_size'])
with open(config_path, 'w') as f:
json.dump(config_to_save, f, indent=4)
print(f"Configuration saved to: {config_path}")
def load_config(self, config_path):
"""Load parameters from JSON config file."""
with open(config_path, 'r') as f:
loaded_params = json.load(f)
# Convert list back to tuple if needed
if 'clahe_grid_size' in loaded_params:
loaded_params['clahe_grid_size'] = tuple(loaded_params['clahe_grid_size'])
self.current_params.update(loaded_params)
self.cleaner = NewspaperImageCleaner(self.current_params)
print(f"Configuration loaded from: {config_path}")
def interactive_tune(self):
"""Start interactive tuning session."""
print("\n" + "="*60)
print("INTERACTIVE PARAMETER TUNING")
print("="*60)
print("Commands:")
print(" set <param> <value> - Set parameter value")
print(" show - Show current parameters")
print(" test [steps] - Test current parameters")
print(" save <file> - Save configuration to file")
print(" load <file> - Load configuration from file")
print(" compare [file] - Save comparison image")
print(" presets - Show parameter presets")
print(" help - Show this help")
print(" quit - Exit tuning")
print("\nParameters you can adjust:")
for param in self.current_params:
print(f" {param}")
while True:
try:
command = input("\ntuner> ").strip().split()
if not command:
continue
cmd = command[0].lower()
if cmd == 'quit' or cmd == 'exit':
break
elif cmd == 'show':
self._show_parameters()
elif cmd == 'set' and len(command) >= 3:
param = command[1]
try:
value = float(command[2]) if '.' in command[2] else int(command[2])
except ValueError:
value = command[2]
self.update_parameter(param, value)
elif cmd == 'test':
steps = command[1:] if len(command) > 1 else None
print("Processing with current parameters...")
processed = self.process_with_current_params(steps)
print(f"Processed image shape: {processed.shape}")
elif cmd == 'save' and len(command) > 1:
self.save_config(command[1])
elif cmd == 'load' and len(command) > 1:
self.load_config(command[1])
elif cmd == 'compare':
output = command[1] if len(command) > 1 else "tuning_comparison.jpg"
self.save_comparison(output)
elif cmd == 'presets':
self._show_presets()
elif cmd == 'help':
self._show_help()
else:
print("Unknown command. Type 'help' for available commands.")
except KeyboardInterrupt:
print("\nExiting tuner...")
break
except Exception as e:
print(f"Error: {str(e)}")
def _show_parameters(self):
"""Display current parameter values."""
print("\nCurrent Parameters:")
print("-" * 30)
for param, value in self.current_params.items():
print(f" {param:<20} = {value}")
def _show_presets(self):
"""Show preset configurations for different image types."""
presets = {
"light_cleaning": {
"bilateral_d": 5,
"denoise_h": 5,
"clahe_clip_limit": 1.5,
"gamma": 1.1,
"unsharp_amount": 1.2
},
"heavy_cleaning": {
"bilateral_d": 15,
"denoise_h": 15,
"clahe_clip_limit": 3.0,
"gamma": 1.3,
"unsharp_amount": 2.0
},
"high_contrast": {
"clahe_clip_limit": 4.0,
"gamma": 1.4,
"unsharp_amount": 2.5
}
}
print("\nAvailable Presets:")
print("-" * 30)
for name, params in presets.items():
print(f"{name}:")
for param, value in params.items():
print(f" {param} = {value}")
print()
def _show_help(self):
"""Show detailed help information."""
help_text = """
Parameter Descriptions:
-----------------------
bilateral_d : Neighborhood diameter for bilateral filtering (5-15)
bilateral_sigma_color: Filter sigma in color space (50-150)
bilateral_sigma_space: Filter sigma in coordinate space (50-150)
clahe_clip_limit : Contrast limit for CLAHE (1.0-4.0)
clahe_grid_size : CLAHE tile grid size (4-16)
gamma : Gamma correction value (0.8-2.0)
denoise_h : Denoising filter strength (5-20)
morph_kernel_size : Morphological operation kernel size (1-5)
unsharp_amount : Unsharp masking amount (0.5-3.0)
unsharp_radius : Unsharp masking radius (0.5-2.0)
unsharp_threshold : Unsharp masking threshold (0-10)
Tips:
- Start with small adjustments (±20% of current value)
- Test frequently with 'compare' command
- Save working configurations before major changes
- Use 'test denoise' to test individual steps
"""
print(help_text)
def main():
"""Main function for command line usage."""
import argparse
parser = argparse.ArgumentParser(description="Interactive parameter tuning for newspaper image cleaning")
parser.add_argument("image", help="Sample image path for tuning")
parser.add_argument("-c", "--config", help="Load initial config from file")
args = parser.parse_args()
try:
tuner = ParameterTuner(args.image)
if args.config:
tuner.load_config(args.config)
tuner.interactive_tune()
except Exception as e:
print(f"Error: {str(e)}")
if __name__ == "__main__":
main()

170
scripts/ex/demo.py Executable file
View File

@@ -0,0 +1,170 @@
#!/usr/bin/env python3
"""
Demo Script for Newspaper Image Cleaning Pipeline
This script demonstrates the cleaning pipeline on the sample images
and shows the available functionality.
"""
import sys
import os
from pathlib import Path
# Add current directory to Python path
sys.path.append(str(Path(__file__).parent))
try:
from image_cleaner import NewspaperImageCleaner, create_comparison_image
import cv2
import numpy as np
print("✓ All required libraries imported successfully")
except ImportError as e:
print(f"✗ Import error: {e}")
print("Please install required packages: pip install -r requirements.txt")
sys.exit(1)
def demo_single_image(image_path):
"""Demonstrate processing a single image."""
print(f"\n=== Processing Single Image: {image_path} ===")
if not os.path.exists(image_path):
print(f"Image not found: {image_path}")
return False
try:
# Initialize cleaner
cleaner = NewspaperImageCleaner()
# Process image
output_path = f"demo_cleaned_{Path(image_path).name}"
processed, original = cleaner.process_image(image_path, output_path)
# Create comparison
comparison_path = f"demo_comparison_{Path(image_path).name}"
create_comparison_image(original, processed, comparison_path)
print(f"✓ Processed image saved: {output_path}")
print(f"✓ Comparison saved: {comparison_path}")
return True
except Exception as e:
print(f"✗ Error processing {image_path}: {str(e)}")
return False
def demo_step_by_step(image_path):
"""Demonstrate individual processing steps."""
print(f"\n=== Step-by-Step Processing: {image_path} ===")
if not os.path.exists(image_path):
print(f"Image not found: {image_path}")
return
try:
# Load image
original = cv2.imread(image_path)
if original is None:
print(f"Could not load image: {image_path}")
return
# Resize if too large for demo
height, width = original.shape[:2]
if height > 1000 or width > 1000:
scale = min(1000/height, 1000/width)
new_width = int(width * scale)
new_height = int(height * scale)
original = cv2.resize(original, (new_width, new_height))
print(f"Resized to {new_width}x{new_height} for demo")
cleaner = NewspaperImageCleaner()
# Process step by step
steps = [
('original', original),
('denoised', cleaner.reduce_noise(original.copy())),
('contrast_enhanced', cleaner.enhance_contrast(original.copy())),
('background_cleaned', cleaner.clean_background(original.copy())),
('sharpened', cleaner.sharpen_image(original.copy()))
]
# Save each step
for step_name, image in steps:
output_path = f"demo_step_{step_name}_{Path(image_path).name}"
cv2.imwrite(output_path, image)
print(f"✓ Saved {step_name}: {output_path}")
print("✓ Individual processing steps completed")
except Exception as e:
print(f"✗ Error in step-by-step processing: {str(e)}")
def show_image_info():
"""Show information about available images."""
print("\n=== Available Sample Images ===")
image_files = []
for ext in ['*.jpg', '*.jpeg', '*.JPG', '*.JPEG']:
image_files.extend(Path('.').glob(ext))
if not image_files:
print("No image files found in current directory")
return []
for img_file in image_files:
try:
# Load image to get dimensions
img = cv2.imread(str(img_file))
if img is not None:
height, width = img.shape[:2]
file_size = img_file.stat().st_size / (1024*1024) # MB
print(f" {img_file.name}: {width}x{height} pixels, {file_size:.1f}MB")
else:
print(f" {img_file.name}: Could not load")
except Exception as e:
print(f" {img_file.name}: Error - {str(e)}")
return image_files
def main():
"""Main demo function."""
print("Historical Newspaper Image Cleaning Pipeline - Demo")
print("=" * 55)
# Show available images
image_files = show_image_info()
if not image_files:
print("\nNo images found. Please add some image files to test.")
return
# Select first image for demo
sample_image = image_files[0]
print(f"\nUsing sample image: {sample_image.name}")
# Demo single image processing
success = demo_single_image(str(sample_image))
if success:
# Demo step-by-step processing
demo_step_by_step(str(sample_image))
print(f"\n=== Demo Complete ===")
print("Generated files:")
print(" - demo_cleaned_*.jpg (cleaned image)")
print(" - demo_comparison_*.jpg (before/after comparison)")
print(" - demo_step_*.jpg (individual processing steps)")
print(f"\nNext steps:")
print(f" - Try: python config_tuner.py {sample_image.name}")
print(f" - Try: python batch_process.py")
print(f" - Adjust parameters in config.json for better results")
else:
print("\nDemo failed. Please check your Python environment and dependencies.")
if __name__ == "__main__":
main()

310
scripts/ex/image_cleaner.py Normal file
View File

@@ -0,0 +1,310 @@
"""
Historical Newspaper Image Cleaning Pipeline
This module provides functions to clean and enhance scanned historical newspaper images
by reducing noise, improving contrast, and sharpening text for better readability.
"""
import cv2
import numpy as np
from PIL import Image, ImageEnhance
import os
import argparse
from pathlib import Path
class NewspaperImageCleaner:
"""
Image processing pipeline specifically designed for historical newspaper scans.
"""
def __init__(self, config=None):
"""Initialize with default or custom configuration."""
self.config = config or self._default_config()
def _default_config(self):
"""Default processing parameters optimized for newspaper scans."""
return {
'bilateral_d': 9, # Neighborhood diameter for bilateral filter
'bilateral_sigma_color': 75, # Filter sigma in color space
'bilateral_sigma_space': 75, # Filter sigma in coordinate space
'clahe_clip_limit': 2.0, # Contrast limiting for CLAHE
'clahe_grid_size': (8, 8), # CLAHE grid size
'gamma': 1.2, # Gamma correction value
'denoise_h': 10, # Denoising filter strength
'morph_kernel_size': 2, # Morphological operation kernel size
'unsharp_amount': 1.5, # Unsharp masking amount
'unsharp_radius': 1.0, # Unsharp masking radius
'unsharp_threshold': 0, # Unsharp masking threshold
}
def reduce_noise(self, image):
"""
Apply noise reduction techniques to remove speckles and JPEG artifacts.
Args:
image: Input BGR image
Returns:
Denoised image
"""
# Bilateral filter - preserves edges while reducing noise
bilateral = cv2.bilateralFilter(
image,
self.config['bilateral_d'],
self.config['bilateral_sigma_color'],
self.config['bilateral_sigma_space']
)
# Non-local means denoising for better noise reduction
if len(image.shape) == 3:
# Color image
denoised = cv2.fastNlMeansDenoisingColored(
bilateral, None,
self.config['denoise_h'],
self.config['denoise_h'],
7, 21
)
else:
# Grayscale image
denoised = cv2.fastNlMeansDenoising(
bilateral, None,
self.config['denoise_h'],
7, 21
)
return denoised
def enhance_contrast(self, image):
"""
Improve image contrast using CLAHE and gamma correction.
Args:
image: Input BGR image
Returns:
Contrast-enhanced image
"""
# Convert to LAB color space for better contrast processing
if len(image.shape) == 3:
lab = cv2.cvtColor(image, cv2.COLOR_BGR2LAB)
l_channel, a_channel, b_channel = cv2.split(lab)
else:
l_channel = image
# Apply CLAHE (Contrast Limited Adaptive Histogram Equalization)
clahe = cv2.createCLAHE(
clipLimit=self.config['clahe_clip_limit'],
tileGridSize=self.config['clahe_grid_size']
)
l_channel = clahe.apply(l_channel)
# Reconstruct image
if len(image.shape) == 3:
enhanced = cv2.merge([l_channel, a_channel, b_channel])
enhanced = cv2.cvtColor(enhanced, cv2.COLOR_LAB2BGR)
else:
enhanced = l_channel
# Apply gamma correction
gamma = self.config['gamma']
inv_gamma = 1.0 / gamma
table = np.array([((i / 255.0) ** inv_gamma) * 255
for i in np.arange(0, 256)]).astype("uint8")
enhanced = cv2.LUT(enhanced, table)
return enhanced
def clean_background(self, image):
"""
Remove small artifacts and clean background noise.
Args:
image: Input image
Returns:
Background-cleaned image
"""
# Convert to grayscale for morphological operations
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
else:
gray = image
# Morphological opening to remove small noise
kernel = np.ones((self.config['morph_kernel_size'],
self.config['morph_kernel_size']), np.uint8)
# Opening (erosion followed by dilation)
opened = cv2.morphologyEx(gray, cv2.MORPH_OPEN, kernel)
# If original was color, apply the mask
if len(image.shape) == 3:
# Create a mask and apply it to the original color image
mask = opened > 0
result = image.copy()
result[~mask] = [255, 255, 255] # Set background to white
return result
else:
return opened
def sharpen_image(self, image):
"""
Apply unsharp masking to enhance text clarity.
Args:
image: Input image
Returns:
Sharpened image
"""
# Convert to float for processing
float_img = image.astype(np.float32) / 255.0
# Create Gaussian blur
radius = self.config['unsharp_radius']
sigma = radius / 3.0
blurred = cv2.GaussianBlur(float_img, (0, 0), sigma)
# Unsharp masking
amount = self.config['unsharp_amount']
sharpened = float_img + amount * (float_img - blurred)
# Threshold and clamp
threshold = self.config['unsharp_threshold'] / 255.0
sharpened = np.where(np.abs(float_img - blurred) < threshold,
float_img, sharpened)
sharpened = np.clip(sharpened, 0.0, 1.0)
return (sharpened * 255).astype(np.uint8)
def process_image(self, image_path, output_path=None, steps=None):
"""
Process a single image through the complete pipeline.
Args:
image_path: Path to input image
output_path: Path for output image (optional)
steps: List of processing steps to apply (optional)
Returns:
Processed image array
"""
if steps is None:
steps = ['denoise', 'contrast', 'background', 'sharpen']
# Load image
image = cv2.imread(str(image_path))
if image is None:
raise ValueError(f"Could not load image: {image_path}")
original = image.copy()
# Apply processing steps
if 'denoise' in steps:
print(f"Applying noise reduction...")
image = self.reduce_noise(image)
if 'contrast' in steps:
print(f"Enhancing contrast...")
image = self.enhance_contrast(image)
if 'background' in steps:
print(f"Cleaning background...")
image = self.clean_background(image)
if 'sharpen' in steps:
print(f"Sharpening image...")
image = self.sharpen_image(image)
# Save output if path provided
if output_path:
cv2.imwrite(str(output_path), image)
print(f"Processed image saved to: {output_path}")
return image, original
def process_directory(self, input_dir, output_dir, extensions=None):
"""
Process all images in a directory.
Args:
input_dir: Input directory path
output_dir: Output directory path
extensions: List of file extensions to process
"""
if extensions is None:
extensions = ['.jpg', '.jpeg', '.png', '.tif', '.tiff']
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(parents=True, exist_ok=True)
for file_path in input_path.iterdir():
if file_path.suffix.lower() in extensions:
print(f"\nProcessing: {file_path.name}")
output_file = output_path / f"cleaned_{file_path.name}"
try:
self.process_image(file_path, output_file)
except Exception as e:
print(f"Error processing {file_path.name}: {str(e)}")
print(f"\nBatch processing completed. Results in: {output_dir}")
def create_comparison_image(original, processed, output_path):
"""
Create a side-by-side comparison image.
Args:
original: Original image array
processed: Processed image array
output_path: Path to save comparison
"""
# Resize images to same height if needed
h1, w1 = original.shape[:2]
h2, w2 = processed.shape[:2]
if h1 != h2:
height = min(h1, h2)
original = cv2.resize(original, (int(w1 * height / h1), height))
processed = cv2.resize(processed, (int(w2 * height / h2), height))
# Create side-by-side comparison
comparison = np.hstack([original, processed])
cv2.imwrite(str(output_path), comparison)
print(f"Comparison saved to: {output_path}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Clean historical newspaper images")
parser.add_argument("input", help="Input image or directory path")
parser.add_argument("-o", "--output", help="Output path")
parser.add_argument("-d", "--directory", action="store_true",
help="Process entire directory")
parser.add_argument("-c", "--comparison", action="store_true",
help="Create before/after comparison")
parser.add_argument("--steps", nargs="+",
choices=['denoise', 'contrast', 'background', 'sharpen'],
default=['denoise', 'contrast', 'background', 'sharpen'],
help="Processing steps to apply")
args = parser.parse_args()
cleaner = NewspaperImageCleaner()
if args.directory:
output_dir = args.output or "cleaned_images"
cleaner.process_directory(args.input, output_dir)
else:
output_path = args.output
if not output_path:
input_path = Path(args.input)
output_path = input_path.parent / f"cleaned_{input_path.name}"
processed, original = cleaner.process_image(args.input, output_path, args.steps)
if args.comparison:
comparison_path = Path(output_path).parent / f"comparison_{Path(args.input).name}"
create_comparison_image(original, processed, comparison_path)

View File

@@ -0,0 +1,5 @@
opencv-python==4.10.0.84
scikit-image==0.24.0
Pillow==10.4.0
numpy==2.1.1
matplotlib==3.9.2

42
scripts/ex/run.sh Executable file
View File

@@ -0,0 +1,42 @@
#!/bin/bash
# Convenience script to run the image cleaning pipeline with virtual environment
# Activate virtual environment
source venv/bin/activate
# Check if any arguments provided
if [ $# -eq 0 ]; then
echo "Historical Newspaper Image Cleaning Pipeline"
echo "Usage examples:"
echo " $0 demo # Run demo"
echo " $0 clean image.jpg # Clean single image"
echo " $0 batch # Process all images in directory"
echo " $0 tune image.jpg # Interactive parameter tuning"
echo " $0 python script.py [args] # Run custom Python script"
exit 1
fi
case "$1" in
"demo")
python demo.py
;;
"clean")
shift
python image_cleaner.py "$@"
;;
"batch")
shift
python batch_process.py "$@"
;;
"tune")
shift
python config_tuner.py "$@"
;;
"python")
shift
python "$@"
;;
*)
python "$@"
;;
esac