feat: implement theme stripping system with THEME_MIN_CARDS config

This commit is contained in:
matt 2026-03-19 15:05:40 -07:00
parent 1ebc2fcb3c
commit 86ece36012
20 changed files with 6604 additions and 1364 deletions

View file

@ -8,9 +8,27 @@
- **Enhanced Quality Scoring**: Four-tier system (Excellent/Good/Fair/Poor) with 0.0-1.0 numerical scores based on uniqueness, duplication, description quality, and metadata completeness
- **CLI Linter**: `validate_theme_catalog.py --lint` flag with configurable thresholds for duplication and quality warnings, provides actionable improvement suggestions
- **Editorial Documentation**: Comprehensive guide at `docs/theme_editorial_guide.md` covering quality scoring, best practices, linter usage, and workflow examples
- **Theme Stripping Configuration**: Configurable minimum card threshold for theme retention
- **THEME_MIN_CARDS Setting**: Environment variable (default: 5) to strip themes with too few cards from catalogs and card metadata
- **Analysis Tooling**: `analyze_theme_distribution.py` script to visualize theme distribution and identify stripping candidates
- **Core Threshold Logic**: `theme_stripper.py` module with functions to identify and filter low-card-count themes
- **Catalog Stripping**: Automated removal of low-card themes from YAML catalog with backup/logging via `strip_catalog_themes.py` script
### Changed
_No unreleased changes yet_
- **Build Process Modernization**: Theme catalog generation now reads from parquet files instead of obsolete CSV format
- Updated `build_theme_catalog.py` and `extract_themes.py` to use parquet data (matches rest of codebase)
- Removed silent CSV exception handling (build now fails loudly if parquet read fails)
- Added THEME_MIN_CARDS filtering directly in build pipeline (themes below threshold excluded during generation)
- `theme_list.json` now auto-generated from stripped parquet data after theme stripping
- Eliminated manual JSON stripping step (JSON is derived artifact, not source of truth)
- **Parquet Theme Stripping**: Strip low-card themes directly from card data files
- Added `strip_parquet_themes.py` script with dry-run, verbose, and backup modes
- Added parquet manipulation functions to `theme_stripper.py`: backup, filter, update, and strip operations
- Handles multiple themeTags formats: numpy arrays, lists, and comma/pipe-separated strings
- Stripped 97 theme tag occurrences from 30,674 cards in `all_cards.parquet`
- Updated `stripped_themes.yml` log with 520 themes stripped from parquet source
- **Automatic integration**: Theme stripping now runs automatically in `run_tagging()` after tagging completes (when `THEME_MIN_CARDS` > 1, default: 5)
- Integrated into web UI setup, CLI tagging, and CI/CD workflows (build-similarity-cache)
### Fixed
_No unreleased changes yet_