mirror of
https://github.com/mwisnowski/mtg_python_deckbuilder.git
synced 2026-03-24 14:06:31 +01:00
feat: implement theme stripping system with THEME_MIN_CARDS config (#55)
Some checks are pending
CI / build (push) Waiting to run
Some checks are pending
CI / build (push) Waiting to run
* feat: implement theme stripping system with THEME_MIN_CARDS config * fix: call build_catalog directly to avoid argparse conflicts in CI
This commit is contained in:
parent
1ebc2fcb3c
commit
03e2846882
20 changed files with 6613 additions and 1364 deletions
|
|
@ -8,9 +8,27 @@
|
|||
- **Enhanced Quality Scoring**: Four-tier system (Excellent/Good/Fair/Poor) with 0.0-1.0 numerical scores based on uniqueness, duplication, description quality, and metadata completeness
|
||||
- **CLI Linter**: `validate_theme_catalog.py --lint` flag with configurable thresholds for duplication and quality warnings, provides actionable improvement suggestions
|
||||
- **Editorial Documentation**: Comprehensive guide at `docs/theme_editorial_guide.md` covering quality scoring, best practices, linter usage, and workflow examples
|
||||
- **Theme Stripping Configuration**: Configurable minimum card threshold for theme retention
|
||||
- **THEME_MIN_CARDS Setting**: Environment variable (default: 5) to strip themes with too few cards from catalogs and card metadata
|
||||
- **Analysis Tooling**: `analyze_theme_distribution.py` script to visualize theme distribution and identify stripping candidates
|
||||
- **Core Threshold Logic**: `theme_stripper.py` module with functions to identify and filter low-card-count themes
|
||||
- **Catalog Stripping**: Automated removal of low-card themes from YAML catalog with backup/logging via `strip_catalog_themes.py` script
|
||||
|
||||
### Changed
|
||||
_No unreleased changes yet_
|
||||
- **Build Process Modernization**: Theme catalog generation now reads from parquet files instead of obsolete CSV format
|
||||
- Updated `build_theme_catalog.py` and `extract_themes.py` to use parquet data (matches rest of codebase)
|
||||
- Removed silent CSV exception handling (build now fails loudly if parquet read fails)
|
||||
- Added THEME_MIN_CARDS filtering directly in build pipeline (themes below threshold excluded during generation)
|
||||
- `theme_list.json` now auto-generated from stripped parquet data after theme stripping
|
||||
- Eliminated manual JSON stripping step (JSON is derived artifact, not source of truth)
|
||||
- **Parquet Theme Stripping**: Strip low-card themes directly from card data files
|
||||
- Added `strip_parquet_themes.py` script with dry-run, verbose, and backup modes
|
||||
- Added parquet manipulation functions to `theme_stripper.py`: backup, filter, update, and strip operations
|
||||
- Handles multiple themeTags formats: numpy arrays, lists, and comma/pipe-separated strings
|
||||
- Stripped 97 theme tag occurrences from 30,674 cards in `all_cards.parquet`
|
||||
- Updated `stripped_themes.yml` log with 520 themes stripped from parquet source
|
||||
- **Automatic integration**: Theme stripping now runs automatically in `run_tagging()` after tagging completes (when `THEME_MIN_CARDS` > 1, default: 5)
|
||||
- Integrated into web UI setup, CLI tagging, and CI/CD workflows (build-similarity-cache)
|
||||
|
||||
### Fixed
|
||||
_No unreleased changes yet_
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue