grabar-ocr
andylitalo/grabar-ocr
Manages pipeline to digitize "Grabar" (Classical Armenian) texts
Summary
A comprehensive pipeline for digitizing Classical Armenian (Grabar) texts from scanned PDFs into a searchable, translated PostgreSQL database. The project integrates PDF layout detection (PyMuPDF, YOLOv8), fine-tuned optical character recognition (TrOCR), AI translation (Claude/GPT-4o), and database storage, all orchestrated by Apache Airflow on a k3s Kubernetes cluster with GPU acceleration.