grabar-ocr
andylitalo/grabar-ocr
Manages pipeline to digitize "Grabar" (Classical Armenian) texts
Summary
An end-to-end pipeline for digitizing Classical Armenian (Grabar) texts from scanned PDFs into a searchable, translated PostgreSQL database. The system uses PyMuPDF and YOLOv8 for layout detection, fine-tuned TrOCR for Armenian OCR, Claude/GPT-4 for translation, and Apache Airflow for orchestration on a k3s Kubernetes cluster with GPU support.