grabar-ocr

andylitalo/grabar-ocr

Manages pipeline to digitize "Grabar" (Classical Armenian) texts

Python Stars: 0 Forks: 0 ML/AI

Summary

An end-to-end pipeline for digitizing Classical Armenian (Grabar) texts from scanned PDFs into a searchable, translated PostgreSQL database. The system uses PyMuPDF and YOLOv8 for layout detection, fine-tuned TrOCR for Armenian OCR, Claude/GPT-4 for translation, and Apache Airflow for orchestration on a k3s Kubernetes cluster with GPU support.

Similar Projects