grabar-ocr

andylitalo/grabar-ocr

Manages pipeline to digitize "Grabar" (Classical Armenian) texts

Python Stars: 1 Forks: 0 ML/AI

Summary

A comprehensive pipeline for digitizing Classical Armenian (Grabar) texts from scanned PDFs into a searchable, translated PostgreSQL database. The project integrates PDF layout detection (PyMuPDF, YOLOv8), fine-tuned optical character recognition (TrOCR), AI translation (Claude/GPT-4o), and database storage, all orchestrated by Apache Airflow on a k3s Kubernetes cluster with GPU acceleration.

Similar Projects