Docling
Open-source document parsing and conversion library for extracting structured data from PDFs and documents

Docling is an MIT-licensed Python library developed by IBM Research for parsing and converting documents (PDF, DOCX, HTML, images) into structured formats like Markdown, JSON, and text. It uses advanced AI models to understand document layout, extract tables, preserve formatting, and handle complex document structures. Ideal for building RAG (Retrieval-Augmented Generation) systems, document processing pipelines, and knowledge extraction workflows where you need to convert unstructured documents into machine-readable formats.
✅ Document ingestion for RAG and LLM applications ✅ PDF data extraction and table parsing ✅ Converting legacy documents to structured formats ✅ Building searchable document repositories ✅ Automated document processing pipelines ✅ Knowledge base creation from document collections ✅ Invoice and form data extraction
🧩 Apache Tika 📂 Unstructured.io 📘 PyMuPDF 🧾 pdfplumber 📊 Camelot 📋 Tabula