Parse¶
Universal data parser supporting documents, web content, structured data, emails, code, and media.
🎯 Overview¶
The Parse Module extracts structured content from raw files and data sources. It converts various file formats (PDF, DOCX, HTML, JSON, etc.) into usable text and structured data for semantic processing.
What is Document Parsing?¶
Document parsing is the process of extracting text, structure, and metadata from files. The parse module handles: - Documents: PDF, DOCX, PPTX, Excel, TXT, RTF - Web Content: HTML, XML, JavaScript-rendered pages - Structured Data: JSON, CSV, XML, YAML - Email: MIME messages with attachments - Code: Source code parsing into ASTs - Media: OCR for images, metadata for audio/video
Why Use the Parse Module?¶
- Universal Format Support: Handle dozens of file formats
- Structure Preservation: Maintain document structure (tables, headings, etc.)
- OCR Support: Extract text from scanned documents and images
- Metadata Extraction: Extract document metadata (author, date, etc.)
- Code Analysis: Parse source code for structure and dependencies
- Error Handling: Robust handling of corrupted or malformed files
How It Works¶
- Format Detection: Automatically detect file format
- Parser Selection: Choose appropriate parser for the format
- Content Extraction: Extract text, structure, and metadata
- Normalization: Normalize extracted content
- Output: Return structured data ready for semantic processing
-
Document Parsing
Extract text, tables, and metadata from PDF, DOCX, PPTX, Excel, and TXT
-
Web Content
Parse HTML, XML, and JavaScript-rendered pages with Selenium/Playwright
-
Structured Data
Handle JSON, CSV, XML, and YAML with nested structure preservation
-
Email Parsing
Extract headers, bodies, attachments, and thread structure from MIME messages
-
Code Analysis
Parse source code (Python, JS, etc.) into ASTs, extracting functions and dependencies
-
Media Processing
OCR for images and metadata extraction for audio/video files
When to Use
- After Ingestion: The first step after loading raw files to convert them into usable text/data
- Data Extraction: Pulling specific fields from structured files (JSON/CSV)
- Content Analysis: Analyzing codebases or email archives
- OCR: Extracting text from scanned documents or images
⚙️ Algorithms Used¶
Document Parsing¶
- PDF:
pdfplumberfor precise layout preservation, table extraction, and image handling. Fallback toPyPDF2. - Office (DOCX/PPTX/XLSX): XML-based parsing of OpenXML formats to extract text, styles, and properties.
- OCR: Tesseract-based optical character recognition for image-based PDFs and image files.
Web Parsing¶
- DOM Traversal: BeautifulSoup for static HTML parsing and element extraction.
- Headless Browser: Selenium/Playwright for rendering dynamic JavaScript content before extraction.
- Content Cleaning: Heuristic removal of boilerplates (navbars, footers, ads).
Code Parsing¶
- AST Traversal: Abstract Syntax Tree parsing to identify classes, functions, and imports.
- Dependency Graphing: Static analysis of import statements to build dependency networks.
- Comment Extraction: Regex and parser-based extraction of docstrings and inline comments.
Main Classes¶
DocumentParser¶
Unified interface for document formats.
Methods:
| Method | Description |
|---|---|
`parse_document(path)` | Auto-detect format and parse |
`extract_text(path)` | Extract text from PDF/DOCX/HTML/TXT |
`extract_metadata(path)` | Extract document metadata |
`parse_batch(paths)` | Parse multiple documents |
Example:
from semantica.parse import DocumentParser
parser = DocumentParser()
doc = parser.parse_document("report.pdf")
print(doc.get("metadata", {}).get("title"))
print(doc.get("full_text", "")[:100])
WebParser¶
Parses web content.
Methods:
| Method | Description |
|---|---|
`parse_web_content(content, content_type)` | Parse HTML/XML |
`extract_text(content)` | Clean text from HTML |
`extract_links(content)` | Extract hyperlinks |
`render_javascript(url)` | Render JS for dynamic pages |
StructuredDataParser¶
Parses data files.
Methods:
| Method | Description |
|---|---|
`parse_data(path, data_format)` | Parse JSON/CSV/XML/YAML |
Example:
from semantica.parse import StructuredDataParser
parser = StructuredDataParser()
data = parser.parse_data("data.json", data_format="json")
print(type(data.get("data"))).__name__
CodeParser¶
Parses source code.
Methods:
| Method | Description |
|---|---|
`parse_code(path)` | Parse code file; returns structure, comments, dependencies |
Example:
from semantica.parse import CodeParser
parser = CodeParser()
data = parser.parse_code("script.py", language="python")
print(data.get("structure", {}).get("functions", []))
print(data.get("dependencies", {}))
EmailParser¶
Parses email messages.
Methods:
| Method | Description |
|---|---|
`parse_email(path)` | Parse full email (headers/body/attachments) |
`parse_headers(path)` | Extract headers only |
`extract_body(path)` | Extract text/HTML body |
`analyze_thread(path)` | Thread reconstruction |
Example:
from semantica.parse import EmailParser
parser = EmailParser()
email = parser.parse_email("email.eml", extract_attachments=True)
print(email.headers.subject)
print(email.body.text[:120])
MediaParser¶
Parses media files.
Methods:
| Method | Description |
|---|---|
`parse_media(path, media_type)` | Parse image/audio/video |
Example:
from semantica.parse import MediaParser
parser = MediaParser()
image = parser.parse_media("image.jpg", media_type="image")
print(image.get("metadata", {}))
Format-Specific Parsers¶
PDFParser,DOCXParser,PPTXParser,ExcelParserHTMLParser,XMLParserJSONParser,CSVParserImageParser
Examples:
from semantica.parse import DocumentParser, WebParser, StructuredDataParser
# Document
doc = DocumentParser().parse_document("document.pdf")
print(doc.get("full_text", "")[:120])
# Web
web = WebParser().parse_web_content("https://example.com", content_type="html")
print(web.get("text", "")[:120])
# Structured Data (JSON)
data = StructuredDataParser().parse_data("data.json", data_format="json")
print(list(data.get("data", {}).keys()))
Usage Examples¶
WebParser¶
from semantica.parse import WebParser
parser = WebParser()
html = parser.parse_web_content("https://example.com", content_type="html")
links = parser.extract_links("https://example.com")
StructuredDataParser¶
from semantica.parse import StructuredDataParser
parser = StructuredDataParser()
json = parser.parse_data("data.json", data_format="json")
csv = parser.parse_data("data.csv", data_format="csv")
xml = parser.parse_data("data.xml", data_format="xml")
Configuration¶
Environment Variables¶
YAML Configuration¶
parse:
ocr:
enabled: true
language: eng
web:
user_agent: "MyBot/1.0"
timeout: 30
pdf:
extract_tables: true
extract_images: false
Integration¶
Use parser classes directly in pipelines and services. Avoid convenience functions for stronger type clarity and consistency.
Best Practices¶
- Disable OCR if not needed; enable only for scanned documents.
- Use specific parser classes like
JSONParserorPDFParserwhen format is known. - Handle encodings explicitly for CSV/TXT where auto-detect may fail.
- Clean web content using
WebParserutilities rather than raw HTML parsing.
Troubleshooting¶
Issue: TesseractNotFoundError Solution: Install Tesseract OCR on your system (apt-get install tesseract-ocr or brew).
Issue: PDF tables are messy. Solution: Try pdfplumber settings in config or use specialized table extraction tools if layout is complex.
See Also¶
- Ingest Module - Handles file downloading/loading
- Split Module - Chunks the parsed text
- Semantic Extract Module - Extracts entities from text
Cookbook¶
Interactive tutorials to learn document parsing:
- Document Parsing: Extract clean text from complex formats
- Topics: OCR, PDF parsing, text extraction, format detection
- Difficulty: Beginner
- Use Cases: Processing documents, extracting text from various formats