Ingest¶
Universal data ingestion from files, web, feeds, streams, repos, emails, and databases.
🎯 Overview¶
The Ingest Module is the entry point for loading data into Semantica. It provides universal data ingestion from files, web content, feeds, streams, repositories, emails, databases, and more.
What is Data Ingestion?¶
Data ingestion is the process of loading data from various sources into Semantica for processing. The ingest module handles: - File Systems: Local files, cloud storage (S3, GCS, Azure) - Web Content: Websites, RSS feeds, APIs - Streams: Real-time data from Kafka, RabbitMQ, etc. - Databases: SQL and NoSQL databases - Repositories: Git repositories (GitHub, GitLab) - Email: IMAP, POP3 servers - MCP: Model Context Protocol servers
Why Use the Ingest Module?¶
- Universal Support: Handle multiple data formats and sources
- Automatic Detection: Automatically detect file types and content
- Streaming Support: Process real-time data streams
- Cloud Integration: Direct support for cloud storage
- Rate Limiting: Built-in rate limiting for web crawling
- Error Handling: Robust error handling and retry logic
How It Works¶
- Source Detection: Automatically detect the type of data source
- Connection: Establish connection to the source (file system, web, database, etc.)
- Loading: Load data from the source
- Format Detection: Detect the format of the loaded data
- Output: Return data in a standardized format for processing
-
File Ingestion
Local and Cloud (S3/GCS/Azure) file processing with type detection
-
Web Crawling
Scrape websites with rate limiting, robots.txt compliance, and JS rendering
-
Feed Monitoring
Consume RSS/Atom feeds with real-time update monitoring
-
Stream Processing
Ingest from Kafka, RabbitMQ, Kinesis, and Pulsar
-
Code Repos
Clone and analyze Git repositories (GitHub/GitLab)
-
Databases
Ingest tables and query results from SQL databases
When to Use
- Data Onboarding: The entry point for all external data into Semantica
- Crawling: Building a dataset from the web
- Monitoring: Listening to real-time data streams
- Migration: Importing legacy data from databases
⚙️ Algorithms Used¶
File Processing¶
- Magic Number Detection: Identifying file types by binary signatures, not just extensions.
- Recursive Traversal: Efficiently walking directory trees.
Web Ingestion¶
- Sitemap Parsing: Discovering URLs via XML sitemaps.
- DOM Extraction: Removing boilerplate (nav/ads) to extract main content.
- Politeness: Respecting
robots.txtand enforcing crawl delays.
Stream Processing¶
- Consumer Groups: Managing offsets and scaling consumption (Kafka).
- Backpressure: Handling high-velocity streams without crashing.
Repository Analysis¶
- AST Parsing: Extracting code structure (classes/functions) without executing.
- Dependency Parsing: Reading
requirements.txt,package.json, etc.
Main Classes¶
FileIngestor¶
Handles file systems and object storage.
Methods:
| Method | Description |
|---|---|
ingest_file(path) | Process single file |
ingest_directory(path) | Process folder |
WebIngestor¶
Handles web content.
Methods:
| Method | Description |
|---|---|
ingest_url(url) | Scrape page |
crawl(url, depth) | Recursive crawl |
StreamIngestor¶
Handles real-time data.
Methods:
| Method | Description |
|---|---|
consume(topic) | Start consumption |
RepoIngestor¶
Handles code repositories.
Methods:
| Method | Description |
|---|---|
ingest_repo(url) | Clone and process |
FeedIngestor¶
Handles RSS and Atom feeds.
Methods:
| Method | Description |
|---|---|
ingest_feed(url) | Parse feed items |
monitor(url) | Watch for updates |
EmailIngestor¶
Handles IMAP and POP3 servers.
Methods:
| Method | Description |
|---|---|
connect_imap(host) | Connect to server |
ingest_mailbox(name) | Fetch emails |
DBIngestor¶
Handles SQL databases.
Methods:
| Method | Description |
|---|---|
ingest_database(conn) | Export tables |
execute_query(sql) | Run custom SQL |
MCPIngestor¶
Handles Model Context Protocol servers.
Methods:
| Method | Description |
|---|---|
connect(url) | Connect to server |
ingest_resources() | Fetch resources |
call_tool(name) | Execute tool |
Convenience Functions¶
from semantica.ingest import ingest
# Auto-detect source type
ingest("doc.pdf", source_type="file")
ingest("https://google.com", source_type="web")
ingest("kafka://topic", source_type="stream")
Configuration¶
Environment Variables¶
export INGEST_USER_AGENT="SemanticaBot/1.0"
export AWS_ACCESS_KEY_ID=...
export KAFKA_BOOTSTRAP_SERVERS=localhost:9092
YAML Configuration¶
ingest:
web:
user_agent: "MyBot"
rate_limit: 1.0 # seconds
files:
max_size: 100MB
allowed_extensions: [.pdf, .txt, .md]
Integration Examples¶
Continuous Feed Ingestion¶
from semantica.ingest import FeedIngestor
from semantica.pipeline import Pipeline
# 1. Setup Monitor
ingestor = FeedIngestor()
pipeline = Pipeline()
def on_new_item(item):
# 2. Trigger Pipeline on new content
pipeline.run(item.content)
# 3. Start Monitoring
ingestor.monitor(
"https://news.ycombinator.com/rss",
callback=on_new_item,
interval=300
)
Best Practices¶
- Respect Rate Limits: When crawling, always set a polite
rate_limitto avoid getting banned. - Filter Content: Use
allowed_extensionsor URL patterns to avoid ingesting junk data. - Handle Failures: Stream ingestors should have error handlers for bad messages.
- Use Credentials Securely: Never hardcode API keys; use Environment Variables.
See Also¶
- Parse Module - Processes the raw data ingested here
- Split Module - Chunks the ingested content
- Utils Module - Validation helpers
Cookbook¶
Interactive tutorials to learn data ingestion:
- Data Ingestion: Comprehensive guide to data ingestion from multiple sources
- Topics: File ingestion, web scraping, database integration, streams, feeds, repositories, email, MCP
- Difficulty: Beginner
-
Use Cases: Loading data from various sources, understanding ingestion capabilities
-
Multi-Source Data Integration: Merge data from disparate sources into a unified graph
- Topics: Entity resolution, merging, fusion, multi-source integration
- Difficulty: Advanced
- Use Cases: Combining data from multiple sources, data fusion