Skip to content

Ingest

Universal data ingestion from files, web, feeds, streams, repos, emails, and databases.


🎯 Overview

The Ingest Module is the entry point for loading data into Semantica. It provides universal data ingestion from files, web content, feeds, streams, repositories, emails, databases, and more.

What is Data Ingestion?

Data ingestion is the process of loading data from various sources into Semantica for processing. The ingest module handles: - File Systems: Local files, cloud storage (S3, GCS, Azure) - Web Content: Websites, RSS feeds, APIs - Streams: Real-time data from Kafka, RabbitMQ, etc. - Databases: SQL and NoSQL databases - Repositories: Git repositories (GitHub, GitLab) - Email: IMAP, POP3 servers - MCP: Model Context Protocol servers

Why Use the Ingest Module?

  • Universal Support: Handle multiple data formats and sources
  • Automatic Detection: Automatically detect file types and content
  • Streaming Support: Process real-time data streams
  • Cloud Integration: Direct support for cloud storage
  • Rate Limiting: Built-in rate limiting for web crawling
  • Error Handling: Robust error handling and retry logic

How It Works

  1. Source Detection: Automatically detect the type of data source
  2. Connection: Establish connection to the source (file system, web, database, etc.)
  3. Loading: Load data from the source
  4. Format Detection: Detect the format of the loaded data
  5. Output: Return data in a standardized format for processing
  • File Ingestion


    Local and Cloud (S3/GCS/Azure) file processing with type detection

  • Web Crawling


    Scrape websites with rate limiting, robots.txt compliance, and JS rendering

  • Feed Monitoring


    Consume RSS/Atom feeds with real-time update monitoring

  • Stream Processing


    Ingest from Kafka, RabbitMQ, Kinesis, and Pulsar

  • Code Repos


    Clone and analyze Git repositories (GitHub/GitLab)

  • Databases


    Ingest tables and query results from SQL databases

When to Use

  • Data Onboarding: The entry point for all external data into Semantica
  • Crawling: Building a dataset from the web
  • Monitoring: Listening to real-time data streams
  • Migration: Importing legacy data from databases

⚙️ Algorithms Used

File Processing

  • Magic Number Detection: Identifying file types by binary signatures, not just extensions.
  • Recursive Traversal: Efficiently walking directory trees.

Web Ingestion

  • Sitemap Parsing: Discovering URLs via XML sitemaps.
  • DOM Extraction: Removing boilerplate (nav/ads) to extract main content.
  • Politeness: Respecting robots.txt and enforcing crawl delays.

Stream Processing

  • Consumer Groups: Managing offsets and scaling consumption (Kafka).
  • Backpressure: Handling high-velocity streams without crashing.

Repository Analysis

  • AST Parsing: Extracting code structure (classes/functions) without executing.
  • Dependency Parsing: Reading requirements.txt, package.json, etc.

Main Classes

FileIngestor

Handles file systems and object storage.

Methods:

Method Description
ingest_file(path) Process single file
ingest_directory(path) Process folder

WebIngestor

Handles web content.

Methods:

Method Description
ingest_url(url) Scrape page
crawl(url, depth) Recursive crawl

StreamIngestor

Handles real-time data.

Methods:

Method Description
consume(topic) Start consumption

RepoIngestor

Handles code repositories.

Methods:

Method Description
ingest_repo(url) Clone and process

FeedIngestor

Handles RSS and Atom feeds.

Methods:

Method Description
ingest_feed(url) Parse feed items
monitor(url) Watch for updates

EmailIngestor

Handles IMAP and POP3 servers.

Methods:

Method Description
connect_imap(host) Connect to server
ingest_mailbox(name) Fetch emails

DBIngestor

Handles SQL databases.

Methods:

Method Description
ingest_database(conn) Export tables
execute_query(sql) Run custom SQL

MCPIngestor

Handles Model Context Protocol servers.

Methods:

Method Description
connect(url) Connect to server
ingest_resources() Fetch resources
call_tool(name) Execute tool

Convenience Functions

from semantica.ingest import ingest

# Auto-detect source type
ingest("doc.pdf", source_type="file")
ingest("https://google.com", source_type="web")
ingest("kafka://topic", source_type="stream")

Configuration

Environment Variables

export INGEST_USER_AGENT="SemanticaBot/1.0"
export AWS_ACCESS_KEY_ID=...
export KAFKA_BOOTSTRAP_SERVERS=localhost:9092

YAML Configuration

ingest:
  web:
    user_agent: "MyBot"
    rate_limit: 1.0 # seconds

  files:
    max_size: 100MB
    allowed_extensions: [.pdf, .txt, .md]

Integration Examples

Continuous Feed Ingestion

from semantica.ingest import FeedIngestor
from semantica.pipeline import Pipeline

# 1. Setup Monitor
ingestor = FeedIngestor()
pipeline = Pipeline()

def on_new_item(item):
    # 2. Trigger Pipeline on new content
    pipeline.run(item.content)

# 3. Start Monitoring
ingestor.monitor(
    "https://news.ycombinator.com/rss",
    callback=on_new_item,
    interval=300
)

Best Practices

  1. Respect Rate Limits: When crawling, always set a polite rate_limit to avoid getting banned.
  2. Filter Content: Use allowed_extensions or URL patterns to avoid ingesting junk data.
  3. Handle Failures: Stream ingestors should have error handlers for bad messages.
  4. Use Credentials Securely: Never hardcode API keys; use Environment Variables.

See Also

Cookbook

Interactive tutorials to learn data ingestion:

  • Data Ingestion: Comprehensive guide to data ingestion from multiple sources
  • Topics: File ingestion, web scraping, database integration, streams, feeds, repositories, email, MCP
  • Difficulty: Beginner
  • Use Cases: Loading data from various sources, understanding ingestion capabilities

  • Multi-Source Data Integration: Merge data from disparate sources into a unified graph

  • Topics: Entity resolution, merging, fusion, multi-source integration
  • Difficulty: Advanced
  • Use Cases: Combining data from multiple sources, data fusion