Skip to content

Modules & Architecture

Semantica is built with a modular architecture, designed to be flexible, extensible, and scalable. This guide provides a comprehensive overview of all modules, their responsibilities, key features, and components.

About This Guide

This guide covers all 20+ core modules in Semantica, organized by their functional layer. Each module can be used independently or combined into powerful pipelines.


Module Overview

Semantica's modules are organized into six logical layers:

Layer Modules Description
Input Layer Ingest, Parse, Split, Normalize Data ingestion, parsing, chunking, and cleaning
Core Processing Semantic Extract, Knowledge Graph, Ontology, Reasoning Entity extraction, graph construction, inference
Storage Embeddings, Vector Store, Graph Store, Triplet Store Vector, graph, and triplet persistence
Quality Assurance Deduplication, Conflicts Data quality and consistency
Context & Memory Context, Seed, LLM Providers Agent memory, foundation data, and LLM integration
Output & Orchestration Export, Visualization, Pipeline Export, visualization, and workflow management

Input Layer

These modules handle data ingestion, parsing, chunking, and preparation.


Ingest Module

Purpose

The entry point for data ingestion. Connects to various data sources including files, web, databases, and MCP servers.

Key Features:

  • Multiple file format support (PDF, DOCX, HTML, JSON, CSV, etc.)
  • Web scraping with JavaScript rendering
  • Database integration (SQL, NoSQL)
  • Real-time streaming support
  • MCP (Model Context Protocol) server integration
  • Batch processing capabilities
  • Metadata extraction and preservation

Components:

  • FileIngestor β€” Read files (PDF, DOCX, HTML, JSON, CSV, etc.)
  • WebIngestor β€” Scrape and ingest web pages
  • FeedIngestor β€” Process RSS/Atom feeds
  • StreamIngestor β€” Real-time data streaming
  • DBIngestor β€” Database queries and ingestion
  • EmailIngestor β€” Process email messages
  • RepoIngestor β€” Git repository analysis
  • MCPIngestor β€” Connect to MCP servers for resource and tool-based ingestion

Try It:

  • Data Ingestion Cookbook: Learn to ingest from multiple sources
  • Topics: File, web, feed, stream, database ingestion
  • Difficulty: Beginner
  • Time: 15-20 minutes
  • Use Cases: Loading data from various sources

API Reference: Ingest Module


Parse Module

Purpose

Extracts raw text and metadata from ingested documents. Supports OCR, table extraction, and structured data parsing.

Key Features:

  • Multiple file format support
  • OCR for images and scanned documents
  • Table extraction from PDFs and spreadsheets
  • Metadata preservation
  • Automatic format detection
  • Structured data parsing (JSON, CSV, XML)
  • Code file parsing with syntax awareness

Components:

  • DocumentParser β€” Main parser orchestrator
  • PDFParser β€” Extract text, tables, images from PDFs
  • DOCXParser β€” Parse Word documents
  • HTMLParser β€” Extract content from HTML
  • JSONParser β€” Parse structured JSON data
  • ExcelParser β€” Process spreadsheets
  • ImageParser β€” OCR and image analysis
  • CodeParser β€” Parse source code files

Try It:

  • Document Parsing Cookbook: Learn to parse various document formats
  • Topics: PDF, DOCX, HTML, JSON parsing, OCR, table extraction
  • Difficulty: Beginner
  • Time: 15-20 minutes
  • Use Cases: Extracting text from different file formats

API Reference: Parse Module


Split Module

Purpose

Comprehensive document chunking and splitting for optimal processing. Provides 15+ splitting methods including KG-aware chunking.

Key Features:

  • Multiple standard splitting methods (recursive, token, sentence, paragraph)
  • Semantic-based chunking using NLP and embeddings
  • Entity-aware chunking for GraphRAG workflows
  • Relation-aware chunking for KG preservation
  • Graph-based and ontology-aware chunking
  • Hierarchical multi-level chunking
  • Community detection-based splitting
  • Sliding window chunking with overlap
  • Table-specific chunking
  • Chunk validation and quality assessment
  • Provenance tracking for data lineage

Components:

  • TextSplitter β€” Unified text splitter with method parameter
  • SemanticChunker β€” Semantic-based chunking coordinator
  • StructuralChunker β€” Structure-aware chunking (headings, lists)
  • SlidingWindowChunker β€” Fixed-size sliding window chunking
  • TableChunker β€” Table-specific chunking
  • EntityAwareChunker β€” Entity boundary-preserving chunker
  • RelationAwareChunker β€” Triplet-preserving chunker
  • GraphBasedChunker β€” Graph structure-based chunker
  • OntologyAwareChunker β€” Ontology concept-based chunker
  • HierarchicalChunker β€” Multi-level hierarchical chunker
  • ChunkValidator β€” Chunk quality validation
  • ProvenanceTracker β€” Chunk provenance tracking

Supported Methods:

Category Methods
Standard recursive, token, sentence, paragraph, character, word, semantic_transformer, llm
KG/Ontology entity_aware, relation_aware, graph_based, ontology_aware, hierarchical, community_detection, centrality_based

Try It:

  • Text Splitting Cookbook: Learn different splitting methods
  • Topics: Recursive, token, sentence splitting, entity-aware chunking
  • Difficulty: Beginner
  • Time: 15-20 minutes
  • Use Cases: Document chunking for processing

Normalize Module

Purpose

Cleans, standardizes, and prepares text for semantic extraction. Handles encoding, entity names, dates, and numbers.

Key Features:

  • Text cleaning and noise removal
  • Encoding normalization (Unicode handling)
  • Entity name standardization
  • Date and number formatting
  • Language detection
  • Whitespace normalization
  • Special character handling

Components:

  • TextNormalizer β€” Main normalization orchestrator
  • TextCleaner β€” Remove noise, fix encoding
  • DataCleaner β€” Clean structured data
  • EntityNormalizer β€” Normalize entity names
  • DateNormalizer β€” Standardize date formats
  • NumberNormalizer β€” Normalize numeric values
  • LanguageDetector β€” Detect document language
  • EncodingHandler β€” Handle character encoding

Try It:

  • Data Normalization Cookbook: Learn text normalization
  • Topics: Text cleaning, encoding normalization, entity standardization
  • Difficulty: Beginner
  • Time: 15-20 minutes
  • Use Cases: Preparing text for processing

API Reference: Normalize Module


Core Processing Layer

These modules form the intelligence coreβ€”extracting meaning, building relationships, and inferring knowledge.


Semantic Extract Module

Purpose

The brain of Semantica. Uses LLMs and NLP to extract entities, relationships, and semantic meaning from text.

Key Features:

  • Multiple NER methods (rule-based, ML, LLM)
  • Relationship extraction with confidence scoring
  • Event extraction
  • Custom entity type support
  • Multi-language support
  • Semantic network extraction
  • Coreference resolution

Components:

  • NERExtractor β€” Named Entity Recognition
  • RelationExtractor β€” Extract relationships between entities
  • SemanticAnalyzer β€” Deep semantic analysis
  • SemanticNetworkExtractor β€” Extract semantic networks
  • EventExtractor β€” Extract events from text
  • CoreferenceResolver β€” Resolve entity coreferences

Try It:

  • Entity Extraction Cookbook: Learn entity extraction
  • Topics: Named entity recognition, entity types, extraction methods
  • Difficulty: Beginner
  • Time: 15-20 minutes
  • Use Cases: Understanding entity extraction

  • Relation Extraction Cookbook: Learn relationship extraction

  • Topics: Relationship extraction, dependency parsing, semantic role labeling
  • Difficulty: Beginner
  • Time: 15-20 minutes
  • Use Cases: Building rich knowledge graphs

Quick Example:

from semantica.semantic_extract import NERExtractor, RelationExtractor

# Extract entities
extractor = NERExtractor(method="llm", model="gpt-4")
entities = extractor.extract(normalized_docs)

# Extract relationships
relation_extractor = RelationExtractor()
relationships = relation_extractor.extract(normalized_docs, entities=entities)

for rel in relationships[:5]:
    print(f"{rel.subject.text} --[{rel.predicate}]--> {rel.object.text}")

API Reference: Semantic Extract Module


Knowledge Graph (KG) Module

Purpose

Constructs and manages knowledge graphs from extracted entities and relationships. Supports multiple backends and advanced analytics.

Key Features:

  • Graph construction from entities/relationships
  • Multiple backend support (NetworkX, Neo4j)
  • Temporal graph support
  • Graph analytics and metrics
  • Entity resolution and deduplication
  • Community detection
  • Centrality calculations
  • Path finding algorithms
  • Graph validation

Components:

  • GraphBuilder β€” Construct knowledge graphs
  • GraphAnalyzer β€” Analyze graph structure and properties
  • EntityResolver β€” Resolve entity conflicts and duplicates
  • ConflictDetector β€” Detect conflicting information
  • CentralityCalculator β€” Calculate node importance metrics
  • CommunityDetector β€” Detect community structure
  • ConnectivityAnalyzer β€” Analyze graph connectivity
  • SeedManager β€” Manage seed data for KG initialization
  • TemporalQuery β€” Query temporal knowledge graphs
  • Deduplicator β€” Remove duplicate entities/relationships

Quick Example:

from semantica.kg import GraphBuilder, GraphAnalyzer

# Build graph
builder = GraphBuilder(backend="networkx", temporal=True)
kg = builder.build(entities, relationships)

# Analyze graph
analyzer = GraphAnalyzer()
metrics = analyzer.analyze(kg)

print(f"Nodes: {metrics['nodes']}, Edges: {metrics['edges']}")
print(f"Density: {metrics['density']:.3f}")

API Reference: Knowledge Graph Module


Ontology Module

Purpose

Defines schema and structure for your knowledge domain. Generates and validates ontologies with OWL/RDF export.

Key Features:

  • Automatic ontology generation (6-stage pipeline)
  • OWL/RDF/Turtle export
  • Class and property inference
  • Ontology validation
  • Symbolic reasoning (HermiT, Pellet)
  • Version management
  • SHACL constraint support
  • Ontology merging and alignment

Components:

  • OntologyGenerator β€” Generate ontologies from knowledge graphs
  • OntologyValidator β€” Validate ontologies
  • OWLGenerator β€” Generate OWL format ontologies
  • PropertyGenerator β€” Generate ontology properties
  • ClassInferrer β€” Infer ontology classes
  • OntologyMerger β€” Merge multiple ontologies
  • ReasonerInterface β€” Interface with symbolic reasoners

Quick Example:

from semantica.ontology import OntologyEngine

# Initialize engine
engine = OntologyEngine(base_uri="https://example.org/ontology/")

# Generate ontology from data
ontology = engine.from_data({
    "entities": [...],
    "relationships": [...]
})

# Validate ontology
result = engine.validate(ontology)
if result.valid:
    print("Ontology is valid!")

# Export to OWL
owl_content = engine.to_owl(ontology, format="turtle")
print(f"Generated {len(owl_content)} lines of OWL")

API Reference: Ontology Module


Reasoning Module

Purpose

Infers new facts and validates existing knowledge using logical rules. Supports forward-chaining, high-performance pattern matching, and explanation generation.

Key Features:

  • Forward-chaining inference engine
  • IF-THEN rule support with variable substitution
  • High-performance Rete algorithm for large-scale rule matching
  • Natural language explanation generation for inferred facts
  • SPARQL query expansion for RDF graphs
  • Conflict detection in inferences
  • Priority-based rule execution

Components:

  • Reasoner β€” High-level facade for all reasoning tasks
  • ReteEngine β€” High-performance pattern matching (Rete algorithm)
  • ExplanationGenerator β€” Generate justifications for inferred facts
  • SPARQLReasoner β€” Query expansion for triplet stores

Quick Example:

from semantica.reasoning import Reasoner

reasoner = Reasoner()

# Add rules and facts
reasoner.add_rule("IF Person(?x) AND Parent(?x, ?y) THEN ParentOfPerson(?x, ?y)")
reasoner.add_fact("Person(Alice)")
reasoner.add_fact("Parent(Alice, Bob)")

# Perform inference
inferred = reasoner.infer_facts(["Person(Alice)", "Parent(Alice, Bob)"])
# Inferred: ["ParentOfPerson(Alice, Bob)"]

API Reference: Reasoning Module


Storage Layer

These modules handle persistence and retrieval of vectors, graphs, and triplets.


Embeddings Module

Purpose

Generates vector embeddings for text, images, and audio. Supports multiple providers with caching and batch processing.

Key Features:

  • Multiple provider support (OpenAI, Cohere, HuggingFace, Sentence Transformers)
  • Text, image, and audio embeddings
  • Multimodal embeddings
  • Batch processing
  • Caching support
  • Custom models
  • Similarity calculations

Components:

  • EmbeddingGenerator β€” Main embedding orchestrator
  • TextEmbedder β€” Generate text embeddings
  • ImageEmbedder β€” Generate image embeddings
  • AudioEmbedder β€” Generate audio embeddings
  • MultimodalEmbedder β€” Combine multiple modalities
  • EmbeddingOptimizer β€” Optimize embedding quality
  • ProviderStores β€” Support for OpenAI, Cohere, etc.

Quick Example:

from semantica.embeddings import EmbeddingGenerator

generator = EmbeddingGenerator(
    provider="openai",
    model="text-embedding-3-small"
)
embeddings = generator.generate(documents)

# Calculate similarity
similarity = generator.similarity(embeddings[0], embeddings[1])
print(f"Similarity: {similarity:.3f}")

API Reference: Embeddings Module


Vector Store Module

Purpose

Manages storage and retrieval of high-dimensional vectors. Supports hybrid search combining vector and keyword search.

Key Features:

  • Multiple backend support (FAISS, Weaviate, Qdrant, Milvus)
  • Hybrid search (vector + keyword)
  • Metadata filtering
  • Batch operations
  • Similarity search with scoring
  • Index management
  • Namespace support

Components:

  • VectorStore β€” Main vector store interface
  • FAISSStore β€” FAISS integration
  • WeaviateStore β€” Weaviate integration
  • HybridSearch β€” Combine vector and keyword search
  • VectorRetriever β€” Retrieve relevant vectors

Quick Example:

from semantica.vector_store import VectorStore, HybridSearch

vector_store = VectorStore(backend="faiss")
vector_store.store(embeddings, documents, metadata)

# Hybrid search
hybrid_search = HybridSearch(vector_store)
results = hybrid_search.search(
    query="machine learning",
    top_k=10,
    filters={"category": "AI"}
)

API Reference: Vector Store Module


Graph Store Module

Purpose

Integration with property graph databases for storing and querying knowledge graphs.

Key Features:

  • Multiple backend support (Neo4j, FalkorDB)
  • Cypher query language
  • Graph algorithms and analytics
  • Transaction support
  • Index management
  • High-performance queries
  • Batch operations

Components:

  • GraphStore β€” Main graph store interface
  • Neo4jStore β€” Neo4j database integration
  • FalkorDBStore β€” FalkorDB (Redis-based) integration
  • NodeManager β€” Node CRUD operations
  • RelationshipManager β€” Relationship CRUD operations
  • QueryEngine β€” Cypher query execution
  • GraphAnalytics β€” Graph algorithms and analytics

Quick Example:

from semantica.graph_store import GraphStore

store = GraphStore(backend="neo4j", uri="bolt://localhost:7687")
store.connect()

# Create nodes and relationships
alice = store.create_node(
    labels=["Person"],
    properties={"name": "Alice", "age": 30}
)
bob = store.create_node(
    labels=["Person"],
    properties={"name": "Bob", "age": 25}
)
store.create_relationship(
    start_node_id=alice["id"],
    end_node_id=bob["id"],
    rel_type="KNOWS",
    properties={"since": 2020}
)

# Query with Cypher
results = store.execute_query("MATCH (p:Person) RETURN p.name")

API Reference: Graph Store Module


Triplet Store Module

Purpose

RDF triplet store integration for semantic web applications. Supports SPARQL queries and multiple backends.

Key Features:

  • Multi-backend support (Blazegraph, Jena, RDF4J)
  • CRUD operations for RDF triplets
  • SPARQL query execution and optimization
  • Bulk data loading with progress tracking
  • Query caching and optimization
  • Transaction support
  • Store backend pattern

Components:

  • TripletStore β€” Main triplet store interface
  • QueryEngine β€” SPARQL query execution and optimization
  • BulkLoader β€” High-volume data loading with progress tracking
  • BlazegraphStore β€” Blazegraph integration
  • JenaStore β€” Apache Jena integration
  • RDF4JStore β€” Eclipse RDF4J integration
  • QueryPlan β€” Query execution plan dataclass
  • LoadProgress β€” Bulk loading progress tracking

Algorithms:

Category Algorithms
Query Optimization Cost estimation, query rewriting, LIMIT injection
Caching MD5-based cache keys, LRU eviction
Bulk Loading Batch processing, retry with exponential backoff

Quick Example:

from semantica.triplet_store import TripletStore

store = TripletStore(backend="blazegraph", endpoint="http://localhost:9999/blazegraph")

# Add triplet
result = store.add_triplet({
    "subject": "http://example.org/Alice",
    "predicate": "http://example.org/knows",
    "object": "http://example.org/Bob"
})

# Execute SPARQL
query_result = store.execute_query("SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 10")

API Reference: Triplet Store Module


Quality Assurance Layer

These modules ensure data quality, handle duplicates, and resolve conflicts.


Deduplication Module

Purpose

Comprehensive entity deduplication and merging. Detects duplicates using multiple similarity methods and merges them intelligently.

Key Features:

  • Multiple similarity methods (exact, Levenshtein, Jaro-Winkler, cosine, embedding)
  • Advanced String Matching: Jaro-Winkler by default for better company/person name resolution
  • Smart Property Handling: Neutral scoring for disjoint properties to prevent false negatives
  • Duplicate detection with confidence scoring
  • Entity merging with configurable strategies
  • Cluster-based batch deduplication
  • Provenance preservation during merges
  • Relationship preservation
  • Incremental processing support

Components:

  • DuplicateDetector β€” Detects duplicate entities using similarity metrics
  • EntityMerger β€” Merges duplicate entities using configurable strategies
  • SimilarityCalculator β€” Multi-factor similarity between entities
  • MergeStrategyManager β€” Manages merge strategies and conflict resolution
  • ClusterBuilder β€” Builds clusters for batch deduplication

Merge Strategies:

Strategy Description
"keep_first" Preserve first entity, merge others
"keep_last" Preserve last entity, merge others
"keep_most_complete" Preserve entity with most properties
"keep_highest_confidence" Preserve entity with highest confidence
"merge_all" Combine all properties and relationships

Quick Example:

from semantica.deduplication import DuplicateDetector, EntityMerger

# Detect duplicates
detector = DuplicateDetector(similarity_threshold=0.8)
duplicate_groups = detector.detect_duplicate_groups(entities)

# Merge duplicates
merger = EntityMerger(preserve_provenance=True)
merge_operations = merger.merge_duplicates(
    entities,
    strategy="keep_most_complete"
)

merged_entities = [op.merged_entity for op in merge_operations]
print(f"Reduced from {len(entities)} to {len(merged_entities)} entities")

Conflicts Module

Purpose

Detects and resolves conflicts from multiple data sources. Provides investigation guides and source tracking.

Key Features:

  • Multi-source conflict detection (value, type, relationship, temporal, logical)
  • Source tracking and provenance management
  • Conflict analysis and pattern identification
  • Multiple resolution strategies (voting, credibility-weighted, recency)
  • Investigation guide generation
  • Source credibility scoring
  • Conflict reporting and statistics

Components:

  • ConflictDetector β€” Detects conflicts from multiple sources
  • ConflictResolver β€” Resolves conflicts using various strategies
  • ConflictAnalyzer β€” Analyzes conflict patterns and trends
  • SourceTracker β€” Tracks source information and provenance
  • InvestigationGuideGenerator β€” Generates investigation guides

Resolution Strategies:

Strategy Algorithm
Voting Majority value selection using frequency counting
Credibility Weighted Weighted average using source credibility scores
Temporal Selection Newest/oldest value based on timestamps
Confidence Selection Maximum confidence value selection

Quick Example:

from semantica.conflicts import ConflictDetector, ConflictResolver

detector = ConflictDetector()
conflicts = detector.detect_value_conflicts(entities, "name")

resolver = ConflictResolver()
results = resolver.resolve_conflicts(conflicts, strategy="voting")

KG Quality Assurance Module

Purpose

Comprehensive quality assessment, validation, and automated fixes for knowledge graphs.

Key Features:

  • Quality metrics calculation (overall, completeness, consistency)
  • Consistency checking (logical, temporal, hierarchical)
  • Completeness validation (entity, relationship, property)
  • Automated fixes (duplicates, inconsistencies, missing properties)
  • Quality reporting with issue tracking
  • Validation engine with rules and constraints
  • Improvement suggestions

Components:

  • KGQualityAssessor β€” Overall quality assessment coordinator
  • ConsistencyChecker β€” Consistency validation engine
  • CompletenessValidator β€” Completeness validation engine
  • QualityMetrics β€” Quality metrics calculator
  • ValidationEngine β€” Rule and constraint validation
  • RuleValidator β€” Rule-based validation
  • ConstraintValidator β€” Constraint-based validation
  • QualityReporter β€” Quality report generation
  • IssueTracker β€” Issue tracking and management
  • ImprovementSuggestions β€” Improvement suggestions generator
  • AutomatedFixer β€” Automated issue fixing
  • AutoMerger β€” Automatic merging of duplicates
  • AutoResolver β€” Automatic conflict resolution

Note: The KG quality assessment module has been temporarily removed and will be reintroduced in a future release.


Context & Memory Layer

These modules provide context engineering for agents and foundation data management.


Context Module

Purpose

Context engineering infrastructure for agents. Formalizes context as a graph of connections with RAG-enhanced memory. Features GraphRAG with multi-hop reasoning and LLM-generated responses.

Key Features:

  • Context graph construction from entities, relationships, and conversations
  • Agent memory management with RAG integration
  • Entity linking across sources with URI assignment
  • Hybrid context retrieval (vector + graph + memory)
  • Multi-hop reasoning through knowledge graphs
  • LLM-generated responses grounded in graph context
  • Reasoning trace showing entity relationship paths
  • Conversation history management
  • Context accumulation and synthesis
  • Graph-based context traversal

Components:

  • ContextGraph β€” In-memory context graph store and builder methods
  • ContextNode β€” Context graph node data structure
  • ContextEdge β€” Context graph edge data structure
  • AgentMemory β€” Manages persistent agent memory with RAG
  • AgentContext β€” High-level context interface with GraphRAG capabilities
  • ContextRetriever β€” Retrieves relevant context with multi-hop reasoning
  • MemoryItem β€” Memory item data structure
  • EntityLinker β€” Links entities across sources with URI assignment

Algorithms:

Category Algorithms
Graph Construction BFS/DFS traversal, type-based indexing
Memory Management Vector embedding, similarity search, retention policies
Context Retrieval Vector similarity, multi-hop graph expansion, hybrid scoring
Multi-Hop Reasoning BFS traversal up to N hops, reasoning path construction
LLM Integration Prompt engineering with context and reasoning paths
Entity Linking Hash-based URI generation, text similarity matching

Quick Example:

from semantica.context import AgentContext, ContextGraph, AgentMemory
from semantica.llms import Groq
from semantica.vector_store import VectorStore
import os

# Using AgentContext with GraphRAG reasoning
context = AgentContext(
    vector_store=VectorStore(backend="faiss"),
    knowledge_graph=kg
)

# Configure LLM provider
llm_provider = Groq(
    model="llama-3.1-8b-instant",
    api_key=os.getenv("GROQ_API_KEY")
)

# Query with multi-hop reasoning and LLM-generated response
result = context.query_with_reasoning(
    query="What IPs are associated with security alerts?",
    llm_provider=llm_provider,
    max_results=10,
    max_hops=2
)

print(f"Response: {result['response']}")
print(f"Reasoning Path: {result['reasoning_path']}")
print(f"Confidence: {result['confidence']:.3f}")

# Traditional context graph and memory
graph = ContextGraph()
graph_data = graph.build_from_entities_and_relationships(entities, relationships)

memory = AgentMemory(vector_store=vs, knowledge_graph=kg)
memory_id = memory.store("User asked about Python", metadata={"type": "conversation"})
results = memory.retrieve("Python", max_results=5)

API Reference: Context Module


LLM Providers Module

Purpose

Unified interface for LLM providers. Supports Groq, OpenAI, HuggingFace, and LiteLLM (100+ LLMs) with clean imports and consistent API.

Key Features:

  • Unified Interface: Same generate() and generate_structured() methods across all providers
  • Multiple Providers: Groq, OpenAI, HuggingFace, and LiteLLM (100+ LLMs)
  • Clean Imports: Simple from semantica.llms import Groq, OpenAI, HuggingFaceLLM, LiteLLM
  • Structured Output: JSON generation support
  • API Key Management: Environment variable and direct key support
  • Error Handling: Graceful fallback when providers unavailable

Components:

  • Groq β€” Groq API provider for fast inference
  • OpenAI β€” OpenAI API provider (GPT-3.5, GPT-4, etc.)
  • HuggingFaceLLM β€” HuggingFace Transformers for local LLM inference
  • LiteLLM β€” Unified interface to 100+ LLM providers (OpenAI, Anthropic, Azure, Bedrock, Vertex AI, etc.)

Supported Providers via LiteLLM:

  • OpenAI, Anthropic, Groq, Azure, Bedrock, Vertex AI, Cohere, Mistral, and 90+ more

Quick Example:

from semantica.llms import Groq, OpenAI, HuggingFaceLLM, LiteLLM
import os

# Groq - Fast inference
groq = Groq(
    model="llama-3.1-8b-instant",
    api_key=os.getenv("GROQ_API_KEY")
)
response = groq.generate("What is AI?")

# OpenAI
openai = OpenAI(
    model="gpt-4",
    api_key=os.getenv("OPENAI_API_KEY")
)
response = openai.generate("What is AI?")

# HuggingFace - Local models
hf = HuggingFaceLLM(model_name="gpt2")  # or model="gpt2" for consistency
response = hf.generate("What is AI?")

# LiteLLM - Unified interface to 100+ LLMs
litellm = LiteLLM(
    model="openai/gpt-4o",  # or "anthropic/claude-sonnet-4-20250514", etc.
    api_key=os.getenv("OPENAI_API_KEY")
)
response = litellm.generate("What is AI?")

# Structured output
structured = groq.generate_structured("Extract entities from: Apple Inc. was founded by Steve Jobs.")

API Reference: LLM Providers Module


Seed Module

Purpose

Seed data management for initial knowledge graph construction. Builds on verified knowledge from multiple sources.

Key Features:

  • Multi-source seed data loading (CSV, JSON, Database, API)
  • Foundation graph creation from seed data
  • Seed data quality validation
  • Integration with extracted data using configurable merge strategies
  • Version management for seed sources
  • Export capabilities (JSON, CSV)
  • Schema template validation

Components:

  • SeedDataManager β€” Main coordinator for seed data operations
  • SeedDataSource β€” Seed data source definition
  • SeedData β€” Seed data container

Merge Strategies:

Strategy Description
seed_first Seed data takes precedence, extracted fills gaps
extracted_first Extracted data takes precedence, seed fills gaps
merge Property merging, seed takes precedence for conflicts

Quick Example:

from semantica.seed import SeedDataManager

manager = SeedDataManager()
manager.register_source("entities", "json", "data/entities.json")
foundation = manager.create_foundation_graph()
validation = manager.validate_quality(foundation)

Output & Orchestration Layer

These modules handle export, visualization, and workflow management.


Export Module

Purpose

Export knowledge graphs and data to various formats for use in external tools.

Key Features:

  • Multiple export formats (JSON, RDF, CSV, OWL, GraphML, GEXF)
  • Custom export formats
  • Batch export
  • Metadata preservation
  • Streaming export for large graphs
  • Vector export support

Components:

  • JSONExporter β€” Export to JSON
  • RDFExporter β€” Export to RDF/XML
  • CSVExporter β€” Export to CSV
  • GraphExporter β€” Export to graph formats (GraphML, GEXF)
  • OWLExporter β€” Export to OWL
  • VectorExporter β€” Export vectors

Quick Example:

from semantica.export import JSONExporter, RDFExporter, CSVExporter

# Export to multiple formats
JSONExporter().export(kg, "output.json")
RDFExporter().export(kg, "output.rdf")
CSVExporter().export(kg, "output.csv")

API Reference: Export Module


Visualization Module

Purpose

Visual exploration of knowledge graphs, embeddings, and analytics data.

Key Features:

  • Interactive graph visualization
  • Embedding visualization (t-SNE, PCA, UMAP)
  • Temporal data visualization
  • Ontology visualization
  • Multiple output formats (HTML, PNG, SVG)
  • Custom styling

Components:

  • KGVisualizer β€” Visualize knowledge graphs
  • EmbeddingVisualizer β€” Visualize embeddings (t-SNE, PCA, UMAP)
  • AnalyticsVisualizer β€” Visualize graph analytics
  • TemporalVisualizer β€” Visualize temporal data
  • OntologyVisualizer β€” Visualize ontology structure
  • SemanticNetworkVisualizer β€” Visualize semantic networks

Quick Example:

from semantica.visualization import KGVisualizer, EmbeddingVisualizer

# Visualize knowledge graph
KGVisualizer().visualize(kg, output_format="html", output_path="graph.html")

# Visualize embeddings
EmbeddingVisualizer().visualize(embeddings, method="tsne", output_path="embeddings.png")

API Reference: Visualization Module


Pipeline Module

Purpose

Orchestrates workflows, connecting modules into robust, executable pipelines.

Key Features:

  • Pipeline construction DSL
  • Parallel execution
  • Error handling and recovery
  • Resource scheduling
  • Pipeline validation
  • Monitoring and logging
  • Checkpoint support

Components:

  • PipelineBuilder β€” Build complex pipelines
  • ExecutionEngine β€” Execute pipelines
  • FailureHandler β€” Handle pipeline failures
  • ParallelismManager β€” Enable parallel processing
  • ResourceScheduler β€” Schedule resources
  • PipelineValidator β€” Validate pipeline configuration

Quick Example:

from semantica.pipeline import PipelineBuilder
from semantica.ingest import FileIngestor
from semantica.parse import DocumentParser
from semantica.semantic_extract import NERExtractor

builder = PipelineBuilder()
pipeline = builder \
    .add_step("ingest", FileIngestor()) \
    .add_step("parse", DocumentParser()) \
    .add_step("extract", NERExtractor()) \
    .build()

result = pipeline.execute(sources=["data/"], parallel=True)

API Reference: Pipeline Module


Integration Patterns

Pattern 1: Complete Knowledge Graph Pipeline

Build a complete knowledge graph from documents using the full pipeline.

For complete examples, see: - Your First Knowledge Graph Cookbook: Complete pipeline walkthrough - Topics: Ingestion, parsing, extraction, graph building, embeddings - Difficulty: Beginner - Time: 20-30 minutes - Use Cases: Learning the complete workflow

Pattern 2: Custom Pipeline with Module Selection

Build custom pipelines with specific module selections and quality assurance.

For examples, see: - Building Knowledge Graphs Cookbook: Advanced graph construction - Topics: Custom pipelines, entity merging, conflict resolution - Difficulty: Intermediate - Time: 30-45 minutes - Use Cases: Production graph construction

Build GraphRAG systems with hybrid search combining vector and graph retrieval.

For complete examples, see: - GraphRAG Complete Cookbook: Production GraphRAG system - Topics: GraphRAG, hybrid retrieval, graph traversal, LLM integration - Difficulty: Advanced - Time: 1-2 hours - Use Cases: Production RAG applications

Pattern 4: Temporal Graph with Reasoning

Build temporal graphs with logical reasoning capabilities.

For examples, see: - Temporal Graphs Cookbook: Temporal graph construction - Topics: Time-stamped entities, temporal relationships, historical queries - Difficulty: Intermediate - Time: 30-45 minutes - Use Cases: Time-aware knowledge graphs


Quick Reference: All Modules

Module Import Main Class Purpose
Ingest semantica.ingest FileIngestor Data ingestion
Parse semantica.parse DocumentParser Document parsing
Split semantica.split TextSplitter Text chunking
Normalize semantica.normalize TextNormalizer Data cleaning
Semantic Extract semantica.semantic_extract NERExtractor Entity extraction
KG semantica.kg GraphBuilder Graph construction
Ontology semantica.ontology OntologyGenerator Ontology generation
Reasoning semantica.reasoning Reasoner Logical inference
Embeddings semantica.embeddings EmbeddingGenerator Vector generation
Vector Store semantica.vector_store VectorStore Vector storage
Graph Store semantica.graph_store GraphStore Graph database
Triplet Store semantica.triplet_store TripletStore RDF storage
Deduplication semantica.deduplication DuplicateDetector Duplicate removal
Conflicts semantica.conflicts ConflictDetector Conflict resolution
Context semantica.context AgentContext Agent context & GraphRAG
LLM Providers semantica.llms Groq, OpenAI, HuggingFaceLLM, LiteLLM LLM integration
Seed semantica.seed SeedDataManager Foundation data
Export semantica.export JSONExporter Data export
Visualization semantica.visualization KGVisualizer Visualization
Pipeline semantica.pipeline PipelineBuilder Workflow orchestration

Next Steps

  • Welcome to Semantica: Comprehensive introduction to all modules
  • Topics: Framework overview, all modules, architecture
  • Difficulty: Beginner
  • Time: 30-45 minutes
  • Use Cases: Understanding the complete framework

  • Your First Knowledge Graph: Build your first knowledge graph

  • Topics: Complete pipeline from ingestion to graph construction
  • Difficulty: Beginner
  • Time: 20-30 minutes
  • Use Cases: Hands-on practice with all modules

Contribute

Found an issue or want to improve this guide? Contribute on GitHub