π§ Semantica
Open Source Framework for Semantic Layer & Knowledge Engineering
Transform chaotic data into intelligent knowledge.
The missing fabric between raw data and AI engineering. A comprehensive open-source framework for building semantic layers and knowledge engineering systems that transform unstructured data into AI-ready knowledge β powering Knowledge Graph-Powered RAG (GraphRAG), AI Agents, Multi-Agent Systems, and AI applications with structured semantic knowledge.
π 100% Open Source β’ π MIT Licensed β’ π Production Ready β’ π Community Driven
π What is Semantica?¶
Semantica bridges the gap between raw data chaos and AI-ready knowledge. It's a semantic intelligence platform that transforms unstructured data into structured, queryable knowledge graphs powering GraphRAG, AI agents, and multi-agent systems.
What Makes Semantica Different?¶
Unlike traditional approaches that process isolated documents and extract text into vectors, Semantica understands semantic relationships across all content, provides automated ontology generation, and builds a unified semantic layer with production-grade QA.
| Traditional Approaches | Semantica's Approach |
|---|---|
| Process data as isolated documents | Understands semantic relationships across all content |
| Extract text and store vectors | Builds knowledge graphs with meaningful connections |
| Generic entity recognition | General-purpose ontology generation and validation |
| Manual schema definition | Automatic semantic modeling from content patterns |
| Disconnected data silos | Unified semantic layer across all data sources |
| Basic quality checks | Production-grade QA with conflict detection & resolution |
π― The Problem We Solve¶
The Semantic Gap¶
Organizations today face a fundamental mismatch between how data exists and how AI systems need it.
The Semantic Gap: Problem vs. Solution¶
Organizations have unstructured data (PDFs, emails, logs), messy data (inconsistent formats, duplicates, conflicts), and disconnected silos (no shared context, missing relationships). AI systems need clear rules (formal ontologies), structured entities (validated, consistent), and relationships (semantic connections, context-aware reasoning).
| What Organizations Have | What AI Systems Require |
|---|---|
| Unstructured Data | Clear Rules |
| PDFs, emails, logs | Formal ontologies |
| Mixed schemas | Graphs & Networks |
| Conflicting facts | |
| Messy, Noisy Data | Structured Entities |
| Inconsistent formats | Validated entities |
| Duplicate records | Domain Knowledge |
| Missing relationships | |
| Disconnected, Siloed Data | Relationships |
| Data in separate systems | Semantic connections |
| No shared context | Context-Aware Reasoning |
| Isolated knowledge |
What Happens Without Semantics?¶
They Break β Systems crash due to inconsistent formats and missing structure.
They Hallucinate β AI models generate false information without semantic context to validate outputs.
They Fail Silently β Systems return wrong answers without warnings, leading to bad decisions.
Why? Systems have data β not semantics. They can't connect concepts, understand relationships, validate against domain rules, or detect conflicts.
The Semantica Framework¶
Semantica operates through three integrated layers that transform raw data into AI-ready knowledge:
Input Layer β Universal ingestion from multiple data formats (PDFs, DOCX, HTML, JSON, CSV, databases, live feeds, APIs, streams, archives, multi-modal content) into a unified pipeline.
Semantic Layer β Core intelligence engine performing entity extraction, relationship mapping, ontology generation, context engineering, and quality assurance. Includes advanced entity deduplication (Jaro-Winkler, disjoint property handling) to ensure a clean single source of truth.
Output Layer β Production-ready knowledge graphs, vector embeddings, and validated ontologies that power GraphRAG systems, AI agents, and multi-agent systems.
Powers: GraphRAG, AI Agents, Multi-Agent Systems
Semantica Processing Flow¶
flowchart TD
A[Raw Data Sources<br/>PDFs, Emails, Logs, Databases<br/>Multiple Formats] --> B[Input Layer<br/>Universal Data Ingestion]
B --> C[Format Detection<br/>& Parsing]
C --> D[Normalization<br/>& Preprocessing]
D --> E[Semantic Layer<br/>Core Intelligence]
E --> F[Entity Extraction<br/>NER + LLM Enhancement]
E --> G[Relationship Mapping<br/>Triplet Generation]
E --> H[Ontology Generation<br/>6-Stage Pipeline]
E --> I[Context Engineering<br/>Semantic Enrichment]
E --> J[Quality Assurance<br/>Conflict Detection]
F --> K[Output Layer]
G --> K
H --> K
I --> K
J --> K
K --> L[Knowledge Graphs<br/>Production-Ready]
K --> M[Vector Embeddings<br/>Semantic Search]
K --> N[Ontologies<br/>OWL Validated]
L --> O[Application Layer]
M --> O
N --> O
O --> P[GraphRAG Engine<br/>91% Accuracy]
O --> Q[AI Agents<br/>Persistent Memory]
O --> R[Multi-Agent Systems<br/>Shared Models]
O --> S[Analytics & BI<br/>Graph Insights] π‘ The Semantica Solution¶
Semantica is an open-source framework that closes the semantic gap between real-world messy data and the structured semantic layers required by advanced AI systems β GraphRAG, agents, multi-agent systems, reasoning models, and more.
How Semantica Solves These Problems¶
-
Efficient Embeddings --- Uses FastEmbed by default for high-performance, lightweight local embedding generation (faster than sentence-transformers).
-
Universal Data Ingestion --- Handles multiple formats (PDF, DOCX, HTML, JSON, CSV, databases, APIs, streams) with unified pipeline, no custom parsers needed.
-
Automated Semantic Extraction --- NER, relationship extraction, and triplet generation with LLM enhancement discovers entities and relationships automatically.
-
Knowledge Graph Construction --- Production-ready graphs with entity resolution, temporal support, and graph analytics. Queryable knowledge ready for AI applications.
-
GraphRAG Engine --- Hybrid vector + graph retrieval achieves 91% accuracy (30% improvement) via semantic search + graph traversal for multi-hop reasoning.
-
AI Agent Context Engineering --- Persistent memory with RAG + knowledge graphs enables context maintenance, action validation, and structured knowledge access.
-
Automated Ontology Generation --- 6-stage LLM pipeline generates validated OWL ontologies with HermiT/Pellet validation, eliminating manual engineering.
-
Production-Grade QA --- Conflict detection, deduplication, quality scoring, and provenance tracking ensure trusted, production-ready knowledge graphs.
-
Pipeline Orchestration --- Flexible pipeline builder with parallel execution enables scalable processing via orchestrator-worker pattern.
Core Features at a Glance¶
| Feature Category | Capabilities | Key Benefits |
|---|---|---|
| Data Ingestion | Multiple formats (PDF, DOCX, HTML, JSON, CSV, databases, APIs, streams, archives) | Universal ingestion, no custom parsers needed |
| Semantic Extraction | NER, relationship extraction, triplet generation, LLM enhancement | Automated discovery of entities and relationships |
| Knowledge Graphs | Entity resolution, temporal support, graph analytics, query interface | Production-ready, queryable knowledge structures |
| Ontology Generation | 6-stage LLM pipeline, OWL generation, HermiT/Pellet validation | Automated ontology creation from documents |
| GraphRAG | Hybrid vector + graph retrieval, multi-hop reasoning | 91% accuracy, 30% improvement over vector-only |
| Agent Memory | Persistent memory (Save/Load), Hybrid Retrieval (Vector+Graph), FastEmbed support | Context-aware agents with semantic understanding |
| Pipeline Orchestration | Parallel execution, custom steps, orchestrator-worker pattern | Scalable, flexible data processing |
| Quality Assurance | Conflict detection, deduplication, quality scoring, provenance | Trusted knowledge graphs ready for production |
β¨ Core Capabilities¶
1. π Universal Data Ingestion¶
Process multiple file formats with intelligent semantic extraction:
-
π Documents ---
- PDF (with OCR)
- DOCX, XLSX, PPTX
- TXT, RTF, ODT
- EPUB, LaTeX, Markdown
-
π Web & Feeds ---
- HTML, XHTML, XML
- RSS, Atom feeds
- JSON-LD, RDFa
- Web scraping
-
πΎ Structured Data ---
- JSON, YAML, TOML
- CSV, TSV, Excel
- Parquet, Avro, ORC
- SQL/NoSQL databases
-
π§ Communication ---
- EML, MSG, MBOX
- PST archives
- Email threads
- Attachment extraction
-
ποΈ Archives ---
- ZIP, TAR, RAR, 7Z
- Recursive processing
- Multi-level extraction
-
π¬ Scientific ---
- BibTeX, EndNote, RIS
- JATS XML
- PubMed formats
- Citation networks
2. π§ Semantic Intelligence Engine¶
Transform raw text into structured semantic knowledge with state-of-the-art NLP and AI models:
- Named Entity Recognition (NER): Extract people, organizations, locations, dates, and custom entities
- Relationship Extraction: Identify semantic, temporal, and causal relationships
- Event Detection: Detect and classify events (acquisitions, partnerships, announcements)
- Coreference Resolution: Resolve pronouns and entity mentions across documents
- Triplet Extraction: Generate RDF triplets for knowledge graph construction
3. πΈοΈ Knowledge Graph Construction¶
Build production-ready knowledge graphs with:
- Automatic Entity Resolution: Merge duplicate entities with fuzzy matching
- Conflict Detection & Resolution: Handle contradictory information from multiple sources
- Temporal Knowledge Graphs: Track changes over time with version history
- Graph Analytics: Centrality, community detection, path finding
- Multi-Format Export: Neo4j, RDF, JSON-LD, GraphML
4. π Ontology Generation & Management¶
Generate formal ontologies automatically using a 6-stage LLM-based pipeline:
- Semantic Network Parsing β Extract domain concepts
- YAML-to-Definition β Transform into class definitions
- Definition-to-Types β Map to OWL types
- Hierarchy Generation β Build taxonomic structures
- TTL Generation β Generate OWL/Turtle syntax
- Symbolic Validation β HermiT/Pellet reasoning (F1 up to 0.99)
5. π Hybrid Search & Retrieval¶
Power GraphRAG applications with:
- Vector Search: Semantic similarity using embeddings
- Graph Traversal: Multi-hop reasoning for context expansion
- Hybrid Retrieval: Combine vector + graph for improved accuracy
- Temporal Queries: Query knowledge at specific time points
π Choose Your Path¶
-
Quick Start --- Get up and running with Semantica in minutes. Learn the basics of ingestion and extraction.
-
Core Concepts --- Deep dive into Knowledge Graphs, Ontologies, and Semantic Reasoning.
-
API Reference --- Detailed technical documentation for all Semantica modules and classes.
-
Cookbook --- Interactive tutorials, real-world examples, and 14 domain-specific cookbooks.
π¦ Installation¶
Now Available on PyPI!
Semantica is officially published on PyPI! Install it with a single command.
Install Semantica directly from PyPI:
Install from the local source for the latest development version:
For contributors who want to modify the framework:
π¦ Quick Example¶
Semantica uses a modular architecture. You can use individual modules directly for maximum flexibility:
from semantica.ingest import FileIngestor
from semantica.parse import DocumentParser
from semantica.semantic_extract import NERExtractor, RelationExtractor
from semantica.kg import GraphBuilder
# 1. Ingest documents
ingestor = FileIngestor()
documents = ingestor.ingest_directory("documents/", recursive=True)
# 2. Parse documents
parser = DocumentParser()
parsed_docs = [parser.parse_document(doc) for doc in documents]
# 3. Extract entities and relationships
ner = NERExtractor()
rel_extractor = RelationExtractor()
entities = []
relationships = []
for doc in parsed_docs:
text = doc.get("full_text", "")
doc_entities = ner.extract_entities(text)
doc_rels = rel_extractor.extract_relations(text, entities=doc_entities)
entities.extend(doc_entities)
relationships.extend(doc_rels)
# 4. Build knowledge graph
builder = GraphBuilder(merge_entities=True)
kg = builder.build_graph(entities=entities, relationships=relationships)
print(f"Created graph with {len(kg.nodes)} nodes and {len(kg.edges)} edges")
Orchestration Option
For complex workflows, you can also use the Semantica class for orchestration. See the Core Module documentation for details.
π― Why Semantica?¶
-
π 100% Open Source --- MIT licensed. No vendor lock-in. Full transparency.
-
π Production Ready --- Battle-tested with quality assurance, conflict resolution, and validation.
-
π§© Modular Architecture --- Use only what you need. Swap components easily.
-
π Community Driven --- Built by developers, for developers. Active Discord community.
-
π Comprehensive --- End-to-end solution from ingestion to reasoning. No duct-taping required.
-
π¬ Research-Backed --- Based on latest research in knowledge graphs, ontologies, and semantic web.
ποΈ Built For¶
- Data Scientists: Transform messy data into clean knowledge graphs
- Data Engineers: Build scalable data pipelines with semantic enrichment
- AI Engineers: Build GraphRAG, AI agents, and multi-agent systems
- Knowledge Engineers: Generate and manage formal ontologies
- Ontologists: Design and validate domain-specific ontologies and taxonomies
- Researchers: Analyze scientific literature and build citation networks
- ML Engineers: Create semantic features for machine learning models
- Enterprises: Unify data silos into a semantic layer
π Learn More¶
- Getting Started Guide - Your first knowledge graph in 5 minutes
- Core Concepts - Deep dive into knowledge graphs and ontologies
- Cookbook - Real-world examples and 14 domain-specific cookbooks
- API Reference - Complete technical documentation
π³ Recommended Cookbook Tutorials¶
Get hands-on with interactive Jupyter notebooks:
- Welcome to Semantica: Comprehensive introduction to all Semantica modules
- Topics: Framework overview, all modules, architecture
- Difficulty: Beginner
-
Use Cases: First-time users, understanding the framework
-
Your First Knowledge Graph: Build your first knowledge graph from scratch
- Topics: Entity extraction, relationship extraction, graph construction
- Difficulty: Beginner
-
Use Cases: Learning the basics, quick start
-
GraphRAG Complete: Production-ready Graph Retrieval Augmented Generation
- Topics: GraphRAG, hybrid retrieval, vector search, graph traversal
- Difficulty: Advanced
- Use Cases: Building AI applications with knowledge graphs