Skip to content

Provenance Tracking Module

W3C PROV-O compliant provenance tracking for high-stakes domains requiring complete traceability

Overview

The Semantica provenance module provides W3C PROV-O compliant tracking for knowledge graphs, enabling complete end-to-end lineage from source documents to query responses. Designed for high-stakes domains where every decision must be explainable and auditable.

  • W3C PROV-O Compliant


    Implements PROV-O ontology (prov:Entity, prov:Activity, prov:Agent, prov:wasDerivedFrom)

  • Complete Coverage


    All 17 Semantica modules integrated for comprehensive tracking

  • Source Tracking


    Document identifiers, page numbers, sections, and direct quotes supported

  • Backward Compatible


    100% backward compatible, opt-in only with zero breaking changes

  • Multiple Storage


    InMemory (fast) and SQLite (persistent) backends available

  • Bridge Axiom Support


    Translation chain tracking for domain transformations (L1 → L2 → L3)

  • Integrity Verification


    SHA-256 checksums for tamper detection and verification

  • Complete Lineage


    End-to-end tracing from document to AI response

Key Features

  • W3C PROV-O Compliant — Implements PROV-O ontology (prov:Entity, prov:Activity, prov:Agent, prov:wasDerivedFrom)
  • All 17 Modules Integrated — Complete coverage across Semantica
  • Source Tracking — Document identifiers, page numbers, sections, and direct quotes supported
  • Zero Breaking Changes — 100% backward compatible, opt-in only
  • Multiple Storage Backends — InMemory (fast) and SQLite (persistent)
  • Bridge Axiom Support — Translation chain tracking for domain transformations (L1 → L2 → L3)
  • Integrity Verification — SHA-256 checksums for tamper detection
  • Complete Lineage Tracing — End-to-end from document to response

Installation

The provenance module is included with Semantica. No additional installation required.

from semantica.provenance import ProvenanceManager

Core Components

ProvenanceManager

Central manager for all provenance tracking operations.

from semantica.provenance import ProvenanceManager

# Initialize with in-memory storage (default)
manager = ProvenanceManager()

# Initialize with persistent SQLite storage
manager = ProvenanceManager(storage_path="provenance.db")

Key capabilities: - Track entities and relationships with complete lineage - Store provenance data in memory or persistent SQLite storage - Query provenance information for audit and compliance - Maintain W3C PROV-O compliant records

Methods: - track_entity(entity_id, source, entity_type, **metadata) — Track entity provenance - track_relationship(relationship_id, source, subject, predicate, obj, **metadata) — Track relationship provenance - track_chunk(chunk_id, source_document, chunk_text, start_char, end_char, **metadata) — Track document chunk provenance - track_property_source(entity_id, property_name, value, source, **metadata) — Track property-level provenance - get_lineage(entity_id) — Retrieve complete lineage for an entity - get_statistics() — Get provenance statistics - get_all_entries() — Retrieve all provenance entries

Storage Backends

InMemoryStorage

Fast, non-persistent storage for development and testing.

from semantica.provenance import ProvenanceManager, InMemoryStorage

manager = ProvenanceManager(storage=InMemoryStorage())

Best for: - Development and testing environments - Temporary provenance tracking - High-performance scenarios where persistence isn't required - Rapid prototyping and debugging

SQLiteStorage

Persistent storage for production use.

from semantica.provenance import ProvenanceManager, SQLiteStorage

storage = SQLiteStorage("provenance.db")
manager = ProvenanceManager(storage=storage)

Best for: - Production deployments requiring persistence - Long-term provenance storage - Compliance and audit requirements - Multi-process environments

Data Schemas

ProvenanceEntry

Core data structure for provenance tracking.

from semantica.provenance import ProvenanceEntry
from datetime import datetime

entry = ProvenanceEntry(
    entity_id="entity_1",
    source="document.pdf",
    timestamp=datetime.now(),
    entity_type="named_entity",
    metadata={"text": "Apple Inc.", "confidence": 0.95}
)

SourceReference

Structured source information with page and section details.

from semantica.provenance import SourceReference

source = SourceReference(
    document="research_paper.pdf",
    page=5,
    section="Results",
    confidence=0.98
)

Module Integrations

All Semantica modules have provenance-enabled versions. Enable tracking by setting provenance=True.

Semantic Extract

from semantica.semantic_extract.semantic_extract_provenance import (
    NERExtractorWithProvenance,
    RelationExtractorWithProvenance,
    EventDetectorWithProvenance,
    CoreferenceResolverWithProvenance,
    TripletExtractorWithProvenance
)

# Named Entity Recognition with provenance
ner = NERExtractorWithProvenance(provenance=True)
entities = ner.extract(
    text="Apple Inc. was founded by Steve Jobs in Cupertino.",
    source="company_history.pdf"
)

# Access provenance manager
prov_manager = ner._prov_manager
lineage = prov_manager.get_lineage("entity_id")

Tracks: Entity text, labels, confidence scores, source documents, character positions, extraction timestamps

LLM Providers

from semantica.llms.llms_provenance import (
    GroqLLMWithProvenance,
    OpenAILLMWithProvenance,
    HuggingFaceLLMWithProvenance,
    LiteLLMWithProvenance
)

# Groq LLM with provenance
llm = GroqLLMWithProvenance(
    provenance=True,
    model="llama-3.1-70b"
)

response = llm.generate("What is artificial intelligence?")

# Access cost and performance data
stats = llm._prov_manager.get_statistics()

Tracks: Model name, prompt/completion tokens, API costs, latency, generation parameters, prompts and responses

Pipeline Execution

from semantica.pipeline.pipeline_provenance import PipelineWithProvenance

pipeline = PipelineWithProvenance(provenance=True)
result = pipeline.run(data=input_data, source="input_file.json")

Tracks: Pipeline steps executed, duration, input/output data, execution status

Context Management

from semantica.context.context_provenance import ContextManagerWithProvenance

ctx = ContextManagerWithProvenance(provenance=True)
ctx.add_context("Relevant background information", source="knowledge_base.txt")

Tracks: Context additions, sources, timestamps

Document Ingestion

from semantica.ingest.ingest_provenance import PDFIngestorWithProvenance

ingestor = PDFIngestorWithProvenance(provenance=True)
documents = ingestor.ingest("research_paper.pdf")

Tracks: File paths, page counts, file metadata, ingestion timestamps

Embeddings Generation

from semantica.embeddings.embeddings_provenance import EmbeddingGeneratorWithProvenance

embedder = EmbeddingGeneratorWithProvenance(
    provenance=True,
    model="sentence-transformers/all-mpnet-base-v2"
)
embeddings = embedder.embed(["Text 1", "Text 2"], source="corpus.txt")

Tracks: Model name, embedding dimensions, generation timestamps

Graph Store

from semantica.graph_store.graph_store_provenance import GraphStoreWithProvenance

store = GraphStoreWithProvenance(provenance=True)
store.add_node(entity_node, source="knowledge_graph.json")

Tracks: Nodes added, node properties, graph structure changes

Vector Store

from semantica.vector_store.vector_store_provenance import VectorStoreWithProvenance

store = VectorStoreWithProvenance(provenance=True)
store.add_vectors(embedding_vectors, source="embeddings.npy")

Tracks: Vectors stored, dimensions, storage timestamps

Triplet Store

from semantica.triplet_store.triplet_store_provenance import TripletStoreWithProvenance

store = TripletStoreWithProvenance(provenance=True)
store.add_triplet("Steve_Jobs", "founded", "Apple_Inc", source="knowledge_base.ttl")

Tracks: Subject, predicate, object, confidence scores, timestamps

Other Modules

All remaining modules follow the same pattern:

  • ReasoningReasoningEngineWithProvenance
  • ConflictsSourceTrackerWithUnifiedBackend
  • DeduplicationDeduplicatorWithProvenance
  • ExportExporterWithProvenance
  • ParseParserWithProvenance
  • NormalizeNormalizerWithProvenance
  • OntologyOntologyManagerWithProvenance
  • VisualizationVisualizerWithProvenance

Usage Examples

Basic Entity Tracking

from semantica.provenance import ProvenanceManager

manager = ProvenanceManager()

# Track entity
manager.track_entity(
    entity_id="entity_1",
    source="document.pdf",
    entity_type="organization",
    metadata={
        "name": "Apple Inc.",
        "confidence": 0.95,
        "extraction_method": "NER"
    }
)

# Retrieve lineage
lineage = manager.get_lineage("entity_1")
print(f"Source: {lineage['source']}")
print(f"Timestamp: {lineage['timestamp']}")
print(f"Metadata: {lineage['metadata']}")

Relationship Tracking

# Track entities
manager.track_entity("steve_jobs", "biography.pdf", "person")
manager.track_entity("apple_inc", "biography.pdf", "organization")

# Track relationship
manager.track_relationship(
    relationship_id="rel_1",
    source="biography.pdf",
    subject="steve_jobs",
    predicate="founded",
    obj="apple_inc",
    metadata={"confidence": 0.92}
)

Lineage Chain Tracking

# Create lineage chain: document → chunk → entity
manager.track_entity("doc_1", "research_paper.pdf", "document")

manager.track_chunk(
    chunk_id="chunk_1",
    source_document="doc_1",
    chunk_text="Sample text content",
    start_char=0,
    end_char=100
)

manager.track_entity(
    entity_id="entity_1",
    source="chunk_1",
    entity_type="named_entity",
    metadata={"text": "Apple"}
)

# Retrieve complete lineage
lineage = manager.get_lineage("entity_1")
print(f"Lineage chain: {lineage['lineage_chain']}")

Property-Level Provenance

from semantica.provenance import SourceReference

# Track entity
manager.track_entity("company_1", "doc.pdf", "organization")

# Track property sources
manager.track_property_source(
    entity_id="company_1",
    property_name="revenue",
    value="$394.3B",
    source=SourceReference(
        document="annual_report_2023.pdf",
        page=5,
        section="Financial Summary",
        confidence=0.98
    )
)

manager.track_property_source(
    entity_id="company_1",
    property_name="employees",
    value="500",
    source=SourceReference(
        document="company_profile.pdf",
        page=2,
        confidence=0.90
    )
)

End-to-End Workflow

from semantica.provenance import ProvenanceManager
from semantica.ingest.ingest_provenance import PDFIngestorWithProvenance
from semantica.semantic_extract.semantic_extract_provenance import NERExtractorWithProvenance
from semantica.llms.llms_provenance import GroqLLMWithProvenance
from semantica.graph_store.graph_store_provenance import GraphStoreWithProvenance

# Initialize
manager = ProvenanceManager()

# Step 1: Ingest
ingestor = PDFIngestorWithProvenance(provenance=True)
documents = ingestor.ingest("research_paper.pdf")

# Step 2: Extract
ner = NERExtractorWithProvenance(provenance=True)
entities = ner.extract(documents[0].text, source="research_paper.pdf")

# Step 3: LLM Analysis
llm = GroqLLMWithProvenance(provenance=True)
summary = llm.generate(f"Summarize: {documents[0].text[:500]}")

# Step 4: Store
graph = GraphStoreWithProvenance(provenance=True)
for entity in entities:
    graph.add_node(entity, source="research_paper.pdf")

# Step 5: Retrieve provenance
lineage = ner._prov_manager.get_lineage("entity_id")
stats = ner._prov_manager.get_statistics()
print(f"Total operations: {stats['total_entries']}")

Bridge Axioms

Bridge axioms enable translation chain tracking across multiple abstraction layers.

from semantica.provenance.bridge_axiom import BridgeAxiom, TranslationChain

# Create bridge axiom
axiom = BridgeAxiom(
    source_layer="L1_ecological",
    target_layer="L2_financial",
    translation_rule="fish_biomass_to_revenue",
    confidence=0.89
)

# Add provenance
axiom.add_source_provenance(
    document="DOI:10.1371/journal.pone.0023601",
    location="Figure 2",
    quote="Total fish biomass increased by 463%"
)

# Create translation chain
chain = TranslationChain()
chain.add_axiom(axiom)

# Track complete chain
provenance_data = chain.get_complete_provenance()

Use Cases: - Blue Finance: Ecological data → Financial metrics - Healthcare: Clinical data → Treatment recommendations - Legal: Evidence → Legal conclusions - Pharmaceutical: Research data → Drug efficacy claims


Best Practices

1. Always Provide Source Information

# ✅ GOOD - Provides source
entities = ner.extract(text, source="document.pdf")

# ❌ BAD - No source information
entities = ner.extract(text)

2. Use Descriptive Entity IDs

# ✅ GOOD - Descriptive IDs
manager.track_entity("company_apple_inc", source, "organization")

# ❌ BAD - Generic IDs
manager.track_entity("entity_1", source, "organization")

3. Include Rich Metadata

# ✅ GOOD - Rich metadata
manager.track_entity(
    entity_id="person_steve_jobs",
    source="biography.pdf",
    entity_type="person",
    metadata={
        "full_name": "Steve Jobs",
        "birth_year": 1955,
        "confidence": 0.95,
        "extraction_method": "NER_spacy"
    }
)

4. Enable Provenance for High-Stakes Operations

# For high-stakes requirements
llm = GroqLLMWithProvenance(provenance=True)  # Track all LLM calls
ner = NERExtractorWithProvenance(provenance=True)  # Track all extractions

5. Use Persistent Storage for Production

from semantica.provenance import ProvenanceManager, SQLiteStorage

# Use SQLite for persistence
storage = SQLiteStorage("provenance.db")
manager = ProvenanceManager(storage=storage)

Performance

Benchmarks

  • Entity tracking: Fast per operation
  • Lineage retrieval: Quick retrieval for long chains
  • Batch operations: High-throughput batch processing
  • Storage: InMemory (fastest), SQLite (persistent)

Optimization Tips

  1. Batch Operations: Use batch methods for multiple entities
  2. Selective Tracking: Only track provenance for critical entities
  3. Storage Choice: Use InMemory for development, SQLite for production
  4. Index Optimization: SQLite automatically indexes entity_id and source_document

Compliance Standards Support

The provenance module provides technical infrastructure that supports compliance efforts:

  • W3C PROV-O — Implements PROV-O ontology data structures and relationships
  • FDA 21 CFR Part 11 — Provides audit trails, checksums, and temporal tracking for electronic records
  • SOX — Enables financial data lineage tracking and integrity verification
  • HIPAA — Supports healthcare data integrity through checksums and source tracking
  • TNFD — Enables bridge axiom tracking for nature-to-financial translations

Important: This module provides the technical capabilities for compliance. Organizations must implement additional policies, procedures, validation, and controls to meet specific regulatory requirements. Semantica does not provide regulatory certification or legal compliance guarantees.


API Reference

ProvenanceManager

__init__(storage=None, storage_path=None)

Initialize provenance manager.

Parameters: - storage (ProvenanceStorage, optional): Storage backend instance - storage_path (str, optional): Path for SQLite storage

track_entity(entity_id, source, entity_type, **metadata)

Track entity provenance.

Parameters: - entity_id (str): Unique identifier for entity - source (str): Source document or identifier - entity_type (str): Type of entity - **metadata: Additional metadata

Returns: ProvenanceEntry

track_relationship(relationship_id, source, subject, predicate, obj, **metadata)

Track relationship provenance.

Parameters: - relationship_id (str): Unique identifier for relationship - source (str): Source document - subject (str): Subject entity ID - predicate (str): Relationship type - obj (str): Object entity ID - **metadata: Additional metadata

Returns: ProvenanceEntry

track_chunk(chunk_id, source_document, chunk_text, start_char, end_char, **metadata)

Track document chunk provenance.

Parameters: - chunk_id (str): Unique identifier for chunk - source_document (str): Source document ID - chunk_text (str): Text content of chunk - start_char (int): Start character position - end_char (int): End character position - **metadata: Additional metadata

Returns: ProvenanceEntry

get_lineage(entity_id)

Retrieve complete lineage for an entity.

Parameters: - entity_id (str): Entity identifier

Returns: dict with lineage information

get_statistics()

Get provenance statistics.

Returns: dict with statistics (total_entries, entities, relationships, chunks)


Testing

Run the provenance test suite:

# All provenance tests
pytest tests/provenance/ -v

# Specific test categories
pytest tests/provenance/test_manager.py -v
pytest tests/provenance/test_storage.py -v
pytest tests/provenance/test_bridge_axiom.py -v
pytest tests/provenance/test_integration.py -v

# Module integration tests
pytest tests/provenance/test_semantic_extract_provenance.py -v
pytest tests/provenance/test_llms_provenance.py -v

Troubleshooting

Provenance Not Being Tracked

# Check if provenance is enabled
print(f"Provenance enabled: {obj.provenance}")
print(f"Manager available: {obj._prov_manager is not None}")

Performance Issues

# Use batch operations
entities = [{"id": f"entity_{i}"} for i in range(1000)]
manager.track_entities_batch(entities, source="doc_1")

Storage Growing Too Large

# Use separate databases for different time periods
manager_2026 = ProvenanceManager(storage_path="provenance_2026.db")

See Also


License

MIT License - See LICENSE for details.

Support

For issues or questions, please open an issue on GitHub or join our Discord.