Split¶
Comprehensive document chunking and splitting for optimal processing with 15+ methods including KG-aware, semantic, and structural chunking.
🎯 Overview¶
-
Multiple Methods
15+ chunking methods: recursive, semantic, entity-aware, relation-aware, and more
-
KG-Aware Chunking
Preserve entities, relationships, and graph structure for GraphRAG workflows
-
Semantic Chunking
Intelligent boundary detection using embeddings and NLP
-
Structural Chunking
Respect document structure: headings, paragraphs, lists, tables
-
Quality Validation
Chunk quality assessment and validation
-
Provenance Tracking
Track chunk origins for data lineage
Choosing the Right Method
- Standard Documents: Use
recursiveorsentencefor general text - GraphRAG: Use
entity_awareorrelation_awareto preserve knowledge - Semantic Coherence: Use
semantic_transformerfor topic-based chunks - Structured Docs: Use
structuralfor documents with headings/sections - Large Documents: Use
hierarchicalfor multi-level chunking
⚙️ Algorithms Used¶
Standard Splitting Algorithms¶
Purpose: Split documents into chunks using various strategies.
How it works:
- Recursive Splitting: Separator hierarchy (
`\n\n`,`\n`,` `, ``) with greedy splitting - Token Counting: BPE tokenization using tiktoken or transformers
- Sentence Segmentation: NLTK punkt, spaCy sentencizer, or regex-based
- Paragraph Detection: Double newline detection with whitespace normalization
- Character Splitting: Fixed-size character chunks with overlap
- Word Splitting: Whitespace tokenization with word boundary preservation
Semantic Chunking Algorithms¶
Purpose: Intelligent boundary detection using embeddings and NLP.
How it works:
- Semantic Boundary Detection:
- Sentence transformer embeddings (384-1024 dim)
- Cosine similarity between consecutive sentences
- Threshold-based boundary detection (default:
`0.7`) - LLM-based Splitting:
- Prompt engineering for optimal split point detection
- Context window management
- Coherence scoring
KG/Ontology Chunking Algorithms¶
Purpose: Preserve entities, relationships, and graph structure for GraphRAG workflows.
How it works:
- Entity Boundary Detection:
- NER-based entity extraction (spaCy, LLM)
- Entity span tracking
- Boundary preservation (no entity splitting)
- Triplet Preservation:
- Graph-based triplet integrity checking
- Subject-predicate-object span tracking
- Relationship boundary preservation
- Graph Centrality Analysis:
- Degree centrality:
`C_D(v) = deg(v) / (n-1)` - Betweenness centrality:
`C_B(v) = Σ(σ_st(v) / σ_st)` - Closeness centrality:
`C_C(v) = (n-1) / Σd(v,u)` - Eigenvector centrality: Power iteration method
- Community Detection:
- Louvain algorithm: Modularity optimization O(n log n)
- Leiden algorithm: Improved Louvain with refinement
- Modularity calculation:
Q = (1/2m) Σ[A_ij - k_i*k_j/2m]δ(c_i,c_j)
Structural Chunking Algorithms¶
- Heading Detection: Markdown/HTML heading parsing
- List Detection: Ordered/unordered list identification
- Table Detection: Table boundary identification
- Section Hierarchy: Tree-based section structure
Validation Algorithms¶
- Chunk Size Validation: Min/max size checking
- Overlap Validation: Overlap percentage calculation
- Completeness Check: Coverage verification
- Quality Scoring: Multi-factor quality assessment
Main Classes¶
TextSplitter¶
Unified text splitter with method parameter for all chunking strategies.
Methods:
| Method | Description | Algorithm |
|---|---|---|
split(text) | Split text using configured method | Method-specific algorithm |
split_documents(documents) | Batch split documents | Parallel processing |
split_with_metadata(text, metadata) | Split with metadata preservation | Metadata propagation |
validate_chunks(chunks) | Validate chunk quality | Quality assessment |
Supported Methods:
| Category | Methods |
|---|---|
| Standard | recursive, token, sentence, paragraph, character, word |
| Semantic | semantic_transformer, llm, huggingface, nltk |
| KG/Ontology | entity_aware, relation_aware, graph_based, ontology_aware |
| Advanced | hierarchical, community_detection, centrality_based, subgraph, topic_based |
Configuration Options:
TextSplitter(
method="recursive", # Chunking method
chunk_size=1000, # Target chunk size (characters/tokens)
chunk_overlap=200, # Overlap between chunks
length_function=len, # Function to measure chunk size
separators=["\n\n", "\n", " ", ""], # For recursive method
keep_separator=True, # Keep separators in chunks
add_start_index=True, # Add start index to metadata
strip_whitespace=True, # Strip whitespace from chunks
# Semantic chunking options
embedding_model="all-MiniLM-L6-v2", # For semantic_transformer
similarity_threshold=0.7, # Semantic boundary threshold
# Entity-aware options
ner_method="ml", # NER method (ml/spacy, llm, pattern)
preserve_entities=True, # Don't split entities
# LLM options
llm_provider="openai", # LLM provider
llm_model="gpt-4", # LLM model
# Graph-based options
centrality_method="degree", # Centrality measure
community_algorithm="louvain", # Community detection algorithm
)
Example:
from semantica.split import TextSplitter
# Standard recursive splitting
splitter = TextSplitter(
method="recursive",
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split(long_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {len(chunk.text)} chars")
print(f"Metadata: {chunk.metadata}")
# Entity-aware for GraphRAG
splitter = TextSplitter(
method="entity_aware",
ner_method="ml",
chunk_size=1000,
preserve_entities=True
)
chunks = splitter.split(text)
# Semantic chunking
splitter = TextSplitter(
method="semantic_transformer",
embedding_model="all-MiniLM-L6-v2",
similarity_threshold=0.7
)
chunks = splitter.split(text)
# Hierarchical chunking
splitter = TextSplitter(
method="hierarchical",
chunk_sizes=[2000, 1000, 500], # Multi-level
chunk_overlaps=[400, 200, 100]
)
chunks = splitter.split(text)
SemanticChunker¶
Semantic-based chunking using embeddings and similarity.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk(text) | Chunk by semantic boundaries | Embedding similarity |
find_boundaries(sentences) | Find semantic boundaries | Threshold-based detection |
calculate_similarity(sent1, sent2) | Calculate similarity | Cosine similarity |
Example:
from semantica.split import SemanticChunker
chunker = SemanticChunker(
embedding_model="all-MiniLM-L6-v2",
similarity_threshold=0.7,
min_chunk_size=100,
max_chunk_size=2000
)
chunks = chunker.chunk(long_text)
for chunk in chunks:
print(f"Chunk: {chunk.text[:100]}...")
print(f"Coherence score: {chunk.metadata.get('coherence_score')}")
EntityAwareChunker¶
Preserve entity boundaries during chunking for GraphRAG.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk(text, entities) | Chunk preserving entities | Entity boundary detection |
Example:
from semantica.split import EntityAwareChunker
from semantica.semantic_extract import NERExtractor
# Extract entities first
ner = NERExtractor(method="ml")
entities = ner.extract(text)
# Chunk preserving entities
chunker = EntityAwareChunker(
chunk_size=1000,
chunk_overlap=200,
ner_method="ml"
)
chunks = chunker.chunk(text, entities=entities)
for chunk in chunks:
print(f"Entities in chunk: {chunk.metadata.get('entities')}")
RelationAwareChunker¶
Preserve relationship triplets during chunking.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk(text, relationships) | Chunk preserving triplets | Triplet span tracking |
extract_relationships(text) | Extract relationships | Relation extraction |
validate_triplet_integrity(chunk, relationships) | Validate triplets | Integrity checking |
Example:
from semantica.split import RelationAwareChunker
from semantica.semantic_extract import RelationExtractor
# Extract relationships
rel_extractor = RelationExtractor()
relationships = rel_extractor.extract(text)
# Chunk preserving relationships
chunker = RelationAwareChunker(
chunk_size=1000,
preserve_triplets=True
)
chunks = chunker.chunk(text, relationships=relationships)
for chunk in chunks:
print(f"Relationships: {chunk.metadata.get('relationships')}")
GraphBasedChunker¶
Chunk based on graph structure and centrality.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk(text, graph) | Chunk by graph structure | Centrality-based |
calculate_centrality(graph) | Calculate node centrality | Degree/betweenness/closeness |
detect_communities(graph) | Detect communities | Louvain/Leiden algorithm |
Example:
from semantica.split import GraphBasedChunker
from semantica.kg import GraphBuilder
# Build graph
builder = GraphBuilder()
kg = builder.build(entities, relationships)
# Chunk by graph structure
chunker = GraphBasedChunker(
centrality_method="betweenness",
community_algorithm="louvain"
)
chunks = chunker.chunk(text, graph=kg)
for chunk in chunks:
print(f"Community: {chunk.metadata.get('community_id')}")
print(f"Centrality: {chunk.metadata.get('avg_centrality')}")
StructuralChunker¶
Structure-aware chunking respecting document hierarchy.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk(text) | Chunk by structure | Heading/section detection |
_extract_structure(text) | Extract structural elements | Markdown/HTML parsing |
Example:
from semantica.split import StructuralChunker
chunker = StructuralChunker(
respect_headers=True,
respect_sections=True,
max_chunk_size=2000
)
chunks = chunker.chunk(markdown_text)
for chunk in chunks:
print(f"Structure preserved: {chunk.metadata.get('structure_preserved')}")
print(f"Elements: {chunk.metadata.get('element_types')}")
HierarchicalChunker¶
Multi-level hierarchical chunking.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk(text) | Multi-level chunking | Recursive hierarchical split |
Example:
from semantica.split import HierarchicalChunker
chunker = HierarchicalChunker(
chunk_sizes=[2000, 1000, 500],
chunk_overlaps=[400, 200, 100],
create_parent_chunks=True
)
chunks = chunker.chunk(long_text)
for chunk in chunks:
print(f"Level: {chunk.metadata.get('level')}")
print(f"Parent: {chunk.metadata.get('parent_id')}")
print(f"Children: {chunk.metadata.get('child_ids')}")
OntologyAwareChunker¶
Chunk based on ontology concepts and relationships.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk(text, ontology) | Chunk by ontology concepts | Concept boundary detection |
extract_concepts(text) | Extract ontology concepts | Concept extraction |
find_concept_boundaries(text, concepts) | Find concept boundaries | Concept span checking |
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size | int | 1000 | Target chunk size |
chunk_overlap | int | 200 | Overlap between chunks |
ontology_path | str | None | Path to ontology file (.owl, .rdf) |
preserve_concepts | bool | True | Don't split ontology concepts |
concept_extraction_method | str | "llm" | Method for concept extraction |
Example:
from semantica.split import OntologyAwareChunker
chunker = OntologyAwareChunker(
chunk_size=1000,
chunk_overlap=200,
ontology_path="domain_ontology.owl",
preserve_concepts=True,
concept_extraction_method="llm"
)
chunks = chunker.chunk(text)
for chunk in chunks:
concepts = chunk.metadata.get('concepts', [])
print(f"Concepts in chunk: {[c['label'] for c in concepts]}")
print(f"Concept types: {[c['type'] for c in concepts]}")
SlidingWindowChunker¶
Fixed-size sliding window chunking with configurable step size.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk(text) | Sliding window chunking | Fixed-size window with step |
chunk_with_overlap(text) | Chunk with specific overlap | Window position calculation |
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
chunk_size | int | 1000 | Size of sliding window |
overlap | int | 0 | Overlap size |
stride | int | chunk_size - overlap | Step size |
Example:
from semantica.split import SlidingWindowChunker
# Basic sliding window
chunker = SlidingWindowChunker(
chunk_size=1000,
overlap=200
)
chunks = chunker.chunk(long_text)
for i, chunk in enumerate(chunks):
print(f"Window {i}: chars {chunk.start_index}-{chunk.end_index}")
print(f"Has overlap: {chunk.metadata.get('has_overlap')}")
# Boundary-preserving sliding window
chunks = chunker.chunk(text, preserve_boundaries=True)
TableChunker¶
Table-specific chunking preserving table structure.
Methods:
| Method | Description | Algorithm |
|---|---|---|
chunk_table(table_data) | Chunk tables | Row/Column-based splitting |
chunk_to_text_chunks(table_data) | Convert table chunks to text | Table to text conversion |
extract_table_schema(table_data) | Extract schema | Type inference and schema extraction |
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_rows | int | 100 | Maximum rows per table chunk |
preserve_headers | bool | True | Keep headers in each chunk |
chunk_by_columns | bool | False | Chunk by columns instead of rows |
Example:
from semantica.split import TableChunker
chunker = TableChunker(
max_rows=50,
preserve_headers=True,
chunk_by_columns=False
)
table_data = {
"headers": ["Col1", "Col2", "Col3"],
"rows": [["Val1", "Val2", "Val3"], ...]
}
# Get structured table chunks
table_chunks = chunker.chunk_table(table_data)
# Get text chunks for RAG
text_chunks = chunker.chunk_to_text_chunks(table_data)
for chunk in text_chunks:
print(f"Table chunk {chunk.metadata.get('chunk_index')}")
print(f"Rows: {chunk.metadata.get('row_count')}")
ChunkValidator¶
Validate chunk quality and completeness.
Methods:
| Method | Description | Algorithm |
|---|---|---|
validate(chunks) | Validate chunks | Multi-factor validation |
check_size(chunk) | Check size constraints | Min/max checking |
check_overlap(chunks) | Check overlap | Overlap calculation |
check_completeness(chunks, original) | Check coverage | Coverage verification |
calculate_quality_score(chunk) | Quality score | Multi-factor scoring |
Example:
from semantica.split import ChunkValidator
validator = ChunkValidator(
min_chunk_size=100,
max_chunk_size=2000,
min_overlap=50,
max_overlap=500
)
validation_result = validator.validate(chunks)
print(f"Valid: {validation_result['valid']}")
print(f"Issues: {validation_result['issues']}")
print(f"Quality score: {validation_result['quality_score']}")
ProvenanceTracker¶
Track chunk provenance for data lineage.
Methods:
| Method | Description | Algorithm |
|---|---|---|
track(chunk, source) | Track chunk origin | Provenance recording |
get_lineage(chunk_id) | Get chunk lineage | Lineage retrieval |
visualize_lineage(chunk_id) | Visualize lineage | Graph visualization |
Example:
from semantica.split import ProvenanceTracker
tracker = ProvenanceTracker()
for chunk in chunks:
tracker.track(
chunk=chunk,
source={
"document_id": "doc123",
"file_path": "data/document.pdf",
"page": 5,
"timestamp": "2024-01-01T00:00:00Z"
}
)
# Get lineage
lineage = tracker.get_lineage(chunk.id)
print(f"Source: {lineage['source']}")
print(f"Transformations: {lineage['transformations']}")
Convenience Functions¶
Quick access to splitting operations:
from semantica.split import (
split_recursive,
split_by_tokens,
split_by_sentences,
split_by_paragraphs,
split_entity_aware,
split_relation_aware,
split_semantic_transformer,
list_available_methods
)
# List available methods
methods = list_available_methods()
print(f"Available methods: {methods}")
# Quick splitting
chunks = split_recursive(text, chunk_size=1000, chunk_overlap=200)
chunks = split_by_sentences(text, sentences_per_chunk=5)
chunks = split_entity_aware(text, ner_method="ml")
Configuration¶
Environment Variables¶
# Default settings
export SPLIT_DEFAULT_METHOD=recursive
export SPLIT_DEFAULT_CHUNK_SIZE=1000
export SPLIT_DEFAULT_CHUNK_OVERLAP=200
# Semantic chunking
export SPLIT_EMBEDDING_MODEL=all-MiniLM-L6-v2
export SPLIT_SIMILARITY_THRESHOLD=0.7
# Entity-aware
export SPLIT_NER_METHOD=ml # or spacy
export SPLIT_PRESERVE_ENTITIES=true
# LLM-based
export SPLIT_LLM_PROVIDER=openai
export SPLIT_LLM_MODEL=gpt-4
YAML Configuration¶
# config.yaml - Split Module Configuration
split:
default_method: recursive
chunk_size: 1000
chunk_overlap: 200
recursive:
separators: ["\n\n", "\n", " ", ""]
keep_separator: true
semantic:
embedding_model: all-MiniLM-L6-v2
similarity_threshold: 0.7
min_chunk_size: 100
max_chunk_size: 2000
entity_aware:
ner_method: ml # or spacy
preserve_entities: true
min_entity_gap: 50
relation_aware:
preserve_triplets: true
relation_extraction_method: llm
graph_based:
centrality_method: betweenness
community_algorithm: louvain
min_community_size: 3
hierarchical:
levels: 3
chunk_sizes: [2000, 1000, 500]
chunk_overlaps: [400, 200, 100]
validation:
enabled: true
min_chunk_size: 100
max_chunk_size: 2000
check_overlap: true
check_completeness: true
Method Comparison¶
| Method | Best For | Pros | Cons |
|---|---|---|---|
| recursive | General text | Fast, simple | May split mid-sentence |
| sentence | Coherent chunks | Respects sentences | Variable size |
| semantic_transformer | Topic coherence | Semantic boundaries | Slower, needs embeddings |
| entity_aware | GraphRAG | Preserves entities | Requires NER |
| relation_aware | KG extraction | Preserves triplets | Requires relation extraction |
| graph_based | Graph analysis | Graph-aware | Requires graph construction |
| hierarchical | Large documents | Multi-level | More complex |
| structural | Formatted docs | Respects structure | Needs structure |
Integration Examples¶
Complete GraphRAG Pipeline¶
from semantica.split import TextSplitter
from semantica.semantic_extract import NERExtractor, RelationExtractor
from semantica.kg import GraphBuilder
from semantica.embeddings import EmbeddingGenerator
from semantica.vector_store import VectorStore
# Parse document
text = "Apple Inc. was founded by Steve Jobs in 1976..."
# Entity-aware chunking
splitter = TextSplitter(
method="entity_aware",
ner_method="llm",
chunk_size=1000
)
chunks = splitter.split(text)
# Extract from each chunk
ner = NERExtractor(method="llm")
rel_extractor = RelationExtractor()
all_entities = []
all_relationships = []
for chunk in chunks:
entities = ner.extract(chunk.text)
relationships = rel_extractor.extract(chunk.text, entities)
all_entities.extend(entities)
all_relationships.extend(relationships)
# Build knowledge graph
builder = GraphBuilder()
kg = builder.build(all_entities, all_relationships)
# Generate embeddings
embedder = EmbeddingGenerator()
embeddings = embedder.generate([chunk.text for chunk in chunks])
# Store in vector store
vector_store = VectorStore()
vector_store.store(embeddings, chunks)
Multi-Level Hierarchical Chunking¶
from semantica.split import HierarchicalChunker
chunker = HierarchicalChunker(
chunk_sizes=[4000, 2000, 1000],
chunk_overlaps=[800, 400, 200],
create_parent_chunks=True
)
chunks = chunker.chunk(very_long_document)
# Access hierarchy
for chunk in chunks:
level = chunk.metadata['level']
parent_id = chunk.metadata.get('parent_id')
child_ids = chunk.metadata.get('child_ids', [])
print(f"Level {level}: {len(chunk.text)} chars")
if parent_id:
print(f" Parent: {parent_id}")
if child_ids:
print(f" Children: {len(child_ids)}")
Best Practices¶
1. Choose Appropriate Chunk Size¶
# For semantic search (embeddings)
splitter = TextSplitter(method="recursive", chunk_size=512)
# For LLM context (GPT-4)
splitter = TextSplitter(method="recursive", chunk_size=4000)
# For entity extraction
splitter = TextSplitter(method="entity_aware", chunk_size=1000)
2. Use Overlap for Context¶
3. Validate Chunks¶
from semantica.split import ChunkValidator
validator = ChunkValidator()
validation = validator.validate(chunks)
if not validation['valid']:
print(f"Issues: {validation['issues']}")
4. Track Provenance¶
from semantica.split import ProvenanceTracker
tracker = ProvenanceTracker()
for chunk in chunks:
tracker.track(chunk, source={"doc_id": "123"})
Troubleshooting¶
Issue: Chunks too small/large¶
# Solution: Adjust chunk size and method
splitter = TextSplitter(
method="recursive",
chunk_size=1500, # Increase
chunk_overlap=300
)
# Or use validation
validator = ChunkValidator(min_chunk_size=500, max_chunk_size=2000)
Issue: Entities split across chunks¶
# Solution: Use entity-aware chunking
splitter = TextSplitter(
method="entity_aware",
ner_method="llm",
preserve_entities=True
)
Issue: Slow semantic chunking¶
# Solution: Use faster embedding model or batch processing
splitter = TextSplitter(
method="semantic_transformer",
embedding_model="all-MiniLM-L6-v2", # Faster model
batch_size=32 # Batch embeddings
)
Performance Tips¶
Memory Optimization¶
# Process in batches
def chunk_large_corpus(documents, batch_size=100):
splitter = TextSplitter(method="recursive")
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
chunks = splitter.split_documents(batch)
yield from chunks
Speed Optimization¶
# Use faster methods for large documents
splitter = TextSplitter(
method="recursive", # Fastest
chunk_size=1000
)
# Avoid LLM-based for large corpora
# Use semantic_transformer instead of llm
See Also¶
- Parse Module - Document parsing
- Semantic Extract Module - Entity extraction
- Knowledge Graph Module - Graph construction
- Embeddings Module - Vector generation
- Vector Store Module - Vector storage
Cookbook¶
Interactive tutorials to learn text chunking and splitting:
- Chunking and Splitting: Split documents for RAG and processing
- Topics: Recursive character splitting, semantic splitting, token-based splitting, chunking strategies
- Difficulty: Beginner
- Use Cases: Preparing text for RAG, document chunking