Semantic Extract¶

Advanced information extraction system for Entities, Relations, Events, and Triplets.

🎯 Overview¶

The Semantic Extract Module extracts structured information from unstructured text. It identifies entities, relationships, events, and semantic structures that form the foundation of knowledge graphs.

What is Semantic Extraction?¶

Semantic extraction is the process of identifying meaningful information from text: - Named Entities: People, organizations, locations, dates, etc. - Relationships: Connections between entities (e.g., "founded_by", "located_in") - Events: Actions with temporal information and participants - Triplets: Subject-Predicate-Object structures for knowledge graphs - Semantic Networks: Structured networks of nodes and edges

Why Use the Semantic Extract Module?¶

Multiple Methods: Support for ML models, LLMs, and rule-based extraction
High Accuracy: LLM-based extraction for complex schemas
Flexible Configuration: Customize extraction for your domain
Confidence Scores: Get confidence scores for all extractions
Batch Processing: Efficient parallel batch processing for large datasets
Coreference Resolution: Resolve pronouns to their entity references

How It Works¶

Text Input: Receive parsed text from the parse module
Entity Extraction: Identify named entities using NER
Coreference Resolution: Resolve pronouns to entities (optional)
Relationship Extraction: Identify relationships between entities
Event Detection: Detect events with temporal information
Triplet Generation: Generate RDF triplets for knowledge graphs
Output: Return structured entities, relationships, and triplets

NER

Extract Named Entities (Person, Org, Loc) with confidence scores
Relation Extraction

Identify relationships between entities (e.g., founded_by, located_in)
Event Detection

Detect events with temporal information and participants
Coreference

Resolve pronouns ("he", "it") to their entity references
Triplet Extraction

Extract Subject-Predicate-Object triplets for Knowledge Graphs
LLM Extraction

Use LLMs to improve extraction quality and handle complex schemas
Semantic Networks

Build structured networks with nodes and edges from text

When to Use

KG Construction: Converting unstructured text into structured graph data
Text Analysis: Identifying key actors and events in documents
Search Indexing: Extracting metadata for faceted search
Data Enrichment: Adding semantic tags to content

⚙️ Algorithms Used¶

Named Entity Recognition (NER)¶

Purpose: Identify and classify named entities in text.

How it works:

Transformer Models: BERT/RoBERTa for token classification
Regex Patterns: Pattern matching for specific formats (Emails, IDs)
LLM Prompting: Zero-shot extraction for custom entity types

Relation Extraction¶

Purpose: Identify relationships between entities.

How it works:

Dependency Parsing: Analyzing grammatical structure to find subject-verb-object paths
Joint Extraction: Extracting entities and relations simultaneously
Semantic Role Labeling: Identifying "Who did What to Whom"

Coreference Resolution¶

Purpose: Resolve pronouns and references to their entity references.

How it works:

Mention Detection: Finding all potential references (nouns, pronouns)
Clustering: Grouping mentions that refer to the same real-world entity
Pronoun Resolution: Mapping pronouns to the most likely antecedent

Triplet Extraction¶

Purpose: Extract Subject-Predicate-Object triplets for Knowledge Graphs.

How it works:

OpenIE: Open Information Extraction for arbitrary relation strings
Schema-Based: Mapping extracted relations to a predefined ontology
Reification: Handling complex relations (time, location) by creating event nodes.

Main Classes¶

NamedEntityRecognizer¶

Coordinator for entity extraction.

Parameters:

Parameter	Type	Default	Description
`methods`	list	`["spacy"]`	Extraction methods to use
`confidence_threshold`	float	`0.5`	Minimum confidence score
`merge_overlapping`	bool	`True`	Merge overlapping entities
`include_standard_types`	bool	`True`	Include Person, Org, Location

Methods:

Method	Description
`extract_entities(text)`	Get list of entities
`add_custom_pattern(pattern)`	Add regex rule

Example:

from semantica.semantic_extract import NamedEntityRecognizer

# Basic usage
ner = NamedEntityRecognizer()
entities = ner.extract_entities("Elon Musk leads SpaceX.")
# [Entity(text="Elon Musk", label="PERSON"), Entity(text="SpaceX", label="ORG")]

# With configuration
ner = NamedEntityRecognizer(
    methods=["spacy", "rule-based"],
    confidence_threshold=0.7,
    merge_overlapping=True
)
entities = ner.extract_entities("Apple Inc. was founded in 1976.")

NERExtractor¶

Core entity extraction implementation used by notebooks and lower-level integrations.

Parameters:

Parameter	Type	Default	Description
`method`	str or list	`"ml"`	Method(s): "ml", "llm", "pattern", "regex", "huggingface"
`entity_types`	list	`None`	Filter for specific entity types
`**config`	dict	`{}`	Method-specific config (e.g., `model`, `aggregation_strategy`, `device`)

Methods:

Method	Description
`extract(text, pipeline_id=None, **kwargs)`	Alias for `extract_entities`. Supports `max_workers`.
`extract_entities(text, pipeline_id=None, **kwargs)`	Get list of entities. Supports `max_workers`.

Example:

from semantica.semantic_extract import NERExtractor

# 1. ML (spaCy) - Default
extractor = NERExtractor(method="ml", model="en_core_web_trf")
entities = extractor.extract("Elon Musk leads SpaceX.")

# 2. LLM (OpenAI/Gemini/Groq/etc)
extractor = NERExtractor(
    method="llm", 
    provider="groq", 
    model="llama-3.3-70b-versatile",
    max_tokens=2048, # Increased output limit
    temperature=0.0
)

# 3. Regex with custom patterns
patterns = {"CODE": r"[A-Z]{3}-\d{3}"}
extractor = NERExtractor(method="regex", patterns=patterns)

# 4. Ensemble (Multiple methods)
extractor = NERExtractor(method=["ml", "llm"], ensemble_voting=True)

RelationExtractor¶

Extracts relationships between entities.

Parameters:

Parameter	Type	Default	Description
`method`	str	`"dependency"`	Method: "dependency", "pattern", "cooccurrence", "huggingface", "llm"
`relation_types`	list	`None`	Specific relation types to extract
`bidirectional`	bool	`False`	Extract bidirectional relations
`confidence_threshold`	float	`0.6`	Minimum confidence score
`max_distance`	int	`50`	Max token distance between entities
`**config`	dict	`{}`	Method-specific config (e.g., `model`, `device` for HuggingFace)

Methods:

Method	Description
`extract(text, entities, pipeline_id=None, **kwargs)`	Alias for `extract_relations`. Supports `max_workers`.
`extract_relations(text, entities, pipeline_id=None, **kwargs)`	Find links. Supports `max_workers`.

Example:

from semantica.semantic_extract import RelationExtractor, NamedEntityRecognizer

# First extract entities
ner = NamedEntityRecognizer()
text = "Elon Musk founded SpaceX in 2002."
entities = ner.extract_entities(text)

# Basic relation extraction
rel_extractor = RelationExtractor()
relations = rel_extractor.extract(text, entities=entities)
# [Relation(subject="Elon Musk", predicate="founded", object="SpaceX")]

# With configuration
rel_extractor = RelationExtractor(
    relation_types=["founded", "leads", "works_at"],
    confidence_threshold=0.7,
    bidirectional=False
)
relations = rel_extractor.extract(text, entities=entities)

CoreferenceResolver¶

Resolves pronoun references and entity coreferences.

Parameters:

Parameter	Type	Default	Description
`method`	str or list	`None`	Underlying NER method(s)
`**config`	dict	`{}`	Configuration for NER method

Methods:

Method	Description
`resolve(text)`	Alias for `resolve_coreferences`. Get coreference chains.
`resolve_coreferences(text)`	Get coreference chains
`resolve_pronouns(text)`	Resolve pronouns to entities

Example:

from semantica.semantic_extract import CoreferenceResolver

resolver = CoreferenceResolver()
text = "Steve Jobs founded Apple. He was the CEO."

# Resolve references
chains = resolver.resolve(text)
# [CoreferenceChain(mentions=["Steve Jobs", "He"], representative="Steve Jobs")]

EventDetector¶

Identifies events with temporal information and participants.

Parameters:

Parameter	Type	Default	Description
`event_types`	list	`None`	Specific event types to detect
`extract_participants`	bool	`True`	Extract event participants
`extract_location`	bool	`True`	Extract event locations
`extract_time`	bool	`True`	Extract temporal information
`max_workers`	int	`1`	Threads for parallel batch processing

Methods:

Method	Description
`detect_events(text)`	Find events

Example:

from semantica.semantic_extract import EventDetector

detector = EventDetector(
    event_types=["launch", "acquisition", "announcement"],
    extract_participants=True,
    extract_time=True
)
events = detector.detect_events("SpaceX launched Starship on March 14, 2024.")

TripletExtractor¶

Extracts RDF triplets (Subject-Predicate-Object).

Parameters:

Parameter	Type	Default	Description
`method`	str	`"pattern"`	Extraction method ("pattern", "rules", "huggingface", "llm")
`triplet_types`	list	`None`	Specific triplet types/predicates to extract
`include_temporal`	bool	`False`	Include time information
`include_provenance`	bool	`False`	Track source sentences
`**kwargs`	dict	`{}`	Configuration options (e.g., `model`, `device`)

Methods:

Method	Description
`extract(text, entities=None, relations=None, pipeline_id=None, **kwargs)`	Alias for `extract_triplets`. Supports `max_workers`.
`extract_triplets(text, entities=None, relations=None, pipeline_id=None, **kwargs)`	Get (S, P, O) tuples. Supports `max_workers`.

Example:

from semantica.semantic_extract import TripletExtractor

extractor = TripletExtractor(
    include_temporal=True,
    include_provenance=True
)
triplets = extractor.extract_triplets("Steve Jobs founded Apple in 1976.")
# [Triplet(subject="Steve Jobs", predicate="founded", object="Apple", temporal="1976")]

SemanticNetworkExtractor¶

Extracts structured semantic networks with nodes and edges.

Parameters:

Parameter	Type	Default	Description
`ner_method`	str	`None`	Method for node extraction
`relation_method`	str	`None`	Method for edge extraction
`max_workers`	int	`1`	Threads for parallel batch processing
`**config`	dict	`{}`	Configuration for underlying extractors

Methods:

Method	Description
`extract_network(text)`	Build network from text
`extract(text)`	Alias for `extract_network`
`export_to_yaml(network, path)`	Save network to YAML

Example:

from semantica.semantic_extract import SemanticNetworkExtractor

extractor = SemanticNetworkExtractor()
network = extractor.extract("Apple Inc. is located in Cupertino.")

# Analyze network
print(f"Nodes: {len(network.nodes)}")
print(f"Edges: {len(network.edges)}")

LLMExtraction¶

LLM-based extraction and enhancement. (Alias: LLMEnhancer)

Parameters:

Parameter	Type	Default	Description
`provider`	str	`"openai"`	LLM provider ("openai", "gemini", "anthropic", etc.)
`**config`	dict	`{}`	Model config (model name, api_key, etc.)

Methods:

Method	Description
`enhance_extractions(extractions, text)`	Enhance generic extractions
`enhance_entities(text, entities)`	Improve entity accuracy and details
`enhance_relations(text, relations)`	Improve relation detection

Example:

from semantica.semantic_extract import LLMExtraction

extractor = LLMExtraction(provider="openai", model="gpt-4")
enhanced_entities = extractor.enhance_entities(text, entities)

Batch Processing & Provenance¶

All extractors support batch processing for high-throughput extraction. You can pass a list of strings or a list of dictionaries (with content and id keys).

Features: - Progress Tracking: Automatically shows a progress bar for large batches. - Provenance Metadata: Each extracted item includes batch_index and document_id in its metadata.

from semantica.semantic_extract import NERExtractor

documents = [
    {"id": "doc_1", "content": "Apple Inc. was founded by Steve Jobs."},
    {"id": "doc_2", "content": "Microsoft Corporation was founded by Bill Gates."}
]

extractor = NERExtractor()
batch_results = extractor.extract(documents)

for i, doc_entities in enumerate(batch_results):
    print(f"Document {i} entities:")
    for entity in doc_entities:
        print(f"  - {entity.text} ({entity.label})")
        print(f"    Provenance: Batch Index {entity.metadata['batch_index']}, Doc ID {entity.metadata.get('document_id')}")

Robust Extraction Fallbacks¶

The framework implements robust fallback chains to prevent empty results when primary methods fail (e.g., due to model unavailability or obscure text).

NER: ML/LLM -> Pattern -> Last Resort (Capitalized Words)
Relation: Primary -> Pattern -> Last Resort (Adjacency)
Triplet: Primary -> Relation-to-Triplet -> Pattern

This ensures that you almost always get some structured data, even if it requires falling back to simpler heuristics.

Usage Examples¶

from semantica.semantic_extract import (
    NamedEntityRecognizer, 
    RelationExtractor,
    TripletExtractor,
    EventDetector,
    CoreferenceResolver,
    SemanticNetworkExtractor
)

text = "Apple released the iPhone in 2007. Steve Jobs announced it at Macworld."

# Extract entities with confidence filtering
ner = NamedEntityRecognizer(confidence_threshold=0.7)
entities = ner.extract_entities(text)

# Resolve coreferences (recommended before relation extraction)
coref = CoreferenceResolver()
resolved = coref.resolve(text)

# Extract relations
rel_extractor = RelationExtractor(confidence_threshold=0.6)
relations = rel_extractor.extract_relations(text, entities=entities)

# Extract triplets for KG
triplet_extractor = TripletExtractor(include_temporal=True)
triplets = triplet_extractor.extract_triplets(text)

# Detect events
event_detector = EventDetector(extract_time=True)
events = event_detector.detect_events(text)

# Extract semantic network
network_extractor = SemanticNetworkExtractor()
network = network_extractor.extract(text)

print(f"Entities: {len(entities)}")
print(f"Relations: {len(relations)}")
print(f"Triplets: {len(triplets)}")
print(f"Events: {len(events)}")
print(f"Network Nodes: {len(network.nodes)}")

Configuration¶

Environment Variables¶

export NER_MODEL=dslim/bert-base-NER
export RELATION_MODEL=semantica/rel-extract-v1
export EXTRACT_CONFIDENCE_THRESHOLD=0.7

YAML Configuration¶

semantic_extract:
  ner:
    model: dslim/bert-base-NER
    min_confidence: 0.7

  relations:
    max_distance: 50 # tokens

  coreference:
    enabled: true

Integration Examples¶

KG Population Pipeline¶

from semantica.semantic_extract import NamedEntityRecognizer, RelationExtractor, TripletExtractor
from semantica.kg import GraphBuilder

# 1. Extract
text = "Google was founded by Larry Page and Sergey Brin."
ner = NamedEntityRecognizer()
entities = ner.extract_entities(text)
triplet_extractor = TripletExtractor()
triplets = triplet_extractor.extract_triplets(text)

# 2. Populate KG using GraphBuilder
builder = GraphBuilder()
sources = [{
    "entities": entities,
    "relationships": [{"source": t.subject, "target": t.object, "type": t.predicate} for t in triplets]
}]
kg = builder.build(sources)

Best Practices¶

Resolve Coreferences: Always run coreference resolution before relation extraction to link "He" to "John Doe".
Filter Low Confidence: Set a confidence threshold (e.g., 0.7) to reduce noise.
Use Custom Patterns: For domain-specific IDs (e.g., "Invoice #123"), regex is faster and more accurate than ML.
Batch Processing: Use batch methods when processing large corpora.

Cookbook¶

Interactive tutorials to learn semantic extraction:

Entity Extraction: Extract named entities from text using NER
Topics: NER, Spacy, LLM extraction, entity types, confidence scores
Difficulty: Beginner
Use Cases: Identifying entities in text, building entity lists
Relation Extraction: Discover and classify relationships between entities
Topics: Relation classification, dependency parsing, relationship types
Difficulty: Beginner
Use Cases: Finding relationships, building knowledge graphs
Advanced Extraction: Custom extractors, LLM-based extraction, and complex pattern matching
Topics: Custom models, regex, LLMs, ensemble methods, domain-specific extraction
Difficulty: Advanced
Use Cases: Custom extraction schemas, domain-specific entities

Semantic Extract¶

🎯 Overview¶

What is Semantic Extraction?¶

Why Use the Semantic Extract Module?¶

How It Works¶

⚙️ Algorithms Used¶

Named Entity Recognition (NER)¶

Relation Extraction¶

Coreference Resolution¶

Triplet Extraction¶

Main Classes¶

NamedEntityRecognizer¶

NERExtractor¶

RelationExtractor¶

CoreferenceResolver¶

EventDetector¶

TripletExtractor¶

SemanticNetworkExtractor¶

LLMExtraction¶

Batch Processing & Provenance¶

Robust Extraction Fallbacks¶

Usage Examples¶

Configuration¶

Environment Variables¶

YAML Configuration¶

Integration Examples¶

KG Population Pipeline¶

Best Practices¶

See Also¶

Cookbook¶