Normalize¶
Clean, standardize, and prepare text and data for semantic processing with comprehensive normalization capabilities.
π― Overview¶
-
Text Cleaning
Remove noise, fix encoding issues, and standardize whitespace for clean text
-
Entity Normalization
Standardize entity names, abbreviations, and formats across documents
-
Date & Time
Parse and standardize date/time formats to ISO 8601
-
Number Normalization
Standardize numeric values, units, and measurements
-
Language Detection
Automatically detect document language with confidence scoring
-
Encoding Handling
Fix character encoding issues and ensure UTF-8 compliance
Why Normalize?
Normalization is crucial for:
- Consistency: Ensure uniform data representation
- Accuracy: Improve entity extraction and matching
- Quality: Reduce noise and errors in downstream processing
- Performance: Enable better deduplication and search
βοΈ Algorithms Used¶
Text Normalization¶
Purpose: Clean and standardize text for consistent processing.
How it works:
- Unicode Normalization: NFC, NFD, NFKC, NFKD forms using Unicode standard
- Whitespace Normalization: Regex-based cleanup (
`\s+`β single space) - Case Folding: Locale-aware case normalization (Unicode case folding)
- Diacritic Removal: Unicode decomposition and combining character removal
- Punctuation Handling: Smart punctuation normalization preserving sentence structure
Entity Normalization¶
Purpose: Standardize entity names and resolve variations to canonical forms.
How it works:
- Fuzzy Matching: Levenshtein distance with configurable threshold (default:
`0.85`) - Phonetic Matching: Soundex and Metaphone algorithms for name variants
- Abbreviation Expansion: Dictionary-based expansion with context awareness
- Canonical Form Selection: Frequency-based or confidence-based selection
- Entity Linking: Hash-based entity ID generation for cross-document linking
Date/Time Normalization¶
Purpose: Parse and standardize date/time formats to ISO 8601.
How it works:
- Parsing: dateutil parser with 100+ format support
- Timezone Handling: pytz for timezone conversion and DST handling
- Standardization: ISO 8601 format output (
`YYYY-MM-DDTHH:MM:SSZ`) - Relative Date Resolution: Convert "yesterday", "last week" to absolute dates
- Fuzzy Date Parsing: Handle incomplete dates (e.g., "March 2024")
Number Normalization¶
Purpose: Standardize numeric values, units, and measurements.
How it works:
- Numeric Parsing: Handle various formats (
`1,000.00`,`1.000,00`,`1 000.00`) - Unit Conversion: Standardize units (km β meters, lbs β kg)
- Scientific Notation: Parse and normalize scientific notation
- Percentage Handling: Normalize percentage representations
- Currency Normalization: Standardize currency symbols and amounts
Language Detection¶
Purpose: Automatically detect document language with confidence scoring.
How it works:
- N-gram Analysis: Character and word n-gram frequency analysis
- Statistical Models: Language-specific statistical models
- Confidence Scoring: Probability-based confidence scores
- Multi-language Support: 100+ languages supported
Main Classes¶
TextNormalizer¶
Main text normalization orchestrator with comprehensive cleaning capabilities.
Methods:
| Method | Description |
|---|---|
`normalize_text(text, ...)` | Normalize single text using full pipeline |
`clean_text(text, ...)` | Clean text (HTML removal, sanitization) |
`standardize_format(text, format_type)` | Standardize formatting (standard/compact/preserve) |
`process_batch(texts, ...)` | Batch normalize multiple texts |
Example:
from semantica.normalize import TextNormalizer
normalizer = TextNormalizer()
# Normalize
normalized = normalizer.normalize_text(" Apple Inc. was founded in 1976. ", case="preserve")
# Clean only
cleaned = normalizer.clean_text("<p>Hello</p>", remove_html=True)
# Batch
texts = ["Hello World", "Another Example"]
normalized_batch = normalizer.process_batch(texts, case="lower")
EntityNormalizer¶
Standardize entity names and resolve variations to canonical forms.
Methods:
| Method | Description |
|---|---|
`normalize_entity(name, ...)` | Normalize entity name to canonical form |
`resolve_aliases(name, ...)` | Resolve aliases via alias map |
`disambiguate_entity(name, ...)` | Disambiguate using context and candidates |
`link_entities(names, ...)` | Link a list of names to canonical forms |
Configuration Options:
EntityNormalizer(
fuzzy_matching=True, # Enable fuzzy matching
similarity_threshold=0.85, # Similarity threshold (0-1)
phonetic_matching=False, # Enable phonetic matching
case_sensitive=False, # Case-sensitive matching
preserve_case=True, # Preserve original case in output
expand_abbreviations=True, # Expand common abbreviations
canonical_dict=None # Custom canonical mappings
)
Example:
from semantica.normalize import EntityNormalizer
normalizer = EntityNormalizer(similarity_threshold=0.85)
# Normalize single
canonical = normalizer.normalize_entity("Apple, Inc.")
# Link list
linked = normalizer.link_entities(["Apple Inc.", "Apple", "AAPL"], entity_type="Organization")
DateNormalizer¶
Parse and standardize date/time formats to ISO 8601.
Methods:
| Method | Description |
|---|---|
`normalize_date(date_str, ...)` | Parse and normalize date |
`normalize_time(time_str, ...)` | Normalize time-only strings |
`parse_temporal_expression(expr)` | Parse date ranges and temporal phrases |
Configuration Options:
DateNormalizer(
output_format="ISO8601", # ISO8601, UNIX, custom format
timezone="UTC", # Target timezone
handle_relative=True, # Parse "yesterday", "last week"
fuzzy=True, # Fuzzy parsing
default_day=1, # Default day for incomplete dates
default_month=1 # Default month for incomplete dates
)
Example:
from semantica.normalize import DateNormalizer
normalizer = DateNormalizer()
dates = ["Jan 1, 2024", "01/01/2024", "yesterday"]
normalized = [normalizer.normalize_date(d) for d in dates]
time = normalizer.normalize_time("10:30 AM")
NumberNormalizer¶
Standardize numeric values, units, and measurements.
Methods:
| Method | Description |
|---|---|
`normalize_number(input, ...)` | Parse and normalize number |
`normalize_quantity(quantity, ...)` | Parse value with unit |
`convert_units(value, from_unit, to_unit)` | Convert units |
`process_currency(text, ...)` | Parse currency amount and code |
Example:
from semantica.normalize import NumberNormalizer
normalizer = NumberNormalizer()
numbers = ["1,000.50", "50%", "1.5e3"]
normalized = [normalizer.normalize_number(n) for n in numbers]
quantity = normalizer.normalize_quantity("5 kg")
converted = normalizer.convert_units(5, "km", "m")
currency = normalizer.process_currency("$1,234.56")
LanguageDetector¶
Detect document language with confidence scoring.
Methods:
| Method | Description |
|---|---|
`detect(text)` | Detect language |
`detect_with_confidence(text)` | Detect with confidence score |
`detect_multiple(text, top_n)` | List top-N candidate languages |
`detect_batch(texts)` | Batch language detection |
Example:
from semantica.normalize import LanguageDetector
detector = LanguageDetector()
# Detect language
texts = [
"Hello, how are you?",
"Bonjour, comment allez-vous?",
"Hola, ΒΏcΓ³mo estΓ‘s?",
"Hallo, wie geht es dir?",
"γγγ«γ‘γ―γγε
ζ°γ§γγοΌ"
]
for text in texts:
result = detector.detect(text)
print(f"{text[:30]:30} β {result['language']} ({result['confidence']:.2f})")
# Output:
# Hello, how are you? β en (0.99)
# Bonjour, comment allez-vous? β fr (0.98)
# Hola, ΒΏcΓ³mo estΓ‘s? β es (0.97)
# Hallo, wie geht es dir? β de (0.96)
# γγγ«γ‘γ―γγε
ζ°γ§γγοΌ β ja (0.99)
Configuration¶
Environment Variables¶
# Normalization settings
export NORMALIZE_DEFAULT_LOWERCASE=false
export NORMALIZE_DEFAULT_ENCODING=utf-8
export NORMALIZE_DEFAULT_TIMEZONE=UTC
# Entity normalization
export NORMALIZE_ENTITY_SIMILARITY_THRESHOLD=0.85
export NORMALIZE_ENTITY_FUZZY_MATCHING=true
# Language detection
export NORMALIZE_LANGUAGE_DETECTOR=langdetect
export NORMALIZE_LANGUAGE_CONFIDENCE_THRESHOLD=0.8
YAML Configuration¶
# config.yaml - Normalize Module Configuration
normalize:
text:
lowercase: false
remove_punctuation: false
fix_encoding: true
normalize_whitespace: true
remove_urls: false
expand_contractions: false
entity:
fuzzy_matching: true
similarity_threshold: 0.85
phonetic_matching: false
expand_abbreviations: true
date:
output_format: "ISO8601"
timezone: "UTC"
handle_relative: true
fuzzy: true
number:
decimal_separator: "."
thousands_separator: ","
normalize_units: true
language:
detector: "langdetect" # langdetect, fasttext
confidence_threshold: 0.8
fallback_language: "en"
Integration Examples¶
Complete Document Normalization Pipeline¶
from semantica.normalize import TextNormalizer, EntityNormalizer, DateNormalizer, LanguageDetector
from semantica.parse import DocumentParser
# Parse documents
parser = DocumentParser()
documents = parser.parse(["document1.pdf", "document2.docx"])
# Detect language
detector = LanguageDetector()
for doc in documents:
lang_result = detector.detect(doc.content)
doc.metadata["language"] = lang_result["language"]
doc.metadata["language_confidence"] = lang_result["confidence"]
text_normalizer = TextNormalizer()
for doc in documents:
doc.content = text_normalizer.normalize_text(doc.content)
# Normalize dates in metadata
date_normalizer = DateNormalizer(output_format="ISO8601")
for doc in documents:
if "date" in doc.metadata:
doc.metadata["date"] = date_normalizer.normalize_date(doc.metadata["date"])
# Normalize entities
entity_normalizer = EntityNormalizer(similarity_threshold=0.85)
# ... entity normalization logic
Multi-Language Document Processing¶
from semantica.normalize import LanguageDetector, TextNormalizer
detector = LanguageDetector()
normalizers = {
"en": TextNormalizer(expand_contractions=True),
"fr": TextNormalizer(remove_diacritics=False),
"de": TextNormalizer(lowercase=False)
}
def process_multilingual_document(text):
# Detect language
lang_result = detector.detect(text)
language = lang_result["language"]
# Use language-specific normalizer
normalizer = normalizers.get(language, TextNormalizer())
normalized = normalizer.normalize_text(text)
return {
"text": normalized,
"language": language,
"confidence": lang_result["confidence"]
}
# Process documents
documents = ["Hello world", "Bonjour le monde", "Hallo Welt"]
results = [process_multilingual_document(doc) for doc in documents]
Best Practices¶
1. Choose Appropriate Normalization Level¶
# Minimal normalization for entity extraction
minimal = TextNormalizer(
fix_encoding=True,
normalize_whitespace=True
)
# Moderate normalization for search
moderate = TextNormalizer(
fix_encoding=True,
normalize_whitespace=True,
lowercase=True,
remove_urls=True
)
# Aggressive normalization for topic modeling
aggressive = TextNormalizer(
lowercase=True,
remove_punctuation=True,
remove_numbers=True,
remove_urls=True,
expand_contractions=True
)
2. Preserve Original Data¶
# Always keep original text
doc.original_content = doc.content
doc.content = normalizer.normalize_text(doc.content)
# Store normalization metadata
doc.metadata["normalized"] = True
doc.metadata["normalization_config"] = normalizer.config
3. Batch Processing for Performance¶
# Batch normalize for better performance
texts = [doc.content for doc in documents]
normalized_texts = normalizer.process_batch(texts)
for doc, normalized in zip(documents, normalized_texts):
doc.content = normalized
Troubleshooting¶
Common Issues¶
Issue: Encoding errors with special characters
# Solution: Enable encoding fix
normalizer = TextNormalizer(fix_encoding=True)
# Or manually fix encoding
from semantica.normalize import handle_encoding
fixed_text, confidence = handle_encoding(problematic_text, operation="convert", source_encoding="latin-1")
Issue: Over-normalization losing important information
# Solution: Use conservative settings
normalizer = TextNormalizer(
lowercase=False, # Keep case
remove_punctuation=False, # Keep punctuation
remove_numbers=False # Keep numbers
)
Issue: Slow processing for large documents
# Solution: Use batch processing
normalizer = TextNormalizer()
normalized = normalizer.process_batch(documents)
Components¶
Key supporting classes available in semantica.normalize:
UnicodeNormalizerβ Unicode processing (NFC/NFD/NFKC/NFKD), special charsWhitespaceNormalizerβ Line breaks, indentation, whitespace cleanupSpecialCharacterProcessorβ Punctuation and diacritic handlingTextCleanerβ HTML removal and sanitization utilitiesAliasResolverβ Entity alias mappingEntityDisambiguatorβ Context-based entity disambiguationNameVariantHandlerβ Title and name variant handlingTimeZoneNormalizerβ Timezone conversion utilitiesRelativeDateProcessorβ Relative date expressions (e.g., "3 days ago")TemporalExpressionParserβ Date range and temporal phrase parsingUnitConverterβ Unit normalization and conversionCurrencyNormalizerβ Currency symbol/code parsingScientificNotationHandlerβ Scientific notation parsingDataCleanerβ General data cleaning utilitiesDuplicateDetectorβ Duplicate record detectionDataValidatorβ Schema-based dataset validationMissingValueHandlerβ Missing value strategiesEncodingHandlerβ Encoding detection and conversionMethodRegistryβ Register and retrieve custom normalization methodsNormalizeConfigβ Module configuration manager
Performance Tips¶
Memory Optimization¶
# Process documents in chunks
def normalize_large_corpus(documents, chunk_size=1000):
normalizer = TextNormalizer()
for i in range(0, len(documents), chunk_size):
chunk = documents[i:i + chunk_size]
normalized_chunk = normalizer.process_batch(chunk)
yield from normalized_chunk
Speed Optimization¶
# Disable unnecessary features
fast_normalizer = TextNormalizer(
fix_encoding=False, # Skip if encoding is known good
normalize_unicode=False, # Skip if not needed
remove_diacritics=False # Skip if not needed
)
# Use parallel processing
# Batch processing
normalized_docs = normalizer.process_batch(documents)
See Also¶
- Parse Module - Document parsing and extraction
- Semantic Extract Module - Entity and relation extraction
- Split Module - Text chunking and splitting
- Ingest Module - Data ingestion
Cookbook¶
Interactive tutorials to learn data normalization:
- Data Normalization: Pipelines for cleaning, normalizing, and preparing text
- Topics: Text cleaning, Unicode, formatting, language detection, entity normalization
- Difficulty: Beginner
- Use Cases: Data preprocessing, text cleaning, standardization