Skip to content

Normalize

Clean, standardize, and prepare text and data for semantic processing with comprehensive normalization capabilities.


🎯 Overview

  • Text Cleaning


    Remove noise, fix encoding issues, and standardize whitespace for clean text

  • Entity Normalization


    Standardize entity names, abbreviations, and formats across documents

  • Date & Time


    Parse and standardize date/time formats to ISO 8601

  • Number Normalization


    Standardize numeric values, units, and measurements

  • Language Detection


    Automatically detect document language with confidence scoring

  • Encoding Handling


    Fix character encoding issues and ensure UTF-8 compliance

Why Normalize?

Normalization is crucial for:

  • Consistency: Ensure uniform data representation
  • Accuracy: Improve entity extraction and matching
  • Quality: Reduce noise and errors in downstream processing
  • Performance: Enable better deduplication and search

βš™οΈ Algorithms Used

Text Normalization

Purpose: Clean and standardize text for consistent processing.

How it works:

  • Unicode Normalization: NFC, NFD, NFKC, NFKD forms using Unicode standard
  • Whitespace Normalization: Regex-based cleanup (`\s+` β†’ single space)
  • Case Folding: Locale-aware case normalization (Unicode case folding)
  • Diacritic Removal: Unicode decomposition and combining character removal
  • Punctuation Handling: Smart punctuation normalization preserving sentence structure

Entity Normalization

Purpose: Standardize entity names and resolve variations to canonical forms.

How it works:

  • Fuzzy Matching: Levenshtein distance with configurable threshold (default: `0.85`)
  • Phonetic Matching: Soundex and Metaphone algorithms for name variants
  • Abbreviation Expansion: Dictionary-based expansion with context awareness
  • Canonical Form Selection: Frequency-based or confidence-based selection
  • Entity Linking: Hash-based entity ID generation for cross-document linking

Date/Time Normalization

Purpose: Parse and standardize date/time formats to ISO 8601.

How it works:

  • Parsing: dateutil parser with 100+ format support
  • Timezone Handling: pytz for timezone conversion and DST handling
  • Standardization: ISO 8601 format output (`YYYY-MM-DDTHH:MM:SSZ`)
  • Relative Date Resolution: Convert "yesterday", "last week" to absolute dates
  • Fuzzy Date Parsing: Handle incomplete dates (e.g., "March 2024")

Number Normalization

Purpose: Standardize numeric values, units, and measurements.

How it works:

  • Numeric Parsing: Handle various formats (`1,000.00`, `1.000,00`, `1 000.00`)
  • Unit Conversion: Standardize units (km β†’ meters, lbs β†’ kg)
  • Scientific Notation: Parse and normalize scientific notation
  • Percentage Handling: Normalize percentage representations
  • Currency Normalization: Standardize currency symbols and amounts

Language Detection

Purpose: Automatically detect document language with confidence scoring.

How it works:

  • N-gram Analysis: Character and word n-gram frequency analysis
  • Statistical Models: Language-specific statistical models
  • Confidence Scoring: Probability-based confidence scores
  • Multi-language Support: 100+ languages supported

Main Classes

TextNormalizer

Main text normalization orchestrator with comprehensive cleaning capabilities.

Methods:

Method Description
`normalize_text(text, ...)` Normalize single text using full pipeline
`clean_text(text, ...)` Clean text (HTML removal, sanitization)
`standardize_format(text, format_type)` Standardize formatting (standard/compact/preserve)
`process_batch(texts, ...)` Batch normalize multiple texts

Example:

from semantica.normalize import TextNormalizer

normalizer = TextNormalizer()

# Normalize
normalized = normalizer.normalize_text("  Apple Inc.  was founded in 1976.  ", case="preserve")

# Clean only
cleaned = normalizer.clean_text("<p>Hello</p>", remove_html=True)

# Batch
texts = ["Hello   World", "Another   Example"]
normalized_batch = normalizer.process_batch(texts, case="lower")

EntityNormalizer

Standardize entity names and resolve variations to canonical forms.

Methods:

Method Description
`normalize_entity(name, ...)` Normalize entity name to canonical form
`resolve_aliases(name, ...)` Resolve aliases via alias map
`disambiguate_entity(name, ...)` Disambiguate using context and candidates
`link_entities(names, ...)` Link a list of names to canonical forms

Configuration Options:

EntityNormalizer(
    fuzzy_matching=True,          # Enable fuzzy matching
    similarity_threshold=0.85,    # Similarity threshold (0-1)
    phonetic_matching=False,      # Enable phonetic matching
    case_sensitive=False,         # Case-sensitive matching
    preserve_case=True,           # Preserve original case in output
    expand_abbreviations=True,    # Expand common abbreviations
    canonical_dict=None           # Custom canonical mappings
)

Example:

from semantica.normalize import EntityNormalizer

normalizer = EntityNormalizer(similarity_threshold=0.85)

# Normalize single
canonical = normalizer.normalize_entity("Apple, Inc.")

# Link list
linked = normalizer.link_entities(["Apple Inc.", "Apple", "AAPL"], entity_type="Organization")

DateNormalizer

Parse and standardize date/time formats to ISO 8601.

Methods:

Method Description
`normalize_date(date_str, ...)` Parse and normalize date
`normalize_time(time_str, ...)` Normalize time-only strings
`parse_temporal_expression(expr)` Parse date ranges and temporal phrases

Configuration Options:

DateNormalizer(
    output_format="ISO8601",      # ISO8601, UNIX, custom format
    timezone="UTC",               # Target timezone
    handle_relative=True,         # Parse "yesterday", "last week"
    fuzzy=True,                   # Fuzzy parsing
    default_day=1,                # Default day for incomplete dates
    default_month=1               # Default month for incomplete dates
)

Example:

from semantica.normalize import DateNormalizer

normalizer = DateNormalizer()

dates = ["Jan 1, 2024", "01/01/2024", "yesterday"]
normalized = [normalizer.normalize_date(d) for d in dates]

time = normalizer.normalize_time("10:30 AM")

NumberNormalizer

Standardize numeric values, units, and measurements.

Methods:

Method Description
`normalize_number(input, ...)` Parse and normalize number
`normalize_quantity(quantity, ...)` Parse value with unit
`convert_units(value, from_unit, to_unit)` Convert units
`process_currency(text, ...)` Parse currency amount and code

Example:

from semantica.normalize import NumberNormalizer

normalizer = NumberNormalizer()

numbers = ["1,000.50", "50%", "1.5e3"]
normalized = [normalizer.normalize_number(n) for n in numbers]

quantity = normalizer.normalize_quantity("5 kg")
converted = normalizer.convert_units(5, "km", "m")
currency = normalizer.process_currency("$1,234.56")

LanguageDetector

Detect document language with confidence scoring.

Methods:

Method Description
`detect(text)` Detect language
`detect_with_confidence(text)` Detect with confidence score
`detect_multiple(text, top_n)` List top-N candidate languages
`detect_batch(texts)` Batch language detection

Example:

from semantica.normalize import LanguageDetector

detector = LanguageDetector()

# Detect language
texts = [
    "Hello, how are you?",
    "Bonjour, comment allez-vous?",
    "Hola, ΒΏcΓ³mo estΓ‘s?",
    "Hallo, wie geht es dir?",
    "γ“γ‚“γ«γ‘γ―γ€γŠε…ƒζ°—γ§γ™γ‹οΌŸ"
]

for text in texts:
    result = detector.detect(text)
    print(f"{text[:30]:30} β†’ {result['language']} ({result['confidence']:.2f})")

# Output:
# Hello, how are you?           β†’ en (0.99)
# Bonjour, comment allez-vous?  β†’ fr (0.98)
# Hola, ΒΏcΓ³mo estΓ‘s?            β†’ es (0.97)
# Hallo, wie geht es dir?       β†’ de (0.96)
# γ“γ‚“γ«γ‘γ―γ€γŠε…ƒζ°—γ§γ™γ‹οΌŸ         β†’ ja (0.99)

Configuration

Environment Variables

# Normalization settings
export NORMALIZE_DEFAULT_LOWERCASE=false
export NORMALIZE_DEFAULT_ENCODING=utf-8
export NORMALIZE_DEFAULT_TIMEZONE=UTC

# Entity normalization
export NORMALIZE_ENTITY_SIMILARITY_THRESHOLD=0.85
export NORMALIZE_ENTITY_FUZZY_MATCHING=true

# Language detection
export NORMALIZE_LANGUAGE_DETECTOR=langdetect
export NORMALIZE_LANGUAGE_CONFIDENCE_THRESHOLD=0.8

YAML Configuration

# config.yaml - Normalize Module Configuration

normalize:
  text:
    lowercase: false
    remove_punctuation: false
    fix_encoding: true
    normalize_whitespace: true
    remove_urls: false
    expand_contractions: false

  entity:
    fuzzy_matching: true
    similarity_threshold: 0.85
    phonetic_matching: false
    expand_abbreviations: true

  date:
    output_format: "ISO8601"
    timezone: "UTC"
    handle_relative: true
    fuzzy: true

  number:
    decimal_separator: "."
    thousands_separator: ","
    normalize_units: true

  language:
    detector: "langdetect"  # langdetect, fasttext
    confidence_threshold: 0.8
    fallback_language: "en"

Integration Examples

Complete Document Normalization Pipeline

from semantica.normalize import TextNormalizer, EntityNormalizer, DateNormalizer, LanguageDetector
from semantica.parse import DocumentParser

# Parse documents
parser = DocumentParser()
documents = parser.parse(["document1.pdf", "document2.docx"])

# Detect language
detector = LanguageDetector()
for doc in documents:
    lang_result = detector.detect(doc.content)
    doc.metadata["language"] = lang_result["language"]
    doc.metadata["language_confidence"] = lang_result["confidence"]

text_normalizer = TextNormalizer()

for doc in documents:
    doc.content = text_normalizer.normalize_text(doc.content)

# Normalize dates in metadata
date_normalizer = DateNormalizer(output_format="ISO8601")
for doc in documents:
    if "date" in doc.metadata:
        doc.metadata["date"] = date_normalizer.normalize_date(doc.metadata["date"])

# Normalize entities
entity_normalizer = EntityNormalizer(similarity_threshold=0.85)
# ... entity normalization logic

Multi-Language Document Processing

from semantica.normalize import LanguageDetector, TextNormalizer

detector = LanguageDetector()
normalizers = {
    "en": TextNormalizer(expand_contractions=True),
    "fr": TextNormalizer(remove_diacritics=False),
    "de": TextNormalizer(lowercase=False)
}

def process_multilingual_document(text):
    # Detect language
    lang_result = detector.detect(text)
    language = lang_result["language"]

    # Use language-specific normalizer
    normalizer = normalizers.get(language, TextNormalizer())
    normalized = normalizer.normalize_text(text)

    return {
        "text": normalized,
        "language": language,
        "confidence": lang_result["confidence"]
    }

# Process documents
documents = ["Hello world", "Bonjour le monde", "Hallo Welt"]
results = [process_multilingual_document(doc) for doc in documents]

Best Practices

1. Choose Appropriate Normalization Level

# Minimal normalization for entity extraction
minimal = TextNormalizer(
    fix_encoding=True,
    normalize_whitespace=True
)

# Moderate normalization for search
moderate = TextNormalizer(
    fix_encoding=True,
    normalize_whitespace=True,
    lowercase=True,
    remove_urls=True
)

# Aggressive normalization for topic modeling
aggressive = TextNormalizer(
    lowercase=True,
    remove_punctuation=True,
    remove_numbers=True,
    remove_urls=True,
    expand_contractions=True
)

2. Preserve Original Data

# Always keep original text
doc.original_content = doc.content
doc.content = normalizer.normalize_text(doc.content)

# Store normalization metadata
doc.metadata["normalized"] = True
doc.metadata["normalization_config"] = normalizer.config

3. Batch Processing for Performance

# Batch normalize for better performance
texts = [doc.content for doc in documents]
normalized_texts = normalizer.process_batch(texts)

for doc, normalized in zip(documents, normalized_texts):
    doc.content = normalized

Troubleshooting

Common Issues

Issue: Encoding errors with special characters

# Solution: Enable encoding fix
normalizer = TextNormalizer(fix_encoding=True)

# Or manually fix encoding
from semantica.normalize import handle_encoding
fixed_text, confidence = handle_encoding(problematic_text, operation="convert", source_encoding="latin-1")

Issue: Over-normalization losing important information

# Solution: Use conservative settings
normalizer = TextNormalizer(
    lowercase=False,           # Keep case
    remove_punctuation=False,  # Keep punctuation
    remove_numbers=False       # Keep numbers
)

Issue: Slow processing for large documents

# Solution: Use batch processing
normalizer = TextNormalizer()
normalized = normalizer.process_batch(documents)

Components

Key supporting classes available in semantica.normalize:

  • UnicodeNormalizer β€” Unicode processing (NFC/NFD/NFKC/NFKD), special chars
  • WhitespaceNormalizer β€” Line breaks, indentation, whitespace cleanup
  • SpecialCharacterProcessor β€” Punctuation and diacritic handling
  • TextCleaner β€” HTML removal and sanitization utilities
  • AliasResolver β€” Entity alias mapping
  • EntityDisambiguator β€” Context-based entity disambiguation
  • NameVariantHandler β€” Title and name variant handling
  • TimeZoneNormalizer β€” Timezone conversion utilities
  • RelativeDateProcessor β€” Relative date expressions (e.g., "3 days ago")
  • TemporalExpressionParser β€” Date range and temporal phrase parsing
  • UnitConverter β€” Unit normalization and conversion
  • CurrencyNormalizer β€” Currency symbol/code parsing
  • ScientificNotationHandler β€” Scientific notation parsing
  • DataCleaner β€” General data cleaning utilities
  • DuplicateDetector β€” Duplicate record detection
  • DataValidator β€” Schema-based dataset validation
  • MissingValueHandler β€” Missing value strategies
  • EncodingHandler β€” Encoding detection and conversion
  • MethodRegistry β€” Register and retrieve custom normalization methods
  • NormalizeConfig β€” Module configuration manager

Performance Tips

Memory Optimization

# Process documents in chunks
def normalize_large_corpus(documents, chunk_size=1000):
    normalizer = TextNormalizer()

    for i in range(0, len(documents), chunk_size):
        chunk = documents[i:i + chunk_size]
    normalized_chunk = normalizer.process_batch(chunk)
        yield from normalized_chunk

Speed Optimization

# Disable unnecessary features
fast_normalizer = TextNormalizer(
    fix_encoding=False,        # Skip if encoding is known good
    normalize_unicode=False,   # Skip if not needed
    remove_diacritics=False    # Skip if not needed
)

# Use parallel processing
# Batch processing
normalized_docs = normalizer.process_batch(documents)

See Also

Cookbook

Interactive tutorials to learn data normalization:

  • Data Normalization: Pipelines for cleaning, normalizing, and preparing text
  • Topics: Text cleaning, Unicode, formatting, language detection, entity normalization
  • Difficulty: Beginner
  • Use Cases: Data preprocessing, text cleaning, standardization