Skip to content

โš™๏ธ Processors Module

The Processors module in EmbeddingFramework handles file ingestion, text extraction, and preprocessing before generating embeddings.


๐Ÿ“‚ File Processing

The FileProcessor class automatically detects file types and extracts text from: - TXT - PDF - DOCX - Other supported formats

Example:

from embeddingframework.processors.file_processor import FileProcessor

processor = FileProcessor()
text = processor.process_file("document.pdf")
print(text)

๐Ÿงน Preprocessing

Preprocessing utilities help clean and normalize text for better embedding quality: - Remove special characters - Normalize whitespace - Lowercasing - Tokenization

Example:

from embeddingframework.utils.preprocessing import clean_text

cleaned = clean_text("Hello,   World!!!")
print(cleaned)  # "hello world"

โœ‚๏ธ Text Splitting

The framework includes intelligent text splitters for optimal embedding performance:

from embeddingframework.utils.splitters import split_text

chunks = split_text("Long text...", chunk_size=500)

๐Ÿ”Œ Extending Processors

You can create custom processors by extending the base processor class.