โ๏ธ Processors Module¶
The Processors module in EmbeddingFramework handles file ingestion, text extraction, and preprocessing before generating embeddings.
๐ File Processing¶
The FileProcessor
class automatically detects file types and extracts text from:
- TXT
- PDF
- DOCX
- Other supported formats
Example:
from embeddingframework.processors.file_processor import FileProcessor
processor = FileProcessor()
text = processor.process_file("document.pdf")
print(text)
๐งน Preprocessing¶
Preprocessing utilities help clean and normalize text for better embedding quality: - Remove special characters - Normalize whitespace - Lowercasing - Tokenization
Example:
from embeddingframework.utils.preprocessing import clean_text
cleaned = clean_text("Hello, World!!!")
print(cleaned) # "hello world"
โ๏ธ Text Splitting¶
The framework includes intelligent text splitters for optimal embedding performance:
from embeddingframework.utils.splitters import split_text
chunks = split_text("Long text...", chunk_size=500)
๐ Extending Processors¶
You can create custom processors by extending the base processor class.