kara package
Subpackages
Submodules
kara.core module
Core KARA algorithm implementation.
- class kara.core.ChunkData(content: Any, splits: list[T], hash: str, document_id: int | None = None)[source]
Bases:
Generic[T]Represents a chunk with its content and metadata.
- content: Any
- document_id: int | None = None
- classmethod from_splits(splits: Sequence[T], document_id: int | None = None, serializer: Callable[[Sequence[T]], bytes] | None = None, renderer: Callable[[Sequence[T]], Any] | None = None) ChunkData[T][source]
Create ChunkData from splits.
- hash: str
- splits: list[T]
- class kara.core.ChunkedDocument(chunks: list[ChunkData[T]])[source]
Bases:
Generic[T]Represents the current state of the document collection.
- classmethod from_chunks(chunks: list[Any], chunker: BaseDocumentChunker[T], document_id: int | None = None) ChunkedDocument[T][source]
Create a
ChunkedDocumentfrom pre-split chunks.- Args:
chunks: list of text chunks to include document_id: Optional document identifier
- Returns:
ChunkedDocument with chunks created
- class kara.core.KARAUpdater(chunker: BaseDocumentChunker[T])[source]
Bases:
Generic[T]Knowledge-Aware Re-embedding Algorithm updater.
Efficiently updates document collections by minimizing embedding operations through intelligent reuse of existing chunks.
- chunker: BaseDocumentChunker[T]
- create_collection(documents: list[str]) UpdateResult[T][source]
Create a new document collection from documents.
- Args:
documents: list of document texts
- Returns:
UpdateResult with initial chunks
- max_chunk_size: int
- update_collection(current_collection: ChunkedDocument[T], documents: list[str]) UpdateResult[T][source]
Update the document collection with new documents.
- Args:
current_collection: Current document collection state documents: list of updated document texts
- Returns:
UpdateResult with statistics and new collection
- class kara.core.UpdateResult(num_added: int = 0, num_reused: int = 0, num_deleted: int = 0, new_chunked_doc: ChunkedDocument[T] | None = None)[source]
Bases:
Generic[T]Result of a KARA update operation.
- property efficiency_ratio: float
Ratio of skipped operations to total operations.
- new_chunked_doc: ChunkedDocument[T] | None = None
- num_added: int = 0
- num_deleted: int = 0
- num_reused: int = 0
- property total_operations: int
Total number of operations performed.
kara.splitters module
Module contents
kara-toolkit - Knowledge-Aware Re-embedding Algorithm
A Python library for efficient updates to RAG document collections, minimizing embedding operations through intelligent chunk reuse.
- class kara.BaseDocumentChunker(chunk_size: int = 1000, overlap: int = 0)[source]
Bases:
ABC,Generic[T]Abstract base class for document chunkers.
- class kara.CharacterChunker(separators: list[str] | None = None, chunk_size: int = 4000, overlap: int = 0, keep_separator: bool = True)[source]
Bases:
BaseDocumentChunker[str]Recursive character-based chunker that tries multiple separators.
First splits text into smallest units using separators, then greedily merges them into chunks within the size limit.
- class kara.HuggingFaceTokenChunker(model_name: str, chunk_size: int = 1000, overlap: int = 0)[source]
Bases:
TokenChunkerToken chunker using Hugging Face tokenizers.
- class kara.KARAUpdater(chunker: BaseDocumentChunker[T])[source]
Bases:
Generic[T]Knowledge-Aware Re-embedding Algorithm updater.
Efficiently updates document collections by minimizing embedding operations through intelligent reuse of existing chunks.
- chunker: BaseDocumentChunker[T]
- create_collection(documents: list[str]) UpdateResult[T][source]
Create a new document collection from documents.
- Args:
documents: list of document texts
- Returns:
UpdateResult with initial chunks
- max_chunk_size: int
- update_collection(current_collection: ChunkedDocument[T], documents: list[str]) UpdateResult[T][source]
Update the document collection with new documents.
- Args:
current_collection: Current document collection state documents: list of updated document texts
- Returns:
UpdateResult with statistics and new collection
- class kara.OpenAITokenChunker(encoding_name: str = 'cl100k_base', chunk_size: int = 1000, overlap: int = 0, allowed_special: Literal['all'] | Set[str] | None = None, disallowed_special: Literal['all'] | Collection[str] | None = None)[source]
Bases:
TokenChunkerToken chunker using OpenAI’s tiktoken encodings.
- class kara.TokenChunker(tokenizer_function: Callable[[str], list[int]] | None = None, chunk_size: int = 512, overlap: int = 0)[source]
Bases:
BaseDocumentChunker[int]Token-based chunker that splits text into tokens and merges them greedily.
This demonstrates how the unified chunking approach works for different unit types (tokens instead of characters).
- class kara.UpdateResult(num_added: int = 0, num_reused: int = 0, num_deleted: int = 0, new_chunked_doc: ChunkedDocument[T] | None = None)[source]
Bases:
Generic[T]Result of a KARA update operation.
- property efficiency_ratio: float
Ratio of skipped operations to total operations.
- new_chunked_doc: ChunkedDocument[T] | None = None
- num_added: int = 0
- num_deleted: int = 0
- num_reused: int = 0
- property total_operations: int
Total number of operations performed.