kara package

Subpackages

kara.integrations package

Submodules

kara.core module

Core KARA algorithm implementation.

class kara.core.ChunkData(content: Any, splits: list[T], hash: str, document_id: int | None = None)[source]

Bases: Generic[T]

Represents a chunk with its content and metadata.

content: Any

document_id: int | None = None

classmethod from_splits(splits: Sequence[T], document_id: int | None = None, serializer: Callable[[Sequence[T]], bytes] | None = None, renderer: Callable[[Sequence[T]], Any] | None = None) → ChunkData[T][source]: Create ChunkData from splits.

hash: str

splits: list[T]

class kara.core.ChunkedDocument(chunks: list[ChunkData[T]])[source]

Bases: Generic[T]

Represents the current state of the document collection.

chunks: list[ChunkData[T]]

classmethod from_chunks(chunks: list[Any], chunker: BaseDocumentChunker[T], document_id: int | None = None) → ChunkedDocument[T][source]

Create a ChunkedDocument from pre-split chunks.

Args:: chunks: list of text chunks to include document_id: Optional document identifier
Returns:: ChunkedDocument with chunks created

get_chunk_contents() → list[Any][source]: Get all chunk contents.

get_chunk_hashes() → set[str][source]: Get all chunk hashes in the collection.

get_chunks_by_document(document_id: int) → list[ChunkData[T]][source]: Get all chunks belonging to a specific document.

get_document_ids() → set[int][source]: Get all unique document IDs in the collection.

class kara.core.KARAUpdater(chunker: BaseDocumentChunker[T])[source]

Bases: Generic[T]

Knowledge-Aware Re-embedding Algorithm updater.

Efficiently updates document collections by minimizing embedding operations through intelligent reuse of existing chunks.

chunker: BaseDocumentChunker[T]

create_collection(documents: list[str]) → UpdateResult[T][source]

Create a new document collection from documents.

Args:: documents: list of document texts
Returns:: UpdateResult with initial chunks

max_chunk_size: int

update_collection(current_collection: ChunkedDocument[T], documents: list[str]) → UpdateResult[T][source]

Update the document collection with new documents.

Args:: current_collection: Current document collection state documents: list of updated document texts
Returns:: UpdateResult with statistics and new collection

class kara.core.UpdateResult(num_added: int = 0, num_reused: int = 0, num_deleted: int = 0, new_chunked_doc: ChunkedDocument[T] | None = None)[source]

Bases: Generic[T]

Result of a KARA update operation.

property efficiency_ratio: float: Ratio of skipped operations to total operations.

new_chunked_doc: ChunkedDocument[T] | None = None

num_added: int = 0

num_deleted: int = 0

num_reused: int = 0

property total_operations: int: Total number of operations performed.

kara.splitters module

Module contents

kara-toolkit - Knowledge-Aware Re-embedding Algorithm

A Python library for efficient updates to RAG document collections, minimizing embedding operations through intelligent chunk reuse.

class kara.BaseDocumentChunker(chunk_size: int = 1000, overlap: int = 0)[source]

Bases: ABC, Generic[T]

Abstract base class for document chunkers.

abstractmethod create_chunks(text: str) → list[list[T]][source]: Split text into optimally-sized chunks.

normalize_chunk(chunk: Any) → list[T][source]: Normalize a chunk to a list of units.

render_units(units: Sequence[T]) → Any[source]: Render units for output or storage.

serialize_units(units: Sequence[T]) → bytes[source]: Serialize units to bytes for hashing.

unit_length(unit: T) → int[source]: Return the unit length for sizing and chunk limits.

class kara.CharacterChunker(separators: list[str] | None = None, chunk_size: int = 4000, overlap: int = 0, keep_separator: bool = True)[source]

Bases: BaseDocumentChunker[str]

Recursive character-based chunker that tries multiple separators.

First splits text into smallest units using separators, then greedily merges them into chunks within the size limit.

create_chunks(text: str) → list[list[str]][source]

Split text into optimally-sized chunks.

Args:: text: Input text to split
Returns:: List of chunks as unit lists

class kara.HuggingFaceTokenChunker(model_name: str, chunk_size: int = 1000, overlap: int = 0)[source]

Bases: TokenChunker

Token chunker using Hugging Face tokenizers.

render_units(units: Sequence[Any]) → str[source]: Render token units by decoding them as a sequence.

class kara.KARAUpdater(chunker: BaseDocumentChunker[T])[source]

Bases: Generic[T]

Knowledge-Aware Re-embedding Algorithm updater.

Efficiently updates document collections by minimizing embedding operations through intelligent reuse of existing chunks.

chunker: BaseDocumentChunker[T]

create_collection(documents: list[str]) → UpdateResult[T][source]

Create a new document collection from documents.

Args:: documents: list of document texts
Returns:: UpdateResult with initial chunks

max_chunk_size: int

update_collection(current_collection: ChunkedDocument[T], documents: list[str]) → UpdateResult[T][source]

Update the document collection with new documents.

Args:: current_collection: Current document collection state documents: list of updated document texts
Returns:: UpdateResult with statistics and new collection

class kara.OpenAITokenChunker(encoding_name: str = 'cl100k_base', chunk_size: int = 1000, overlap: int = 0, allowed_special: Literal['all'] | Set[str] | None = None, disallowed_special: Literal['all'] | Collection[str] | None = None)[source]

Bases: TokenChunker

Token chunker using OpenAI’s tiktoken encodings.

render_units(units: Sequence[Any]) → str[source]: Render token units by decoding them as a sequence.

class kara.TokenChunker(tokenizer_function: Callable[[str], list[int]] | None = None, chunk_size: int = 512, overlap: int = 0)[source]

Bases: BaseDocumentChunker[int]

Token-based chunker that splits text into tokens and merges them greedily.

This demonstrates how the unified chunking approach works for different unit types (tokens instead of characters).

create_chunks(text: str) → list[list[int]][source]

Split text into token-based chunks.

Args:: text: Input text to split
Returns:: List of chunks as token lists

unit_length(unit: int) → int[source]

Return the unit length for sizing and chunk limits.

For token-based chunking, each token counts as 1 unit, regardless of its representation (e.g., string length).

class kara.UpdateResult(num_added: int = 0, num_reused: int = 0, num_deleted: int = 0, new_chunked_doc: ChunkedDocument[T] | None = None)[source]

Bases: Generic[T]

Result of a KARA update operation.

property efficiency_ratio: float: Ratio of skipped operations to total operations.

new_chunked_doc: ChunkedDocument[T] | None = None

num_added: int = 0

num_deleted: int = 0

num_reused: int = 0

property total_operations: int: Total number of operations performed.