kara package

Subpackages

Submodules

kara.core module

Core KARA algorithm implementation.

class kara.core.ChunkData(content: Any, splits: list[T], hash: str, document_id: int | None = None)[source]

Bases: Generic[T]

Represents a chunk with its content and metadata.

content: Any
document_id: int | None = None
classmethod from_splits(splits: Sequence[T], document_id: int | None = None, serializer: Callable[[Sequence[T]], bytes] | None = None, renderer: Callable[[Sequence[T]], Any] | None = None) ChunkData[T][source]

Create ChunkData from splits.

hash: str
splits: list[T]
class kara.core.ChunkedDocument(chunks: list[ChunkData[T]])[source]

Bases: Generic[T]

Represents the current state of the document collection.

chunks: list[ChunkData[T]]
classmethod from_chunks(chunks: list[Any], chunker: BaseDocumentChunker[T], document_id: int | None = None) ChunkedDocument[T][source]

Create a ChunkedDocument from pre-split chunks.

Args:

chunks: list of text chunks to include document_id: Optional document identifier

Returns:

ChunkedDocument with chunks created

get_chunk_contents() list[Any][source]

Get all chunk contents.

get_chunk_hashes() set[str][source]

Get all chunk hashes in the collection.

get_chunks_by_document(document_id: int) list[ChunkData[T]][source]

Get all chunks belonging to a specific document.

get_document_ids() set[int][source]

Get all unique document IDs in the collection.

class kara.core.KARAUpdater(chunker: BaseDocumentChunker[T])[source]

Bases: Generic[T]

Knowledge-Aware Re-embedding Algorithm updater.

Efficiently updates document collections by minimizing embedding operations through intelligent reuse of existing chunks.

chunker: BaseDocumentChunker[T]
create_collection(documents: list[str]) UpdateResult[T][source]

Create a new document collection from documents.

Args:

documents: list of document texts

Returns:

UpdateResult with initial chunks

max_chunk_size: int
update_collection(current_collection: ChunkedDocument[T], documents: list[str]) UpdateResult[T][source]

Update the document collection with new documents.

Args:

current_collection: Current document collection state documents: list of updated document texts

Returns:

UpdateResult with statistics and new collection

class kara.core.UpdateResult(num_added: int = 0, num_reused: int = 0, num_deleted: int = 0, new_chunked_doc: ChunkedDocument[T] | None = None)[source]

Bases: Generic[T]

Result of a KARA update operation.

property efficiency_ratio: float

Ratio of skipped operations to total operations.

new_chunked_doc: ChunkedDocument[T] | None = None
num_added: int = 0
num_deleted: int = 0
num_reused: int = 0
property total_operations: int

Total number of operations performed.

kara.splitters module

Module contents

kara-toolkit - Knowledge-Aware Re-embedding Algorithm

A Python library for efficient updates to RAG document collections, minimizing embedding operations through intelligent chunk reuse.

class kara.BaseDocumentChunker(chunk_size: int = 1000, overlap: int = 0)[source]

Bases: ABC, Generic[T]

Abstract base class for document chunkers.

abstractmethod create_chunks(text: str) list[list[T]][source]

Split text into optimally-sized chunks.

normalize_chunk(chunk: Any) list[T][source]

Normalize a chunk to a list of units.

render_units(units: Sequence[T]) Any[source]

Render units for output or storage.

serialize_units(units: Sequence[T]) bytes[source]

Serialize units to bytes for hashing.

unit_length(unit: T) int[source]

Return the unit length for sizing and chunk limits.

class kara.CharacterChunker(separators: list[str] | None = None, chunk_size: int = 4000, overlap: int = 0, keep_separator: bool = True)[source]

Bases: BaseDocumentChunker[str]

Recursive character-based chunker that tries multiple separators.

First splits text into smallest units using separators, then greedily merges them into chunks within the size limit.

create_chunks(text: str) list[list[str]][source]

Split text into optimally-sized chunks.

Args:

text: Input text to split

Returns:

List of chunks as unit lists

class kara.HuggingFaceTokenChunker(model_name: str, chunk_size: int = 1000, overlap: int = 0)[source]

Bases: TokenChunker

Token chunker using Hugging Face tokenizers.

render_units(units: Sequence[Any]) str[source]

Render token units by decoding them as a sequence.

class kara.KARAUpdater(chunker: BaseDocumentChunker[T])[source]

Bases: Generic[T]

Knowledge-Aware Re-embedding Algorithm updater.

Efficiently updates document collections by minimizing embedding operations through intelligent reuse of existing chunks.

chunker: BaseDocumentChunker[T]
create_collection(documents: list[str]) UpdateResult[T][source]

Create a new document collection from documents.

Args:

documents: list of document texts

Returns:

UpdateResult with initial chunks

max_chunk_size: int
update_collection(current_collection: ChunkedDocument[T], documents: list[str]) UpdateResult[T][source]

Update the document collection with new documents.

Args:

current_collection: Current document collection state documents: list of updated document texts

Returns:

UpdateResult with statistics and new collection

class kara.OpenAITokenChunker(encoding_name: str = 'cl100k_base', chunk_size: int = 1000, overlap: int = 0, allowed_special: Literal['all'] | Set[str] | None = None, disallowed_special: Literal['all'] | Collection[str] | None = None)[source]

Bases: TokenChunker

Token chunker using OpenAI’s tiktoken encodings.

render_units(units: Sequence[Any]) str[source]

Render token units by decoding them as a sequence.

class kara.TokenChunker(tokenizer_function: Callable[[str], list[int]] | None = None, chunk_size: int = 512, overlap: int = 0)[source]

Bases: BaseDocumentChunker[int]

Token-based chunker that splits text into tokens and merges them greedily.

This demonstrates how the unified chunking approach works for different unit types (tokens instead of characters).

create_chunks(text: str) list[list[int]][source]

Split text into token-based chunks.

Args:

text: Input text to split

Returns:

List of chunks as token lists

unit_length(unit: int) int[source]

Return the unit length for sizing and chunk limits.

For token-based chunking, each token counts as 1 unit, regardless of its representation (e.g., string length).

class kara.UpdateResult(num_added: int = 0, num_reused: int = 0, num_deleted: int = 0, new_chunked_doc: ChunkedDocument[T] | None = None)[source]

Bases: Generic[T]

Result of a KARA update operation.

property efficiency_ratio: float

Ratio of skipped operations to total operations.

new_chunked_doc: ChunkedDocument[T] | None = None
num_added: int = 0
num_deleted: int = 0
num_reused: int = 0
property total_operations: int

Total number of operations performed.