KV$ Rewind¶
The KV$ Rewind/KV$ Prefix Match feature allows for efficient query processing by leveraging previously cached KV values. When using KV Rewind, Genie can reuse the KV cache values from a previous query to speed up the processing of a new, similar query. This is particularly useful in scenarios where the new query shares a common prefix with the previous one.
Using KV Rewind between queries¶
typedef enum {
/// The string is the entire query/response.
GENIE_DIALOG_SENTENCE_COMPLETE = 0,
/// The string is the beginning of the query/response.
GENIE_DIALOG_SENTENCE_BEGIN = 1,
/// The string is a part of the query/response and not the beginning or end.
GENIE_DIALOG_SENTENCE_CONTINUE = 2,
/// The string is the end of the query/response.
GENIE_DIALOG_SENTENCE_END = 3,
/// The query has been aborted.
GENIE_DIALOG_SENTENCE_ABORT = 4,
///Rewind the KV cache as per prefix query match before processing the query
GENIE_DIALOG_SENTENCE_REWIND = 5,
} GenieDialog_SentenceCode_t;
GENIE_API
Genie_Status_t GenieDialog_query(const GenieDialog_Handle_t dialogHandle,
const char* queryStr,
const GenieDialog_SentenceCode_t sentenceCode,
const GenieDialog_QueryCallback_t callback,
const void* userData);
Use the sentence code GENIE_DIALOG_SENTENCE_REWIND and pass the query string as you would for a normal query. The API will handle prefix matching and KV rewind internally.
Note
KV$ prefix match works well with the KV update method SMART_MASK. However, with KV update method POINTER_SHIFT, we observed that in a few cases, it throws memory register-related errors for weight-shared bins. POINTER_SHIFT works fine or shows no issues with decoder-only models (AR1 / AR8 / AR128, etc.).
In genie-t2t-run, we can use ‘-w’ option for rewind queries.
For example:
./genie-t2t-run -c llama2-7b-htp.json
-p "Answer in one sentence, what is the capital city of India?"
-w "Answer in one sentence, what is the capital city of Russia?"