Genie Dialog JSON configuration string¶
The following sections contain information that pertain to the format of the JSON configuration string that is supplied to GenieDialogConfig_createFromJson. This JSON configuration is can also be supplied to the genie-t2t-run tool.
Note
Please refer to the example configs contained in the SDK at ${SDK_ROOT}/examples/Genie/configs/.
General configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note that dependencies are not specified in the schema, but are discussed in the following per-backend sections.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum":["basic"]},
"context" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"size": {"type": "integer"},
"n-vocab": {"type": "integer"},
"bos-token": {"type": "integer"},
"eos-token": {"type": "integer"}
}
},
"sampler" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"seed" : {"type": "integer"},
"temp" : {"type": "float"},
"top-k" : {"type": "integer"},
"top-p" : {"type": "float"},
"greedy" : {"type": "boolean"},
"token-penalty" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"penalize-last-n": {"type": "integer"},
"repetition-penalty": {"type": "float"},
"presence-penalty": {"type": "float"},
"frequency-penalty": {"type": "float"}
}
}
}
},
"tokenizer" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"path" : {"type": "string"}
}
},
"engine" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"n-threads" : {"type": "integer"},
"backend" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum" : ["QnnHtp", "QnnGenAiTransformer"]},
"QnnHtp" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"spill-fill-bufsize" : {"type": "integer"},
"data-alignment-size" : {"type": "integer"},
"use-mmap" : {"type": "boolean"},
"mmap-budget" : {"type": "integer"},
"poll" : {"type": "boolean"},
"pos-id-dim" : {"type": "integer"},
"cpu-mask" : {"type": "string"},
"kv-dim" : {"type": "integer"},
"allow-async-init" : {"type": "boolean"},
"enable-graph-switching" : {"type": "boolean"},
"skip-lora-validation" : {"type" : "boolean"},
"rope-theta" : {"type": "double"},
"shared-engine" : {"type" : "boolean"}
}
},
"QnnGenAiTransformer" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"n-logits" : {"type": "integer"},
"n-layer" : {"type": "integer"},
"n-embd" : {"type": "integer"},
"n-heads" : {"type": "integer"},
"n-kv-heads" : {"type": "integer"},
"kv-quantization" : {"type": "boolean"},
"enable-in-memory-kv-share" : {"type": "boolean"}
}
},
"extensions" : {"type": "string"}
}
},
"model" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum":["binary", "library"]},
"binary" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"ctx-bins" : {"type": "array", "items": {"type": "string"}}
}
},
"library" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"model-bin" : {"type": "string"}
}
}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::version |
all backends |
Version of dialog object that is supported by APIs.(1) |
dialog::type |
all backends |
Type of dialog supported by APIs.(basic) |
dialog::stop-sequence |
all backends |
Stop query when a set of sequences detected in response. Argument passed in as an array of strings |
dialog::max-num-tokens |
all backends |
Stop query when max number of tokens generated in response. |
context::version |
all backends |
Version of context object that is supported by APIs. (1) |
context::size |
all backends |
Context length. Maximum number of tokens to store. |
context::n-vocab |
all backends |
Model vocabulary size. |
context::bos-token |
all backends |
Beginning of sentence token. |
context::eos-token |
all backends |
End of sentence token. Argument passed in as an integer or array of integers |
context::eot-token |
all backends |
End of turn token. |
sampler::version |
all backends |
Version of sampler object that is supported by APIs. (1) |
sampler::type |
all backends |
Type of sampler to use. Supported options: basic, custom |
sampler::callback-name |
all backends |
Name of the callback function to use for Sampling. |
sampler::seed |
all backends |
Sampling random number generation seed. |
sampler::temp |
all backends |
Sampling temperature. |
sampler::top-k |
all backends |
Top-k number of samples. |
sampler::top-p |
all backends |
Top-p sampling threshold. |
sampler::greedy |
all backends |
Sampler that need to be used is random or greedy. true value specify greedy sampling. |
token-penalty::version |
all backends |
Version of token-penalty object, supported by APIs.(1) |
token-penalty::penalize-last-n |
all backends |
Maintains the history of last-n tokens for penalization. zero signifies no penalty. |
token-penalty::repetition-penalty |
all backends |
Penalizes the logit value by given value for repetition. >1 penalizes the logits, where <1 boosts the repetition. |
token-penalty::presence-penalty |
all backends |
Penalizes the logit value by given value for presence. >0 penalizes the logits, where <0 boosts the repetition. |
token-penalty::frequency-penalty |
all backends |
Penalizes the logit value by given value for frequency. >0 penalizes the logits, where <0 boosts the repetition. |
tokenizer::version |
all backends |
Version of tokenizer object that is supported by APIs. (1) |
tokenizer::path |
all backends |
Path to tokenizer file. |
engine::version |
all backends |
Version of engine object that is supported by APIs. (1) |
engine::n-threads |
all backends |
Number of threads to use for KV-cache updates. |
debug::path |
all backends |
File path to dump debug information. |
debug::dump-tensors |
all backends |
Raw data dump of input and output tensors |
debug::dump-specs |
all backends |
Dumps Input output tensor specification such as bw, scale, offset, dimensions |
debug::dump-outputs |
all backends |
Raw data dump of output tensor from engine |
backend::version |
all backends |
Version of backend object that is supported by APIs. (1) |
backend::type |
all backends |
Type of engine like “QnnHtp” for QNN HTP, “QnnGenAiTransformer” for QNN GenAITransformer backend and “QnnGpu” for QNN GPU. |
backend::extensions |
QNN HTP |
Path to backend extensions configuration file. |
QnnHtp::version |
QNN HTP |
Version of QnnHtp object that is supported by APIs. (1) |
QnnHtp::spill-fill-bufsize |
QNN HTP |
Buffer size to pre-allocate for the QNN HTP spill fill. This field depends upon the HTP VTCM memory size. It should be set greater than the spill-fill required by each context binary in the model. Consult the QNN HTP backend documentation in the QAIRT SDK for more details. |
QnnHtp::data-alignment-size |
QNN HTP |
Data will be aligned by rounding up the size to the nearest multiple of alignment number. Typically should be zero. |
QnnHtp::use-mmap |
QNN HTP |
Memory map the context binary files. Typically should be turned on. |
QnnHtp::mmap-budget |
QNN HTP |
Memory map the context binary files in chunks of the given size. Typically should be 25MB. |
QnnHtp::poll |
QNN HTP |
Specify whether to busy-wait on threads. |
QnnHtp::pos-id-dim |
QNN HTP |
Dimension of positional embeddings, usually (kv-dim) / 2. |
QnnHtp::cpumask |
QNN HTP |
CPU affinity mask. |
QnnHtp::kv-dim |
QNN HTP |
Dimension of the KV-cache embedding. |
QnnHtp::allow-async-init |
QNN HTP |
Allow context binaries to be initialized asynchronously if the backend supports it. |
QnnHtp::enable-graph-switching |
QNN HTP |
Enables graph switching for graphs within each context binary. |
QnnHtp::skip-lora-validation |
QNN HTP |
Skips CRC validation when LoRA binary sections are applied. Please refer to QNN HTP documentation for more information. |
QnnHtp::rope-theta |
QNN HTP |
Used to calculate rotary positional encodings. |
QnnHtp::shared-engine |
QNN HTP |
Enable engine to be created and managed in a shared mode. “allow-async-init” must be True, for engine sharing. |
QnnGenAiTransformer::version |
QNN GenAiTransformer |
Version of QnnGenAiTransformer object that is supported by APIs. (1) |
QnnGenAiTransformer::n-logits |
QNN GenAiTransformer |
Number of logit vectors that result will have for sampling. |
QnnGenAiTransformer::n-layer |
QNN GenAiTransformer |
Number of decoder layers model is having. |
QnnGenAiTransformer::n-embd |
QNN GenAiTransformer |
Size of embedding vector for each token. |
QnnGenAiTransformer::n-heads |
QNN GenAiTransformer |
Number of attention heads model is having. |
QnnGenAiTransformer::n-kv-heads |
QNN GenAiTransformer |
Number of kv heads. Used for model having (GQA) group query attention. |
QnnGenAiTransformer::kv-quantization |
QNN GenAiTransformer |
Quantize KV Cache to Q8_0_32. |
QnnGenAiTransformer::enable-in-memory-kv-share |
QNN GenAiTransformer |
Enables in-memory buffer for the KV-Share dialog for better performance. |
model::version |
all backends |
Version of model object that is supported by APIs. (1) |
model::type |
all backends |
Type of model object “binary” for QNN HTP and “library” for QNN GenAiTransformer. |
model::positional-encoding |
all backends |
Captures positional encoding parameters for a model. |
positional-encoding::type |
all backends |
Type of positional encoding. Supported types are rope, alibi and absolute |
positional-encoding::rope-dim |
all backends |
Dimension of Rope positional embeddings, usually (kv-dim) / 2. |
positional-encoding::rope-theta |
all backends |
Used to calculate rotary position encodings for type rope |
binary::version |
QNN HTP |
Version of binary object that is supported by APIs. (1) |
binary::ctx-bins |
QNN HTP |
List of serialized model files. |
library::version |
QNN GenAiTransformer |
Version of library object that is supported by APIs. (1) |
library::model-bin |
QNN GenAiTransformer |
Path to model.bin file. |
SSD-Q1 configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["ssd-q1"]},
"ssd-q1": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"ssd-version": {"type": "integer"},
"forecast-prefix": {"type": "integer"},
"forecast-token-count" : {"type": "integer"},
"forecast-prefix" : {"type": "integer"},
"forecast-prefix-name" : {"type": "string"},
"branches" : "branches": {"type": "array"},
"n-streams": {"type": "integer"},
"p-threshold": {"type": "number", "minimum": 0, "maximum": 1}
}
},
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to used.(ssd-q1) |
ssd-q1::version |
all backends |
Version of ssd-q1 object that is supported by APIs.(1) |
ssd-q1::ssd-version |
all backends |
Version of ssd that need to used, supported version is 1. |
ssd-q1::forecast-prefix |
all backends |
Length of forecast prefix. |
ssd-q1::forecast-token-count |
all backends |
Highest number of tokens can be forecasted. |
ssd-q1::branch-mode |
all backends |
Supports top-1 and all-expand branching modes. |
ssd-q1::branches |
all backends |
Parallel decoding branches that will be created |
ssd-q1::forecast-prefix-name |
all backends |
Path of forecast-prefix binary. |
ssd-q1::n-streams |
all backends |
Number of parallel streams of the output for same query. |
ssd-q1::p-threshold |
all backends |
Probability threshold for sampling n-tokens for n-streams. |
SPD configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here except “engine”, will follow general configuration schema. For SPD, the “engine” will become an array of objects, instead of an object. Each object in “engine” array will follow the “engine” defined in the general configuration schema with the addition of “role” properties.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["spd"]},
"spd": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"draft-len" : {"type": "integer"}
}
},
"engine" : {
"type" : "array",
"properties" : {
{
"version" : {"type": "integer"},
"role" : {"type": "string" "enum" : ["draft", "target"]}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to used.(spd) |
spd::version |
all backends |
Version of spd object that is supported by APIs.(1) |
spd::draft-len |
all backends |
Speculative Decoding, draft length |
spd::engine::0::role |
all backends |
Engine Role. Distinguish between multiple engines |
LADE configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
Note
Please follow rules to setting LADE paramterts.
lade::window == lade::gcap
(lade::window + lade::gcap) * (lade::ngram - 1) <= AR-N
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["lade"]},
"lade": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"update-mode": {"type": "string", "enum": ["ALWAYS_FWD_ONE", "FWD_MAX_HIT", "FWD_LEVEL"]},
"window": {"type": "integer"},
"ngram": {"type": "integer"},
"gcap": {"type": "integer"}
}
},
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to be used.(lade) |
lade::version |
all backends |
Version of lade object that is supported by APIs.(1) |
lade::window |
all backends |
Window Size that will be used for LADE |
lade::ngram |
all backends |
Number of ngrams that will be used for LADE |
lade::gcap |
all backends |
Gcap that will be used for LADE |
lade::update-mode |
all backends |
Update Mode that needs to be used for LookAhead Branch updation.(ALWAYS_FWD_ONE/FWD_MAX_HIT/FWD_LEVEL) |
Multistream configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["multistream"]},
"multistream": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"n-streams": {"type": "integer"},
"p-threshold": {"type": "number", "minimum": 0, "maximum": 1}
}
},
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to be used.(multistream) |
multistream::version |
all backends |
Version of multistream object that is supported by APIs.(1) |
multistream::n-streams |
all backends |
Number of parallel streams of the output for same query. |
multistream::p-threshold |
all backends |
Probability threshold for sampling n-tokens for n-streams. |
LoRA V1 configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"engine" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"model" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum":["binary"]},
"binary" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"lora": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"alpha-tensor-name": {"type": "string"},
"lora-version": {"type": "integer"},
"adapters" : {
"type": "array",
"items" : {
"type": "object",
"properties" : {
"version" : {"type": "integer"},
"name" : {"type": "string"},
"path" : {"type": "string"}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
lora::version |
all backends |
Version of lora object that is supported by APIs.(1) |
lora::alpha-tensor-name |
all backends |
Name of alpha tensor. |
lora::lora-version |
all backends |
Configured lora version. |
adapters::version |
all backends |
Version of adapters object that is supported by APIs.(1) |
adapters::name |
all backends |
Name of lora weights adapters. |
adapters::path | all backends |
Path to lora-weights directory. |
|
LoRA V2/V3 configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"engine" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"model" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum":["binary"]},
"binary" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"lora": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"alpha-tensor-name": {"type": "string"},
"adapters" : {
"type": "array",
"items" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"name" : {"type": "string"},
"alphas" : {"type": "array", "items": {"type": "string"}},
"bin-sections": {"type": "array", "items": {"type": "string"}}
}
}
},
"groups" : {
"type": "array",
"items" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"name" : {"type": "string"},
"members" : {"type": "array", "items": {"type": "string"}},
"quant-bin-sections": {"type": "array", "items": {"type": "string"}}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
lora::version |
all backends |
Version of lora object that is supported by APIs.(1) |
lade::alpha-tensor-name |
all backends |
Name of alpha tensor. This tensor is part of the base graph and gets populated by alpha strength value(s) during GenieDialog_setLoraStrength API call. |
adapters::version |
all backends |
Version of adapters object that is supported by APIs.(1) |
adapters::name |
all backends |
Name of lora adapters. |
adapters::alphas |
all backends |
List of alpha names, which serve as virtual alpha tensor labels, one for each adapter. The user specifies the alpha values corresponding to these names via genie-t2t-run lora argument, and these values populate the base graph alpha tensor. Mandatory for LoRAv3, optional for LoRAv1 & LoRAv2. |
adapters::bin-sections |
all backends |
List of serialized lora weights bins. |
lora::groups |
QNN HTP |
The groups to which the LoRA adapters belong. Adapters in the same group share the same encodings bins. |
groups::version |
QNN HTP |
Version of groups object that is supported by APIs.(1) |
groups::name |
QNN HTP |
Name of lora group. |
groups::members |
QNN HTP |
Name of the adapters belonging to the lora group. |
groups::quant-bin-sections |
QNN HTP |
List of serialized lora quant-only bins. |
Embedding-to-Text configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum":["basic", "ssd-q1", "multistream"]},
"embedding" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"size": {"type": "integer"},
"datatype": {"type": "string", "enum":["float32", "native"]}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to be used. |
embedding::version |
all backends |
Version of embedding object that is supported by APIs.(1) |
embedding::size |
all backends |
Embedding vector dimensionality. |
embedding::datatype |
all backends |
Expected datatype for embedding vectors provided to GenieDialog_embeddingQuery() and the token-to-embedding conversion callback. |
Running split LLM (Embedding + Decoder) configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum":["basic", "ssd-q1", "multistream"]},
"embedding" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type": {"type": "string", "enum":["lut", "callback"]},
"size": {"type": "integer"},
"datatype": {"type": "string", "enum":["float32", "native", "ufixed8", "ufixed16", "sfixed8", "sfixed16"]}
"lut-path" : {"type": "string"},
"quant-param" : {
"type": "object",
"properties": {
"scale" : {"type": "float"},
"offset" : {"type": "float"}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to be used. |
embedding::version |
all backends |
Version of embedding object that is supported by APIs.(1) |
embedding::type |
all backends |
Type of embedding operation to configure |
embedding::size |
all backends |
Embedding vector dimensionality. |
embedding::datatype |
all backends |
Datatype for the Look up table binary file. |
embedding::lut-path |
all backends |
Path to the look up table binary file. |
quant-param::scale |
all backends |
Quantization Scale for look-up-table binary. |
quant-param::offset |
all backends |
Quantization Offset for look-up-table binary. |
Eaglet configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here except “engine” and “context”, will follow general configuration schema. For eaglet, the “engine” will become an array of objects, instead of an object. Each object in “engine” array will follow the “engine” defined in the general configuration schema with the addition of “role” properties. “draft” would be including “draft-token-map” in case of trimmed vocab case, which is governed by whether “n-vocab” != “draft-n-vocab”.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["eaglet"]},
"eaglet": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"eaglet-version": {"type": "integer"},
"draft-len": {"type": "integer"},
"n-branches": {"type": "integer"},
"max-tokens-target-can-evaluate": {"type": "integer"},
"draft-kv-cache": {"type": "boolean"}
}
},
"context" : {
"version" : {"type": "integer"},
"draft-n-vocab" : {"type" : "integer"}
}
"engine" : {
"type" : "array",
"properties" : {
"version" : {"type": "integer"},
"role" : {"type": "string" "enum" : ["draft", "target"]},
{
"model" : {
"type" : "array",
"properties" : {
"version" : {"type": "integer"},
"draft-token-map" : {"type" : "string"}
}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to used.(eaglet) |
eaglet::version |
all backends |
Version of eaglet object that is supported by APIs.(1) |
eaglet::eaglet-version |
all backends |
Version of eaglet that need to used, supported version is 1. |
eaglet::draft-len |
all backends |
Length of draft sequence |
eaglet::n-branches |
all backends |
Parallel decoding branches that will be created |
eaglet::max-tokens-target-can-evaluate |
all backends |
Maximum number of tokens target model can evaluate in a run. |
eaglet::draft-kv-cache |
all backends |
Draft model would run with KV$ cache. |
eaglet::engine::role |
all backends |
Engine Role. Distinguish between multiple engines |
context::draft-n-vocab |
all backends |
Draft model vocabulary size. |
eaglet::engine::draft-token-map |
all backends |
Map for draft-token-id to target-token-id. |
QNN GenAITransformer backend configuration example¶
The following is an example configuration for the QNN GenAITransformer backend.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"context" : {
"version" : 1,
"size": 512,
"n-vocab": 32000,
"bos-token": 1,
"eos-token": 2
},
"sampler" : {
"version" : 1,
"seed" : 100,
"temp" : 1.2,
"top-k" : 20,
"top-p" : 0.75,
"greedy" : false
},
"tokenizer" : {
"version" : 1,
"path" : "path_to_tokenizer_json"
},
"engine" : {
"version" : 1,
"n-threads" : 10,
"backend" : {
"version" : 1,
"type" : "QnnGenAiTransformer",
"QnnGenAiTransformer" : {
"version" : 1,
"kv-quantization": false
}
},
"model" : {
"version" : 1,
"type" : "library",
"library" : {
"version" : 1,
"model-bin" : "path_to_model_binary_file"
}
}
}
}
}
QNN HTP backend configuration example¶
The following is an example configuration for the QNN HTP backend.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"context" : {
"version" : 1,
"size": 1024,
"n-vocab": 32000,
"bos-token": 1,
"eos-token": 2
},
"sampler" : {
"version" : 1,
"seed" : 42,
"temp" : 0.8,
"top-k" : 40,
"top-p" : 0.95,
"greedy" : true
},
"tokenizer" : {
"version" : 1,
"path" : "test_path"
},
"engine" : {
"version" : 1,
"n-threads" : 3,
"backend" : {
"version" : 1,
"type" : "QnnHtp",
"QnnHtp" : {
"version" : 1,
"spill-fill-bufsize" : 320000000,
"use-mmap" : true,
"mmap-budget" : 0,
"poll" : true,
"pos-id-dim" : 64,
"cpu-mask" : "0xe0",
"kv-dim" : 128,
"rope-theta" : 10000
},
"extensions" : "htp_backend_ext_config.json"
},
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"ctx-bins" : [
"file_1_of_4",
"file_2_of_4",
"file_3_of_4",
"file_4_of_4"
]
}
}
}
}
}
QNN GPU backend configuration example¶
The following is an example configuration for the QNN GPU backend.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"context" : {
"version" : 1,
"size": 512,
"n-vocab": 32000,
"bos-token": 1,
"eos-token": 2
},
"sampler" : {
"version" : 1,
"seed" : 100,
"temp" : 1.2,
"top-k" : 20,
"top-p" : 0.75,
"greedy" : false
},
"tokenizer" : {
"version" : 1,
"path" : "path_to_tokenizer_json"
},
"engine" : {
"version" : 1,
"n-threads" : 1,
"backend" : {
"version" : 1,
"type" : "QnnGpu"
},
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"ctx-bins" : [
"path_to_model_context_binary"
]
}
}
}
}
}
SSD-Q1 configuration example¶
The following is an example configuration for the Self Speculative Decoding (SSD) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "ssd-q1",
"ssd-q1" : {
"version" : 1,
"ssd-version" : 1,
"forecast-token-count" : 4,
"forecast-prefix" : 16,
"forecast-prefix-name" : "forecast_prefix_name_string",
"branches" : [4, 4]
},
}
}
An example of a SSD-Q1 configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-ssd.json.
SPD configuration example¶
The following is an example configuration for the Speculative Decoding (SPD) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters. The engine parameter in SPD config is an array of two engines. Each engine config follows existing definition of engine, with an addition of “role” paramter to define the role of the model.
{
"dialog" : {
"version" : 1,
"type" : "spd",
"spd" : {
"version" : 1,
"draft-len" : 7
},
"engine" : [
{
"role" : "draft"
},
{
"role" : "target"
}
]
}
}
An example of a SPD configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-draft-htp-target-htp-spd.json.
LADE configuration example¶
The following is an example configuration for the LookAhead Decoding (LADE) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "lade",
"lade" : {
"version" : 1,
"update-mode" : "ALWAYS_FWD_ONE",
"window" : 8,
"ngram" : 5,
"gcap" : 8
}
}
}
An example of a LADE configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-lade.json.
Multistream configuration example¶
The following is an example configuration for the Multistream execution for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "multistream",
"multistream" : {
"version" : 1,
"n-streams" : 8,
"p-threshold" : 0
}
}
}
An example of a Multistream configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-multistream.json.
LoRA V1 configuration example¶
The following is an example configuration for the Low-Rank Adaptation (LoRA) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"engine" : {
"version" : 1,
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"lora": {
"version" : 1,
"alpha-tensor-name": "alpha",
"lora-version" : 1,
"adapters" : [
{
"version" : 1,
"name" : "lora-weights-1"
"path" : "path/to/lora-weights-1-dir/"
},
{
"version" : 1,
"name" : "lora-weights-2"
"path" : "path/to/lora-weights-2-dir/"
}
]
}
}
}
}
}
}
LoRA V1 Weights Directory Format¶
LoRA weights directory must follow naming convention for each file as tensor_name.raw for tensors having “lora” keyword in their names. e.g. If a tensor’s name is “lora_scale” then file name must be lora_scale.raw.
Note
Tensor name should match with the tensor name in model. If even a single lora tensor not found in lora-weights directory, adapter will fail to apply.
LoRA V2 configuration example for HTP¶
The following is an example configuration for the Low-Rank Adaptation (LoRA) for HTP backend. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"engine" : {
"version" : 1,
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"lora": {
"version" : 1,
"alpha-tensor-name": "alpha",
"adapters" : [
{
"version" : 1,
"name" : "lora1",
"bin-sections": [
"lora1_file_1_of_4.bin",
"lora1_file_2_of_4.bin",
"lora1_file_3_of_4.bin",
"lora1_file_4_of_4.bin"
]
},
{
"version" : 1,
"name" : "lora2",
"bin-sections": [
"lora2_file_1_of_4.bin",
"lora2_file_2_of_4.bin",
"lora2_file_3_of_4.bin",
"lora2_file_4_of_4.bin"
]
}
]
}
}
}
}
}
}
An example of a LoRA configuration for HTP can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-lora.json.
LoRA configuration example for GenAiTransformer BE¶
The following is an example configuration for the Low-Rank Adaptation (LoRA) for GenAiTransformer backend.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"engine" : {
"version" : 1,
"model" : {
"version" : 1,
"type" : "library",
"library" : {
"version" : 1,
"lora": {
"version" : 1,
"alpha-tensor-name": "alpha",
"adapters" : [
{
"version" : 1,
"name" : "lora1",
"bin-sections": [
"lora1_adapter.bin"
]
},
{
"version" : 1,
"name" : "lora2",
"bin-sections": [
"lora2_adapter.bin"
]
}
]
}
}
}
}
}
}
An example of a LoRA configuration for GenAiTransformer BE can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-genaitransformer-lora.json.
Eaglet configuration example¶
The following is an example configuration for the Eaglet Decoding for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters. The engine parameter in eaglet config is an array of two engines. Each engine config follows existing definition of engine, with an addition of “role” paramter to define the role of the model. Also “draft” engine must include “draft-token-map” and “draft-n-vocab” for trimmed/non-trimmed vocabulary case.
{
"dialog" : {
"version" : 1,
"type" : "eaglet",
"eaglet": {
"version": 1,
"eaglet-version": 1,
"draft-len": 6,
"n-branches": 8,
"max-tokens-target-can-evaluate": 32,
"draft-kv-cache": true
},
"context": {
"version": 1,
"n-vocab": 128256,
"draft-n-vocab": 3200,
},
"engine" : [
{
"role" : "draft",
"model": {
"version": 1,
"draft-token-map" : "vocab_trim_elementary16181.json"
}
},
{
"role" : "target"
}
]
}
}
An example of a eaglet configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama3-3b/llama3-3b-eaglet-htp.json.