API¶
This section provides information about Qualcomm® Genie C API.
Table of Contents
Revision history¶
QAIRT SDK Version |
Description |
|---|---|
2.29.0 |
|
2.28.0 |
|
2.27.0 |
|
2.26.0 |
|
2.25.0 |
|
2.23.0 |
|
Thread safety¶
Please note that the Genie C API currently does not offer thread safety guarantees. Thread safety features will be provided in a future release.
Genie Dialog JSON configuration string¶
The following sections contain information that pertain to the format of the JSON configuration string that is supplied to GenieDialogConfig_createFromJson. This JSON configuration is can also be supplied to the genie-t2t-run tool.
Note
Please refer to the example configs contained in the SDK at ${SDK_ROOT}/examples/Genie/configs/.
General configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note that dependencies are not specified in the schema, but are discussed in the following per-backend sections.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum":["basic"]},
"context" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"size": {"type": "integer"},
"n-vocab": {"type": "integer"},
"bos-token": {"type": "integer"},
"eos-token": {"type": "integer"}
}
},
"sampler" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"seed" : {"type": "integer"},
"temp" : {"type": "float"},
"top-k" : {"type": "integer"},
"top-p" : {"type": "float"},
"greedy" : {"type": "boolean"}
}
},
"tokenizer" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"path" : {"type": "string"}
}
},
"engine" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"n-threads" : {"type": "integer"},
"backend" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum" : ["QnnHtp", "QnnGenAiTransformer"]},
"QnnHtp" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"spill-fill-bufsize" : {"type": "integer"},
"use-mmap" : {"type": "boolean"},
"mmap-budget" : {"type": "integer"},
"poll" : {"type": "boolean"},
"pos-id-dim" : {"type": "integer"},
"cpu-mask" : {"type": "string"},
"kv-dim" : {"type": "integer"},
"allow-async-init" : {"type": "boolean"},
"rope-theta" : {"type": "double"}
}
},
"QnnGenAiTransformer" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"n-logits" : {"type": "integer"},
"n-layer" : {"type": "integer"},
"n-embd" : {"type": "integer"},
"n-heads" : {"type": "integer"}
}
},
"extensions" : {"type": "string"}
}
},
"model" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum":["binary", "library"]},
"binary" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"ctx-bins" : {"type": "array", "items": {"type": "string"}}
}
},
"library" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"model-bin" : {"type": "string"}
}
}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::version |
all backends |
Version of dialog object that is supported by APIs.(1) |
dialog::type |
all backends |
Type of dialog supported by APIs.(basic) |
dialog::stop-sequence |
all backends |
Stop query when a set of sequences detected in response. Argument passed in as an array of strings |
dialog::max-num-tokens |
all backends |
Stop query when max number of tokens generated in response. |
context::version |
all backends |
Version of context object that is supported by APIs. (1) |
context::size |
all backends |
Context length. Maximum number of tokens to store. |
context::n-vocab |
all backends |
Model vocabulary size. |
context::bos-token |
all backends |
Beginning of sentence token. |
context::eos-token |
all backends |
End of sentence token. Argument passed in as an integer or array of integers |
context::eot-token |
all backends |
End of turn token. |
sampler::version |
all backends |
Version of sampler object that is supported by APIs. (1) |
sampler::seed |
all backends |
Sampling random number generation seed. |
sampler::temp |
all backends |
Sampling temperature. |
sampler::top-k |
all backends |
Top-k number of samples. |
sampler::top-p |
all backends |
Top-p sampling threshold. |
sampler::greedy |
all backends |
Sampler that need to be used is random or greedy. true value specify greedy sampling. |
tokenizer::version |
all backends |
Version of tokenizer object that is supported by APIs. (1) |
tokenizer::path |
all backends |
Path to tokenizer file. |
engine::version |
all backends |
Version of engine object that is supported by APIs. (1) |
engine::n-threads |
all backends |
Number of threads to use for KV-cache updates. |
backend::version |
all backends |
Version of backend object that is supported by APIs. (1) |
backend::type |
all backends |
Type of engine like “QnnHtp” for QNN HTP, “QnnGenAiTransformer” for QNN GenAITransformer backend and “QnnGpu” for QNN GPU. |
backend::extensions |
QNN HTP |
Path to backend extensions configuration file. |
QnnHtp::version |
QNN HTP |
Version of QnnHtp object that is supported by APIs. (1) |
QnnHtp::spill-fill-bufsize |
QNN HTP |
Buffer size to pre-allocate for the QNN HTP spill fill. This field depends upon the HTP VTCM memory size. It should be set greater than the spill-fill required by each context binary in the model. Consult the QNN HTP backend documentation in the QAIRT SDK for more details. |
QnnHtp::use-mmap |
QNN HTP |
Memory map the context binary files. Typically should be turned on. |
QnnHtp::mmap-budget |
QNN HTP |
Memory map the context binary files in chunks of the given size. Typically should be 25MB. |
QnnHtp::poll |
QNN HTP |
Specify whether to busy-wait on threads. |
QnnHtp::pos-id-dim |
QNN HTP |
Dimension of positional embeddings, usually (kv-dim) / 2. |
QnnHtp::cpumask |
QNN HTP |
CPU affinity mask. |
QnnHtp::kv-dim |
QNN HTP |
Dimension of the KV-cache embedding. |
QnnHtp::allow-async-init |
QNN HTP |
Allow context binaries to be initialized asynchronously if the backend supports it. |
QnnHtp::rope-theta |
QNN HTP |
Used to calculate rotary positional encodings. |
QnnHtp::enable-graph-switching |
QNN HTP |
Enables graph switching for graphs within each context binary. |
QnnGenAiTransformer::version |
QNN GenAiTransformer |
Version of QnnGenAiTransformer object that is supported by APIs. (1) |
QnnGenAiTransformer::n-logits |
QNN GenAiTransformer |
Number of logit vectors that result will have for sampling. |
QnnGenAiTransformer::n-layer |
QNN GenAiTransformer |
Number of decoder layers model is having. |
QnnGenAiTransformer::n-embd |
QNN GenAiTransformer |
Size of embedding vector for each token. |
QnnGenAiTransformer::n-heads |
QNN GenAiTransformer |
Number of heads model is having. |
model::version |
all backends |
Version of model object that is supported by APIs. (1) |
model::type |
all backends |
Type of model object “binary” for QNN HTP and “library” for QNN GenAiTransformer. |
model::positional-encoding |
all backends |
Captures positional encoding parameters for a model. |
positional-encoding::type |
all backends |
Type of positional encoding. Supported types are rope, alibi and absolute |
positional-encoding::rope-dim |
all backends |
Dimension of Rope positional embeddings, usually (kv-dim) / 2. |
positional-encoding::rope-theta |
all backends |
Used to calculate rotary position encodings for type rope |
binary::version |
QNN HTP |
Version of binary object that is supported by APIs. (1) |
binary::ctx-bins |
QNN HTP |
List of serialized model files. |
library::version |
QNN GenAiTransformer |
Version of library object that is supported by APIs. (1) |
library::model-bin |
QNN GenAiTransformer |
Path to model.bin file. |
SSD-Q1 configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["ssd-q1"]},
"ssd-q1": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"ssd-version": {"type": "integer"},
"forecast-prefix": {"type": "integer"},
"forecast-token-count" : {"type": "integer"},
"forecast-prefix" : {"type": "integer"},
"forecast-prefix-name" : {"type": "string"},
"branches" : "branches": {"type": "array"},
"n-streams": {"type": "integer"},
"p-threshold": {"type": "number", "minimum": 0, "maximum": 1}
}
},
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to used.(ssd-q1) |
ssd-q1::version |
all backends |
Version of ssd-q1 object that is supported by APIs.(1) |
ssd-q1::ssd-version |
all backends |
Version of ssd that need to used, supported version is 1. |
ssd-q1::forecast-prefix |
all backends |
Length of forecast prefix. |
ssd-q1::forecast-token-count |
all backends |
Highest number of tokens can be forecasted. |
ssd-q1::branches |
all backends |
Parallel decoding branches that will be created |
ssd-q1::forecast-prefix-name |
all backends |
Path of forecast-prefix binary. |
ssd-q1::n-streams |
all backends |
Number of parallel streams of the output for same query. |
ssd-q1::p-threshold |
all backends |
Probability threshold for sampling n-tokens for n-streams. |
LADE configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["lade"]},
"lade": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"update-mode": {"type": "string", "enum": ["ALWAYS_FWD_ONE", "FWD_MAX_HIT", "FWD_LEVEL"]},
"window": {"type": "integer"},
"ngram": {"type": "integer"},
"gcap": {"type": "integer"}
}
},
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to be used.(lade) |
lade::version |
all backends |
Version of lade object that is supported by APIs.(1) |
lade::window |
all backends |
Window Size that will be used for LADE |
lade::ngram |
all backends |
Number of ngrams that will be used for LADE |
lade::gcap |
all backends |
Gcap that will be used for LADE |
lade::update-mode |
all backends |
Update Mode that needs to be used for LookAhead Branch updation.(ALWAYS_FWD_ONE/FWD_MAX_HIT/FWD_LEVEL) |
LoRA configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"engine" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"model" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum":["binary"]},
"binary" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"lora": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"alpha-tensor-name": {"type": "string"},
"adapters" : {
"type": "array",
"items" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"name" : {"type": "string"},
"bin-sections": {"type": "array", "items": {"type": "string"}}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
lora::version |
all backends |
Version of lora object that is supported by APIs.(1) |
lade::alpha-tensor-name |
all backends |
Name of alpha tensor. |
adapters::version |
all backends |
Version of adapters object that is supported by APIs.(1) |
adapters::name |
all backends |
Name of lora adapters. |
adapters::bin-sections |
all backends |
List of serialized lora weights bins. |
LoRA-v1 configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"engine" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"model" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum":["binary"]},
"binary" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"lora": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"alpha-tensor-name": {"type": "string"},
"lora-version": {"type": "integer"},
"adapters" : {
"type": "array",
"items" : {
"type": "object",
"properties" : {
"version" : {"type": "integer"},
"name" : {"type": "string"},
"path" : {"type": "string"}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
lora::version |
all backends |
Version of lora object that is supported by APIs.(1) |
lora::alpha-tensor-name |
all backends |
Name of alpha tensor. |
lora::lora-version |
all backends |
Configured lora version. |
adapters::version |
all backends |
Version of adapters object that is supported by APIs.(1) |
adapters::name |
all backends |
Name of lora weights adapters. |
adapters::path | all backends |
Path to lora-weights directory. |
|
SPD configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here except “engine”, will follow general configuration schema. For SPD, the “engine” will become an array of objects, instead of an object. Each object in “engine” array will follow the “engine” defined in the general configuration schema with the addition of “role” properties.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["spd"]},
"spd": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"draft-len" : {"type": "integer"}
}
},
"engine" : {
"type" : "array",
"properties" : {
{
"version" : {"type": "integer"},
"role" : {"type": "string" "enum" : ["draft", "target"]}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to used.(spd) |
spd::version |
all backends |
Version of spd object that is supported by APIs.(1) |
spd::draft-len |
all backends |
Speculative Decoding, draft length |
spd::engine::0::role |
all backends |
Engine Role. Distinguish between multiple engines |
Multistream configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum" : ["multistream"]},
"multistream": {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"n-streams": {"type": "integer"},
"p-threshold": {"type": "number", "minimum": 0, "maximum": 1}
}
},
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to be used.(multistream) |
multistream::version |
all backends |
Version of multistream object that is supported by APIs.(1) |
multistream::n-streams |
all backends |
Number of parallel streams of the output for same query. |
multistream::p-threshold |
all backends |
Probability threshold for sampling n-tokens for n-streams. |
Embedding-to-Text configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.
{
"dialog" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string", "enum":["basic", "ssd-q1", "multistream"]},
"embedding" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"size": {"type": "integer"},
"datatype": {"type": "string", "enum":["float32", "native"]}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
dialog::type |
all backends |
Dialog type that has to be used. |
embedding::version |
all backends |
Version of embedding object that is supported by APIs.(1) |
embedding::size |
all backends |
Embedding vector dimensionality. |
embedding::datatype |
all backends |
Expected datatype for embedding vectors provided to GenieDialog_embeddingQuery() and the token-to-embedding conversion callback. |
QNN GenAITransformer backend configuration example¶
The following is an example configuration for the QNN GenAITransformer backend.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"context" : {
"version" : 1,
"size": 512,
"n-vocab": 32000,
"bos-token": 1,
"eos-token": 2
},
"sampler" : {
"version" : 1,
"seed" : 100,
"temp" : 1.2,
"top-k" : 20,
"top-p" : 0.75,
"greedy" : false
},
"tokenizer" : {
"version" : 1,
"path" : "path_to_tokenizer_json"
},
"engine" : {
"version" : 1,
"n-threads" : 10,
"backend" : {
"version" : 1,
"type" : "QnnGenAiTransformer",
"QnnGenAiTransformer" : {
"version" : 1
}
},
"model" : {
"version" : 1,
"type" : "library",
"library" : {
"version" : 1,
"model-bin" : "path_to_model_binary_file"
}
}
}
}
}
QNN HTP backend configuration example¶
The following is an example configuration for the QNN HTP backend.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"context" : {
"version" : 1,
"size": 1024,
"n-vocab": 32000,
"bos-token": 1,
"eos-token": 2
},
"sampler" : {
"version" : 1,
"seed" : 42,
"temp" : 0.8,
"top-k" : 40,
"top-p" : 0.95,
"greedy" : true
},
"tokenizer" : {
"version" : 1,
"path" : "test_path"
},
"engine" : {
"version" : 1,
"n-threads" : 3,
"backend" : {
"version" : 1,
"type" : "QnnHtp",
"QnnHtp" : {
"version" : 1,
"spill-fill-bufsize" : 320000000,
"use-mmap" : true,
"mmap-budget" : 0,
"poll" : true,
"pos-id-dim" : 64,
"cpu-mask" : "0xe0",
"kv-dim" : 128,
"rope-theta" : 10000
},
"extensions" : "htp_backend_ext_config.json"
},
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"ctx-bins" : [
"file_1_of_4",
"file_2_of_4",
"file_3_of_4",
"file_4_of_4"
]
}
}
}
}
}
QNN GPU backend configuration example¶
The following is an example configuration for the QNN GPU backend.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"context" : {
"version" : 1,
"size": 512,
"n-vocab": 32000,
"bos-token": 1,
"eos-token": 2
},
"sampler" : {
"version" : 1,
"seed" : 100,
"temp" : 1.2,
"top-k" : 20,
"top-p" : 0.75,
"greedy" : false
},
"tokenizer" : {
"version" : 1,
"path" : "path_to_tokenizer_json"
},
"engine" : {
"version" : 1,
"n-threads" : 1,
"backend" : {
"version" : 1,
"type" : "QnnGpu"
},
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"ctx-bins" : [
"path_to_model_context_binary"
]
}
}
}
}
}
SSD-Q1 configuration example¶
The following is an example configuration for the Self Speculative Decoding (SSD) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "ssd-q1",
"ssd-q1" : {
"version" : 1,
"ssd-version" : 1,
"forecast-token-count" : 4,
"forecast-prefix" : 16,
"forecast-prefix-name" : "forecast_prefix_name_string",
"branches" : [4, 4]
},
}
}
An example of a SSD-Q1 configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-ssd.json.
LADE configuration example¶
The following is an example configuration for the LookAhead Decoding (LADE) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "lade",
"lade" : {
"version" : 1,
"update-mode" : "ALWAYS_FWD_ONE",
"window" : 8,
"ngram" : 5,
"gcap" : 8
}
}
}
An example of a LADE configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-lade.json.
LoRA configuration example for HTP¶
The following is an example configuration for the Low-Rank Adaptation (LoRA) for HTP backend. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"engine" : {
"version" : 1,
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"lora": {
"version" : 1,
"alpha-tensor-name": "alpha",
"adapters" : [
{
"version" : 1,
"name" : "lora1",
"bin-sections": [
"lora1_file_1_of_4.bin",
"lora1_file_2_of_4.bin",
"lora1_file_3_of_4.bin",
"lora1_file_4_of_4.bin"
]
},
{
"version" : 1,
"name" : "lora2",
"bin-sections": [
"lora2_file_1_of_4.bin",
"lora2_file_2_of_4.bin",
"lora2_file_3_of_4.bin",
"lora2_file_4_of_4.bin"
]
}
]
}
}
}
}
}
}
An example of a LoRA configuration for HTP can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-lora.json.
LoRA configuration example for GenAiTransformer BE¶
The following is an example configuration for the Low-Rank Adaptation (LoRA) for GenAiTransformer backend.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"engine" : {
"version" : 1,
"model" : {
"version" : 1,
"type" : "library",
"library" : {
"version" : 1,
"lora": {
"version" : 1,
"alpha-tensor-name": "alpha",
"adapters" : [
{
"version" : 1,
"name" : "lora1",
"bin-sections": [
"lora1_adapter.bin"
]
},
{
"version" : 1,
"name" : "lora2",
"bin-sections": [
"lora2_adapter.bin"
]
}
]
}
}
}
}
}
}
An example of a LoRA configuration for GenAiTransformer BE can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-genaitransformer-lora.json.
LoRA-v1 configuration example¶
The following is an example configuration for the Low-Rank Adaptation (LoRA) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "basic",
"engine" : {
"version" : 1,
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"lora": {
"version" : 1,
"alpha-tensor-name": "alpha",
"lora-version" : 1,
"adapters" : [
{
"version" : 1,
"name" : "lora-weights-1"
"lora-weights-dir" : "path/to/lora-weights-1-dir/"
},
{
"version" : 1,
"name" : "lora-weights-2"
"lora-weights-dir" : "path/to/lora-weights-2-dir/"
}
]
}
}
}
}
}
}
LoRA V1 Weights Directory Format¶
LoRA weights directory must follow naming convention for each file as tensor_name.raw for tensors having “lora” keyword in their names. e.g. If a tensor’s name is “lora_scale” then file name must be lora_scale.raw.
Note
Tensor name should match with the tensor name in model. If even a single lora tensor not found in lora-weights directory, adapter will fail to apply.
SPD configuration example¶
The following is an example configuration for the Speculative Decoding (SPD) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters. The engine parameter in SPD config is an array of two engines. Each engine config follows existing definition of engine, with an addition of “role” paramter to define the role of the model.
{
"dialog" : {
"version" : 1,
"type" : "spd",
"spd" : {
"version" : 1,
"draft-len" : 7
},
"engine" : [
{
"role" : "draft"
},
{
"role" : "target"
}
]
}
}
An example of a SPD configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-draft-htp-target-htp-spd.json.
Multistream configuration example¶
The following is an example configuration for the Multistream execution for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.
{
"dialog" : {
"version" : 1,
"type" : "multistream",
"multistream" : {
"version" : 1,
"n-streams" : 8,
"p-threshold" : 0
}
}
}
An example of a Multistream configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-multistream.json.
Genie Embedding JSON configuration string¶
The following sections contain information that pertain to the format of the JSON configuration string that is supplied to GenieEmbeddingConfig_createFromJson. This JSON configuration is can also be supplied to the genie-t2e-run tool.
Note
Please refer to the example configs contained in the SDK at ${SDK_ROOT}/examples/Genie/configs/.
General configuration schema¶
The following provides the schema of the JSON configuration format that is provided to GenieEmbeddingConfig_createFromJson. Note that dependencies are not specified in the schema, but are discussed in the following per-backend sections.
{
"embedding" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"context" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"ctx-size": {"type": "integer"},
"n-vocab": {"type": "integer"},
"embed-size": {"type": "integer"},
"pad-token": {"type": "integer"}
}
},
"prompt" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"prompt-template" : {"type": "array", "items": {"type": "string"}}
}
},
"tokenizer" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"path" : {"type": "string"}
}
},
"truncate-input" : {"type" : "boolean"},
"engine" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"n-threads" : {"type": "integer"},
"backend" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum" : ["QnnHtp", "QnnGenAiTransformer"]},
"QnnHtp" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"spill-fill-bufsize" : {"type": "integer"},
"use-mmap" : {"type": "boolean"},
"allow-async-init" : {"type": "boolean"},
"pooled-output" : {"type": "boolean"},
"disable-kv-cache" : {"type": "boolean"}
}
},
"QnnGenAiTransformer" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"n-layer" : {"type": "integer"},
"n-embd" : {"type": "integer"},
"n-heads" : {"type": "integer"}
}
},
"extensions" : {"type": "string"}
}
},
"model" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"type" : {"type": "string","enum":["binary", "library"]},
"binary" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"ctx-bins" : {"type": "array", "items": {"type": "string"}}
}
},
"library" : {
"type": "object",
"properties": {
"version" : {"type": "integer"},
"model-bin" : {"type": "string"}
}
}
}
}
}
}
}
}
}
Option |
Applicability |
Description |
|---|---|---|
embedding::version |
all backends |
Version of embedding object that is supported by APIs.(1) |
embedding::truncate-input |
all backends |
To allow truncation of input, when it exceeds the context length. |
context::version |
all backends |
Version of context object that is supported by APIs. (1) |
context::ctx-size |
all backends |
Context length. Maximum number of tokens to process. |
context::n-vocab |
all backends |
Model vocabulary size. |
context::embed-size |
all backends |
Embedding length. Embedding vector length for each token. |
context::pad-token |
all backends |
Token id for pad token. |
prompt::version |
all backends |
Version of prompt object that is supported by APIs. (1) |
prompt::prompt-template |
all backends |
Prefix and Suffix string that will be added to each prompt. |
tokenizer::version |
all backends |
Version of tokenizer object that is supported by APIs. (1) |
tokenizer::path |
all backends |
Path to tokenizer file. |
engine::version |
all backends |
Version of engine object that is supported by APIs. (1) |
engine::n-threads |
all backends |
Number of threads to use for KV-cache updates. |
backend::version |
all backends |
Version of backend object that is supported by APIs. (1) |
backend::type |
all backends |
Type of engine like “QnnHtp” for QNN HTP and “QnnGenAiTransformer” for QNN GenAITransformer backend. |
backend::extensions |
QNN HTP |
Path to backend extensions configuration file. |
QnnHtp::version |
QNN HTP |
Version of QnnHtp object that is supported by APIs. (1) |
QnnHtp::spill-fill-bufsize |
QNN HTP |
Buffer size to pre-allocate for the QNN HTP spill fill. This field depends upon the HTP VTCM memory size. It should be set greater than the spill-fill required by each context binary in the model. Consult the QNN HTP backend documentation in the QAIRT SDK for more details. |
QnnHtp::use-mmap |
QNN HTP |
Memory map the context binary files. Typically should be turned on. |
QnnHtp::allow-async-init |
QNN HTP |
Allow context binaries to be initialized asynchronously if the backend supports it. |
QnnHtp::pooled-output |
QNN HTP |
To decide in between pooled or per token embedding result as generation result. |
QnnHtp::disable-kv-cache |
QNN HTP |
Disables the KV cache Manager, as models will not have KV cache. |
QnnGenAiTransformer::version |
QNN GenAiTransformer |
Version of QnnGenAiTransformer object that is supported by APIs. (1) |
QnnGenAiTransformer::n-layer |
QNN GenAiTransformer |
Number of decoder layers model is having. |
QnnGenAiTransformer::n-embd |
QNN GenAiTransformer |
Size of embedding vector for each token. |
QnnGenAiTransformer::n-heads |
QNN GenAiTransformer |
Number of heads model is having. |
model::version |
all backends |
Version of model object that is supported by APIs. (1) |
model::type |
all backends |
Type of model object “binary” for QNN HTP and “library” for QNN GenAiTransformer. |
binary::version |
QNN HTP |
Version of binary object that is supported by APIs. (1) |
binary::ctx-bins |
QNN HTP |
List of serialized model files. |
library::version |
QNN GenAiTransformer |
Version of library object that is supported by APIs. (1) |
library::model-bin |
QNN GenAiTransformer |
Path to model.bin file. |
QNN GenAITransformer backend configuration example¶
The following is an example configuration for the QNN GenAITransformer backend.
{
"embedding" : {
"version" : 1,
"context": {
"version": 1,
"n-vocab": 30522,
"ctx-size": 512,
"embed-size" : 1024,
"pad-token" : 0
},
"prompt": {
"version" : 1,
"prompt-template": ["[CLS]","[SEP]"]
},
"tokenizer" : {
"version" : 1,
"path" : "test_path"
},
"truncate-input" : true,
"engine": {
"version": 1,
"n-threads" : 10,
"backend" : {
"version" : 1,
"type" : "QnnGenAiTransformer",
"QnnGenAiTransformer" : {
"version" : 1,
"n-layer": 24,
"n-embd": 1024,
"n-heads": 16
}
},
"model" : {
"version" : 1,
"type" : "library",
"library" : {
"version" : 1,
"model-bin" : "path_to_model_binary_file"
}
}
}
}
}
QNN HTP backend configuration example¶
The following is an example configuration for the QNN HTP backend.
{
"embedding" : {
"version" : 1,
"context": {
"version": 1,
"n-vocab": 30522,
"ctx-size": 512,
"embed-size" : 1024,
"pad-token" : 0
},
"prompt": {
"version" : 1,
"prompt-template": ["[CLS]","[SEP]"]
},
"tokenizer" : {
"version" : 1,
"path" : "test_path"
},
"truncate-input" : true,
"engine" : {
"version" : 1,
"backend" : {
"version" : 1,
"type" : "QnnHtp",
"QnnHtp" : {
"version" : 1,
"spill-fill-bufsize" : 0,
"use-mmap" : true,
"pooled-output" : true,
"allow-async-init": false,
"disable-kv-cache": true
},
"extensions" : "htp_backend_ext_config.json"
},
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"ctx-bins" : [
"file_1_of_1.bin"
]
}
}
}
}
}