API

This section provides information about Qualcomm® Genie C API.

Revision history

QAIRT SDK Version

Description

2.29.0

  • SDK:

    • Added llama-2-7b JSON config example for GPU.

  • Dialog JSON configuration:

    • Added QNN GPU engine type.

2.28.0

  • SDK:

    • Added genie-t2e-run application and sample config for GenieEmbedding.h.

    • Added llama-3-8b JSON config example for HTP.

  • Genie C API 1.2.0:

    • Added GenieEmbedding.h.

    • Added GenieDialog_applyLora and GenieDialog_setLoraStrength.

    • Added GenieDialog_tokenQuery. (Supported for basic dialog type only).

    • Added GenieDialog_save and GenieDialog_restore.

  • Dialog JSON configuration:

    • Added self-speculative decoding (SSD) dialog type.

    • Added speculative decoding (SPD) dialog type.

    • Added lookahead decoding (LADE) dialog type.

    • Added multistream dialog type.

    • Added rope-scaling.

    • Added support for multiple EOS tokens.

    • Added alibi and absolute positional encoding support.

    • Added SSD support for GenieDialog_embeddingQuery.

  • Bugfixes:

    • Fixed issue where mmap-budget was unused.

    • Fixed link issue with libGenie.so on aarch64-oe-linux-gcc11.2.

    • Fixed memory leak in model loading when setting use-mmap to false.

2.27.0

  • SDK:

    • Added genie-t2t-run source code example.

    • Added llama-3-8b JSON config example for HTP.

  • Genie C API 1.1.0:

    • Added GenieDialog_embeddingQuery API with corresponding tool and configuration support.

  • Dialog JSON configuration:

    • Added kv-share dialog type which provides support for KV cache transfer between HTP and GenAiTransformer backends.

    • Added max-num-tokens dialog configuration option.

  • Bugfixes:

    • Fixed issue where IO encodings were not updated when a LoRA adapter was applied.

    • Workaround issue where segmentation fault occurs after GenieDialog_free when using the HTP backend.

2.26.0

  • Dialog JSON configuration:

    • Added eot-token configuration option.

    • Added rope-theta configuration option.

    • Added support for async initialization and added allow-async-init

    • config option.

    • Added stop-sequence configuration option to enable dialog query cancellation based upon response text matching.

  • Bugfixes:

    • Fix issue where incorrect Windows .lib files were packaged.

    • Fix issue where unknown genie-t2t-run option does not generate an error.

    • Fixed memory allocation failures during HTP initialization.

2.25.0

  • Genie C API 1.0.0:

    • Genie C API moves into production.

  • Dialog JSON configuration:

    • Introduced GenieDialog JSON configuration format.

2.23.0

  • Genie C API 0.1.0:

    • Added GenieCommon.h and GenieDialog.h.

Thread safety

Please note that the Genie C API currently does not offer thread safety guarantees. Thread safety features will be provided in a future release.

Genie Dialog JSON configuration string

The following sections contain information that pertain to the format of the JSON configuration string that is supplied to GenieDialogConfig_createFromJson. This JSON configuration is can also be supplied to the genie-t2t-run tool.

Note

Please refer to the example configs contained in the SDK at ${SDK_ROOT}/examples/Genie/configs/.

General configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note that dependencies are not specified in the schema, but are discussed in the following per-backend sections.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum":["basic"]},
      "context" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "size": {"type": "integer"},
          "n-vocab": {"type": "integer"},
          "bos-token": {"type": "integer"},
          "eos-token": {"type": "integer"}
        }
      },
      "sampler" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "seed" : {"type": "integer"},
          "temp" : {"type": "float"},
          "top-k" : {"type": "integer"},
          "top-p" : {"type": "float"},
          "greedy" : {"type": "boolean"}
        }
      },
      "tokenizer" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "path" : {"type": "string"}
        }
      },
      "engine" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "n-threads" : {"type": "integer"},
          "backend" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum" : ["QnnHtp", "QnnGenAiTransformer"]},
              "QnnHtp" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "spill-fill-bufsize" : {"type": "integer"},
                  "use-mmap" : {"type": "boolean"},
                  "mmap-budget" : {"type": "integer"},
                  "poll" : {"type": "boolean"},
                  "pos-id-dim" : {"type": "integer"},
                  "cpu-mask" : {"type": "string"},
                  "kv-dim" : {"type": "integer"},
                  "allow-async-init" : {"type": "boolean"},
                  "rope-theta" : {"type": "double"}
                }
              },
              "QnnGenAiTransformer" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "n-logits" : {"type": "integer"},
                  "n-layer" : {"type": "integer"},
                  "n-embd" : {"type": "integer"},
                  "n-heads" : {"type": "integer"}
                }
              },
              "extensions" : {"type": "string"}
            }
          },
          "model" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum":["binary", "library"]},
              "binary" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "ctx-bins" : {"type": "array", "items": {"type": "string"}}
                }
              },
              "library" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "model-bin" : {"type": "string"}
                }
              }
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::version

all backends

Version of dialog object that is supported by APIs.(1)

dialog::type

all backends

Type of dialog supported by APIs.(basic)

dialog::stop-sequence

all backends

Stop query when a set of sequences detected in response. Argument passed in as an array of strings

dialog::max-num-tokens

all backends

Stop query when max number of tokens generated in response.

context::version

all backends

Version of context object that is supported by APIs. (1)

context::size

all backends

Context length. Maximum number of tokens to store.

context::n-vocab

all backends

Model vocabulary size.

context::bos-token

all backends

Beginning of sentence token.

context::eos-token

all backends

End of sentence token. Argument passed in as an integer or array of integers

context::eot-token

all backends

End of turn token.

sampler::version

all backends

Version of sampler object that is supported by APIs. (1)

sampler::seed

all backends

Sampling random number generation seed.

sampler::temp

all backends

Sampling temperature.

sampler::top-k

all backends

Top-k number of samples.

sampler::top-p

all backends

Top-p sampling threshold.

sampler::greedy

all backends

Sampler that need to be used is random or greedy. true value specify greedy sampling.

tokenizer::version

all backends

Version of tokenizer object that is supported by APIs. (1)

tokenizer::path

all backends

Path to tokenizer file.

engine::version

all backends

Version of engine object that is supported by APIs. (1)

engine::n-threads

all backends

Number of threads to use for KV-cache updates.

backend::version

all backends

Version of backend object that is supported by APIs. (1)

backend::type

all backends

Type of engine like “QnnHtp” for QNN HTP, “QnnGenAiTransformer” for QNN GenAITransformer backend and “QnnGpu” for QNN GPU.

backend::extensions

QNN HTP

Path to backend extensions configuration file.

QnnHtp::version

QNN HTP

Version of QnnHtp object that is supported by APIs. (1)

QnnHtp::spill-fill-bufsize

QNN HTP

Buffer size to pre-allocate for the QNN HTP spill fill. This field depends upon the HTP VTCM memory size. It should be set greater than the spill-fill required by each context binary in the model. Consult the QNN HTP backend documentation in the QAIRT SDK for more details.

QnnHtp::use-mmap

QNN HTP

Memory map the context binary files. Typically should be turned on.

QnnHtp::mmap-budget

QNN HTP

Memory map the context binary files in chunks of the given size. Typically should be 25MB.

QnnHtp::poll

QNN HTP

Specify whether to busy-wait on threads.

QnnHtp::pos-id-dim

QNN HTP

Dimension of positional embeddings, usually (kv-dim) / 2.

QnnHtp::cpumask

QNN HTP

CPU affinity mask.

QnnHtp::kv-dim

QNN HTP

Dimension of the KV-cache embedding.

QnnHtp::allow-async-init

QNN HTP

Allow context binaries to be initialized asynchronously if the backend supports it.

QnnHtp::rope-theta

QNN HTP

Used to calculate rotary positional encodings.

QnnHtp::enable-graph-switching

QNN HTP

Enables graph switching for graphs within each context binary.

QnnGenAiTransformer::version

QNN GenAiTransformer

Version of QnnGenAiTransformer object that is supported by APIs. (1)

QnnGenAiTransformer::n-logits

QNN GenAiTransformer

Number of logit vectors that result will have for sampling.

QnnGenAiTransformer::n-layer

QNN GenAiTransformer

Number of decoder layers model is having.

QnnGenAiTransformer::n-embd

QNN GenAiTransformer

Size of embedding vector for each token.

QnnGenAiTransformer::n-heads

QNN GenAiTransformer

Number of heads model is having.

model::version

all backends

Version of model object that is supported by APIs. (1)

model::type

all backends

Type of model object “binary” for QNN HTP and “library” for QNN GenAiTransformer.

model::positional-encoding

all backends

Captures positional encoding parameters for a model.

positional-encoding::type

all backends

Type of positional encoding. Supported types are rope, alibi and absolute

positional-encoding::rope-dim

all backends

Dimension of Rope positional embeddings, usually (kv-dim) / 2.

positional-encoding::rope-theta

all backends

Used to calculate rotary position encodings for type rope

binary::version

QNN HTP

Version of binary object that is supported by APIs. (1)

binary::ctx-bins

QNN HTP

List of serialized model files.

library::version

QNN GenAiTransformer

Version of library object that is supported by APIs. (1)

library::model-bin

QNN GenAiTransformer

Path to model.bin file.

SSD-Q1 configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["ssd-q1"]},
      "ssd-q1": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "ssd-version": {"type": "integer"},
          "forecast-prefix": {"type": "integer"},
          "forecast-token-count" : {"type": "integer"},
          "forecast-prefix" : {"type": "integer"},
          "forecast-prefix-name" : {"type": "string"},
          "branches" : "branches": {"type": "array"},
          "n-streams": {"type": "integer"},
          "p-threshold": {"type": "number", "minimum": 0, "maximum": 1}
        }
      },
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to used.(ssd-q1)

ssd-q1::version

all backends

Version of ssd-q1 object that is supported by APIs.(1)

ssd-q1::ssd-version

all backends

Version of ssd that need to used, supported version is 1.

ssd-q1::forecast-prefix

all backends

Length of forecast prefix.

ssd-q1::forecast-token-count

all backends

Highest number of tokens can be forecasted.

ssd-q1::branches

all backends

Parallel decoding branches that will be created

ssd-q1::forecast-prefix-name

all backends

Path of forecast-prefix binary.

ssd-q1::n-streams

all backends

Number of parallel streams of the output for same query.

ssd-q1::p-threshold

all backends

Probability threshold for sampling n-tokens for n-streams.

LADE configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["lade"]},
      "lade": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "update-mode": {"type": "string", "enum": ["ALWAYS_FWD_ONE", "FWD_MAX_HIT", "FWD_LEVEL"]},
          "window": {"type": "integer"},
          "ngram": {"type": "integer"},
          "gcap": {"type": "integer"}
        }
      },
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to be used.(lade)

lade::version

all backends

Version of lade object that is supported by APIs.(1)

lade::window

all backends

Window Size that will be used for LADE

lade::ngram

all backends

Number of ngrams that will be used for LADE

lade::gcap

all backends

Gcap that will be used for LADE

lade::update-mode

all backends

Update Mode that needs to be used for LookAhead Branch updation.(ALWAYS_FWD_ONE/FWD_MAX_HIT/FWD_LEVEL)

LoRA configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "engine" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "model" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum":["binary"]},
              "binary" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "lora": {
                    "type": "object",
                    "properties": {
                      "version" : {"type": "integer"},
                      "alpha-tensor-name": {"type": "string"},
                      "adapters" : {
                        "type": "array",
                        "items" : {
                          "type": "object",
                          "properties": {
                            "version" : {"type": "integer"},
                            "name" : {"type": "string"},
                            "bin-sections": {"type": "array", "items": {"type": "string"}}
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

lora::version

all backends

Version of lora object that is supported by APIs.(1)

lade::alpha-tensor-name

all backends

Name of alpha tensor.

adapters::version

all backends

Version of adapters object that is supported by APIs.(1)

adapters::name

all backends

Name of lora adapters.

adapters::bin-sections

all backends

List of serialized lora weights bins.

LoRA-v1 configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "engine" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "model" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum":["binary"]},
              "binary" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "lora": {
                    "type": "object",
                    "properties": {
                      "version" : {"type": "integer"},
                      "alpha-tensor-name": {"type": "string"},
                      "lora-version": {"type": "integer"},
                      "adapters" : {
                        "type": "array",
                        "items" : {
                          "type": "object",
                          "properties" : {
                            "version" : {"type": "integer"},
                            "name" : {"type": "string"},
                            "path" : {"type": "string"}
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

lora::version

all backends

Version of lora object that is supported by APIs.(1)

lora::alpha-tensor-name

all backends

Name of alpha tensor.

lora::lora-version

all backends

Configured lora version.

adapters::version

all backends

Version of adapters object that is supported by APIs.(1)

adapters::name

all backends

Name of lora weights adapters.

adapters::path | all backends

Path to lora-weights directory.

SPD configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here except “engine”, will follow general configuration schema. For SPD, the “engine” will become an array of objects, instead of an object. Each object in “engine” array will follow the “engine” defined in the general configuration schema with the addition of “role” properties.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["spd"]},
      "spd": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "draft-len" : {"type": "integer"}
        }
      },
      "engine" : {
        "type" : "array",
        "properties" : {
          {
            "version" : {"type": "integer"},
            "role" : {"type": "string" "enum" : ["draft", "target"]}
          }
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to used.(spd)

spd::version

all backends

Version of spd object that is supported by APIs.(1)

spd::draft-len

all backends

Speculative Decoding, draft length

spd::engine::0::role

all backends

Engine Role. Distinguish between multiple engines

Multistream configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["multistream"]},
      "multistream": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "n-streams": {"type": "integer"},
          "p-threshold": {"type": "number", "minimum": 0, "maximum": 1}
        }
      },
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to be used.(multistream)

multistream::version

all backends

Version of multistream object that is supported by APIs.(1)

multistream::n-streams

all backends

Number of parallel streams of the output for same query.

multistream::p-threshold

all backends

Probability threshold for sampling n-tokens for n-streams.

Embedding-to-Text configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum":["basic", "ssd-q1", "multistream"]},
      "embedding" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "size": {"type": "integer"},
          "datatype": {"type": "string", "enum":["float32", "native"]}
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to be used.

embedding::version

all backends

Version of embedding object that is supported by APIs.(1)

embedding::size

all backends

Embedding vector dimensionality.

embedding::datatype

all backends

Expected datatype for embedding vectors provided to GenieDialog_embeddingQuery() and the token-to-embedding conversion callback.

KV-SHARE configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here except “engine”, will follow general configuration schema. For KV-SHARE, the “engine” will become an array of objects, instead of an object. Each object in “engine” array will follow the “engine” defined in the general configuration schema with the addition of “role” properties.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["kv-share"]},
      "engine" : {
        "type" : "array",
        "properties" : {
          {
            "version" : {"type": "integer"},
            "role" : {"type": "string" "enum" : ["primary", "secondary"]}
          }
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to used.(kv-share)

kv-share::engine::0::role

all backends

Engine Role. Distinguish between multiple engines

QNN GenAITransformer backend configuration example

The following is an example configuration for the QNN GenAITransformer backend.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "context" : {
      "version" : 1,
      "size": 512,
      "n-vocab": 32000,
      "bos-token": 1,
      "eos-token": 2
    },
    "sampler" : {
      "version" : 1,
      "seed" : 100,
      "temp" : 1.2,
      "top-k" : 20,
      "top-p" : 0.75,
      "greedy" : false
    },
    "tokenizer" : {
      "version" : 1,
      "path" : "path_to_tokenizer_json"
    },
    "engine" : {
      "version" : 1,
      "n-threads" : 10,
      "backend" : {
        "version" : 1,
        "type" : "QnnGenAiTransformer",
        "QnnGenAiTransformer" : {
          "version" : 1
        }
      },
      "model" : {
        "version" : 1,
        "type" : "library",
        "library" : {
          "version" : 1,
          "model-bin" : "path_to_model_binary_file"
        }
      }
    }
  }
}

QNN HTP backend configuration example

The following is an example configuration for the QNN HTP backend.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "context" : {
      "version" : 1,
      "size": 1024,
      "n-vocab": 32000,
      "bos-token": 1,
      "eos-token": 2
    },
    "sampler" : {
      "version" : 1,
      "seed" : 42,
      "temp" : 0.8,
      "top-k" : 40,
      "top-p" : 0.95,
      "greedy" : true
    },
    "tokenizer" : {
      "version" : 1,
      "path" : "test_path"
    },
    "engine" : {
      "version" : 1,
      "n-threads" : 3,
      "backend" : {
        "version" : 1,
        "type" : "QnnHtp",
        "QnnHtp" : {
          "version" : 1,
          "spill-fill-bufsize" : 320000000,
          "use-mmap" : true,
          "mmap-budget" : 0,
          "poll" : true,
          "pos-id-dim" : 64,
          "cpu-mask" : "0xe0",
          "kv-dim" : 128,
          "rope-theta" : 10000
        },
        "extensions" : "htp_backend_ext_config.json"
      },
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "ctx-bins" : [
            "file_1_of_4",
            "file_2_of_4",
            "file_3_of_4",
            "file_4_of_4"
          ]
        }
      }
    }
  }
}

QNN GPU backend configuration example

The following is an example configuration for the QNN GPU backend.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "context" : {
      "version" : 1,
      "size": 512,
      "n-vocab": 32000,
      "bos-token": 1,
      "eos-token": 2
    },
    "sampler" : {
      "version" : 1,
      "seed" : 100,
      "temp" : 1.2,
      "top-k" : 20,
      "top-p" : 0.75,
      "greedy" : false
    },
    "tokenizer" : {
      "version" : 1,
      "path" : "path_to_tokenizer_json"
    },
    "engine" : {
      "version" : 1,
      "n-threads" : 1,
      "backend" : {
        "version" : 1,
        "type" : "QnnGpu"
      },
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "ctx-bins" : [
            "path_to_model_context_binary"
          ]
        }
      }
    }
  }
}

SSD-Q1 configuration example

The following is an example configuration for the Self Speculative Decoding (SSD) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "ssd-q1",
    "ssd-q1" : {
      "version" : 1,
      "ssd-version" : 1,
      "forecast-token-count" : 4,
      "forecast-prefix" : 16,
      "forecast-prefix-name" : "forecast_prefix_name_string",
      "branches" : [4, 4]
    },
  }
}

An example of a SSD-Q1 configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-ssd.json.

LADE configuration example

The following is an example configuration for the LookAhead Decoding (LADE) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "lade",
    "lade" : {
      "version" : 1,
      "update-mode" : "ALWAYS_FWD_ONE",
      "window" : 8,
      "ngram" : 5,
      "gcap" : 8
    }
  }
}

An example of a LADE configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-lade.json.

LoRA configuration example for HTP

The following is an example configuration for the Low-Rank Adaptation (LoRA) for HTP backend. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "engine" : {
      "version" : 1,
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "lora": {
            "version" : 1,
            "alpha-tensor-name": "alpha",
            "adapters" : [
              {
                "version" : 1,
                "name" : "lora1",
                "bin-sections": [
                  "lora1_file_1_of_4.bin",
                  "lora1_file_2_of_4.bin",
                  "lora1_file_3_of_4.bin",
                  "lora1_file_4_of_4.bin"
                ]
              },
              {
                "version" : 1,
                "name" : "lora2",
                "bin-sections": [
                  "lora2_file_1_of_4.bin",
                  "lora2_file_2_of_4.bin",
                  "lora2_file_3_of_4.bin",
                  "lora2_file_4_of_4.bin"
                ]
              }
            ]
          }
        }
      }
    }
  }
}

An example of a LoRA configuration for HTP can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-lora.json.

LoRA configuration example for GenAiTransformer BE

The following is an example configuration for the Low-Rank Adaptation (LoRA) for GenAiTransformer backend.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "engine" : {
      "version" : 1,
      "model" : {
        "version" : 1,
        "type" : "library",
        "library" : {
          "version" : 1,
          "lora": {
            "version" : 1,
            "alpha-tensor-name": "alpha",
            "adapters" : [
              {
                "version" : 1,
                "name" : "lora1",
                "bin-sections": [
                  "lora1_adapter.bin"
                ]
              },
              {
                "version" : 1,
                "name" : "lora2",
                "bin-sections": [
                  "lora2_adapter.bin"
                ]
              }
            ]
          }
        }
      }
    }
  }
}

An example of a LoRA configuration for GenAiTransformer BE can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-genaitransformer-lora.json.

LoRA-v1 configuration example

The following is an example configuration for the Low-Rank Adaptation (LoRA) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "engine" : {
      "version" : 1,
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "lora": {
            "version" : 1,
            "alpha-tensor-name": "alpha",
            "lora-version" : 1,
            "adapters" : [
              {
                "version" : 1,
                "name" : "lora-weights-1"
                "lora-weights-dir" : "path/to/lora-weights-1-dir/"
              },
              {
                "version" : 1,
                "name" : "lora-weights-2"
                "lora-weights-dir" : "path/to/lora-weights-2-dir/"
              }
            ]
          }
        }
      }
    }
  }
}

LoRA V1 Weights Directory Format

LoRA weights directory must follow naming convention for each file as tensor_name.raw for tensors having “lora” keyword in their names. e.g. If a tensor’s name is “lora_scale” then file name must be lora_scale.raw.

Note

Tensor name should match with the tensor name in model. If even a single lora tensor not found in lora-weights directory, adapter will fail to apply.

SPD configuration example

The following is an example configuration for the Speculative Decoding (SPD) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters. The engine parameter in SPD config is an array of two engines. Each engine config follows existing definition of engine, with an addition of “role” paramter to define the role of the model.

{
  "dialog" : {
    "version" : 1,
    "type" : "spd",
    "spd" : {
      "version" : 1,
      "draft-len" : 7
    },
    "engine" : [
      {
        "role" : "draft"
      },
      {
        "role" : "target"
      }
    ]
  }
}

An example of a SPD configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-draft-htp-target-htp-spd.json.

Multistream configuration example

The following is an example configuration for the Multistream execution for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "multistream",
    "multistream" : {
      "version" : 1,
      "n-streams" : 8,
      "p-threshold" : 0
    }
  }
}

An example of a Multistream configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-multistream.json.

KV-SHARE configuration example

The following is an example configuration for the KV CACHE SHARING (KV-SHARE) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters. The engine parameter in KV-SHARE config is an array of two engines. Each engine config follows existing definition of engine, with an addition of “role” parameter to define the role of the engine.

{
  "dialog" : {
    "version" : 1,
    "type" : "kv-share",
    "engine" : [
      {
        "role" : "primary"
      },
      {
        "role" : "secondary"
      }
    ]
  }
}

An example of a KV-SHARE configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-genaitransformer-htp-kv-share.json.

Genie Embedding JSON configuration string

The following sections contain information that pertain to the format of the JSON configuration string that is supplied to GenieEmbeddingConfig_createFromJson. This JSON configuration is can also be supplied to the genie-t2e-run tool.

Note

Please refer to the example configs contained in the SDK at ${SDK_ROOT}/examples/Genie/configs/.

General configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieEmbeddingConfig_createFromJson. Note that dependencies are not specified in the schema, but are discussed in the following per-backend sections.

{
  "embedding" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "context" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "ctx-size": {"type": "integer"},
          "n-vocab": {"type": "integer"},
          "embed-size": {"type": "integer"},
          "pad-token": {"type": "integer"}
        }
      },
      "prompt" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "prompt-template" : {"type": "array", "items": {"type": "string"}}
        }
      },
      "tokenizer" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "path" : {"type": "string"}
        }
      },
      "truncate-input" : {"type" : "boolean"},
      "engine" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "n-threads" : {"type": "integer"},
          "backend" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum" : ["QnnHtp", "QnnGenAiTransformer"]},
              "QnnHtp" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "spill-fill-bufsize" : {"type": "integer"},
                  "use-mmap" : {"type": "boolean"},
                  "allow-async-init" : {"type": "boolean"},
                  "pooled-output" : {"type": "boolean"},
                  "disable-kv-cache" : {"type": "boolean"}
                }
              },
              "QnnGenAiTransformer" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "n-layer" : {"type": "integer"},
                  "n-embd" : {"type": "integer"},
                  "n-heads" : {"type": "integer"}
                }
              },
              "extensions" : {"type": "string"}
            }
          },
          "model" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum":["binary", "library"]},
              "binary" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "ctx-bins" : {"type": "array", "items": {"type": "string"}}
                }
              },
              "library" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "model-bin" : {"type": "string"}
                }
              }
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

embedding::version

all backends

Version of embedding object that is supported by APIs.(1)

embedding::truncate-input

all backends

To allow truncation of input, when it exceeds the context length.

context::version

all backends

Version of context object that is supported by APIs. (1)

context::ctx-size

all backends

Context length. Maximum number of tokens to process.

context::n-vocab

all backends

Model vocabulary size.

context::embed-size

all backends

Embedding length. Embedding vector length for each token.

context::pad-token

all backends

Token id for pad token.

prompt::version

all backends

Version of prompt object that is supported by APIs. (1)

prompt::prompt-template

all backends

Prefix and Suffix string that will be added to each prompt.

tokenizer::version

all backends

Version of tokenizer object that is supported by APIs. (1)

tokenizer::path

all backends

Path to tokenizer file.

engine::version

all backends

Version of engine object that is supported by APIs. (1)

engine::n-threads

all backends

Number of threads to use for KV-cache updates.

backend::version

all backends

Version of backend object that is supported by APIs. (1)

backend::type

all backends

Type of engine like “QnnHtp” for QNN HTP and “QnnGenAiTransformer” for QNN GenAITransformer backend.

backend::extensions

QNN HTP

Path to backend extensions configuration file.

QnnHtp::version

QNN HTP

Version of QnnHtp object that is supported by APIs. (1)

QnnHtp::spill-fill-bufsize

QNN HTP

Buffer size to pre-allocate for the QNN HTP spill fill. This field depends upon the HTP VTCM memory size. It should be set greater than the spill-fill required by each context binary in the model. Consult the QNN HTP backend documentation in the QAIRT SDK for more details.

QnnHtp::use-mmap

QNN HTP

Memory map the context binary files. Typically should be turned on.

QnnHtp::allow-async-init

QNN HTP

Allow context binaries to be initialized asynchronously if the backend supports it.

QnnHtp::pooled-output

QNN HTP

To decide in between pooled or per token embedding result as generation result.

QnnHtp::disable-kv-cache

QNN HTP

Disables the KV cache Manager, as models will not have KV cache.

QnnGenAiTransformer::version

QNN GenAiTransformer

Version of QnnGenAiTransformer object that is supported by APIs. (1)

QnnGenAiTransformer::n-layer

QNN GenAiTransformer

Number of decoder layers model is having.

QnnGenAiTransformer::n-embd

QNN GenAiTransformer

Size of embedding vector for each token.

QnnGenAiTransformer::n-heads

QNN GenAiTransformer

Number of heads model is having.

model::version

all backends

Version of model object that is supported by APIs. (1)

model::type

all backends

Type of model object “binary” for QNN HTP and “library” for QNN GenAiTransformer.

binary::version

QNN HTP

Version of binary object that is supported by APIs. (1)

binary::ctx-bins

QNN HTP

List of serialized model files.

library::version

QNN GenAiTransformer

Version of library object that is supported by APIs. (1)

library::model-bin

QNN GenAiTransformer

Path to model.bin file.

QNN GenAITransformer backend configuration example

The following is an example configuration for the QNN GenAITransformer backend.

{
  "embedding" : {
    "version" : 1,
    "context": {
      "version": 1,
      "n-vocab": 30522,
      "ctx-size": 512,
      "embed-size" : 1024,
      "pad-token" : 0
    },
    "prompt": {
      "version" : 1,
      "prompt-template": ["[CLS]","[SEP]"]
    },
    "tokenizer" : {
      "version" : 1,
      "path" : "test_path"
    },
    "truncate-input" : true,
    "engine": {
      "version": 1,
      "n-threads" : 10,
      "backend" : {
        "version" : 1,
        "type" : "QnnGenAiTransformer",
        "QnnGenAiTransformer" : {
          "version" : 1,
          "n-layer": 24,
          "n-embd": 1024,
          "n-heads": 16
        }
      },
      "model" : {
        "version" : 1,
        "type" : "library",
        "library" : {
          "version" : 1,
          "model-bin" : "path_to_model_binary_file"
        }
      }
    }
  }
}

QNN HTP backend configuration example

The following is an example configuration for the QNN HTP backend.

{
  "embedding" : {
    "version" : 1,
    "context": {
      "version": 1,
      "n-vocab": 30522,
      "ctx-size": 512,
      "embed-size" : 1024,
      "pad-token" : 0
    },
    "prompt": {
      "version" : 1,
      "prompt-template": ["[CLS]","[SEP]"]
    },
    "tokenizer" : {
      "version" : 1,
      "path" : "test_path"
    },
    "truncate-input" : true,
    "engine" : {
      "version" : 1,
      "backend" : {
        "version" : 1,
        "type" : "QnnHtp",
        "QnnHtp" : {
          "version" : 1,
          "spill-fill-bufsize" : 0,
          "use-mmap" : true,
          "pooled-output" : true,
          "allow-async-init": false,
          "disable-kv-cache": true
        },
        "extensions" : "htp_backend_ext_config.json"
      },
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "ctx-bins" : [
            "file_1_of_1.bin"
          ]
        }
      }
    }
  }
}