Genie Dialog JSON configuration string

The following sections contain information that pertain to the format of the JSON configuration string that is supplied to GenieDialogConfig_createFromJson. This JSON configuration is can also be supplied to the genie-t2t-run tool.

Note

Please refer to the example configs contained in the SDK at ${SDK_ROOT}/examples/Genie/configs/.

General configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note that dependencies are not specified in the schema, but are discussed in the following per-backend sections.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum":["basic"]},
      "context" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "size": {"type": "integer"},
          "n-vocab": {"type": "integer"},
          "bos-token": {"type": "integer"},
          "eos-token": {"type": "integer"}
        }
      },
      "sampler" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "seed" : {"type": "integer"},
          "temp" : {"type": "float"},
          "top-k" : {"type": "integer"},
          "top-p" : {"type": "float"},
          "greedy" : {"type": "boolean"},
          "token-penalty" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "penalize-last-n": {"type": "integer"},
              "repetition-penalty": {"type": "float"},
              "presence-penalty": {"type": "float"},
              "frequency-penalty": {"type": "float"}
            }
          }
        }
      },
      "tokenizer" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "path" : {"type": "string"}
        }
      },
      "engine" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "n-threads" : {"type": "integer"},
          "backend" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum" : ["QnnHtp", "QnnGenAiTransformer"]},
              "QnnHtp" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "spill-fill-bufsize" : {"type": "integer"},
                  "data-alignment-size" : {"type": "integer"},
                  "use-mmap" : {"type": "boolean"},
                  "mmap-budget" : {"type": "integer"},
                  "poll" : {"type": "boolean"},
                  "pos-id-dim" : {"type": "integer"},
                  "cpu-mask" : {"type": "string"},
                  "kv-dim" : {"type": "integer"},
                  "allow-async-init" : {"type": "boolean"},
                  "enable-graph-switching" : {"type": "boolean"},
                  "skip-lora-validation" : {"type" : "boolean"},
                  "rope-theta" : {"type": "double"},
                  "shared-engine" : {"type" : "boolean"}
                }
              },
              "QnnGenAiTransformer" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "n-logits" : {"type": "integer"},
                  "n-layer" : {"type": "integer"},
                  "n-embd" : {"type": "integer"},
                  "n-heads" : {"type": "integer"},
                  "n-kv-heads" : {"type": "integer"},
                  "kv-quantization" : {"type": "boolean"},
                  "enable-in-memory-kv-share" : {"type": "boolean"}
                }
              },
              "extensions" : {"type": "string"}
            }
          },
          "model" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum":["binary", "library"]},
              "binary" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "ctx-bins" : {"type": "array", "items": {"type": "string"}}
                }
              },
              "library" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "model-bin" : {"type": "string"}
                }
              }
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::version

all backends

Version of dialog object that is supported by APIs.(1)

dialog::type

all backends

Type of dialog supported by APIs.(basic)

dialog::stop-sequence

all backends

Stop query when a set of sequences detected in response. Argument passed in as an array of strings

dialog::max-num-tokens

all backends

Stop query when max number of tokens generated in response.

context::version

all backends

Version of context object that is supported by APIs. (1)

context::size

all backends

Context length. Maximum number of tokens to store.

context::n-vocab

all backends

Model vocabulary size.

context::bos-token

all backends

Beginning of sentence token.

context::eos-token

all backends

End of sentence token. Argument passed in as an integer or array of integers

context::eot-token

all backends

End of turn token.

sampler::version

all backends

Version of sampler object that is supported by APIs. (1)

sampler::type

all backends

Type of sampler to use. Supported options: basic, custom

sampler::callback-name

all backends

Name of the callback function to use for Sampling.

sampler::seed

all backends

Sampling random number generation seed.

sampler::temp

all backends

Sampling temperature.

sampler::top-k

all backends

Top-k number of samples.

sampler::top-p

all backends

Top-p sampling threshold.

sampler::greedy

all backends

Sampler that need to be used is random or greedy. true value specify greedy sampling.

token-penalty::version

all backends

Version of token-penalty object, supported by APIs.(1)

token-penalty::penalize-last-n

all backends

Maintains the history of last-n tokens for penalization. zero signifies no penalty.

token-penalty::repetition-penalty

all backends

Penalizes the logit value by given value for repetition. >1 penalizes the logits, where <1 boosts the repetition.

token-penalty::presence-penalty

all backends

Penalizes the logit value by given value for presence. >0 penalizes the logits, where <0 boosts the repetition.

token-penalty::frequency-penalty

all backends

Penalizes the logit value by given value for frequency. >0 penalizes the logits, where <0 boosts the repetition.

tokenizer::version

all backends

Version of tokenizer object that is supported by APIs. (1)

tokenizer::path

all backends

Path to tokenizer file.

engine::version

all backends

Version of engine object that is supported by APIs. (1)

engine::n-threads

all backends

Number of threads to use for KV-cache updates.

debug::path

all backends

File path to dump debug information.

debug::dump-tensors

all backends

Raw data dump of input and output tensors

debug::dump-specs

all backends

Dumps Input output tensor specification such as bw, scale, offset, dimensions

debug::dump-outputs

all backends

Raw data dump of output tensor from engine

backend::version

all backends

Version of backend object that is supported by APIs. (1)

backend::type

all backends

Type of engine like “QnnHtp” for QNN HTP, “QnnGenAiTransformer” for QNN GenAITransformer backend and “QnnGpu” for QNN GPU.

backend::extensions

QNN HTP

Path to backend extensions configuration file.

QnnHtp::version

QNN HTP

Version of QnnHtp object that is supported by APIs. (1)

QnnHtp::spill-fill-bufsize

QNN HTP

Buffer size to pre-allocate for the QNN HTP spill fill. This field depends upon the HTP VTCM memory size. It should be set greater than the spill-fill required by each context binary in the model. Consult the QNN HTP backend documentation in the QAIRT SDK for more details.

QnnHtp::data-alignment-size

QNN HTP

Data will be aligned by rounding up the size to the nearest multiple of alignment number. Typically should be zero.

QnnHtp::use-mmap

QNN HTP

Memory map the context binary files. Typically should be turned on.

QnnHtp::mmap-budget

QNN HTP

Memory map the context binary files in chunks of the given size. Typically should be 25MB.

QnnHtp::poll

QNN HTP

Specify whether to busy-wait on threads.

QnnHtp::pos-id-dim

QNN HTP

Dimension of positional embeddings, usually (kv-dim) / 2.

QnnHtp::cpumask

QNN HTP

CPU affinity mask.

QnnHtp::kv-dim

QNN HTP

Dimension of the KV-cache embedding.

QnnHtp::allow-async-init

QNN HTP

Allow context binaries to be initialized asynchronously if the backend supports it.

QnnHtp::enable-graph-switching

QNN HTP

Enables graph switching for graphs within each context binary.

QnnHtp::skip-lora-validation

QNN HTP

Skips CRC validation when LoRA binary sections are applied. Please refer to QNN HTP documentation for more information.

QnnHtp::rope-theta

QNN HTP

Used to calculate rotary positional encodings.

QnnHtp::shared-engine

QNN HTP

Enable engine to be created and managed in a shared mode. “allow-async-init” must be True, for engine sharing.

QnnGenAiTransformer::version

QNN GenAiTransformer

Version of QnnGenAiTransformer object that is supported by APIs. (1)

QnnGenAiTransformer::n-logits

QNN GenAiTransformer

Number of logit vectors that result will have for sampling.

QnnGenAiTransformer::n-layer

QNN GenAiTransformer

Number of decoder layers model is having.

QnnGenAiTransformer::n-embd

QNN GenAiTransformer

Size of embedding vector for each token.

QnnGenAiTransformer::n-heads

QNN GenAiTransformer

Number of attention heads model is having.

QnnGenAiTransformer::n-kv-heads

QNN GenAiTransformer

Number of kv heads. Used for model having (GQA) group query attention.

QnnGenAiTransformer::kv-quantization

QNN GenAiTransformer

Quantize KV Cache to Q8_0_32.

QnnGenAiTransformer::enable-in-memory-kv-share

QNN GenAiTransformer

Enables in-memory buffer for the KV-Share dialog for better performance.

model::version

all backends

Version of model object that is supported by APIs. (1)

model::type

all backends

Type of model object “binary” for QNN HTP and “library” for QNN GenAiTransformer.

model::positional-encoding

all backends

Captures positional encoding parameters for a model.

positional-encoding::type

all backends

Type of positional encoding. Supported types are rope, alibi and absolute

positional-encoding::rope-dim

all backends

Dimension of Rope positional embeddings, usually (kv-dim) / 2.

positional-encoding::rope-theta

all backends

Used to calculate rotary position encodings for type rope

binary::version

QNN HTP

Version of binary object that is supported by APIs. (1)

binary::ctx-bins

QNN HTP

List of serialized model files.

library::version

QNN GenAiTransformer

Version of library object that is supported by APIs. (1)

library::model-bin

QNN GenAiTransformer

Path to model.bin file.

SSD-Q1 configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["ssd-q1"]},
      "ssd-q1": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "ssd-version": {"type": "integer"},
          "forecast-prefix": {"type": "integer"},
          "forecast-token-count" : {"type": "integer"},
          "forecast-prefix" : {"type": "integer"},
          "forecast-prefix-name" : {"type": "string"},
          "branches" : "branches": {"type": "array"},
          "n-streams": {"type": "integer"},
          "p-threshold": {"type": "number", "minimum": 0, "maximum": 1}
        }
      },
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to used.(ssd-q1)

ssd-q1::version

all backends

Version of ssd-q1 object that is supported by APIs.(1)

ssd-q1::ssd-version

all backends

Version of ssd that need to used, supported version is 1.

ssd-q1::forecast-prefix

all backends

Length of forecast prefix.

ssd-q1::forecast-token-count

all backends

Highest number of tokens can be forecasted.

ssd-q1::branch-mode

all backends

Supports top-1 and all-expand branching modes.

ssd-q1::branches

all backends

Parallel decoding branches that will be created

ssd-q1::forecast-prefix-name

all backends

Path of forecast-prefix binary.

ssd-q1::n-streams

all backends

Number of parallel streams of the output for same query.

ssd-q1::p-threshold

all backends

Probability threshold for sampling n-tokens for n-streams.

SPD configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here except “engine”, will follow general configuration schema. For SPD, the “engine” will become an array of objects, instead of an object. Each object in “engine” array will follow the “engine” defined in the general configuration schema with the addition of “role” properties.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["spd"]},
      "spd": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "draft-len" : {"type": "integer"}
        }
      },
      "engine" : {
        "type" : "array",
        "properties" : {
          {
            "version" : {"type": "integer"},
            "role" : {"type": "string" "enum" : ["draft", "target"]}
          }
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to used.(spd)

spd::version

all backends

Version of spd object that is supported by APIs.(1)

spd::draft-len

all backends

Speculative Decoding, draft length

spd::engine::0::role

all backends

Engine Role. Distinguish between multiple engines

LADE configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

Note

Please follow rules to setting LADE paramterts.

  1. lade::window == lade::gcap

  2. (lade::window + lade::gcap) * (lade::ngram - 1) <= AR-N

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["lade"]},
      "lade": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "update-mode": {"type": "string", "enum": ["ALWAYS_FWD_ONE", "FWD_MAX_HIT", "FWD_LEVEL"]},
          "window": {"type": "integer"},
          "ngram": {"type": "integer"},
          "gcap": {"type": "integer"}
        }
      },
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to be used.(lade)

lade::version

all backends

Version of lade object that is supported by APIs.(1)

lade::window

all backends

Window Size that will be used for LADE

lade::ngram

all backends

Number of ngrams that will be used for LADE

lade::gcap

all backends

Gcap that will be used for LADE

lade::update-mode

all backends

Update Mode that needs to be used for LookAhead Branch updation.(ALWAYS_FWD_ONE/FWD_MAX_HIT/FWD_LEVEL)

Multistream configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["multistream"]},
      "multistream": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "n-streams": {"type": "integer"},
          "p-threshold": {"type": "number", "minimum": 0, "maximum": 1}
        }
      },
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to be used.(multistream)

multistream::version

all backends

Version of multistream object that is supported by APIs.(1)

multistream::n-streams

all backends

Number of parallel streams of the output for same query.

multistream::p-threshold

all backends

Probability threshold for sampling n-tokens for n-streams.

KV-SHARE configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here except “engine”, will follow general configuration schema. For KV-SHARE, the “engine” will become an array of objects, instead of an object. Each object in “engine” array will follow the “engine” defined in the general configuration schema with the addition of “role” properties.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["kv-share"]},
      "engine" : {
        "type" : "array",
        "properties" : {
          {
            "version" : {"type": "integer"},
            "role" : {"type": "string" "enum" : ["primary", "secondary"]}
          }
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to used.(kv-share)

kv-share::engine::role

all backends

Engine Role. Distinguish between multiple engines

LoRA V1 configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "engine" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "model" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum":["binary"]},
              "binary" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "lora": {
                    "type": "object",
                    "properties": {
                      "version" : {"type": "integer"},
                      "alpha-tensor-name": {"type": "string"},
                      "lora-version": {"type": "integer"},
                      "adapters" : {
                        "type": "array",
                        "items" : {
                          "type": "object",
                          "properties" : {
                            "version" : {"type": "integer"},
                            "name" : {"type": "string"},
                            "path" : {"type": "string"}
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

lora::version

all backends

Version of lora object that is supported by APIs.(1)

lora::alpha-tensor-name

all backends

Name of alpha tensor.

lora::lora-version

all backends

Configured lora version.

adapters::version

all backends

Version of adapters object that is supported by APIs.(1)

adapters::name

all backends

Name of lora weights adapters.

adapters::path | all backends

Path to lora-weights directory.

LoRA V2/V3 configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "engine" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "model" : {
            "type": "object",
            "properties": {
              "version" : {"type": "integer"},
              "type" : {"type": "string","enum":["binary"]},
              "binary" : {
                "type": "object",
                "properties": {
                  "version" : {"type": "integer"},
                  "lora": {
                    "type": "object",
                    "properties": {
                      "version" : {"type": "integer"},
                      "alpha-tensor-name": {"type": "string"},
                      "adapters" : {
                        "type": "array",
                        "items" : {
                          "type": "object",
                          "properties": {
                            "version" : {"type": "integer"},
                            "name" : {"type": "string"},
                            "alphas" : {"type": "array", "items": {"type": "string"}},
                            "bin-sections": {"type": "array", "items": {"type": "string"}}
                          }
                        }
                      },
                      "groups" : {
                        "type": "array",
                        "items" : {
                          "type": "object",
                          "properties": {
                            "version" : {"type": "integer"},
                            "name" : {"type": "string"},
                            "members" : {"type": "array", "items": {"type": "string"}},
                            "quant-bin-sections": {"type": "array", "items": {"type": "string"}}
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

lora::version

all backends

Version of lora object that is supported by APIs.(1)

lade::alpha-tensor-name

all backends

Name of alpha tensor. This tensor is part of the base graph and gets populated by alpha strength value(s) during GenieDialog_setLoraStrength API call.

adapters::version

all backends

Version of adapters object that is supported by APIs.(1)

adapters::name

all backends

Name of lora adapters.

adapters::alphas

all backends

List of alpha names, which serve as virtual alpha tensor labels, one for each adapter. The user specifies the alpha values corresponding to these names via genie-t2t-run lora argument, and these values populate the base graph alpha tensor. Mandatory for LoRAv3, optional for LoRAv1 & LoRAv2.

adapters::bin-sections

all backends

List of serialized lora weights bins.

lora::groups

QNN HTP

The groups to which the LoRA adapters belong. Adapters in the same group share the same encodings bins.

groups::version

QNN HTP

Version of groups object that is supported by APIs.(1)

groups::name

QNN HTP

Name of lora group.

groups::members

QNN HTP

Name of the adapters belonging to the lora group.

groups::quant-bin-sections

QNN HTP

List of serialized lora quant-only bins.

Embedding-to-Text configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum":["basic", "ssd-q1", "multistream"]},
      "embedding" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "size": {"type": "integer"},
          "datatype": {"type": "string", "enum":["float32", "native"]}
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to be used.

embedding::version

all backends

Version of embedding object that is supported by APIs.(1)

embedding::size

all backends

Embedding vector dimensionality.

embedding::datatype

all backends

Expected datatype for embedding vectors provided to GenieDialog_embeddingQuery() and the token-to-embedding conversion callback.

Running split LLM (Embedding + Decoder) configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here, will follow general configuration schema.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum":["basic", "ssd-q1", "multistream"]},
      "embedding" : {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "type": {"type": "string", "enum":["lut", "callback"]},
          "size": {"type": "integer"},
          "datatype": {"type": "string", "enum":["float32", "native", "ufixed8", "ufixed16", "sfixed8", "sfixed16"]}
          "lut-path" : {"type": "string"},
          "quant-param" : {
            "type": "object",
            "properties": {
              "scale" : {"type": "float"},
              "offset" : {"type": "float"}
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to be used.

embedding::version

all backends

Version of embedding object that is supported by APIs.(1)

embedding::type

all backends

Type of embedding operation to configure

embedding::size

all backends

Embedding vector dimensionality.

embedding::datatype

all backends

Datatype for the Look up table binary file.

embedding::lut-path

all backends

Path to the look up table binary file.

quant-param::scale

all backends

Quantization Scale for look-up-table binary.

quant-param::offset

all backends

Quantization Offset for look-up-table binary.

Eaglet configuration schema

The following provides the schema of the JSON configuration format that is provided to GenieDialogConfig_createFromJson. Note all the other parameters which are not mentioned here except “engine” and “context”, will follow general configuration schema. For eaglet, the “engine” will become an array of objects, instead of an object. Each object in “engine” array will follow the “engine” defined in the general configuration schema with the addition of “role” properties. “draft” would be including “draft-token-map” in case of trimmed vocab case, which is governed by whether “n-vocab” != “draft-n-vocab”.

{
  "dialog" : {
    "type": "object",
    "properties": {
      "version" : {"type": "integer"},
      "type" : {"type": "string", "enum" : ["eaglet"]},
      "eaglet": {
        "type": "object",
        "properties": {
          "version" : {"type": "integer"},
          "eaglet-version": {"type": "integer"},
          "draft-len": {"type": "integer"},
          "n-branches": {"type": "integer"},
          "max-tokens-target-can-evaluate": {"type": "integer"},
          "draft-kv-cache": {"type": "boolean"}
        }
      },
      "context" : {
        "version" : {"type": "integer"},
        "draft-n-vocab" : {"type" : "integer"}
      }
      "engine" : {
        "type" : "array",
        "properties" : {
          "version" : {"type": "integer"},
          "role" : {"type": "string" "enum" : ["draft", "target"]},
          {
            "model" : {
              "type" : "array",
              "properties" : {
                "version" : {"type": "integer"},
                "draft-token-map" : {"type" : "string"}
              }
            }
          }
        }
      }
    }
  }
}

Option

Applicability

Description

dialog::type

all backends

Dialog type that has to used.(eaglet)

eaglet::version

all backends

Version of eaglet object that is supported by APIs.(1)

eaglet::eaglet-version

all backends

Version of eaglet that need to used, supported version is 1.

eaglet::draft-len

all backends

Length of draft sequence

eaglet::n-branches

all backends

Parallel decoding branches that will be created

eaglet::max-tokens-target-can-evaluate

all backends

Maximum number of tokens target model can evaluate in a run.

eaglet::draft-kv-cache

all backends

Draft model would run with KV$ cache.

eaglet::engine::role

all backends

Engine Role. Distinguish between multiple engines

context::draft-n-vocab

all backends

Draft model vocabulary size.

eaglet::engine::draft-token-map

all backends

Map for draft-token-id to target-token-id.

QNN GenAITransformer backend configuration example

The following is an example configuration for the QNN GenAITransformer backend.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "context" : {
      "version" : 1,
      "size": 512,
      "n-vocab": 32000,
      "bos-token": 1,
      "eos-token": 2
    },
    "sampler" : {
      "version" : 1,
      "seed" : 100,
      "temp" : 1.2,
      "top-k" : 20,
      "top-p" : 0.75,
      "greedy" : false
    },
    "tokenizer" : {
      "version" : 1,
      "path" : "path_to_tokenizer_json"
    },
    "engine" : {
      "version" : 1,
      "n-threads" : 10,
      "backend" : {
        "version" : 1,
        "type" : "QnnGenAiTransformer",
        "QnnGenAiTransformer" : {
          "version" : 1,
          "kv-quantization": false
        }
      },
      "model" : {
        "version" : 1,
        "type" : "library",
        "library" : {
          "version" : 1,
          "model-bin" : "path_to_model_binary_file"
        }
      }
    }
  }
}

QNN HTP backend configuration example

The following is an example configuration for the QNN HTP backend.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "context" : {
      "version" : 1,
      "size": 1024,
      "n-vocab": 32000,
      "bos-token": 1,
      "eos-token": 2
    },
    "sampler" : {
      "version" : 1,
      "seed" : 42,
      "temp" : 0.8,
      "top-k" : 40,
      "top-p" : 0.95,
      "greedy" : true
    },
    "tokenizer" : {
      "version" : 1,
      "path" : "test_path"
    },
    "engine" : {
      "version" : 1,
      "n-threads" : 3,
      "backend" : {
        "version" : 1,
        "type" : "QnnHtp",
        "QnnHtp" : {
          "version" : 1,
          "spill-fill-bufsize" : 320000000,
          "use-mmap" : true,
          "mmap-budget" : 0,
          "poll" : true,
          "pos-id-dim" : 64,
          "cpu-mask" : "0xe0",
          "kv-dim" : 128,
          "rope-theta" : 10000
        },
        "extensions" : "htp_backend_ext_config.json"
      },
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "ctx-bins" : [
            "file_1_of_4",
            "file_2_of_4",
            "file_3_of_4",
            "file_4_of_4"
          ]
        }
      }
    }
  }
}

QNN GPU backend configuration example

The following is an example configuration for the QNN GPU backend.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "context" : {
      "version" : 1,
      "size": 512,
      "n-vocab": 32000,
      "bos-token": 1,
      "eos-token": 2
    },
    "sampler" : {
      "version" : 1,
      "seed" : 100,
      "temp" : 1.2,
      "top-k" : 20,
      "top-p" : 0.75,
      "greedy" : false
    },
    "tokenizer" : {
      "version" : 1,
      "path" : "path_to_tokenizer_json"
    },
    "engine" : {
      "version" : 1,
      "n-threads" : 1,
      "backend" : {
        "version" : 1,
        "type" : "QnnGpu"
      },
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "ctx-bins" : [
            "path_to_model_context_binary"
          ]
        }
      }
    }
  }
}

SSD-Q1 configuration example

The following is an example configuration for the Self Speculative Decoding (SSD) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "ssd-q1",
    "ssd-q1" : {
      "version" : 1,
      "ssd-version" : 1,
      "forecast-token-count" : 4,
      "forecast-prefix" : 16,
      "forecast-prefix-name" : "forecast_prefix_name_string",
      "branches" : [4, 4]
    },
  }
}

An example of a SSD-Q1 configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-ssd.json.

SPD configuration example

The following is an example configuration for the Speculative Decoding (SPD) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters. The engine parameter in SPD config is an array of two engines. Each engine config follows existing definition of engine, with an addition of “role” paramter to define the role of the model.

{
  "dialog" : {
    "version" : 1,
    "type" : "spd",
    "spd" : {
      "version" : 1,
      "draft-len" : 7
    },
    "engine" : [
      {
        "role" : "draft"
      },
      {
        "role" : "target"
      }
    ]
  }
}

An example of a SPD configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-draft-htp-target-htp-spd.json.

LADE configuration example

The following is an example configuration for the LookAhead Decoding (LADE) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "lade",
    "lade" : {
      "version" : 1,
      "update-mode" : "ALWAYS_FWD_ONE",
      "window" : 8,
      "ngram" : 5,
      "gcap" : 8
    }
  }
}

An example of a LADE configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-lade.json.

Multistream configuration example

The following is an example configuration for the Multistream execution for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "multistream",
    "multistream" : {
      "version" : 1,
      "n-streams" : 8,
      "p-threshold" : 0
    }
  }
}

An example of a Multistream configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-multistream.json.

KV-SHARE configuration example

The following is an example configuration for the KV CACHE SHARING (KV-SHARE) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters. The engine parameter in KV-SHARE config is an array of two engines. Each engine config follows existing definition of engine, with an addition of “role” parameter to define the role of the engine.

{
  "dialog" : {
    "version" : 1,
    "type" : "kv-share",
    "engine" : [
      {
        "role" : "primary"
      },
      {
        "role" : "secondary"
      }
    ]
  }
}

An example of a KV-SHARE configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-genaitransformer-htp-kv-share.json.

LoRA V1 configuration example

The following is an example configuration for the Low-Rank Adaptation (LoRA) for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "engine" : {
      "version" : 1,
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "lora": {
            "version" : 1,
            "alpha-tensor-name": "alpha",
            "lora-version" : 1,
            "adapters" : [
              {
                "version" : 1,
                "name" : "lora-weights-1"
                "path" : "path/to/lora-weights-1-dir/"
              },
              {
                "version" : 1,
                "name" : "lora-weights-2"
                "path" : "path/to/lora-weights-2-dir/"
              }
            ]
          }
        }
      }
    }
  }
}

LoRA V1 Weights Directory Format

LoRA weights directory must follow naming convention for each file as tensor_name.raw for tensors having “lora” keyword in their names. e.g. If a tensor’s name is “lora_scale” then file name must be lora_scale.raw.

Note

Tensor name should match with the tensor name in model. If even a single lora tensor not found in lora-weights directory, adapter will fail to apply.

LoRA V2 configuration example for HTP

The following is an example configuration for the Low-Rank Adaptation (LoRA) for HTP backend. Following are the additional parameters that has to be supplied along with the above mentioned config parameters.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "engine" : {
      "version" : 1,
      "model" : {
        "version" : 1,
        "type" : "binary",
        "binary" : {
          "version" : 1,
          "lora": {
            "version" : 1,
            "alpha-tensor-name": "alpha",
            "adapters" : [
              {
                "version" : 1,
                "name" : "lora1",
                "bin-sections": [
                  "lora1_file_1_of_4.bin",
                  "lora1_file_2_of_4.bin",
                  "lora1_file_3_of_4.bin",
                  "lora1_file_4_of_4.bin"
                ]
              },
              {
                "version" : 1,
                "name" : "lora2",
                "bin-sections": [
                  "lora2_file_1_of_4.bin",
                  "lora2_file_2_of_4.bin",
                  "lora2_file_3_of_4.bin",
                  "lora2_file_4_of_4.bin"
                ]
              }
            ]
          }
        }
      }
    }
  }
}

An example of a LoRA configuration for HTP can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-htp-lora.json.

LoRA configuration example for GenAiTransformer BE

The following is an example configuration for the Low-Rank Adaptation (LoRA) for GenAiTransformer backend.

{
  "dialog" : {
    "version" : 1,
    "type" : "basic",
    "engine" : {
      "version" : 1,
      "model" : {
        "version" : 1,
        "type" : "library",
        "library" : {
          "version" : 1,
          "lora": {
            "version" : 1,
            "alpha-tensor-name": "alpha",
            "adapters" : [
              {
                "version" : 1,
                "name" : "lora1",
                "bin-sections": [
                  "lora1_adapter.bin"
                ]
              },
              {
                "version" : 1,
                "name" : "lora2",
                "bin-sections": [
                  "lora2_adapter.bin"
                ]
              }
            ]
          }
        }
      }
    }
  }
}

An example of a LoRA configuration for GenAiTransformer BE can be found at ${SDK_ROOT}/examples/Genie/configs/llama2-7b/llama2-7b-genaitransformer-lora.json.

Eaglet configuration example

The following is an example configuration for the Eaglet Decoding for all backends. Following are the additional parameters that has to be supplied along with the above mentioned config parameters. The engine parameter in eaglet config is an array of two engines. Each engine config follows existing definition of engine, with an addition of “role” paramter to define the role of the model. Also “draft” engine must include “draft-token-map” and “draft-n-vocab” for trimmed/non-trimmed vocabulary case.

{
  "dialog" : {
    "version" : 1,
    "type" : "eaglet",
    "eaglet": {
      "version": 1,
      "eaglet-version": 1,
      "draft-len": 6,
      "n-branches": 8,
      "max-tokens-target-can-evaluate": 32,
      "draft-kv-cache": true
    },
    "context": {
      "version": 1,
      "n-vocab": 128256,
      "draft-n-vocab": 3200,
    },
    "engine" : [
      {
        "role" : "draft",
        "model": {
          "version": 1,
          "draft-token-map" : "vocab_trim_elementary16181.json"
        }
      },
      {
        "role" : "target"
      }
    ]
  }
}

An example of a eaglet configuration can be found at ${SDK_ROOT}/examples/Genie/configs/llama3-3b/llama3-3b-eaglet-htp.json.