Tools¶
This page describes the tools provided in Genie.
Table of Contents
genie-t2t-run¶
Note
The genie-t2t-run interface will have a backwards incompatible update in the next release.
Note
The -tok/–tokens_file (token to token feature) option is currently supported for basic dialog type only.
The genie-t2t-run tool is provided as a test application to do text to text inference run on provided LLM network.
It takes a user prompt in text format and outputs the result in the text format.
Genie T2T Run Application¶
DESCRIPTION:
------------
Tool for text to text inference of LLMs using Genie.
REQUIRED ARGUMENTS:
-------------------
-c or --config <FILE> Dialog JSON configuration file.
OPTIONAL ARGUMENTS:
-------------------
-h or --help Show this help message and exit.
-p or --prompt <VAL> Prompt to query. Mutually exclusive with --prompt_file.
--prompt_file <FILE> Prompt to query provided as a file. Mutually exclusive with --prompt.
-e PATH or --embedding_file <FILE> Input embeddings provided as a file. Mutually exclusive with --prompt, --prompt_file and --tokens_file.
TYPE, SCALE, and OFFSET are optional parameters representing the model's input quantization encodings.
Required for lookup table requantization. Valid values of TYPE are int8, int16, uint8, uint16. The
signedness must be consistent with the lookup table encodings.
-t PATH or --embedding_table <FILE> Token-to-Embedding lookup table provided as a file. Mutually exclusive with --prompt and --prompt_file.
"TYPE, SCALE, and OFFSET are optional parameters representing the lookup table's quantization encodings.
Required for lookup table requantization. Valid values of TYPE are int8, int16, uint8, uint16. The
signedness must be consistent with the input layer encodings.
-l or --lora <VAL> ADAPTER_NAME,ALPHA_NAME,ALPHA_VALUE Apply a LoRA adapter to a dialog.
-tok PATH or --tokens_file <FILE> Input tokens provided as a file (Supported format .txt). Mutually exclusive with --prompt, --prompt_file and --embedding_file.
See Tutorials for reference example on how to use the genie-t2t-run tool.
Embedding Vector Lookup Table¶
For Embedding-to-Text models, genie-t2t-run accepts an embedding vector lookup table file to convert tokens to embedding vectors. The file is a two-dimensional array in row-major order and raw binary format. The row count should be equal to the tokenizer vocabulary size and the column count should be equal to the embedding vector length. The datatype of the array is determined by the embedding::datatype Dialog JSON configuration.
Note
If a lookup table is not provided, genie-t2t-run can only generate the result of the first output token.
genie-t2e-run¶
The genie-t2e-run tool is provided as a test application to do text to embedding inference run on provided embedding model.
It takes a user prompt in text format and outputs the result in the embedding file.
DESCRIPTION:
------------
Tool for text to embedding inference of encoder models using Genie.
REQUIRED ARGUMENTS:
-------------------
-c or --config <FILE> Embedding JSON configuration file.
OPTIONAL ARGUMENTS:
-------------------
-h or --help Show this help message and exit.
-p or --prompt <VAL> Prompt to query. Mutually exclusive with --prompt_file.
--prompt_file <FILE> Prompt to query provided as a file. Mutually exclusive with --prompt.
--output_file <FILE> Output file path to save embedding result. Default file is output.raw.
Output file saves the float buffer returned by GenieEmbedding_GenerateCallback_t Fn,
User must consult the rank and dimensions, for the shape of the output.
See Tutorials for reference example on how to use the genie-t2e-run tool.
qnn-genai-transformer-composer¶
Note
The qnn-genai-transformer-composer interface will have a backwards incompatible update in the next release.
The qnn-genai-transformer-composer tool prepares a model for inference via Genie, but specifically for the Gen AI
Transformer Backend.
GenAITransformer Composer Tool¶
DESCRIPTION:
------------
Tool to convert a supported LLM model to a binary file consumable by Genie execution engine.
REQUIRED ARGUMENTS:
-------------------
--model <DIR> Path to the downloaded LLM model directory.
OPTIONAL ARGUMENTS:
-------------------
-h, --help Show this help message and exit.
--config_file <DIR> Path to the generic configuration.json
--quantize {Z4,Z4_FP16} <VAL> Quantization type. If not specified, output format will be FP32.
--export_tokenizer_json <VAL> Export the tokenizer model to HuggingFace Fast Tokenizer .json file.
The tokenizer.json file will be written to the path specified via the
--outfile parameter.
--outfile OUTFILE <FILE> Path to write to; default: path provided in --model parameter.
--lora <DIR> Path to the lora adapter model directory, if specified then --quantize option should not be specified.
Note
The --export_tokenizer_json option only supports tokenizers for QWen-1 and BaiChuan-1 models.
Note
If the --config_file option is not provided, the composer will access the config.json file in the downloaded model path given in --model option and accordingly will fetch the generic configuration file from QNN_SDK_ROOT path.
Generic Configuration File Explanation¶
Configuration.json will be a JSON file that will provide information about the model to the qnn-genai-transformer-composer, to prepare the model for inferencing via Genie.
Model Params Description¶
There are 26 static params of the model under 5 categories of params:
General parameters provide global information about the model.
Size parameters inform about the main model dimensions.
Architecture parameters provide information about the transformer control flow.
Operation parameters conveys operator details.
Tensor parameters provide details about the model tensors.
Param-Name |
DataType |
Param-Description |
Optional /Mandatory |
Possible values |
|---|---|---|---|---|
general.name |
String |
Model name in a readable form |
Mandatory |
|
general.architecture |
String |
Model global architecture |
Optional |
generic , llama, qwen, gpt2. Default: generic |
general.tokenizer |
String |
Tokenizer to use |
Optional |
none, gpt2, llama, tiktoken. Default: none |
general.alignment |
Integer |
Byte alignment for each tensor within the file |
Optional |
Default: 32 |
general.hf_hub_model_id |
String |
specifying model identifier in the HuggingFace repository |
Optional |
|
general.output |
String |
Model’s output |
Optional |
logits, embeddings. Default: logits |
size.vocabulary |
Integer |
Vocabulary size. |
Mandatory |
|
size.context |
Integer |
Maximum number of tokens the transformer will consider to predict the next token |
Mandatory |
|
size.embedding |
Integer |
Length of the embedding vector representing a token |
Mandatory |
|
size.feedforward |
Integer |
Size of the inner layer within the feed-forward network |
Mandatory |
|
architecture.num_decoders |
Integer |
Number of decoder layers |
Mandatory |
|
architecture.num_heads |
Integer |
Number of attention heads |
Mandatory |
|
architecture.connector |
String |
How the attention and feed-forward networks are connected to each other |
Mandatory |
sequential, parallel |
architecture.gating |
String |
Gating type of the transformer. |
Mandatory |
gated, fully-connected |
architecture.num_kv_heads |
Integer |
Number of attention heads for keys and values when they differ from queries |
Optional |
Default: num_heads |
operation.normalization |
String |
Normalization operator |
Mandatory |
layernorm, RMS-norm |
operation.activation |
String |
Non-linear activation operator for feed-forward |
Mandatory |
ReLU, GeLU, SiLU |
operation.positional_embedding |
String |
How positional information is handled |
Mandatory |
WPE, RoPE |
operation.rope_num_rotations |
Integer |
Number of elements to be affected by the rope operation |
Mandatory with “RoPE” |
|
operation.rope_complex_organization |
String |
How RoPE real and imaginary parts are expected to be organized in memory |
Mandatory with “RoPE” without “tensor.kq_complex_organization” specified |
AoS, SoA |
operation.rope_scaling |
Floating point |
Scaling factor for the RoPE operator |
Optional with “RoPE” |
Default: 10000.0f |
operation.normalization_epsilon |
Floating point |
Epsilon for the normalization operator |
Optional |
Default: 0.000001 |
operation.attention_mode |
String |
How the model attends to previous and future tokens |
Optional |
causal, bidirectional. Default: causal |
tensor.layer_name |
String |
The layer name prefix for layer tensors. “(d+)” regex for the layer number. |
Mandatory |
|
tensor.kq_complex_organization |
String |
If RoPE real and imaginary parts are expected to be SoA convert them to AoS |
Mandatory with “RoPE” without “operation.rope_complex_organization” specified |
SoA |
name |
String |
Tensor name used in the model file |
Mandatory |
|
transposed |
Boolean |
Whether the tensor is transposed or not with respect to the standard matrix multiplication convention in linear algebra when the matrix is at the rigth hand side of the matrix multiplication |
Optional |
Default: False |
Model Tensor Description¶
Tensor normalized names are of the form tensor.identifier_weight and tensor.identifier_bias where “identifier” is one of the following:
Tensor name |
Tensor Type |
Description |
|---|---|---|
embedding_token |
Model Tensor |
Word token embedding tensor |
embedding_position |
Model Tensor |
Word position embedding tensor (not expected with “RoPE”) |
embedding_token_type |
Model Tensor |
Segment distinction within the input sequence |
embedding_normalization |
Model Tensor |
Embedding normalization tensor (expected with “sequential_post_layer_normalization”) |
attention_normalization |
Layer Tensor |
Attention input normalization tensor |
attention_q |
Layer Tensor |
Attention query tensor |
attention_k |
Layer Tensor |
Attention key tensor |
attention_v |
Layer Tensor |
Attention value tensor |
attention_qkv |
Layer Tensor |
Fused attention query-key-value tensor |
attention_output |
Layer Tensor |
Attention output or projection tensor |
feedforward_normalization |
Layer Tensor |
Feed-forward input or post-attention normalization tensor (not expected when connector is “parallel”) |
feedforward_gate |
Layer Tensor |
Feed-forward gate or fully-connected layer tensor |
feedforward_up |
Layer Tensor |
Feed-forward up tensor (not expected with “fully-connected”) |
feedforward_output |
Layer Tensor |
Feedforward output or projection tensor or down tensor |
feedforward_output_normalization |
Layer Tensor |
Feedforward output normalization tensor |
output_normalization |
Model Tensor |
Output normalization tensor (not expected with “sequential_post_layer_normalization”) |
output |
Model Tensor |
Output tensor |