Tools

This page describes the tools provided in Genie.

genie-t2t-run

Note

The genie-t2t-run interface will have a backwards incompatible update in the next release.

Note

The -tok/–tokens_file (token to token feature) option is currently supported for basic dialog type only.

The genie-t2t-run tool is provided as a test application to do text to text inference run on provided LLM network. It takes a user prompt in text format and outputs the result in the text format.

Genie t2t run diagram.

Genie T2T Run Application

DESCRIPTION:
------------
Tool for text to text inference of LLMs using Genie.


REQUIRED ARGUMENTS:
-------------------

-c or --config                        <FILE>      Dialog JSON configuration file.

OPTIONAL ARGUMENTS:
-------------------
-h or --help                                      Show this help message and exit.

-p or --prompt                        <VAL>       Prompt to query. Mutually exclusive with --prompt_file.

--prompt_file                         <FILE>      Prompt to query provided as a file. Mutually exclusive with --prompt.

-e PATH or --embedding_file           <FILE>      Input embeddings provided as a file. Mutually exclusive with --prompt, --prompt_file and --tokens_file.
                                                  TYPE, SCALE, and OFFSET are optional parameters representing the model's input quantization encodings.
                                                  Required for lookup table requantization. Valid values of TYPE are int8, int16, uint8, uint16. The
                                                  signedness must be consistent with the lookup table encodings.

-t PATH or --embedding_table          <FILE>      Token-to-Embedding lookup table provided as a file. Mutually exclusive with --prompt and --prompt_file.
                                                  "TYPE, SCALE, and OFFSET are optional parameters representing the lookup table's quantization encodings.
                                                  Required for lookup table requantization. Valid values of TYPE are int8, int16, uint8, uint16. The
                                                  signedness must be consistent with the input layer encodings.

-l or --lora                          <VAL>       ADAPTER_NAME,ALPHA_NAME,ALPHA_VALUE Apply a LoRA adapter to a dialog.
-tok PATH or --tokens_file           <FILE>      Input tokens provided as a file (Supported format .txt). Mutually exclusive with --prompt, --prompt_file and --embedding_file.

See Tutorials for reference example on how to use the genie-t2t-run tool.

Embedding Vector Lookup Table

For Embedding-to-Text models, genie-t2t-run accepts an embedding vector lookup table file to convert tokens to embedding vectors. The file is a two-dimensional array in row-major order and raw binary format. The row count should be equal to the tokenizer vocabulary size and the column count should be equal to the embedding vector length. The datatype of the array is determined by the embedding::datatype Dialog JSON configuration.

Note

If a lookup table is not provided, genie-t2t-run can only generate the result of the first output token.

genie-t2e-run

The genie-t2e-run tool is provided as a test application to do text to embedding inference run on provided embedding model. It takes a user prompt in text format and outputs the result in the embedding file.

DESCRIPTION:
------------
Tool for text to embedding inference of encoder models using Genie.


REQUIRED ARGUMENTS:
-------------------

-c or --config                        <FILE>      Embedding JSON configuration file.

OPTIONAL ARGUMENTS:
-------------------
-h or --help                                      Show this help message and exit.

-p or --prompt                        <VAL>       Prompt to query. Mutually exclusive with --prompt_file.

--prompt_file                         <FILE>      Prompt to query provided as a file. Mutually exclusive with --prompt.

--output_file                         <FILE>      Output file path to save embedding result. Default file is output.raw.
                                                  Output file saves the float buffer returned by GenieEmbedding_GenerateCallback_t Fn,
                                                  User must consult the rank and dimensions, for the shape of the output.

See Tutorials for reference example on how to use the genie-t2e-run tool.

qnn-genai-transformer-composer

Note

The qnn-genai-transformer-composer interface will have a backwards incompatible update in the next release.

The qnn-genai-transformer-composer tool prepares a model for inference via Genie, but specifically for the Gen AI Transformer Backend.

GenAITransformer Composer diagram.

GenAITransformer Composer Tool

DESCRIPTION:
------------
Tool to convert a supported LLM model to a binary file consumable by Genie execution engine.

REQUIRED ARGUMENTS:
-------------------
--model                        <DIR>           Path to the downloaded LLM model directory.

OPTIONAL ARGUMENTS:
-------------------

-h, --help                                    Show this help message and exit.

--config_file                  <DIR>          Path to the generic configuration.json

--quantize {Z4,Z4_FP16}        <VAL>          Quantization type.  If not specified, output format will be FP32.

--export_tokenizer_json        <VAL>          Export the tokenizer model to HuggingFace Fast Tokenizer .json file.
                                              The tokenizer.json file will be written to the path specified via the
                                              --outfile parameter.

--outfile OUTFILE              <FILE>         Path to write to; default: path provided in --model parameter.

--lora                         <DIR>          Path to the lora adapter model directory, if specified then --quantize option should not be specified.

Note

The --export_tokenizer_json option only supports tokenizers for QWen-1 and BaiChuan-1 models.

Note

If the --config_file option is not provided, the composer will access the config.json file in the downloaded model path given in --model option and accordingly will fetch the generic configuration file from QNN_SDK_ROOT path.

Generic Configuration File Explanation

Configuration.json will be a JSON file that will provide information about the model to the qnn-genai-transformer-composer, to prepare the model for inferencing via Genie.

Model Params Description

There are 26 static params of the model under 5 categories of params:

  • General parameters provide global information about the model.

  • Size parameters inform about the main model dimensions.

  • Architecture parameters provide information about the transformer control flow.

  • Operation parameters conveys operator details.

  • Tensor parameters provide details about the model tensors.

Param-Name

DataType

Param-Description

Optional /Mandatory

Possible values

general.name

String

Model name in a readable form

Mandatory

general.architecture

String

Model global architecture

Optional

generic , llama, qwen, gpt2. Default: generic

general.tokenizer

String

Tokenizer to use

Optional

none, gpt2, llama, tiktoken. Default: none

general.alignment

Integer

Byte alignment for each tensor within the file

Optional

Default: 32

general.hf_hub_model_id

String

specifying model identifier in the HuggingFace repository

Optional

general.output

String

Model’s output

Optional

logits, embeddings. Default: logits

size.vocabulary

Integer

Vocabulary size.

Mandatory

size.context

Integer

Maximum number of tokens the transformer will consider to predict the next token

Mandatory

size.embedding

Integer

Length of the embedding vector representing a token

Mandatory

size.feedforward

Integer

Size of the inner layer within the feed-forward network

Mandatory

architecture.num_decoders

Integer

Number of decoder layers

Mandatory

architecture.num_heads

Integer

Number of attention heads

Mandatory

architecture.connector

String

How the attention and feed-forward networks are connected to each other

Mandatory

sequential, parallel

architecture.gating

String

Gating type of the transformer.

Mandatory

gated, fully-connected

architecture.num_kv_heads

Integer

Number of attention heads for keys and values when they differ from queries

Optional

Default: num_heads

operation.normalization

String

Normalization operator

Mandatory

layernorm, RMS-norm

operation.activation

String

Non-linear activation operator for feed-forward

Mandatory

ReLU, GeLU, SiLU

operation.positional_embedding

String

How positional information is handled

Mandatory

WPE, RoPE

operation.rope_num_rotations

Integer

Number of elements to be affected by the rope operation

Mandatory with “RoPE”

operation.rope_complex_organization

String

How RoPE real and imaginary parts are expected to be organized in memory

Mandatory with “RoPE” without “tensor.kq_complex_organization” specified

AoS, SoA

operation.rope_scaling

Floating point

Scaling factor for the RoPE operator

Optional with “RoPE”

Default: 10000.0f

operation.normalization_epsilon

Floating point

Epsilon for the normalization operator

Optional

Default: 0.000001

operation.attention_mode

String

How the model attends to previous and future tokens

Optional

causal, bidirectional. Default: causal

tensor.layer_name

String

The layer name prefix for layer tensors. “(d+)” regex for the layer number.

Mandatory

tensor.kq_complex_organization

String

If RoPE real and imaginary parts are expected to be SoA convert them to AoS

Mandatory with “RoPE” without “operation.rope_complex_organization” specified

SoA

name

String

Tensor name used in the model file

Mandatory

transposed

Boolean

Whether the tensor is transposed or not with respect to the standard matrix multiplication convention in linear algebra when the matrix is at the rigth hand side of the matrix multiplication

Optional

Default: False

Model Tensor Description

Tensor normalized names are of the form tensor.identifier_weight and tensor.identifier_bias where “identifier” is one of the following:

Tensor name

Tensor Type

Description

embedding_token

Model Tensor

Word token embedding tensor

embedding_position

Model Tensor

Word position embedding tensor (not expected with “RoPE”)

embedding_token_type

Model Tensor

Segment distinction within the input sequence

embedding_normalization

Model Tensor

Embedding normalization tensor (expected with “sequential_post_layer_normalization”)

attention_normalization

Layer Tensor

Attention input normalization tensor

attention_q

Layer Tensor

Attention query tensor

attention_k

Layer Tensor

Attention key tensor

attention_v

Layer Tensor

Attention value tensor

attention_qkv

Layer Tensor

Fused attention query-key-value tensor

attention_output

Layer Tensor

Attention output or projection tensor

feedforward_normalization

Layer Tensor

Feed-forward input or post-attention normalization tensor (not expected when connector is “parallel”)

feedforward_gate

Layer Tensor

Feed-forward gate or fully-connected layer tensor

feedforward_up

Layer Tensor

Feed-forward up tensor (not expected with “fully-connected”)

feedforward_output

Layer Tensor

Feedforward output or projection tensor or down tensor

feedforward_output_normalization

Layer Tensor

Feedforward output normalization tensor

output_normalization

Model Tensor

Output normalization tensor (not expected with “sequential_post_layer_normalization”)

output

Model Tensor

Output tensor