qnn-genai-transformer-composer

The qnn-genai-transformer-composer tool prepares a model for inference via Genie, but specifically for the Gen AI Transformer Backend.

GenAITransformer Composer diagram.

GenAITransformer Composer Tool

DESCRIPTION:
------------
Tool to convert a supported LLM model to a binary file consumable by Genie execution engine.

REQUIRED ARGUMENTS:
-------------------
--model                        <DIR>           Path to the downloaded LLM model directory.

OPTIONAL ARGUMENTS:
-------------------

-h, --help                                    Show this help message and exit.

--config_file                  <DIR>          Path to the generic configuration.json

--quantize {Z4,Z4_FP16,Z8,Q4}  <VAL>          Quantization type.  If not specified, output format will be FP32.
                                              Q4 uses a block of 32 elements. It provides the highest accuracy,
                                              yet with lowest performance. Z4 and Z8 uses block of 128 elements.
                                              The accuracy is sufficient for most models, with Z8 giving highest
                                              performance.

--outfile OUTFILE              <FILE>         Path to write to; default: path provided in --model parameter.

--lora                         <DIR>          Path to the lora adapter model directory, if specified then --quantize
                                              option should not be specified.

--lm_head_precision            <VAL>          Precision for lm_head (output.weight) tensor. "FP_32" supported.

--export_tokenizer_json                       Exports the tokenizer model to HuggingFace Fast Tokenizer .json file.
                                              The tokenizer.json file will be written to the path specified via the
                                              --outfile parameter.

--dump_lut                                    Dumps the token embedding weight as LUT.bin in the path specified via
                                              the --outfile parameter.

Note

The --export_tokenizer_json option supports tokenizers for QWen-1, BaiChuan-1, Mistral, wt19-en-de and mx-translation models.

Note

If the --config_file option is not provided, the composer will access the config.json file in the downloaded model path given in --model option and accordingly will fetch the generic configuration file from QNN_SDK_ROOT path.

Generic Configuration File Explanation

Configuration.json will be a JSON file that will provide information about the model to the qnn-genai-transformer-composer, to prepare the model for inferencing via Genie.

Model Params Description

There are 26 static params of the model under 5 categories of params:

  • General parameters provide global information about the model.

  • Size parameters inform about the main model dimensions.

  • Architecture parameters provide information about the transformer control flow.

  • Operation parameters conveys operator details.

  • Tensor parameters provide details about the model tensors.

Param-Name

DataType

Param-Description

Optional /Mandatory

Possible values

general.name

String

Model name in a readable form

Mandatory

general.architecture

String

Model global architecture

Optional

generic , llama, qwen, gpt2. Default: generic

general.tokenizer

String

Tokenizer to use

Optional

none, gpt2, llama, tiktoken. Default: none

general.alignment

Integer

Byte alignment for each tensor within the file

Optional

Default: 32

general.tokenizer.bos_token_id

Integer

BOS token ID

Mandatory with Encoder-Decoder Model

general.hf_hub_model_id

String

specifying model identifier in the HuggingFace repository

Optional

general.output

String

Model’s output

Optional

logits, embeddings. Default: logits

size.vocabulary

Integer

Vocabulary size.

Mandatory

size.context

Integer

Maximum number of tokens the transformer will consider to predict the next token

Mandatory

size.embedding

Integer

Length of the embedding vector representing a token

Mandatory

size.embedding_per_head

Integer

Length of the embedding vector per head

Optional

Default: size.embedding / architecture.num_head

size.feedforward

Integer

Size of the inner layer within the feed-forward network

Mandatory

architecture.num_decoders

Integer

Number of decoder layers

Mandatory

architecture.num_heads

Integer

Number of attention heads

Mandatory

architecture.num_kv_heads

Integer

Number of attention heads for keys-values incase of grouped query attention

Optional

Default: architecture.num_heads

architecture.connector

String

How the attention and feed-forward networks are connected to each other

Mandatory

sequential, parallel

architecture.gating

String

Gating type of the transformer.

Mandatory

gated, fully-connected

operation.normalization

String

Normalization operator

Mandatory

layernorm, RMS-norm

operation.activation

String

Non-linear activation operator for feed-forward

Mandatory

ReLU, GeLU, SiLU

operation.positional_embedding

String

How positional information is handled

Mandatory

WPE, RoPE

operation.rope_num_rotations

Integer

Number of elements to be affected by the rope operation

Mandatory with “RoPE”

operation.rope_complex_organization

String

How RoPE real and imaginary parts are expected to be organized in memory

Mandatory with “RoPE” without “tensor.kq_complex_organization” specified

AoS, SoA

operation.rope_scaling

Floating point

Scaling factor for the RoPE operator

Optional with “RoPE”

Default: 10000.0f

operation.rope.scaling.config

Dictionary

RoPE Scaling config

Optional with “RoPE”

operation.rope.scaling.factor

Array

An array of rope scaling factors

Optional

operation.normalization_epsilon

Floating point

Epsilon for the normalization operator

Optional

Default: 0.000001

operation.attention_mode

String

How the model attends to previous and future tokens

Optional

causal, bidirectional. Default: causal

operation.word_position_embedding_offset

Integer

Offset for WPE

Mandatory with “WPE” if not 0

tensor.layer_name

String

The layer name prefix for layer tensors. “(d+)” regex for the layer number.

Mandatory

tensor.kq_complex_organization

String

If RoPE real and imaginary parts are expected to be SoA convert them to AoS

Mandatory with “RoPE” without “operation.rope_complex_organization” specified

SoA

name

String

Tensor name used in the model file

Mandatory

transposed

Boolean

Whether the tensor is transposed or not with respect to the standard matrix multiplication convention in linear algebra when the matrix is at the rigth hand side of the matrix multiplication

Optional

Default: False

scale

Floating point

Multiplicative coefficient to be applied to each tensor element before processing

Optional

Default: 1.0

offset

Floating point

Additive coefficient to be applied to each tensor element before processing

Optional

Default: 0.0

Model Tensor Description

Tensor normalized names are of the form tensor.identifier_weight and tensor.identifier_bias where “identifier” is one of the following:

Tensor name

Tensor Type

Description

embedding_token

Model Tensor

Word token embedding tensor

embedding_position

Model Tensor

Word position embedding tensor (not expected with “RoPE”)

embedding_token_type

Model Tensor

Segment distinction within the input sequence

embedding_normalization

Model Tensor

Embedding normalization tensor (expected with “sequential_post_layer_normalization”)

attention_normalization

Layer Tensor

Attention input normalization tensor

attention_q

Layer Tensor

Attention query tensor

attention_k

Layer Tensor

Attention key tensor

attention_v

Layer Tensor

Attention value tensor

attention_qkv

Layer Tensor

Fused attention query-key-value tensor

attention_output

Layer Tensor

Attention output or projection tensor

cross_attention_normalization

Layer Tensor

Cross Attention normalization tensor

cross_attention_q

Layer Tensor

Cross Attention query tensor

cross_attention_k

Layer Tensor

Cross Attention key tensor

cross_attention_v

Layer Tensor

Cross Attention value tensor

cross_attention_output

Layer Tensor

Cross Attention output or projection tensor

feedforward_normalization

Layer Tensor

Feed-forward input or post-attention normalization tensor (not expected when connector is “parallel”)

feedforward_gate

Layer Tensor

Feed-forward gate or fully-connected layer tensor

feedforward_up

Layer Tensor

Feed-forward up tensor (not expected with “fully-connected”)

feedforward_up_gate

Layer Tensor

Feed-forward up and gate tensor (not expected with “fully-connected”)

feedforward_output

Layer Tensor

Feedforward output or projection tensor or down tensor

feedforward_output_normalization

Layer Tensor

Feedforward output normalization tensor

output_normalization

Model Tensor

Output normalization tensor (not expected with “sequential_post_layer_normalization”)

output

Model Tensor

Output tensor

RoPE Scaling Config Description

The operation.rope.scaling.config key in the Model params table is a dictionary containing the keys depending on the RoPE type. Following section describes the keys for different rope types.

Rope Type llama3

Key

Type

Mandatory

DefaultValue

factor

Floating point

True

high_freq_factor

Floating point

True

low_freq_factor

Floating point

True

original_max_position_embeddings

Integer

True

type

String

True

llama3

Rope Type yarn

Key

Type

Mandatory

DefaultValue

factor

Floating point

True

original_max_position_embeddings

Integer

True

attention_factor

Floating point

False

beta_fast

Floating point

False

32.0

beta_slow

Floating point

False

1.0

type

String

True

yarn

Rope Type longrope

Key

Type

Mandatory

DefaultValue

long_factor

Array

True

short_factor

Array

True

original_max_position_embeddings

Integer

True

attention_factor

Floating point

False

type

String

True

longrope