Linting Profile

Brief

Linting mode is a performance profiling configuration for ops running on the HTP backend. Detailed profiling report provides per op profiling result by cycle counts instead of time in microsecs. There is no direct conversion method from cycle count to microsecs because of the parallelized execution of Ops. Hence it is recommended to use the per op cycle timings as a reference to compare/measure the relative performance to know which of them are using lower/higher cycles to finish the execution. Assuming the HTP backend prerequisite is met, Linting mode is activated by including --profiling_level=linting while running snpe-net-run or by using the Snpe_SNPEBuilder_SetProfilingLevel API header to set the profiling level to SNPE_PROFILING_LEVEL_LINTING.

Linting Profile Metrics

On the main thread, each op has to wait for some cycles since the execution of the last op before the start of its own execution. This wait period can be attributed to various factors such as scheduling or waiting for some background HVX or DMA activity to finish. Linting profiling provides the following diagnostic entries per HTP op:

  • Wait: The “Wait” entry is a foreground execution descriptor that denotes the number of cycles spent actually executing the op on the main thread since the previous op that ran on the main thread.

  • Overlap: The “Overlap” entry is a background execution descriptor that denotes the number of cycles spent on at least one background op while this op is executing on the main thread.

  • Overlap (wait): The “Overlap (wait)” entry is a background execution descriptor. It is similar to the “Wait” entry with the exception that the cycles reported in this entry correspond to the “Wait” period (i.e. cycles spent on at least one background op while the main thread was waiting).

  • Resources The “Resources” entry lists the different resources used by the given op. Namely some combination of HVX, HMX, and DMA.

Background ops that are being waited on by main thread ops are not considered as background activity and as such do not contribute to the counts reported by the overlap entries. Each of the overlap entries also has up to 10 indented lines following it indicating the names of the ops that contributed to the respective overlap cycle count. Please refer to the model optimization example below to see samples of how snpe-diagview displays the aforementioned Linting profile metrics.

Chrometrace

Like its sibling profiling levels, Linting profile metrics are averaged across all inputs used during inference and can be viewed using the snpe-diagview tool. However, one advantage of Linting profile is the ability to export chrometrace JSON files, which can be used to visualize the op foreground and background execution and overlaps detailed by the Linting profile metrics.

Model Optimization Example

In this section, we walk through an example of how we can use Linting mode and chrometraces to address a bottleneck in a simple network. Showcase Model 1 diagram illustrates a model with two branches each performing a couple of convolutions before their results are used in a sub operation.

Showcase Model 1

Linting Profiling Showcase Model 1

The linting profiling output by snpe-diagview for this model is given below:

...

Per-Graph Execution Times:
---------------
HTP Subnet 0: 4327266 cycles

Layer Times:
---------------
  0: Input OpId_2 (cycles) : 0 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 0 cycles
    Overlap (wait) time: 0 cycles
    Resources:
  1: OpId_0 (cycles) : 8036 cycles : DSP
    Wait (Scheduler) time: 629 cycles
    Overlap time: 4770 cycles
    Overlap (wait) time: 565 cycles
    Resources:
  2: model_convStart_Conv2D:OpId_21 (cycles) : 147075 cycles : DSP
    Wait (Scheduler) time: 32 cycles
    Overlap time: 85292 cycles
      model_sub_sub:OpId_57
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 32 cycles
      model_convStart_Conv2D:OpId_21
    Resources: HVX, HMX, DMA
  3: model_tf_op_layer_stride_stride:OpId_24 (cycles) : 146494 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 70807 cycles
      model_add_add:OpId_58
      Output OpId_3
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 0 cycles
    Resources: HVX
  4: model_convLeft1_Conv2D:OpId_34 (cycles) : 288249 cycles : DSP
    Wait (Scheduler) time: 425 cycles
    Overlap time: 195988 cycles
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 304 cycles
      Output OpId_3
      model_add_add:OpId_58
      model_convStart_Conv2D:OpId_21
    Resources: HMX, DMA
  5: model_convRight1_Conv2D:OpId_41 (cycles) : 220391 cycles : DSP
    Wait (Scheduler) time: 803 cycles
    Overlap time: 135268 cycles
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 557 cycles
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Resources: HMX, DMA
  6: model_convRight2_Conv2D:OpId_48 (cycles) : 181016 cycles : DSP
    Wait (Scheduler) time: 1090 cycles
    Overlap time: 69323 cycles
      model_sub_sub:OpId_57
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
    Overlap (wait) time: 489 cycles
      model_sub_sub:OpId_57
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
    Resources: HMX, DMA
  7: model_convLeft2_Conv2D:OpId_55 (cycles) : 233736 cycles : DSP
    Wait (Scheduler) time: 1059 cycles
    Overlap time: 93020 cycles
      model_sub_sub:OpId_57
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 464 cycles
      model_sub_sub:OpId_57
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
    Resources: HMX, DMA
  8: model_sub_sub:OpId_57 (cycles) : 2165162 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 465046 cycles
      model_sub_sub:OpId_57
      Output OpId_3
      model_add_add:OpId_58
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 0 cycles
    Resources: HVX
  9: model_add_add:OpId_58 (cycles) : 525971 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 481468 cycles
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
    Overlap (wait) time: 0 cycles
    Resources: HVX
  10: Output OpId_3 (cycles) : 407091 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 115120 cycles
    Overlap (wait) time: 0 cycles
    Resources: HVX

The linting profiling chrometrace output for this model is given below:

Showcase Model 1 Chrometrace

Linting Profiling Showcase Model 1 Chrometrace

From the output, it is evident that the sub op (OpId_57) is the most significant contributor to the total execution time - around 50%. This op also does not have significant parallel op execution - its Overlap time is 465046 cycles which is about 21.5% of its total execution time - indicating that this op is a good bottleneck to optimize. We can design an equvalent model as shown in the Showcase Model 1 Optimized diagram merging the two branches and replacing the sub op with a convolution with weights manually designed such that it performs the same task as a sub op.

Showcase Model 1 Optimized

Linting Profiling Showcase Model 1 Optimized

The linting profiling output for this optimized model is given below:

...

Per-Graph Execution Times:
---------------
HTP Subnet 0: 1374349 cycles

Layer Times:
---------------
  0: Input OpId_2 (cycles) : 0 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 0 cycles
    Overlap (wait) time: 0 cycles
    Resources:
  1: OpId_0 (cycles) : 3500 cycles : DSP
    Wait (Scheduler) time: 1284 cycles
    Overlap time: 3221 cycles
    Overlap (wait) time: 1268 cycles
    Resources:
  2: model_convStart_Conv2D:OpId_21 (cycles) : 487448 cycles : DSP
    Wait (Scheduler) time: 32 cycles
    Overlap time: 475888 cycles
      Output OpId_3
      model_add_add:OpId_50
      model_tf_op_layer_stride_1_stride_1:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 32 cycles
      model_convStart_Conv2D:OpId_21
    Resources: HVX, HMX, DMA
  3: model_tf_op_layer_stride_1_stride_1:OpId_24 (cycles) : 10422 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 10075 cycles
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_1_stride_1:OpId_24
    Overlap (wait) time: 0 cycles
    Resources: HVX
  4: model_convCombined1_Conv2D:OpId_34 (cycles) : 337711 cycles : DSP
    Wait (Scheduler) time: 82 cycles
    Overlap time: 307394 cycles
      Output OpId_3
      model_tf_op_layer_stride_1_stride_1:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 50 cycles
      Output OpId_3
      model_convStart_Conv2D:OpId_21
    Resources: HMX, DMA
  5: model_convCombined2_Conv2D:OpId_41 (cycles) : 295022 cycles : DSP
    Wait (Scheduler) time: 1184 cycles
    Overlap time: 286062 cycles
      model_add_add:OpId_50
      Output OpId_3
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_1_stride_1:OpId_24
    Overlap (wait) time: 1140 cycles
      model_add_add:OpId_50
      Output OpId_3
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_1_stride_1:OpId_24
    Resources: HMX, DMA
  6: model_subConv_Conv2D:OpId_48 (cycles) : 48720 cycles : DSP
    Wait (Scheduler) time: 1186 cycles
    Overlap time: 46686 cycles
      model_add_add:OpId_50
      model_tf_op_layer_stride_1_stride_1:OpId_24
      Output OpId_3
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 1142 cycles
      model_add_add:OpId_50
      Output OpId_3
      model_convStart_Conv2D:OpId_21
    Resources: HMX, DMA
  7: model_add_add:OpId_50 (cycles) : 110698 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 108524 cycles
      model_add_add:OpId_50
      Output OpId_3
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_1_stride_1:OpId_24
    Overlap (wait) time: 0 cycles
    Resources: HVX
  8: Output OpId_3 (cycles) : 77054 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 75438 cycles
    Overlap (wait) time: 0 cycles
    Resources: HVX

The total execution time decreases significantly as a result of removing the sub op. All the ops now have a significant amount of parallel op execution, as evidenced by their respective Overlap time numbers, indicating good optimization. Showcase Model 2 diagram illustrates a model that is similar to the one in the Showcase Model 1 diagram. The difference is that there is a div op in place of the problematic sub op.

Showcase Model 2

Linting Profiling Showcase Model 2

The linting profiling output for this model is given below:

...

Per-Graph Execution Times:
---------------
HTP Subnet 0: 7866535 cycles

Layer Times:
---------------
  0: Input OpId_2 (cycles) : 0 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 0 cycles
    Overlap (wait) time: 0 cycles
    Resources:
  1: OpId_0 (cycles) : 8657 cycles : DSP
    Wait (Scheduler) time: 782 cycles
    Overlap time: 5155 cycles
    Overlap (wait) time: 717 cycles
    Resources:
  2: model_convStart_Conv2D:OpId_21 (cycles) : 148293 cycles : DSP
    Wait (Scheduler) time: 34 cycles
    Overlap time: 86500 cycles
      model_tf_op_layer_RealDiv_RealDiv:OpId_57
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 34 cycles
      model_convStart_Conv2D:OpId_21
    Resources: HVX, HMX, DMA
  3: model_tf_op_layer_stride_stride:OpId_24 (cycles) : 145084 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 70877 cycles
      model_convStart_Conv2D:OpId_21
      model_add_add:OpId_58
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 0 cycles
    Resources: HVX
  4: model_convLeft1_Conv2D:OpId_34 (cycles) : 285476 cycles : DSP
    Wait (Scheduler) time: 431 cycles
    Overlap time: 196212 cycles
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 318 cycles
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Resources: HMX, DMA
  5: model_convRight1_Conv2D:OpId_41 (cycles) : 219298 cycles : DSP
    Wait (Scheduler) time: 804 cycles
    Overlap time: 134711 cycles
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 558 cycles
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Resources: HMX, DMA
  6: model_convRight2_Conv2D:OpId_48 (cycles) : 181198 cycles : DSP
    Wait (Scheduler) time: 1083 cycles
    Overlap time: 68306 cycles
      model_tf_op_layer_RealDiv_RealDiv:OpId_57
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 476 cycles
      model_tf_op_layer_RealDiv_RealDiv:OpId_57
      Output OpId_3
    Resources: HMX, DMA
  7: model_convLeft2_Conv2D:OpId_55 (cycles) : 233731 cycles : DSP
    Wait (Scheduler) time: 1055 cycles
    Overlap time: 91960 cycles
      model_tf_op_layer_RealDiv_RealDiv:OpId_57
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 447 cycles
      model_tf_op_layer_RealDiv_RealDiv:OpId_57
      Output OpId_3
    Resources: HMX, DMA
  8: model_tf_op_layer_RealDiv_RealDiv:OpId_57 (cycles) : 5344081 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 528123 cycles
      model_tf_op_layer_RealDiv_RealDiv:OpId_57
      Output OpId_3
      model_add_add:OpId_58
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 0 cycles
    Resources: HVX
  9: model_add_add:OpId_58 (cycles) : 525199 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 481084 cycles
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_stride:OpId_24
      Output OpId_3
      model_add_add:OpId_58
    Overlap (wait) time: 0 cycles
    Resources: HVX
  10: Output OpId_3 (cycles) : 771320 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 115729 cycles
    Overlap (wait) time: 0 cycles
    Resources: HVX

Again, the bottleneck for this graph can be identified by examining the main and background utilization of each op. In this case, the div op is the major contributor to the overall graph execution time with it taking up 5344081 cycles - about 68% of the total execution time. Only about 10% of this op’s execution has some parallel background activity which again indicates a good potential for performance gain through optimization. Replacing the div op with a mul op is a suggested optimization strategy found in the best practices guidelines. The linting profiler output for the graph optimized with a mult op instead of a div op is given below:

...

Per-Graph Execution Times:
---------------
HTP Subnet 0: 2741387 cycles

Layer Times:
---------------
  0: Input OpId_2 (cycles) : 0 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 0 cycles
    Overlap (wait) time: 0 cycles
    Resources:
  1: OpId_0 (cycles) : 8067 cycles : DSP
    Wait (Scheduler) time: 735 cycles
    Overlap time: 4781 cycles
    Overlap (wait) time: 669 cycles
    Resources:
  2: model_convStart_Conv2D:OpId_21 (cycles) : 147478 cycles : DSP
    Wait (Scheduler) time: 32 cycles
    Overlap time: 86319 cycles
      model_multiply_mul:OpId_57
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 32 cycles
      model_convStart_Conv2D:OpId_21
    Resources: HVX, HMX, DMA
  3: model_tf_op_layer_stride_stride:OpId_24 (cycles) : 145396 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 70208 cycles
      model_convStart_Conv2D:OpId_21
      model_add_add:OpId_58
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 0 cycles
    Resources: HVX
  4: model_convLeft1_Conv2D:OpId_34 (cycles) : 287130 cycles : DSP
    Wait (Scheduler) time: 430 cycles
    Overlap time: 198222 cycles
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 308 cycles
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Resources: HMX, DMA
  5: model_convRight1_Conv2D:OpId_41 (cycles) : 219409 cycles : DSP
    Wait (Scheduler) time: 806 cycles
    Overlap time: 135286 cycles
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Overlap (wait) time: 558 cycles
      Output OpId_3
      model_tf_op_layer_stride_stride:OpId_24
      model_convStart_Conv2D:OpId_21
    Resources: HMX, DMA
  6: model_convRight2_Conv2D:OpId_48 (cycles) : 181465 cycles : DSP
    Wait (Scheduler) time: 1068 cycles
    Overlap time: 69160 cycles
      model_multiply_mul:OpId_57
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 467 cycles
      model_multiply_mul:OpId_57
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
    Resources: HMX, DMA
  7: model_convLeft2_Conv2D:OpId_55 (cycles) : 233619 cycles : DSP
    Wait (Scheduler) time: 1055 cycles
    Overlap time: 92740 cycles
      model_multiply_mul:OpId_57
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 445 cycles
      model_multiply_mul:OpId_57
      model_convStart_Conv2D:OpId_21
      Output OpId_3
      model_add_add:OpId_58
    Resources: HMX, DMA
  8: model_multiply_mul:OpId_57 (cycles) : 737978 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 437784 cycles
      model_multiply_mul:OpId_57
      Output OpId_3
      model_add_add:OpId_58
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_stride:OpId_24
    Overlap (wait) time: 0 cycles
    Resources: HVX
  9: model_add_add:OpId_58 (cycles) : 527450 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 481714 cycles
      model_convStart_Conv2D:OpId_21
      model_tf_op_layer_stride_stride:OpId_24
      Output OpId_3
      model_add_add:OpId_58
    Overlap (wait) time: 0 cycles
    Resources: HVX
  10: Output OpId_3 (cycles) : 249264 cycles : DSP
    Wait (Scheduler) time: 0 cycles
    Overlap time: 117890 cycles
    Overlap (wait) time: 0 cycles
    Resources: HVX

There is a noticeable reduction in the total graph execute time and the ops also have better background utilization indicating better optimization than before. Next, Showcase Model 3 diagram illustrates a model that is similar to the one in Showcase Model 1 Optimized diagram. The difference is that the ReLU ops have been replaced with PReLU ops.

Showcase Model 3

Linting Profiling Showcase Model 3

The linting profiler output for this model is given below:

...

Per-Graph Execution Times:
---------------
HTP Subnet 0: 2789467 cycles

Layer Times:
---------------
0: Input OpId_2 (cycles) : 0 cycles : DSP
  Wait (Scheduler) time: 0 cycles
  Overlap time: 0 cycles
  Overlap (wait) time: 0 cycles
  Resources:
1: OpId_0 (cycles) : 3411 cycles : DSP
  Wait (Scheduler) time: 1226 cycles
  Overlap time: 3173 cycles
  Overlap (wait) time: 1194 cycles
  Resources:
2: model_convStart_Conv2D:OpId_21 (cycles) : 589431 cycles : DSP
  Wait (Scheduler) time: 957 cycles
  Overlap time: 41199 cycles
    Output OpId_3
    model_add_add:OpId_54
    model_preluCombined1_add:OpId_37
    model_convStart_Conv2D:OpId_21
  Overlap (wait) time: 72 cycles
    Output OpId_3
    model_convStart_Conv2D:OpId_21
  Resources: HVX, HMX, DMA
3: model_tf_op_layer_stride_1_stride_1:OpId_24 (cycles) : 0 cycles : DSP
  Wait (Scheduler) time: 0 cycles
  Overlap time: 0 cycles
  Overlap (wait) time: 0 cycles
  Resources:
4: model_convCombined1_Conv2D:OpId_34 (cycles) : 165119 cycles : DSP
  Wait (Scheduler) time: 1089 cycles
  Overlap time: 155164 cycles
    model_preluCombined1_add:OpId_37
    Output OpId_3
    model_add_add:OpId_54
    model_convStart_Conv2D:OpId_21
  Overlap (wait) time: 977 cycles
    model_preluCombined1_add:OpId_37
    Output OpId_3
    model_add_add:OpId_54
    model_convStart_Conv2D:OpId_21
  Resources: HMX, DMA
5: model_preluCombined1_add:OpId_37 (cycles) : 27315 cycles : DSP
  Wait (Scheduler) time: 0 cycles
  Overlap time: 9431 cycles
    model_convStart_Conv2D:OpId_21
  Overlap (wait) time: 0 cycles
  Resources: HVX
6: model_convCombined2_Conv2D:OpId_43 (cycles) : 805490 cycles : DSP
  Wait (Scheduler) time: 81 cycles
  Overlap time: 251743 cycles
    model_add_add:OpId_54
    Output OpId_3
    model_preluCombined1_add:OpId_37
    model_preluCombined2_add:OpId_46
    model_convStart_Conv2D:OpId_21
  Overlap (wait) time: 62 cycles
    Output OpId_3
    model_convStart_Conv2D:OpId_21
  Resources: HMX, DMA
7: model_preluCombined2_add:OpId_46 (cycles) : 0 cycles : DSP
  Wait (Scheduler) time: 0 cycles
  Overlap time: 0 cycles
  Overlap (wait) time: 0 cycles
  Resources: HVX
8: model_subConv_Conv2D:OpId_52 (cycles) : 666721 cycles : DSP
  Wait (Scheduler) time: 34 cycles
  Overlap time: 180805 cycles
    model_add_add:OpId_54
    Output OpId_3
    model_convStart_Conv2D:OpId_21
    model_preluCombined2_add:OpId_46
  Overlap (wait) time: 13 cycles
    model_convStart_Conv2D:OpId_21
  Resources: HMX, DMA
9: model_add_add:OpId_54 (cycles) : 62806 cycles : DSP
  Wait (Scheduler) time: 0 cycles
  Overlap time: 57481 cycles
    model_add_add:OpId_54
    Output OpId_3
    model_preluCombined1_add:OpId_37
    model_preluCombined2_add:OpId_46
    model_convStart_Conv2D:OpId_21
  Overlap (wait) time: 0 cycles
  Resources: HVX
10: Output OpId_3 (cycles) : 465781 cycles : DSP
  Wait (Scheduler) time: 0 cycles
  Overlap time: 430560 cycles
  Overlap (wait) time: 0 cycles
  Resources: HVX

The usual sign indicating bottlenecks is present here as well. There are multiple ops with low parallel execution. PReLU ops are some of the background ops that executed for these ops and the best practices guidelines suggest that PReLU ops should be replaced with ReLU ops. Changing the graph by replacing the PReLU ops with ReLU gives us the same model as the one shown in the Showcase Model 1 Optimized diagram which is much better optimized as explained before.

Caveats

Since Linting profile is only available for HTP, non-HTP subnets will silently fall back to the next most descriptive profiling level, Detailed, while HTP subnets will be executed with Linting mode enabled as requested by the user. Additionally, for multi-subnet networks with a combination of HTP and non-HTP subnets, snpe-diagview will generate separate chrometraces only for each HTP subnet. For example, when running inference (with Linting profiling enabled) on a network with 3 HTP subnets and 2 non-HTP subnets, snpe-diagview is expected to produce 3 chrometraces when invoked with --chrometrace.