Linting Profile¶
Brief
Linting mode is a performance profiling configuration for ops running on the HTP backend.
Detailed profiling report provides per op profiling result by cycle counts instead of time
in microsecs. There is no direct conversion method from cycle count to microsecs because of
the parallelized execution of Ops. Hence it is recommended to use the per op cycle timings
as a reference to compare/measure the relative performance to know which of them are using
lower/higher cycles to finish the execution. Assuming the HTP backend prerequisite is met,
Linting mode is activated by including --profiling_level=linting while running snpe-net-run
or by using the Snpe_SNPEBuilder_SetProfilingLevel API header to set the profiling level to
SNPE_PROFILING_LEVEL_LINTING.
Linting Profile Metrics
On the main thread, each op has to wait for some cycles since the execution of the last op before the start of its own execution. This wait period can be attributed to various factors such as scheduling or waiting for some background HVX or DMA activity to finish. Linting profiling provides the following diagnostic entries per HTP op:
Wait: The “Wait” entry is a foreground execution descriptor that denotes the number of cycles spent actually executing the op on the main thread since the previous op that ran on the main thread.
Overlap: The “Overlap” entry is a background execution descriptor that denotes the number of cycles spent on at least one background op while this op is executing on the main thread.
Overlap (wait): The “Overlap (wait)” entry is a background execution descriptor. It is similar to the “Wait” entry with the exception that the cycles reported in this entry correspond to the “Wait” period (i.e. cycles spent on at least one background op while the main thread was waiting).
Resources The “Resources” entry lists the different resources used by the given op. Namely some combination of HVX, HMX, and DMA.
Background ops that are being waited on by main thread ops are not considered as background activity and as such do not contribute to the counts reported by the overlap entries. Each of the overlap entries also has up to 10 indented lines following it indicating the names of the ops that contributed to the respective overlap cycle count. Please refer to the model optimization example below to see samples of how snpe-diagview displays the aforementioned Linting profile metrics.
Chrometrace
Like its sibling profiling levels, Linting profile metrics are averaged across all inputs used during inference and can be viewed using the snpe-diagview tool. However, one advantage of Linting profile is the ability to export chrometrace JSON files, which can be used to visualize the op foreground and background execution and overlaps detailed by the Linting profile metrics.
Model Optimization Example
In this section, we walk through an example of how we can use Linting mode and chrometraces to address a bottleneck in a simple network. Showcase Model 1 diagram illustrates a model with two branches each performing a couple of convolutions before their results are used in a sub operation.
Showcase Model 1
The linting profiling output by snpe-diagview for this model is given below:
...
Per-Graph Execution Times:
---------------
HTP Subnet 0: 4327266 cycles
Layer Times:
---------------
0: Input OpId_2 (cycles) : 0 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 0 cycles
Overlap (wait) time: 0 cycles
Resources:
1: OpId_0 (cycles) : 8036 cycles : DSP
Wait (Scheduler) time: 629 cycles
Overlap time: 4770 cycles
Overlap (wait) time: 565 cycles
Resources:
2: model_convStart_Conv2D:OpId_21 (cycles) : 147075 cycles : DSP
Wait (Scheduler) time: 32 cycles
Overlap time: 85292 cycles
model_sub_sub:OpId_57
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 32 cycles
model_convStart_Conv2D:OpId_21
Resources: HVX, HMX, DMA
3: model_tf_op_layer_stride_stride:OpId_24 (cycles) : 146494 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 70807 cycles
model_add_add:OpId_58
Output OpId_3
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 0 cycles
Resources: HVX
4: model_convLeft1_Conv2D:OpId_34 (cycles) : 288249 cycles : DSP
Wait (Scheduler) time: 425 cycles
Overlap time: 195988 cycles
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 304 cycles
Output OpId_3
model_add_add:OpId_58
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
5: model_convRight1_Conv2D:OpId_41 (cycles) : 220391 cycles : DSP
Wait (Scheduler) time: 803 cycles
Overlap time: 135268 cycles
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 557 cycles
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
6: model_convRight2_Conv2D:OpId_48 (cycles) : 181016 cycles : DSP
Wait (Scheduler) time: 1090 cycles
Overlap time: 69323 cycles
model_sub_sub:OpId_57
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
Overlap (wait) time: 489 cycles
model_sub_sub:OpId_57
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
Resources: HMX, DMA
7: model_convLeft2_Conv2D:OpId_55 (cycles) : 233736 cycles : DSP
Wait (Scheduler) time: 1059 cycles
Overlap time: 93020 cycles
model_sub_sub:OpId_57
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 464 cycles
model_sub_sub:OpId_57
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
Resources: HMX, DMA
8: model_sub_sub:OpId_57 (cycles) : 2165162 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 465046 cycles
model_sub_sub:OpId_57
Output OpId_3
model_add_add:OpId_58
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 0 cycles
Resources: HVX
9: model_add_add:OpId_58 (cycles) : 525971 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 481468 cycles
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
Overlap (wait) time: 0 cycles
Resources: HVX
10: Output OpId_3 (cycles) : 407091 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 115120 cycles
Overlap (wait) time: 0 cycles
Resources: HVX
The linting profiling chrometrace output for this model is given below:
Showcase Model 1 Chrometrace
From the output, it is evident that the sub op (OpId_57) is the most significant contributor to the total execution time - around 50%. This op also does not have significant parallel op execution - its Overlap time is 465046 cycles which is about 21.5% of its total execution time - indicating that this op is a good bottleneck to optimize. We can design an equvalent model as shown in the Showcase Model 1 Optimized diagram merging the two branches and replacing the sub op with a convolution with weights manually designed such that it performs the same task as a sub op.
Showcase Model 1 Optimized
The linting profiling output for this optimized model is given below:
...
Per-Graph Execution Times:
---------------
HTP Subnet 0: 1374349 cycles
Layer Times:
---------------
0: Input OpId_2 (cycles) : 0 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 0 cycles
Overlap (wait) time: 0 cycles
Resources:
1: OpId_0 (cycles) : 3500 cycles : DSP
Wait (Scheduler) time: 1284 cycles
Overlap time: 3221 cycles
Overlap (wait) time: 1268 cycles
Resources:
2: model_convStart_Conv2D:OpId_21 (cycles) : 487448 cycles : DSP
Wait (Scheduler) time: 32 cycles
Overlap time: 475888 cycles
Output OpId_3
model_add_add:OpId_50
model_tf_op_layer_stride_1_stride_1:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 32 cycles
model_convStart_Conv2D:OpId_21
Resources: HVX, HMX, DMA
3: model_tf_op_layer_stride_1_stride_1:OpId_24 (cycles) : 10422 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 10075 cycles
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_1_stride_1:OpId_24
Overlap (wait) time: 0 cycles
Resources: HVX
4: model_convCombined1_Conv2D:OpId_34 (cycles) : 337711 cycles : DSP
Wait (Scheduler) time: 82 cycles
Overlap time: 307394 cycles
Output OpId_3
model_tf_op_layer_stride_1_stride_1:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 50 cycles
Output OpId_3
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
5: model_convCombined2_Conv2D:OpId_41 (cycles) : 295022 cycles : DSP
Wait (Scheduler) time: 1184 cycles
Overlap time: 286062 cycles
model_add_add:OpId_50
Output OpId_3
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_1_stride_1:OpId_24
Overlap (wait) time: 1140 cycles
model_add_add:OpId_50
Output OpId_3
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_1_stride_1:OpId_24
Resources: HMX, DMA
6: model_subConv_Conv2D:OpId_48 (cycles) : 48720 cycles : DSP
Wait (Scheduler) time: 1186 cycles
Overlap time: 46686 cycles
model_add_add:OpId_50
model_tf_op_layer_stride_1_stride_1:OpId_24
Output OpId_3
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 1142 cycles
model_add_add:OpId_50
Output OpId_3
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
7: model_add_add:OpId_50 (cycles) : 110698 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 108524 cycles
model_add_add:OpId_50
Output OpId_3
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_1_stride_1:OpId_24
Overlap (wait) time: 0 cycles
Resources: HVX
8: Output OpId_3 (cycles) : 77054 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 75438 cycles
Overlap (wait) time: 0 cycles
Resources: HVX
The total execution time decreases significantly as a result of removing the sub op. All the ops now have a significant amount of parallel op execution, as evidenced by their respective Overlap time numbers, indicating good optimization. Showcase Model 2 diagram illustrates a model that is similar to the one in the Showcase Model 1 diagram. The difference is that there is a div op in place of the problematic sub op.
Showcase Model 2
The linting profiling output for this model is given below:
...
Per-Graph Execution Times:
---------------
HTP Subnet 0: 7866535 cycles
Layer Times:
---------------
0: Input OpId_2 (cycles) : 0 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 0 cycles
Overlap (wait) time: 0 cycles
Resources:
1: OpId_0 (cycles) : 8657 cycles : DSP
Wait (Scheduler) time: 782 cycles
Overlap time: 5155 cycles
Overlap (wait) time: 717 cycles
Resources:
2: model_convStart_Conv2D:OpId_21 (cycles) : 148293 cycles : DSP
Wait (Scheduler) time: 34 cycles
Overlap time: 86500 cycles
model_tf_op_layer_RealDiv_RealDiv:OpId_57
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 34 cycles
model_convStart_Conv2D:OpId_21
Resources: HVX, HMX, DMA
3: model_tf_op_layer_stride_stride:OpId_24 (cycles) : 145084 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 70877 cycles
model_convStart_Conv2D:OpId_21
model_add_add:OpId_58
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 0 cycles
Resources: HVX
4: model_convLeft1_Conv2D:OpId_34 (cycles) : 285476 cycles : DSP
Wait (Scheduler) time: 431 cycles
Overlap time: 196212 cycles
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 318 cycles
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
5: model_convRight1_Conv2D:OpId_41 (cycles) : 219298 cycles : DSP
Wait (Scheduler) time: 804 cycles
Overlap time: 134711 cycles
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 558 cycles
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
6: model_convRight2_Conv2D:OpId_48 (cycles) : 181198 cycles : DSP
Wait (Scheduler) time: 1083 cycles
Overlap time: 68306 cycles
model_tf_op_layer_RealDiv_RealDiv:OpId_57
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 476 cycles
model_tf_op_layer_RealDiv_RealDiv:OpId_57
Output OpId_3
Resources: HMX, DMA
7: model_convLeft2_Conv2D:OpId_55 (cycles) : 233731 cycles : DSP
Wait (Scheduler) time: 1055 cycles
Overlap time: 91960 cycles
model_tf_op_layer_RealDiv_RealDiv:OpId_57
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 447 cycles
model_tf_op_layer_RealDiv_RealDiv:OpId_57
Output OpId_3
Resources: HMX, DMA
8: model_tf_op_layer_RealDiv_RealDiv:OpId_57 (cycles) : 5344081 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 528123 cycles
model_tf_op_layer_RealDiv_RealDiv:OpId_57
Output OpId_3
model_add_add:OpId_58
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 0 cycles
Resources: HVX
9: model_add_add:OpId_58 (cycles) : 525199 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 481084 cycles
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_stride:OpId_24
Output OpId_3
model_add_add:OpId_58
Overlap (wait) time: 0 cycles
Resources: HVX
10: Output OpId_3 (cycles) : 771320 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 115729 cycles
Overlap (wait) time: 0 cycles
Resources: HVX
Again, the bottleneck for this graph can be identified by examining the main and background utilization of each op. In this case, the div op is the major contributor to the overall graph execution time with it taking up 5344081 cycles - about 68% of the total execution time. Only about 10% of this op’s execution has some parallel background activity which again indicates a good potential for performance gain through optimization. Replacing the div op with a mul op is a suggested optimization strategy found in the best practices guidelines. The linting profiler output for the graph optimized with a mult op instead of a div op is given below:
...
Per-Graph Execution Times:
---------------
HTP Subnet 0: 2741387 cycles
Layer Times:
---------------
0: Input OpId_2 (cycles) : 0 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 0 cycles
Overlap (wait) time: 0 cycles
Resources:
1: OpId_0 (cycles) : 8067 cycles : DSP
Wait (Scheduler) time: 735 cycles
Overlap time: 4781 cycles
Overlap (wait) time: 669 cycles
Resources:
2: model_convStart_Conv2D:OpId_21 (cycles) : 147478 cycles : DSP
Wait (Scheduler) time: 32 cycles
Overlap time: 86319 cycles
model_multiply_mul:OpId_57
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 32 cycles
model_convStart_Conv2D:OpId_21
Resources: HVX, HMX, DMA
3: model_tf_op_layer_stride_stride:OpId_24 (cycles) : 145396 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 70208 cycles
model_convStart_Conv2D:OpId_21
model_add_add:OpId_58
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 0 cycles
Resources: HVX
4: model_convLeft1_Conv2D:OpId_34 (cycles) : 287130 cycles : DSP
Wait (Scheduler) time: 430 cycles
Overlap time: 198222 cycles
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 308 cycles
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
5: model_convRight1_Conv2D:OpId_41 (cycles) : 219409 cycles : DSP
Wait (Scheduler) time: 806 cycles
Overlap time: 135286 cycles
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 558 cycles
Output OpId_3
model_tf_op_layer_stride_stride:OpId_24
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
6: model_convRight2_Conv2D:OpId_48 (cycles) : 181465 cycles : DSP
Wait (Scheduler) time: 1068 cycles
Overlap time: 69160 cycles
model_multiply_mul:OpId_57
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 467 cycles
model_multiply_mul:OpId_57
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
Resources: HMX, DMA
7: model_convLeft2_Conv2D:OpId_55 (cycles) : 233619 cycles : DSP
Wait (Scheduler) time: 1055 cycles
Overlap time: 92740 cycles
model_multiply_mul:OpId_57
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 445 cycles
model_multiply_mul:OpId_57
model_convStart_Conv2D:OpId_21
Output OpId_3
model_add_add:OpId_58
Resources: HMX, DMA
8: model_multiply_mul:OpId_57 (cycles) : 737978 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 437784 cycles
model_multiply_mul:OpId_57
Output OpId_3
model_add_add:OpId_58
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_stride:OpId_24
Overlap (wait) time: 0 cycles
Resources: HVX
9: model_add_add:OpId_58 (cycles) : 527450 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 481714 cycles
model_convStart_Conv2D:OpId_21
model_tf_op_layer_stride_stride:OpId_24
Output OpId_3
model_add_add:OpId_58
Overlap (wait) time: 0 cycles
Resources: HVX
10: Output OpId_3 (cycles) : 249264 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 117890 cycles
Overlap (wait) time: 0 cycles
Resources: HVX
There is a noticeable reduction in the total graph execute time and the ops also have better background utilization indicating better optimization than before. Next, Showcase Model 3 diagram illustrates a model that is similar to the one in Showcase Model 1 Optimized diagram. The difference is that the ReLU ops have been replaced with PReLU ops.
Showcase Model 3
The linting profiler output for this model is given below:
...
Per-Graph Execution Times:
---------------
HTP Subnet 0: 2789467 cycles
Layer Times:
---------------
0: Input OpId_2 (cycles) : 0 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 0 cycles
Overlap (wait) time: 0 cycles
Resources:
1: OpId_0 (cycles) : 3411 cycles : DSP
Wait (Scheduler) time: 1226 cycles
Overlap time: 3173 cycles
Overlap (wait) time: 1194 cycles
Resources:
2: model_convStart_Conv2D:OpId_21 (cycles) : 589431 cycles : DSP
Wait (Scheduler) time: 957 cycles
Overlap time: 41199 cycles
Output OpId_3
model_add_add:OpId_54
model_preluCombined1_add:OpId_37
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 72 cycles
Output OpId_3
model_convStart_Conv2D:OpId_21
Resources: HVX, HMX, DMA
3: model_tf_op_layer_stride_1_stride_1:OpId_24 (cycles) : 0 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 0 cycles
Overlap (wait) time: 0 cycles
Resources:
4: model_convCombined1_Conv2D:OpId_34 (cycles) : 165119 cycles : DSP
Wait (Scheduler) time: 1089 cycles
Overlap time: 155164 cycles
model_preluCombined1_add:OpId_37
Output OpId_3
model_add_add:OpId_54
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 977 cycles
model_preluCombined1_add:OpId_37
Output OpId_3
model_add_add:OpId_54
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
5: model_preluCombined1_add:OpId_37 (cycles) : 27315 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 9431 cycles
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 0 cycles
Resources: HVX
6: model_convCombined2_Conv2D:OpId_43 (cycles) : 805490 cycles : DSP
Wait (Scheduler) time: 81 cycles
Overlap time: 251743 cycles
model_add_add:OpId_54
Output OpId_3
model_preluCombined1_add:OpId_37
model_preluCombined2_add:OpId_46
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 62 cycles
Output OpId_3
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
7: model_preluCombined2_add:OpId_46 (cycles) : 0 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 0 cycles
Overlap (wait) time: 0 cycles
Resources: HVX
8: model_subConv_Conv2D:OpId_52 (cycles) : 666721 cycles : DSP
Wait (Scheduler) time: 34 cycles
Overlap time: 180805 cycles
model_add_add:OpId_54
Output OpId_3
model_convStart_Conv2D:OpId_21
model_preluCombined2_add:OpId_46
Overlap (wait) time: 13 cycles
model_convStart_Conv2D:OpId_21
Resources: HMX, DMA
9: model_add_add:OpId_54 (cycles) : 62806 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 57481 cycles
model_add_add:OpId_54
Output OpId_3
model_preluCombined1_add:OpId_37
model_preluCombined2_add:OpId_46
model_convStart_Conv2D:OpId_21
Overlap (wait) time: 0 cycles
Resources: HVX
10: Output OpId_3 (cycles) : 465781 cycles : DSP
Wait (Scheduler) time: 0 cycles
Overlap time: 430560 cycles
Overlap (wait) time: 0 cycles
Resources: HVX
The usual sign indicating bottlenecks is present here as well. There are multiple ops with low parallel execution. PReLU ops are some of the background ops that executed for these ops and the best practices guidelines suggest that PReLU ops should be replaced with ReLU ops. Changing the graph by replacing the PReLU ops with ReLU gives us the same model as the one shown in the Showcase Model 1 Optimized diagram which is much better optimized as explained before.
Caveats
Since Linting profile is only available for HTP, non-HTP subnets will silently fall back to
the next most descriptive profiling level, Detailed, while HTP subnets will be executed with
Linting mode enabled as requested by the user. Additionally, for multi-subnet networks with a
combination of HTP and non-HTP subnets, snpe-diagview will generate separate chrometraces only
for each HTP subnet. For example, when running inference (with Linting profiling enabled) on
a network with 3 HTP subnets and 2 non-HTP subnets, snpe-diagview is expected to produce 3
chrometraces when invoked with --chrometrace.