{
  "display_name" : "CPU Bottlenecks",
  "initial_mode_name" : "bottlenecks",
  "modes" : [
    {
      "display_name" : "CPU Bottlenecks",
      "displays" : [
        {
          "denominator" : 1,
          "elements" : [
            {
              "color" : "green",
              "metric" : "useful"
            },
            {
              "color" : "blue",
              "metric" : "processing"
            },
            {
              "color" : "yellow",
              "metric" : "delivery"
            },
            {
              "color" : "red",
              "metric" : "discarded"
            }
          ],
          "kind" : "normalized-area"
        }
      ],
      "documentation" : "The CPU Bottlenecks mode categorizes the sustainable instruction bandwidth of the CPU into four categories: one that represents useful work and three that represent various high-level sources of inefficiency.\n\nThe CPU pipeline is divided into two primary phases: _Instruction Delivery_ (obtaining instructions from memory along a predicted execution path) and _Instruction Processing_ (performing the action of the instructions). Both phases are potential sources of inefficiency, as well as the CPU-wide impact of mispredicting the execution path.\n\nMore specifically, bandwidth is analyzed at a micro-operation (µop) granularity, where instructions are typically translated into one µop, but some instructions require more than one.",
      "metrics" : [
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Useful",
          "documentation" : "The fraction of sustainable instruction bandwidth that retired, meaning it was on the correct path through the code, and therefore contributed to forward progress of the application.\n\nMore specifically, this fraction directly relates to µops per cycle (µPC) and largely to instructions per cycle (IPC). Rather than presenting the measured µPC directly, this category shows the measured µPC divided by the maximum µPC for the particular core in order to provide a fraction of achievable µPC. A high fraction indicates that the code is executing in the CPU at a high rate.\n\nTo improve performance, improve the efficiency of the algorithm or the instruction sequences you use to implement the algorithm.",
          "name" : "useful",
          "synopsis" : "Fraction of sustainable instruction bandwidth that performed useful work."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Discarded Bottleneck",
          "documentation" : "The fraction of sustainable instruction bandwidth wasted due to the effects of a branch misprediction or a pipeline restart.\n\nMore specifically, this fraction represents µop bandwidth wasted while processing down the wrong predicted path, as well as lost bandwidth required to recover from the misprediction. A high fraction indicates that the control flow or memory dependencies or both are difficult to predict and the CPU is throwing away a significant amount of predicted work.\n\nBecause the instruction window and overall speculative execution capability of the CPU tends to increase from family to family, the amount of discarded work due to similar misprediction rates may also increase.\n\nTo improve performance, focus on removing branches through conditional moves or altering data structures and decision trees for a more stable path through the code.",
          "name" : "discarded",
          "short_display_name" : "Discarded",
          "synopsis" : "Fraction of sustainable instruction bandwidth lost to incorrect speculative execution."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Processing Bottleneck",
          "documentation" : "The fraction of instruction bandwidth lost because the Instruction Processing component is executing instructions at a rate lower than the sustainable bandwidth.\n\nMore specifically, this component maintains a forward-looking set of µops, called the instruction window, that the CPU collects from along a predicted instruction path and executes those µops as they become ready. This fraction represents µop bandwidth lost when the instruction window becomes full due to slow processing. A high fraction indicates insufficient instruction-level parallelism in the code, and that the Instruction Delivery component is inserting instructions into the instruction window faster than they are completed.\n\nTo improve performance, reduce memory delays and improve available instruction-level parallelism. For example, make data structure locality improvements, prefetch data from other cache lines, and unroll loops to create additional parallel computational sequences.",
          "name" : "processing",
          "short_display_name" : "Processing",
          "synopsis" : "Fraction of sustainable instruction bandwidth lost due to slow execution of instructions."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Delivery Bottleneck",
          "documentation" : "The fraction of instruction bandwidth lost because the Instruction Delivery component did not deliver instructions at an adequate rate for the Instruction Processing component.\n\nMore specifically, the Instruction Delivery component reads bytes from memory, decodes them into instructions, and delivers them as µops into the Instruction Processing component's instruction window. This fraction represents lost bandwidth when the processing component has room in the instruction window, but new µops are not available from the predicted instruction stream. A high fraction indicates a higher Instruction Processing bandwidth might be possible if the Instruction Delivery component delivers new instructions faster.\n\nTo improve performance, reduce instruction memory delays by improving the locality of hot functions, and inline, unroll, and straighten common code sequences to create longer streams of sequentially fetched instructions. For less common code, optimize for smaller code size to improve cache performance using `-Os` with C-based languages or `-Osize` for Swift.",
          "name" : "delivery",
          "short_display_name" : "Delivery",
          "synopsis" : "Fraction of sustained bandwidth lost because insufficient instructions were provided for processing."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Cycles",
          "name" : "cycle",
          "synopsis" : "Cycles elapsed while the CPU was active."
        }
      ],
      "name" : "bottlenecks",
      "synopsis" : "Analyzes the efficiency of the sustainable CPU instruction bandwidth.",
      "thresholds" : [
        {
          "display_name" : "High Discarded",
          "documentation" : "Remove branches through conditional moves or redesign data structures and decision trees for a more stable path through the code.",
          "expression" : "discarded",
          "next_modes" : [
            "discarded_sampling"
          ],
          "synopsis" : "Incorrect speculative execution is wasting bandwidth.",
          "thresholds" : [
            0.1
          ]
        },
        {
          "display_name" : "High Processing Bottleneck",
          "documentation" : "Reduce memory delays and improve available instruction-level parallelism. For example, make data structure locality improvements, prefetch data from other cache lines, and unroll loops to create additional parallel computational sequences.",
          "expression" : "processing",
          "next_modes" : [
            "processing",
            "l1d_miss_sampling"
          ],
          "synopsis" : "Serial data dependences with possibly long memory latencies are limiting bandwidth.",
          "thresholds" : [
            0.4
          ]
        },
        {
          "display_name" : "High Delivery Bottleneck",
          "documentation" : "Reduce memory delays by improving the locality of hot functions, and inline, unroll, and straighten code sequences to create longer streams of sequentially fetched instructions.",
          "expression" : "delivery",
          "next_modes" : [
            "delivery"
          ],
          "synopsis" : "Slow instruction delivery is limiting bandwidth.",
          "thresholds" : [
            0.2
          ]
        }
      ],
      "triggers" : [

      ]
    },
    {
      "display_name" : "Instruction Delivery Bottlenecks",
      "displays" : [
        {
          "denominator" : 1,
          "elements" : [
            {
              "color" : "yellow",
              "metric" : "delivery_latency"
            },
            {
              "color" : "red",
              "metric" : "delivery_latency_icache"
            },
            {
              "color" : "orange",
              "metric" : "delivery_latency_itlb"
            },
            {
              "color" : "yellow",
              "metric" : "delivery_latency_other"
            },
            {
              "color" : "purple",
              "metric" : "delivery_bandwidth"
            },
            {
              "color" : "blue",
              "metric" : "delivery_bandwidth_taken_br"
            },
            {
              "color" : "purple",
              "metric" : "delivery_bandwidth_other"
            }
          ],
          "kind" : "normalized-area"
        }
      ],
      "documentation" : "The Instruction Delivery Bottlenecks mode examines the efficiency of the Instruction Delivery component of the CPU, which reads instruction bytes from memory, translates them into instructions, and delivers them as µops into the Instruction Processing component.\n\nThis mode further categorizes the sustainable bandwidth lost due to delivery while the Instruction Processing component is able to accept additional µops, with some categories related to latency issues and others related to reduced bandwidth issues.\n\nUnlike the Instruction Processing Bottlenecks mode, which categorizes the overall activity of the Instruction Processing component each cycle, this analysis effectively _zooms in_ on the Instruction Delivery bottleneck to reveal further categorization.",
      "metrics" : [
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Delivery Latency",
          "documentation" : "The fraction of sustainable bandwidth lost because the Instruction Delivery component experienced an event that delayed the delivery of the next instructions.\n\nTo improve performance, inline common code sequences and consolidate frequently executed functions together in memory. For less common code, optimize for smaller code size to improve cache performance using `-Os` with C-based languages or `-Osize` for Swift.",
          "name" : "delivery_latency",
          "short_display_name" : "Delivery Latency",
          "synopsis" : "Fraction of sustainable bandwidth lost because the Instruction Delivery component was unable to deliver any operations in a cycle."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Delivery Latency from Instruction Cache",
          "documentation" : "The fraction of sustainable bandwidth lost because the Instruction Delivery component experienced an L1 Instruction Cache miss while reading instruction bytes from memory.\n\nA high fraction is often associated with fetching small blocks of infrequently used instructions scattered around memory.\n\nTo improve performance, inline common code sequences and consolidate frequently executed functions together in memory. For less common code, optimize for smaller code size to improve cache performance using `-Os` with C-based languages or `-Osize` for Swift.",
          "name" : "delivery_latency_icache",
          "short_display_name" : "L1IC Latency",
          "synopsis" : "Fraction of sustainable bandwidth lost due to an L1 Instruction Cache miss."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Delivery Latency from Instruction TLB",
          "documentation" : "The fraction of sustainable bandwidth lost because the Instruction Delivery component experienced an L1 Instruction Translation Lookaside Buffer (TLB) miss while reading instruction bytes from memory.\n\nA high fraction is often associated with fetching instructions scattered across the memory space of the application.\n\nTo improve performance, consolidate frequently executed functions together in memory and inline common code sequences to create integrated streams of instructions. For less common code, optimize for smaller code size to improve cache performance using `-Os` with C-based languages or `-Osize` for Swift.",
          "name" : "delivery_latency_itlb",
          "short_display_name" : "L1ITLB Latency",
          "synopsis" : "Fraction of sustainable bandwidth lost due to an L1 Instruction Translation Lookaside Buffer (TLB) miss."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Delivery Other Latency",
          "documentation" : "The fraction of sustainable bandwidth lost because the Instruction Delivery component experienced an event that delayed the delivery of the next instructions due to an undetermined cause.\n\nIn some cases, these delays might be artifacts of the other Instruction Delivery issues, and focusing on solutions for those issues can help with this category.",
          "name" : "delivery_latency_other",
          "short_display_name" : "Other Latency",
          "synopsis" : "Fraction of sustainable bandwidth lost due to an undetermined event that delayed the delivery of the next instructions."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Delivery Bandwidth",
          "documentation" : "The fraction of the sustainable bandwidth lost because the Instruction Delivery component was unable to deliver µops at full bandwidth in a cycle.\n\nA high fraction is often associated with an instruction stream with too many taken branches, including calls and returns, leading to sequences that are shorter than the sustainable bandwidth per cycle.\n\nTo improve performance, inline, unroll, and straighten common code sequences to create longer streams of sequentially fetched instructions. For example, replace simple branches that skip assignments with conditional select or set instructions.",
          "name" : "delivery_bandwidth",
          "short_display_name" : "Delivery BW",
          "synopsis" : "Fraction of the sustainable bandwidth lost due to not delivering µops at full bandwidth."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Delivery Bandwidth from Taken Branch",
          "documentation" : "The fraction of the sustainable bandwidth lost because the Instruction Delivery component was unable to deliver µops at full bandwidth in a cycle due to taken branches.\n\nA high fraction is often associated with an instruction stream with too many taken branches, including calls and returns, leading to sequences that are shorter than the sustainable bandwidth per cycle. These are often due to short loop bodies.\n\nTo improve performance, inline, unroll, and straighten common code sequences to create longer streams of sequentially fetched instructions. For example, replace simple branches that skip assignments with conditional select or set instructions.",
          "name" : "delivery_bandwidth_taken_br",
          "short_display_name" : "BW Taken Br",
          "synopsis" : "Fraction of the sustainable bandwidth lost due to not delivering µops at full bandwidth due to taken branches."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Instruction Delivery Bandwidth due to undetermined events.",
          "documentation" : "The fraction of the sustainable bandwidth lost because the Instruction Delivery component was unable to deliver µops at full bandwidth in a cycle due to undetermined events.\n\nIn some cases, these delays might be artifacts of the other Instruction Delivery issues, and focusing on solutions for those issues can help with this category.",
          "name" : "delivery_bandwidth_other",
          "short_display_name" : "BW Other",
          "synopsis" : "Fraction of the sustainable bandwidth lost due to undetermined events."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Cycles",
          "name" : "cycle",
          "synopsis" : "Cycles elapsed while the CPU was active."
        }
      ],
      "name" : "delivery",
      "synopsis" : "Analyzes the causes of limited instruction delivery.",
      "thresholds" : [
        {
          "display_name" : "High Instruction Delivery Latency",
          "documentation" : "Inline, unroll, and straighten common code sequences to create longer streams of sequentially fetched instructions. Consolidate frequently executed portions of functions together in memory. More consistent control flow can also help with prefetching instructions. For less common code, optimize for smaller code size to improve cache performance using `-Os` with C-based languages or `-Osize` for Swift.",
          "expression" : "delivery_latency",
          "next_modes" : [

          ],
          "synopsis" : "Instruction Translation Lookaside Buffer (TLB) and Cache misses are causing delays.",
          "thresholds" : [
            0.25
          ]
        },
        {
          "display_name" : "High Instruction Delivery Latency from Instruction Cache",
          "documentation" : "Inline, unroll, and straighten common code sequences to create longer streams of sequentially fetched instructions. Consolidate frequently executed portions of functions together in memory. More consistent control flow can also help with prefetching instructions. For less common code, optimize for smaller code size to improve cache performance using `-Os` with C-based languages or `-Osize` for Swift.",
          "expression" : "delivery_latency_icache",
          "next_modes" : [

          ],
          "synopsis" : "Instruction Cache misses are causing delays.",
          "thresholds" : [
            0.2
          ]
        },
        {
          "display_name" : "High Instruction Delivery Latency from Instruction TLB",
          "documentation" : "Consolidate frequently executed portions of functions together in memory. Inline common code sequences to create integrated streams of instructions. More consistent control flow can also help with prefetching instructions. For less common code, optimize for smaller code size to improve cache performance using `-Os` with C-based languages or `-Osize` for Swift.",
          "expression" : "delivery_latency_itlb",
          "next_modes" : [

          ],
          "synopsis" : "Instruction Translation Lookaside Buffer (TLB) misses are causing delays.",
          "thresholds" : [
            0.05
          ]
        },
        {
          "display_name" : "High Instruction Delivery Other Latency",
          "documentation" : "In some cases, these delays might be artifacts of the other Instruction Delivery causes. Follow the recommendations for these other significant causes.",
          "expression" : "delivery_latency_other",
          "next_modes" : [

          ],
          "synopsis" : "An undetermined event is causing delays.",
          "thresholds" : [
            0.15
          ]
        },
        {
          "display_name" : "Bottlenecked Instruction Delivery Bandwidth",
          "documentation" : "There are too few sequential instructions between taken branches. Inline, unroll, and straighten common code sequences to create longer streams of sequentially fetched instructions.",
          "expression" : "delivery_bandwidth",
          "next_modes" : [

          ],
          "synopsis" : "Short sequential code sequences are limiting delivery.",
          "thresholds" : [
            0.15
          ]
        },
        {
          "display_name" : "Bottlenecked Instruction Delivery Bandwidth from Taken Branch",
          "documentation" : "There are too few sequential instructions between taken branches. Inline, unroll, and straighten common code sequences to create longer streams of sequentially fetched instructions.",
          "expression" : "delivery_bandwidth_taken_br",
          "next_modes" : [

          ],
          "synopsis" : "Short sequential code sequences are limiting delivery.",
          "thresholds" : [
            0.1
          ]
        },
        {
          "display_name" : "High Instruction Delivery Other Bandwidth",
          "documentation" : "In some cases, these delays might be artifacts of the other Instruction Delivery causes. Follow the recommendations for these other significant causes.",
          "expression" : "delivery_bandwidth_other",
          "next_modes" : [

          ],
          "synopsis" : "An undetermined event is causing delays.",
          "thresholds" : [
            0.1
          ]
        }
      ],
      "triggers" : [

      ]
    },
    {
      "display_name" : "Instruction Processing Bottlenecks",
      "displays" : [
        {
          "denominator" : 1,
          "elements" : [
            {
              "color" : "blue",
              "metric" : "executing"
            },
            {
              "color" : "purple",
              "metric" : "memory_miss_executing"
            },
            {
              "color" : "red",
              "metric" : "memory_miss"
            },
            {
              "color" : "orange",
              "metric" : "non_critical_memory_miss"
            },
            {
              "color" : "green",
              "metric" : "mte_tag_check"
            },
            {
              "color" : "yellow",
              "metric" : "execution_latency_without_sme"
            },
            {
              "color" : "yellow",
              "metric" : "execution_latency"
            }
          ],
          "kind" : "normalized-area"
        },
        {
          "color" : "blue",
          "kind" : "bar",
          "metric" : "sme_stream_enable"
        }
      ],
      "documentation" : "The Instruction Processing Bottlenecks mode examines the efficiency of the Instruction Processing component of the CPU.\n\nThis component maintains a forward-looking set of instructions (called the instruction window) that it collects along a predicted instruction path, and executes those instructions as they become ready. This mode categorizes overall activity of the Instruction Processing component each cycle, when there is at least 1 µop in the instruction window, into a number of categories.\n\nUnlike the Instruction Delivery Bottleneck mode which further categorizes the cause of each lost delivery, this mode examines all cycles where there is work to do in the instruction window.\n\nSpecifically, this mode helps to analyze whether the workload is more compute-bound (bottlenecked on compute instruction bandwidth and latency) or memory-bound (bottlenecked on obtaining data from the memory hierarchy). Because of the large instruction window, this is a complex question because the Instruction Processing component can often find some speculative work to do even while waiting for data to be supplied from a distant source, and can have requests outstanding for many (speculative) memory accesses at any time.",
      "metrics" : [
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Critical L1D Cache Miss",
          "documentation" : "The fraction of cycles where the instruction window contains µops, did not execute any of them, and the oldest tracked load or store operation is experiencing an L1D Cache miss.\n\nCombined with the _Critical L1D Cache Miss While Executing_ category, the combined fraction represents the overall impact of a critical memory miss, and generally reflects the degree to which the workload is memory bound.\n\nA high fraction is often associated with scattered accesses to data memory such that there is poor cache-line reuse with access patterns that are difficult for the CPU to predict.\n\nTo improve performance, reduce the working set of data and access your data in regular, strided patterns. For multi-threaded applications, ensure that independent variables that might be actively read by different threads are in separate 128B cache lines to avoid false sharing.",
          "name" : "memory_miss",
          "short_display_name" : "L1DC Crit Miss",
          "synopsis" : "Fraction of cycles where the Instruction Processing component is idle and waiting for data for the oldest tracked load or store from beyond the L1D Cache."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Critical L1D Cache Miss While Executing",
          "documentation" : "The fraction of cycles where the Instruction Processing component is executing at least 1 µop from the instruction window, while the oldest tracked load or store operation is experiencing an L1D Cache miss.\n\nCombined with the _Critical L1D Cache Miss_ category, the combined fraction represents the overall impact of a critical memory miss, and generally reflects the degree to which the workload is memory bound.\n\nA high fraction is often associated with scattered accesses to data memory such that there is poor cache-line reuse with access patterns that are difficult for the CPU to predict. However, the Instruction Processing component is still able to find µops to execute.\n\nTo improve performance, reduce the working set of data and access your data in regular, strided patterns. For multi-threaded applications, ensure that independent variables that might be actively read by different threads are in separate 128B cache lines to avoid false sharing.",
          "name" : "memory_miss_executing",
          "short_display_name" : "L1DC Crit Miss Exec",
          "synopsis" : "Fraction of cycles where the oldest tracked load or store is waiting for data from beyond the L1D Cache, but the Instruction Processing component is executing at least 1 µop from the instruction window."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Executing",
          "documentation" : "The fraction of cycles where the instruction window contains µops and executed at least one of them and the oldest load or store operations are not waiting on an L1D Cache miss.\n\nA high fraction indicates that the Instruction Processing component is making progress executing µops in the instruction window, but combined with a high Processing Bottleneck from the CPU Bottleneck Counting Mode, indicates there is insufficient parallelism to keep it very busy.\n\nTo improve performance, reduce critical paths through code sequences and parallelize code to eliminate dependencies.",
          "name" : "executing",
          "synopsis" : "Fraction of cycles where the Instruction Processing component is executing at least 1 µop from the instruction window and there are no load or store operations experiencing a Critical L1D Cache Miss."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Execution Latency",
          "documentation" : "The fraction of cycles where the instruction window contains µops, did not execute any of them, and the critical path through the code is limited by instruction latencies (possibly including load instructions that hit in the L1D Cache).\n\nTo improve performance, reduce critical paths through code sequences and parallelize code to eliminate dependencies.",
          "name" : "execution_latency_without_sme",
          "short_display_name" : "Exec Latency",
          "synopsis" : "Fraction of overall cycles where the Instruction Processing component is idle due to latencies of operations in dependent chains."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Execution Latency",
          "documentation" : "The fraction of cycles where the instruction window contains µops, did not execute any of them, and the critical path through the code is limited by instruction latencies (possibly including load instructions that hit in the L1D Cache).\n\nIf the SME ISA is in use, as indicated by _SME Streaming Enable Transitions_ metric, consult the _Streaming SME Bottlenecks_ mode.\n\nTo improve performance, reduce critical paths through code sequences and parallelize code to eliminate dependencies.",
          "name" : "execution_latency",
          "short_display_name" : "Exec Latency",
          "synopsis" : "Fraction of overall cycles where the Instruction Processing component is idle due to latencies of operations in dependent chains."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Non-Critical L1D Cache Miss",
          "documentation" : "The fraction of cycles where the instruction window contains µops, did not execute any of them, and a younger tracked load operation is experiencing an L1D Cache miss. The CPU is able to issue younger loads speculatively while waiting for other older work to complete.\n\nA high fraction is often associated with scattered accesses to data memory such that there is poor cache-line reuse with access patterns that are difficult for the CPU to predict.\n\nTo improve performance, reduce the working set of data and access your data in regular, strided patterns. For multi-threaded applications, ensure that independent variables that might be actively read by different threads are in separate 128B cache lines to avoid false sharing.",
          "name" : "non_critical_memory_miss",
          "short_display_name" : "L1DC NonCrit Miss",
          "synopsis" : "Fraction of cycles where the Instruction Processing component is waiting for data for a load other than the oldest tracked load from beyond the L1D Cache."
        },
        {
          "aggregation" : "time-weighted-average",
          "display_name" : "Critical MTE Allocation Tag Check",
          "documentation" : "The fraction of cycles where the instruction window contains µops, did not execute any of them, and the oldest tracked load or store operation is experiencing an MTE tag check delay that is limiting the speculation depth of the instruction window.\n\nIn the case of a load, the data must already be available for this to count. This category will not necessarily correlate directly with performance degradation due to MTE, as there are many activities going on in the CPU at the same time with many interactions. It will count when MTE is clearly responsible for Instruction Processing delays.\n\nA high fraction is often associated with scattered accesses to data memory such that there is poor tag cache-line reuse with access patterns that are difficult for the CPU to predict.\n\nTo improve performance, reduce the working set of data and access your data in regular, strided patterns.",
          "name" : "mte_tag_check",
          "short_display_name" : "Tag Check",
          "synopsis" : "Fraction of cycles where the Instruction Processing component is idle and waiting for the MTE tag check to complete for the oldest tracked load or store operation."
        },
        {
          "aggregation" : "sum",
          "display_name" : "SME Streaming Enable Transitions",
          "name" : "sme_stream_enable",
          "short_display_name" : "SME Strm En",
          "synopsis" : "Transitions into SME Engine Streaming Mode."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Cycles",
          "name" : "cycle",
          "synopsis" : "Cycles elapsed while the CPU was active."
        }
      ],
      "name" : "processing",
      "synopsis" : "Analyzes the causes of slow instruction processing.",
      "thresholds" : [
        {
          "display_name" : "High Executing",
          "documentation" : "Reduce critical paths through code sequences and parallelize code to eliminate dependencies. Combined with high overall Instruction Processing Bottleneck and low IPC, instruction-level parallelism might be low.",
          "expression" : "executing",
          "next_modes" : [

          ],
          "synopsis" : "Instruction Processing is executing at least one µop.",
          "thresholds" : [
            0.7
          ]
        },
        {
          "display_name" : "SME Detected",
          "expression" : "sme_stream_enable",
          "next_modes" : [
            "sme_streaming"
          ],
          "synopsis" : "The SME Engine is enabled.",
          "thresholds" : [
            1
          ]
        },
        {
          "display_name" : "High Critical L1D Cache Misses",
          "documentation" : "Reduce the working set of data and ensure there are consistent strides in data-access patterns. For multi-threaded applications, ensure that independent variables that might be actively read by different threads are in separate 128B cache lines to avoid false sharing.",
          "expression" : "memory_miss",
          "next_modes" : [
            "l1d_miss_sampling"
          ],
          "synopsis" : "Data Cache misses are causing delays.",
          "thresholds" : [
            0.1
          ]
        },
        {
          "display_name" : "High Execution Latency",
          "documentation" : "Reduce critical paths through code sequences and parallelize code to eliminate dependencies.",
          "expression" : "execution_latency_without_sme",
          "next_modes" : [

          ],
          "synopsis" : "Instruction latencies, including load cache hits, are causing delays.",
          "thresholds" : [
            0.1
          ]
        },
        {
          "display_name" : "High Execution Latency",
          "documentation" : "Reduce critical paths through code sequences and parallelize code to eliminate dependencies.\n\nIf the SME ISA is in use, as indicated by _SME Streaming Enable Transitions_ metric, consult the _Streaming SME Bottlenecks_ mode.",
          "expression" : "execution_latency",
          "next_modes" : [
            "sme_streaming"
          ],
          "synopsis" : "Instruction latencies, including load cache hits, are causing delays.",
          "thresholds" : [
            0.1
          ]
        }
      ],
      "triggers" : [

      ]
    },
    {
      "display_name" : "Streaming SME Bottlenecks",
      "displays" : [
        {
          "denominator" : 1,
          "elements" : [
            {
              "color" : "blue",
              "metric" : "ldst_sme_instruction_queue_full_waiting"
            },
            {
              "color" : "red",
              "metric" : "sheduler_sme_reg_data_waiting"
            },
            {
              "color" : "orange",
              "metric" : "ldst_sme_mem_data_waiting"
            }
          ],
          "kind" : "normalized-area"
        },
        {
          "color" : "purple",
          "kind" : "bar",
          "metric" : "sme_ssfp"
        }
      ],
      "documentation" : "The Streaming SME Bottlenecks mode examines inefficiencies in the Instruction Processing component related to SME instructions.",
      "metrics" : [
        {
          "aggregation" : "sum",
          "display_name" : "SME Engine Instruction Queue Full",
          "documentation" : "Cycles while the instruction queue to the SME Engine is full, and no µop was issued by the scheduler with no critical miss.\n\nThis condition is normal for SME-heavy workloads, but indicates the core is unable to to work further ahead due to the back pressure.",
          "name" : "ldst_sme_instruction_queue_full_waiting",
          "short_display_name" : "SME Inst Q Full",
          "synopsis" : "Fraction of cycles while the SME Engine instruction queue is full."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Scheduler Waiting for SME Engine Register Data",
          "documentation" : "Cycles while the CPU scheduler is waiting for register, predicate, or flag data from the SME Engine, and no µop was issued by the scheduler with no critical memory miss.\n\nUsing SME-generated register data in the CPU is not recommended due to high latency.",
          "name" : "sheduler_sme_reg_data_waiting",
          "short_display_name" : "Wait SME Reg Data",
          "synopsis" : "Fraction of cycles while the CPU scheduler is waiting on SME Engine register data."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Load\/Store Unit Waiting for SME Engine Memory Data",
          "documentation" : "Cycles while the Load\/Store Unit is waiting for the SME Engine to produce memory data, and no µop was issued by the scheduler with no critical memory miss.\n\nUsing SME-generated memory data in the CPU is not recommended due to high latency, but may not always be avoidable.",
          "name" : "ldst_sme_mem_data_waiting",
          "short_display_name" : "Wait SME Mem Data",
          "synopsis" : "Fraction of cycles while the Load\/Store Unit is waiting on SME Engine memory data."
        },
        {
          "aggregation" : "sum",
          "display_name" : "SME Scalar Floating Point Instructions",
          "name" : "sme_ssfp",
          "short_display_name" : "SME Scalar FP Ops",
          "synopsis" : "Retired scalar floating-point SME Engine instructions."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Cycles",
          "name" : "cycle",
          "synopsis" : "Cycles elapsed while the CPU was active."
        }
      ],
      "name" : "sme_streaming",
      "synopsis" : "Analyzes bottlenecks related to the SME Engine.",
      "thresholds" : [

      ],
      "triggers" : [

      ]
    },
    {
      "display_name" : "L1D Miss Sampling",
      "displays" : [
        {
          "color" : "red",
          "kind" : "bar",
          "metric" : "l1d_load_miss"
        },
        {
          "color" : "blue",
          "kind" : "bar",
          "metric" : "l1d_store_miss"
        },
        {
          "color" : "purple",
          "kind" : "bar",
          "metric" : "l1d_tlb_miss"
        }
      ],
      "documentation" : "The L1D Miss Sampling mode counts and identifies retired instructions that missed in the respective caching mechanism. These misses result in longer memory access times that may slow execution. They are often associated with scattered accesses to data memory such that there is poor cache-line reuse with access patterns that are difficult for the CPU to predict. ",
      "metrics" : [
        {
          "aggregation" : "sum",
          "display_name" : "L1D TLB Misses",
          "documentation" : "Counts and identifies retired instructions that miss in the _L1D Translation Lookaside Buffer (TLB)_, which caches virtual to physical address translations. Misses result in delays that may slow execution, and are often the result of scattered accesses across the application’s memory space.\n\nTo improve performance, reduce your overall data working set size by consolidating the frequently-used data to fewer 16KiB pages.",
          "name" : "l1d_tlb_miss",
          "short_display_name" : "L1DTLB Misses",
          "synopsis" : "Retired instructions that miss in the _Data Translation Lookaside Buffer (TLB)_."
        },
        {
          "aggregation" : "sum",
          "display_name" : "L1D Cache Load Misses",
          "documentation" : "Counts and identifies load instructions that miss in the _L1D Cache_.\n\nTo improve performance, reduce the working set of data and access your data in regular, strided patterns. For multi-threaded applications, ensure that independent variables that might be actively read by different threads are in separate 128B cache lines to avoid false sharing.",
          "name" : "l1d_load_miss",
          "short_display_name" : "L1DC Ld Misses",
          "synopsis" : "Retired load instructions that miss in the _L1D Cache_."
        },
        {
          "aggregation" : "sum",
          "display_name" : "L1D Cache Store Misses",
          "documentation" : "Counts and identifies retired store instructions that miss in the _L1D Cache_.\n\nTo improve performance, reduce the working set of data and access your data in regular, strided patterns. For multi-threaded applications, ensure that independent variables that might be actively read by different threads are in separate 128B cache lines to avoid false sharing.",
          "name" : "l1d_store_miss",
          "short_display_name" : "L1DC St Misses",
          "synopsis" : "Retired store instructions that miss in the _L1D Cache_."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Cycles",
          "name" : "cycle",
          "synopsis" : "Cycles elapsed while the CPU was active."
        }
      ],
      "name" : "l1d_miss_sampling",
      "synopsis" : "Samples retired instructions that miss in the L1D TLB or L1D Cache.",
      "thresholds" : [

      ],
      "triggers" : [
        {
          "metric" : "l1d_load_miss"
        },
        {
          "metric" : "l1d_store_miss"
        },
        {
          "metric" : "l1d_tlb_miss"
        }
      ]
    },
    {
      "display_name" : "Discarded Sampling",
      "displays" : [
        {
          "color" : "purple",
          "kind" : "bar",
          "metric" : "discarded_memory"
        },
        {
          "color" : "orange",
          "kind" : "bar",
          "metric" : "discarded_branch"
        },
        {
          "color" : "red",
          "kind" : "bar",
          "metric" : "discarded_cond_branch"
        },
        {
          "color" : "yellow",
          "kind" : "bar",
          "metric" : "discarded_other_branch"
        }
      ],
      "documentation" : "The Discarded Sampling mode counts and identifies retired instructions for which the CPU made an incorrect prediction regarding branch directions or memory dependences. These are typically the result of complex or irregular execution patterns.\n\nTo improve performance, focus on removing branches through conditional moves or altering data structures and decision trees for a more stable path through the code.",
      "metrics" : [
        {
          "aggregation" : "sum",
          "display_name" : "Unpredicted Memory Dependencies",
          "documentation" : "Counts and identifies retired store instructions that caused a pipeline flush and replay of loads and subsequent instructions because of unpredicted memory dependency.\n\nTo improve performance, access your data in more regular patterns. Avoid patterns where a load instruction only occasionally obtains data from a recent store instruction.",
          "name" : "discarded_memory",
          "short_display_name" : "Mem Dep Mispreds",
          "synopsis" : "Retired store instructions that caused unpredicted memory dependences."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Incorrectly-predicted Branches",
          "documentation" : "Counts and identifies retired branch instructions including calls and returns that experienced a misprediction and required younger speculative instructions to be flushed. \nTo improve performance, simplify branch conditions where possible and consider the use of conditional instructions such as conditional select. Ensure that calls and returns are balanced and avoid deep call stacks.",
          "name" : "discarded_branch",
          "short_display_name" : "Br Mispreds",
          "synopsis" : "Retired branch instructions that caused mispredicted speculative execution."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Incorrectly-predicted Conditional Branches",
          "documentation" : "Counts and identifies retired conditional branch instructions that experienced a misprediction and required younger speculative instructions to be flushed.\n\nTo improve performance, simplify branch conditions where possible and consider the use of conditional instructions such as conditional select.",
          "name" : "discarded_cond_branch",
          "short_display_name" : "Cond Br Mispreds",
          "synopsis" : "Retired conditional branch instructions that caused mispredicted speculative execution."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Incorrectly-predicted Other Branches",
          "documentation" : "Counts retired non-conditional branch instructions that experienced a misprediction and required younger speculative instructions to be flushed. \nRun with \"Discarded Indirect, Call, and Return Sampling Mode\" for sampling of these additional branch types.",
          "name" : "discarded_other_branch",
          "short_display_name" : "Oth Br Mispreds",
          "synopsis" : "Mispredicted non-conditional branch instructions that caused mispredicted speculative execution."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Cycles",
          "name" : "cycle",
          "synopsis" : "Cycles elapsed while the CPU was active."
        }
      ],
      "name" : "discarded_sampling",
      "synopsis" : "Samples retired instructions that were the source of incorrect speculative execution.",
      "thresholds" : [
        {
          "display_name" : "High Indirect Branch Mispredicts",
          "expression" : "discarded_other_branch",
          "next_modes" : [
            "discarded_indirect_sampling"
          ],
          "synopsis" : "Incorrectly-predicted indirect, call, or return branches are wasting bandwidth.",
          "thresholds" : [
            200
          ]
        }
      ],
      "triggers" : [
        {
          "metric" : "discarded_memory"
        },
        {
          "metric" : "discarded_branch"
        },
        {
          "metric" : "discarded_cond_branch"
        }
      ]
    },
    {
      "display_name" : "Discarded Indirect, Call, and Return Sampling",
      "displays" : [
        {
          "color" : "purple",
          "kind" : "bar",
          "metric" : "discarded_indirect_branch"
        },
        {
          "color" : "red",
          "kind" : "bar",
          "metric" : "discarded_call"
        },
        {
          "color" : "blue",
          "kind" : "bar",
          "metric" : "discarded_return"
        }
      ],
      "documentation" : "The Discarded Indirect, Call, and Return Sampling mode counts and identifies retired instructions for which the CPU made an incorrect prediction regarding indirect, call, and return instructions. These are typically the result of complex irregular branching patterns, such as mismatched calls and returns.\n\nTo improve performance, reduce the number of unique indirect targets where possible and pair calls with returns.",
      "metrics" : [
        {
          "aggregation" : "sum",
          "display_name" : "Incorrectly-predicted Indirect Branches",
          "documentation" : "Counts and identifies any indirect branching instruction that experienced a misprediction and required younger speculative instructions to be flushed. Some indirect branches and calls are used to branch into dynamically-linked code and may be unavoidable.\n\nTo improve performance, reduce the number of unique indirect targets, and use properly paired calls and returns.",
          "name" : "discarded_indirect_branch",
          "short_display_name" : "Ind Br Mispreds",
          "synopsis" : "Retired branch instructions, including indirect calls and returns, that caused mispredicted speculative execution."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Incorrectly-predicted Returns",
          "documentation" : "Counts and identifies retired return instructions that experienced a misprediction and required younger speculative instructions to be flushed.\n\nThis is a subset of **Incorrectly-predicted Indirect Branches**.\n\nTo improve performance, ensure that calls and returns are balanced and avoid deep call stacks.",
          "name" : "discarded_return",
          "short_display_name" : "Ret Mispreds",
          "synopsis" : "Retired return instructions that caused mispredicted speculative execution."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Incorrectly-predicted Indirect Calls",
          "documentation" : "Counts and identifies retired indirect call instructions that experienced a misprediction and required younger speculative instructions to be flushed. Some indirect calls are used to branch into dynamically-linked code and may be unavoidable.\n\nThis is a subset of **Incorrectly-predicted Indirect Branches**.\n\nTo improve performance, reduce the number of unique indirect call targets.",
          "name" : "discarded_call",
          "short_display_name" : "Ind Call Mispreds",
          "synopsis" : "Retired indirect call instructions that caused mispredicted speculative execution."
        },
        {
          "aggregation" : "sum",
          "display_name" : "Cycles",
          "name" : "cycle",
          "synopsis" : "Cycles elapsed while the CPU was active."
        }
      ],
      "name" : "discarded_indirect_sampling",
      "synopsis" : "Samples retired indirect, call, and return instructions that were the source of incorrect speculative execution.",
      "thresholds" : [

      ],
      "triggers" : [
        {
          "metric" : "discarded_indirect_branch"
        },
        {
          "metric" : "discarded_call"
        },
        {
          "metric" : "discarded_return"
        }
      ]
    }
  ],
  "name" : "bottleneck",
  "platforms" : [
    {
      "constants" : {
        "MAP_BW" : {
          "e" : 4,
          "p" : 8
        }
      },
      "expressions" : {
        "correction" : "max(0, delivery_latency + discarded + processing + useful - 1)",
        "cycle" : "CORE_ACTIVE_CYCLE",
        "delivery" : "1 - useful - max(0, processing + discarded - correction)",
        "delivery_bandwidth" : "delivery - delivery_latency",
        "delivery_latency" : "MAP_DISPATCH_BUBBLE \/ CORE_ACTIVE_CYCLE",
        "discarded" : "max(0, (MAP_REWIND * MAP_BW) + MAP_INT_UOP + MAP_LDST_UOP + MAP_SIMD_UOP - RETIRE_UOP) \/ slot",
        "discarded_branch" : "BRANCH_MISPRED_NONSPEC",
        "discarded_call" : "BRANCH_CALL_INDIR_MISPRED_NONSPEC",
        "discarded_cond_branch" : "BRANCH_COND_MISPRED_NONSPEC",
        "discarded_indirect_branch" : "BRANCH_INDIR_MISPRED_NONSPEC",
        "discarded_memory" : "ST_MEM_ORDER_VIOL_LD_NONSPEC",
        "discarded_other_branch" : "BRANCH_MISPRED_NONSPEC - BRANCH_COND_MISPRED_NONSPEC",
        "discarded_return" : "BRANCH_RET_INDIR_MISPRED_NONSPEC",
        "l1d_load_miss" : "L1D_CACHE_MISS_LD_NONSPEC",
        "l1d_store_miss" : "L1D_CACHE_MISS_ST_NONSPEC",
        "l1d_tlb_miss" : "L1D_TLB_MISS_NONSPEC",
        "processing" : "MAP_STALL \/ CORE_ACTIVE_CYCLE",
        "slot" : "CORE_ACTIVE_CYCLE * MAP_BW",
        "useful" : "RETIRE_UOP \/ slot"
      },
      "periods" : {
        "discarded_branch" : 10000,
        "discarded_call" : 10000,
        "discarded_cond_branch" : 10000,
        "discarded_indirect_branch" : 10000,
        "discarded_memory" : 10000,
        "discarded_return" : 10000,
        "l1d_load_miss" : 10000,
        "l1d_store_miss" : 10000,
        "l1d_tlb_miss" : 10000
      },
      "platforms" : [
        "t8101",
        "t8103",
        "t6000",
        "t6001",
        "t6002"
      ],
      "supported_mode_names" : [
        "bottlenecks",
        "delivery",
        "l1d_miss_sampling",
        "discarded_sampling",
        "discarded_indirect_sampling"
      ]
    },
    {
      "constants" : {
        "MAP_BW" : {
          "e" : 5,
          "p" : 8
        }
      },
      "expressions" : {
        "correction" : "max(0, delivery_latency + discarded + processing + useful - 1)",
        "cycle" : "CORE_ACTIVE_CYCLE",
        "delivery" : "1 - useful - max(0, processing + discarded - correction)",
        "delivery_bandwidth" : "delivery - delivery_latency",
        "delivery_latency" : "MAP_DISPATCH_BUBBLE \/ CORE_ACTIVE_CYCLE",
        "discarded" : "max(0, (MAP_REWIND * MAP_BW) + MAP_INT_UOP + MAP_LDST_UOP + MAP_SIMD_UOP - RETIRE_UOP) \/ slot",
        "discarded_branch" : "BRANCH_MISPRED_NONSPEC",
        "discarded_call" : "BRANCH_CALL_INDIR_MISPRED_NONSPEC",
        "discarded_cond_branch" : "BRANCH_COND_MISPRED_NONSPEC",
        "discarded_indirect_branch" : "BRANCH_INDIR_MISPRED_NONSPEC",
        "discarded_memory" : "ST_MEM_ORDER_VIOL_LD_NONSPEC",
        "discarded_other_branch" : "BRANCH_MISPRED_NONSPEC - BRANCH_COND_MISPRED_NONSPEC",
        "discarded_return" : "BRANCH_RET_INDIR_MISPRED_NONSPEC",
        "l1d_load_miss" : "L1D_CACHE_MISS_LD_NONSPEC",
        "l1d_store_miss" : "L1D_CACHE_MISS_ST_NONSPEC",
        "l1d_tlb_miss" : "L1D_TLB_MISS_NONSPEC",
        "processing" : "MAP_STALL \/ CORE_ACTIVE_CYCLE",
        "slot" : "CORE_ACTIVE_CYCLE * MAP_BW",
        "useful" : "RETIRE_UOP \/ slot"
      },
      "periods" : {
        "discarded_branch" : 10000,
        "discarded_call" : 10000,
        "discarded_cond_branch" : 10000,
        "discarded_indirect_branch" : 10000,
        "discarded_memory" : 10000,
        "discarded_return" : 10000,
        "l1d_load_miss" : 10000,
        "l1d_store_miss" : 10000,
        "l1d_tlb_miss" : 10000
      },
      "platforms" : [
        "t8110",
        "t8112",
        "t6020",
        "t6021",
        "t6022"
      ],
      "supported_mode_names" : [
        "bottlenecks",
        "delivery",
        "l1d_miss_sampling",
        "discarded_sampling",
        "discarded_indirect_sampling"
      ]
    },
    {
      "constants" : {
        "MAP_BW" : {
          "e" : 5,
          "p" : 9
        }
      },
      "expressions" : {
        "cycle" : "CORE_ACTIVE_CYCLE",
        "delivery" : "1 - useful - discarded - processing",
        "delivery_bandwidth" : "delivery - (MAP_DISPATCH_BUBBLE \/ CORE_ACTIVE_CYCLE)",
        "delivery_latency_icache" : "MAP_DISPATCH_BUBBLE_IC \/ CORE_ACTIVE_CYCLE",
        "delivery_latency_itlb" : "MAP_DISPATCH_BUBBLE_ITLB \/ CORE_ACTIVE_CYCLE",
        "delivery_latency_other" : "(MAP_DISPATCH_BUBBLE \/ CORE_ACTIVE_CYCLE) - delivery_latency_icache - delivery_latency_itlb",
        "discarded" : "(MAP_UOP - RETIRE_UOP + (MAP_REWIND * MAP_BW)) \/ slot",
        "discarded_branch" : "BRANCH_MISPRED_NONSPEC",
        "discarded_call" : "BRANCH_CALL_INDIR_MISPRED_NONSPEC",
        "discarded_cond_branch" : "BRANCH_COND_MISPRED_NONSPEC",
        "discarded_indirect_branch" : "BRANCH_INDIR_MISPRED_NONSPEC",
        "discarded_memory" : "ST_MEM_ORDER_VIOL_LD_NONSPEC",
        "discarded_other_branch" : "BRANCH_MISPRED_NONSPEC - BRANCH_COND_MISPRED_NONSPEC",
        "discarded_return" : "BRANCH_RET_INDIR_MISPRED_NONSPEC",
        "executing" : "executing_cycle \/ processing_cycle",
        "executing_cycle" : "SCHEDULE_UOP_ANY - memory_miss_executing_cycle",
        "execution_latency_without_sme" : "(processing_cycle - SCHEDULE_UOP_ANY - LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS) \/ processing_cycle",
        "l1d_load_miss" : "L1D_CACHE_MISS_LD_NONSPEC",
        "l1d_store_miss" : "L1D_CACHE_MISS_ST_NONSPEC",
        "l1d_tlb_miss" : "L1D_TLB_MISS_NONSPEC",
        "memory_miss" : "LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS \/ processing_cycle",
        "memory_miss_executing" : "memory_miss_executing_cycle \/ processing_cycle",
        "memory_miss_executing_cycle" : "LDST_UNIT_OLD_L1D_CACHE_MISS - LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS",
        "processing" : "MAP_STALL \/ CORE_ACTIVE_CYCLE",
        "processing_cycle" : "CORE_ACTIVE_CYCLE - SCHEDULE_EMPTY",
        "slot" : "CORE_ACTIVE_CYCLE * MAP_BW",
        "useful" : "RETIRE_UOP \/ slot"
      },
      "periods" : {
        "discarded_branch" : 10000,
        "discarded_call" : 10000,
        "discarded_cond_branch" : 10000,
        "discarded_indirect_branch" : 10000,
        "discarded_memory" : 10000,
        "discarded_return" : 10000,
        "l1d_load_miss" : 10000,
        "l1d_store_miss" : 10000,
        "l1d_tlb_miss" : 10000
      },
      "platforms" : [
        "t8120",
        "t8122",
        "t6030",
        "t6031",
        "t6032",
        "t6034",
        "t8130"
      ],
      "supported_mode_names" : [
        "bottlenecks",
        "delivery",
        "processing",
        "l1d_miss_sampling",
        "discarded_sampling",
        "discarded_indirect_sampling"
      ]
    },
    {
      "constants" : {
        "MAP_BW" : {
          "e" : 5,
          "p" : 10
        }
      },
      "expressions" : {
        "cycle" : "CORE_ACTIVE_CYCLE",
        "delivery" : "MAP_DISPATCH_BUBBLE_SLOT \/ slot",
        "delivery_bandwidth" : "delivery - (delivery_latency_icache + delivery_latency_itlb + delivery_latency_other)",
        "delivery_latency_icache" : "MAP_DISPATCH_BUBBLE_IC \/ CORE_ACTIVE_CYCLE",
        "delivery_latency_itlb" : "MAP_DISPATCH_BUBBLE_ITLB \/ CORE_ACTIVE_CYCLE",
        "delivery_latency_other" : "(MAP_DISPATCH_BUBBLE \/ CORE_ACTIVE_CYCLE) - delivery_latency_icache - delivery_latency_itlb",
        "discarded" : "(MAP_UOP - RETIRE_UOP + (MAP_RECOVERY * MAP_BW)) \/ slot",
        "discarded_branch" : "BRANCH_MISPRED_NONSPEC",
        "discarded_call" : "BRANCH_CALL_INDIR_MISPRED_NONSPEC",
        "discarded_cond_branch" : "BRANCH_COND_MISPRED_NONSPEC",
        "discarded_indirect_branch" : "BRANCH_INDIR_MISPRED_NONSPEC",
        "discarded_memory" : "ST_MEM_ORDER_VIOL_LD_NONSPEC",
        "discarded_other_branch" : "BRANCH_MISPRED_NONSPEC - BRANCH_COND_MISPRED_NONSPEC",
        "discarded_return" : "BRANCH_RET_INDIR_MISPRED_NONSPEC",
        "executing" : "executing_cycle \/ processing_cycle",
        "executing_cycle" : "SCHEDULE_UOP_ANY - memory_miss_executing_cycle",
        "execution_latency" : "(processing_cycle - SCHEDULE_UOP_ANY - LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS - LD_UNIT_WAITING_YOUNG_L1D_CACHE_MISS) \/ processing_cycle",
        "l1d_load_miss" : "L1D_CACHE_MISS_LD_NONSPEC",
        "l1d_store_miss" : "L1D_CACHE_MISS_ST_NONSPEC",
        "l1d_tlb_miss" : "L1D_TLB_MISS_NONSPEC",
        "ldst_sme_instruction_queue_full_waiting" : "LDST_UNIT_WAITING_SME_ENGINE_INST_QUEUE_FULL \/ processing_cycle",
        "ldst_sme_mem_data_waiting" : "LDST_UNIT_WAITING_SME_ENGINE_MEM_DATA \/ processing_cycle",
        "memory_miss" : "LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS \/ processing_cycle",
        "memory_miss_executing" : "memory_miss_executing_cycle \/ processing_cycle",
        "memory_miss_executing_cycle" : "LDST_UNIT_OLD_L1D_CACHE_MISS - LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS",
        "non_critical_memory_miss" : "LD_UNIT_WAITING_YOUNG_L1D_CACHE_MISS \/ processing_cycle",
        "processing" : "MAP_STALL_NONRECOVERY \/ CORE_ACTIVE_CYCLE",
        "processing_cycle" : "CORE_ACTIVE_CYCLE - SCHEDULE_EMPTY",
        "sheduler_sme_reg_data_waiting" : "SCHEDULE_WAITING_SME_ENGINE_REG_DATA \/ processing_cycle",
        "slot" : "CORE_ACTIVE_CYCLE * MAP_BW",
        "sme_ssfp" : "INST_SME_ENGINE_SCALARFP",
        "sme_stream_enable" : "SME_ENGINE_SM_ENABLE",
        "useful" : "RETIRE_UOP \/ slot"
      },
      "periods" : {
        "discarded_branch" : 10000,
        "discarded_call" : 10000,
        "discarded_cond_branch" : 10000,
        "discarded_indirect_branch" : 10000,
        "discarded_memory" : 10000,
        "discarded_return" : 10000,
        "l1d_load_miss" : 10000,
        "l1d_store_miss" : 10000,
        "l1d_tlb_miss" : 10000
      },
      "platforms" : [
        "t8132",
        "t6040",
        "t6041",
        "t8140"
      ],
      "supported_mode_names" : [
        "bottlenecks",
        "delivery",
        "processing",
        "sme_streaming",
        "l1d_miss_sampling",
        "discarded_sampling",
        "discarded_indirect_sampling"
      ]
    },
    {
      "constants" : {
        "DECODE_BW" : {
          "m" : 7,
          "p" : 10
        }
      },
      "expressions" : {
        "cycle" : "CORE_ACTIVE_CYCLE",
        "delivery" : "MAP_DISPATCH_BUBBLE_SLOT \/ slot",
        "delivery_bandwidth_other" : "delivery_bandwidth_total - delivery_bandwidth_taken_br",
        "delivery_bandwidth_taken_br" : "MAP_DISPATCH_BUBBLE_TAKENBR_SLOT \/ slot",
        "delivery_bandwidth_total" : "delivery - (delivery_latency_icache + delivery_latency_itlb + delivery_latency_other)",
        "delivery_latency_icache" : "MAP_DISPATCH_BUBBLE_IC \/ CORE_ACTIVE_CYCLE",
        "delivery_latency_itlb" : "MAP_DISPATCH_BUBBLE_ITLB \/ CORE_ACTIVE_CYCLE",
        "delivery_latency_other" : "(MAP_DISPATCH_BUBBLE \/ CORE_ACTIVE_CYCLE) - delivery_latency_icache - delivery_latency_itlb",
        "discarded" : "(DECODE_UOP - RETIRE_UOP + (MAP_RECOVERY * DECODE_BW)) \/ slot",
        "discarded_branch" : "BRANCH_MISPRED_NONSPEC",
        "discarded_call" : "BRANCH_CALL_INDIR_MISPRED_NONSPEC",
        "discarded_cond_branch" : "BRANCH_COND_MISPRED_NONSPEC",
        "discarded_indirect_branch" : "BRANCH_INDIR_MISPRED_NONSPEC",
        "discarded_memory" : "ST_MEM_ORDER_VIOL_LD_NONSPEC",
        "discarded_other_branch" : "BRANCH_MISPRED_NONSPEC - BRANCH_COND_MISPRED_NONSPEC",
        "discarded_return" : "BRANCH_RET_INDIR_MISPRED_NONSPEC",
        "executing" : "executing_cycle \/ processing_cycle",
        "executing_cycle" : "SCHEDULE_UOP_ANY - memory_miss_executing_cycle",
        "execution_latency" : "(processing_cycle - SCHEDULE_UOP_ANY - LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS - LD_UNIT_WAITING_YOUNG_L1D_CACHE_MISS - LDST_OLDEST_MTE_TAG_CHECK_CYCLE) \/ processing_cycle",
        "l1d_load_miss" : "L1D_CACHE_MISS_LD_NONSPEC",
        "l1d_store_miss" : "L1D_CACHE_MISS_ST_NONSPEC",
        "l1d_tlb_miss" : "L1D_TLB_MISS_NONSPEC",
        "ldst_sme_instruction_queue_full_waiting" : "LDST_UNIT_WAITING_SME_ENGINE_INST_QUEUE_FULL \/ processing_cycle",
        "ldst_sme_mem_data_waiting" : "LDST_UNIT_WAITING_SME_ENGINE_MEM_DATA \/ processing_cycle",
        "memory_miss" : "LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS \/ processing_cycle",
        "memory_miss_executing" : "memory_miss_executing_cycle \/ processing_cycle",
        "memory_miss_executing_cycle" : "LDST_UNIT_OLD_L1D_CACHE_MISS - LDST_UNIT_WAITING_OLD_L1D_CACHE_MISS",
        "mte_tag_check" : "LDST_OLDEST_MTE_TAG_CHECK_CYCLE \/ processing_cycle",
        "non_critical_memory_miss" : "LD_UNIT_WAITING_YOUNG_L1D_CACHE_MISS \/ processing_cycle",
        "processing" : "MAP_STALL_NONRECOVERY \/ CORE_ACTIVE_CYCLE",
        "processing_cycle" : "CORE_ACTIVE_CYCLE - SCHEDULE_EMPTY",
        "sheduler_sme_reg_data_waiting" : "SCHEDULE_WAITING_SME_ENGINE_REG_DATA \/ processing_cycle",
        "slot" : "CORE_ACTIVE_CYCLE * DECODE_BW",
        "sme_ssfp" : "INST_SME_ENGINE_SCALARFP",
        "sme_stream_enable" : "SME_ENGINE_SM_ENABLE",
        "useful" : "RETIRE_UOP \/ slot"
      },
      "periods" : {
        "discarded_branch" : 10000,
        "discarded_call" : 10000,
        "discarded_cond_branch" : 10000,
        "discarded_indirect_branch" : 10000,
        "discarded_memory" : 10000,
        "discarded_return" : 10000,
        "l1d_load_miss" : 10000,
        "l1d_store_miss" : 10000,
        "l1d_tlb_miss" : 10000
      },
      "platforms" : [
        "t8142",
        "t6050"
      ],
      "supported_mode_names" : [
        "bottlenecks",
        "delivery",
        "processing",
        "sme_streaming",
        "l1d_miss_sampling",
        "discarded_sampling",
        "discarded_indirect_sampling"
      ]
    }
  ]
}