Skip to content

Commit

Permalink
[docs] document LIR attribution (#30899)
Browse files Browse the repository at this point in the history
Co-authored-by: kay-kim <[email protected]>
  • Loading branch information
mgree and kay-kim authored Jan 9, 2025
1 parent bab00e1 commit 960265c
Show file tree
Hide file tree
Showing 8 changed files with 982 additions and 0 deletions.
314 changes: 314 additions & 0 deletions doc/user/content/transform-data/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,320 @@ If you've gone through the dataflow troubleshooting and do not want to make any
changes to your query, consider [sizing up your cluster](/sql/create-cluster/#size).
A larger size cluster will provision more memory and CPU resources.

## Which part of my query runs slowly or uses a lot of memory?

{{< public-preview />}}

You can [`EXPLAIN`](/sql/explain-plan/) a query to see how it will be run as a
dataflow. In particular, `EXPLAIN PHYSICAL PLAN` will show the concrete, fully
optimized plan that Materialize will run. That plan is written in our "low-level
intermediate representation" (LIR).

For [indexes](/concepts/indexes) and [materialized
views](/concepts/views#materialized-views), you can use
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
to attribute various performance characteristics to the operators inside your
query.

Every time you create an index or materialized view, Materialize uses
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
to map the higher-level LIR operators to zero or more lower-level
dataflow operators. You can construct queries that will combine
information from
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
and other internal views to discover which parts of your query are
computationally expensive (e.g.,
[`mz_introspection.mz_compute_operator_durations_histogram`](/sql/system-catalog/mz_introspection/#mz_compute_operator_durations_histogram), [`mz_introspection.mz_scheduling_elapsed`](/sql/system-catalog/mz_introspection/#mz_scheduling_elapsed))
or consuming excessive memory (e.g., [`mz_introspection.mz_arrangement_sizes`](/sql/system-catalog/mz_introspection/#mz_arrangement_sizes)).

To show how you can use
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
to attribute performance characteristics, the attribution examples in this
section reference the `wins_by_item` index (and the underlying `winning_bids`
view) from the [quickstart
guide](/get-started/quickstart/#step-2-create-the-source):

```sql
CREATE SOURCE auction_house
FROM LOAD GENERATOR AUCTION
(TICK INTERVAL '1s', AS OF 100000)
FOR ALL TABLES;

CREATE VIEW winning_bids AS
SELECT DISTINCT ON (a.id) b.*, a.item, a.seller
FROM auctions AS a
JOIN bids AS b
ON a.id = b.auction_id
WHERE b.bid_time < a.end_time
AND mz_now() >= a.end_time
ORDER BY a.id, b.amount DESC, b.bid_time, b.buyer;

CREATE INDEX wins_by_item ON winning_bids (item);
```

We attribute four different kinds of performance data to parts of the
`wins_by_item` query:

- [computation time](#attributing-computation-time)
- [memory usage](#attributing-memory-usage)
- [Top-k query sizing hints](#attributing-topk-hints)
- [worker skew](#localizing-worker-skew)

### Attributing computation time

When optimizing a query, it helps to be able to attribute 'cost' to its parts,
starting with how much time is spent computing in each part overall. Materialize
reports the time spent in each _dataflow operator_ in
[`mz_introspection.mz_compute_operator_durations_histogram`](/sql/system-catalog/mz_introspection/#mz_compute_operator_durations_histogram).
By joining it with
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping),
we can attribute the time spent in each operator to the higher-level, more
intelligible LIR operators.

For example, to find out how much time is spent in each operator for the `wins_by_item` index (and the underlying `winning_bids` view), run the following query:

```sql
SELECT mo.name AS name, global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
SUM(duration_ns)/1000 * '1 microsecond'::INTERVAL AS duration, SUM(count) AS count
FROM mz_introspection.mz_lir_mapping mlm
LEFT JOIN mz_introspection.mz_compute_operator_durations_histogram mcodh
ON (mlm.operator_id_start <= mcodh.id AND mcodh.id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
WHERE mo.name IN ('wins_by_item', 'winning_bids')
GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting
ORDER BY global_id, lir_id DESC;
```

The query produces results similar to the following (your specific numbers will
vary):

{{< yaml-table data="query_attribution_computation_time_output" >}}

- The `duration` column shows that the `TopK` operator is where we spend the
bulk of the query's computation time.

- Creating an index on a view starts _two_ dataflows as denoted by the two
`global_ids`:
- `u148` is the dataflow for the `winning_bids` view (installed when the index
is created), and
- `u149` is the dataflow for the `wins_by_item` index on `winning_bids` (which
arranges the results of the `winning_bids` view by the index key).

The LIR operators reported in `mz_lir_mapping.operator` are terser than those in
`EXPLAIN PHYSICAL PLAN`. Each operator is restricted to a single line of the
form `OPERATOR [INPUT_LIR_ID ...]`. For example `lir_id` 4 is the operator
`Arrange 3`, where `Arrange` is the operator that will arrange in memory the
results of `lir_id` 3 (`GET::PassArrangements u145`), which `GET`s/reads
dataflow `u145` (while not shown, `u145` is the `bids` table from the Auction
source).

To examine the query in more detail:

- The query works by finding every dataflow operator in the range
(`mz_lir_mapping.operator_id_start` inclusive to
`mz_lir_mapping.operator_id_end` exclusive) and summing up the time spent
(`SUM(duration_ns)`).

- The query joins with
[`mz_catalog.mz_objects`](/sql/system-catalog/mz_catalog/#mz_objects) to find
the actual name corresponding to the `global_id`. The `WHERE mo.name IN ...`
clause of the query ensures we only see information about this index and view.
If you leave this `WHERE` clause out, you will see information on _every_
view, materialized view, and index on your current cluster.

- The `operator` is indented using [`REPEAT`](/sql/functions/#repeat) and
`mz_lir_mapping.nesting`. The indenting, combined with ordering by
`mz_lir_mapping.lir_id` descending, gives us a tree-shaped view of the LIR
plan.

### Attributing memory usage

{{< tip >}}

If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.

{{< /tip >}}

To find the memory usage of each operator for the index and view, join
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping)
with
[`mz_introspection.mz_arrangement_sizes`](/sql/system-catalog/mz_introspection/#mz_arrangement_sizes):

```sql
SELECT mo.name AS name, global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
pg_size_pretty(SUM(size)) AS size
FROM mz_introspection.mz_lir_mapping mlm
LEFT JOIN mz_introspection.mz_arrangement_sizes mas
ON (mlm.operator_id_start <= mas.operator_id AND mas.operator_id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
WHERE mo.name IN ('wins_by_item', 'winning_bids')
GROUP BY mo.name, global_id, lir_id, operator, parent_lir_id, nesting
ORDER BY global_id, lir_id DESC;
```

The query produces results similar to the following (your specific numbers will
vary):

{{< yaml-table data="query_attribution_memory_usage_output" >}}

The results show:

- The [`TopK`](/transform-data/idiomatic-materialize-sql/top-k/) is
overwhelmingly responsible for memory usage.

- Arranging the `bids` table (`lir_id` 4) and `auctions` table (`lir_id` 2) are
cheap in comparison as is arranging the final output in the `wins_by_item`
index (`lir_id` 8).

### Attributing `TopK` hints

{{< tip >}}

If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.

{{< /tip >}}

The
[`mz_introspection.mz_expected_group_size_advice`](/sql/system-catalog/mz_introspection/#mz_expected_group_size_advice)
looks at your running dataflow and suggests parameters you can set. We
can attribute this to particular parts of our query using
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping):

```sql
SELECT mo.name AS name, mlm.global_id AS global_id, lir_id, parent_lir_id, REPEAT(' ', nesting * 2) || operator AS operator,
levels, to_cut, hint, pg_size_pretty(savings) AS savings
FROM mz_introspection.mz_lir_mapping mlm
JOIN mz_introspection.mz_dataflow_global_ids mdgi
ON (mlm.global_id = mdgi.global_id)
LEFT JOIN mz_introspection.mz_expected_group_size_advice megsa
ON (megsa.dataflow_id = mdgi.id AND
mlm.operator_id_start <= megsa.region_id AND megsa.region_id < mlm.operator_id_end)
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
WHERE mo.name IN ('wins_by_item', 'winning_bids')
ORDER BY mlm.global_id, lir_id DESC;
```

Each `TopK` operator will have an [associated `DISTINCT ON INPUT GROUP SIZE`
query hint](/transform-data/idiomatic-materialize-sql/top-k/#query-hints-1):

{{< yaml-table data="query_attribution_topk_hints_output" >}}

Here, the hinted `DISTINCT ON INPUT GROUP SIZE` is `255.0`. We can re-create our view and index using the hint as follows:

```sql
DROP VIEW winning_bids CASCADE;

CREATE VIEW winning_bids AS
SELECT DISTINCT ON (a.id) b.*, a.item, a.seller
FROM auctions AS a
JOIN bids AS b
ON a.id = b.auction_id
WHERE b.bid_time < a.end_time
AND mz_now() >= a.end_time
OPTIONS (DISTINCT ON INPUT GROUP SIZE = 255) -- use hint!
ORDER BY a.id,
b.amount DESC,
b.bid_time,
b.buyer;

CREATE INDEX wins_by_item ON winning_bids (item);
```

Re-running the `TopK`-hints query will show only `null` hints; there are no
hints because our `TopK` is now appropriately sized. But if we re-run our [query
for attributing memory usage](#attributing-memory-usage), we can see that our
`TopK` operator uses a third of the memory it was using before:

{{< yaml-table data="query_attribution_memory_usage_w_hint_output" >}}

### Localizing worker skew

{{< tip >}}

If you have not read about [attributing computation time](#attributing-computation-time), please do so first, as it explains some core concepts.

{{< /tip >}}


[Worker skew](/transform-data/dataflow-troubleshooting/#is-work-distributed-equally-across-workers) occurs when your data do not end up getting evenly
partitioned between workers. Worker skew can only happen when your
cluster has more than one worker. You can query
[`mz_catalog.mz_cluster_replica_sizes`](/sql/system-catalog/mz_catalog/#mz_cluster_replica_sizes)
to determine how many workers a given cluster size has; in our example, there are 4 workers.

You can identify worker skew by comparing a worker's time spent to the
overall time spent across all workers:

```sql
SELECT mo.name AS name, global_id, lir_id, REPEAT(' ', 2 * nesting) || operator AS operator,
worker_id,
ROUND(SUM(elapsed_ns) / SUM(aebi.total_ns), 2) AS ratio,
SUM(epw.elapsed_ns) / 1000 * '1 microsecond'::INTERVAL AS elapsed_ns,
SUM(aebi.total_ns) / 1000 * '1 microsecond'::INTERVAL AS avg_ns
FROM mz_introspection.mz_lir_mapping mlm
CROSS JOIN LATERAL ( SELECT SUM(elapsed_ns) AS total_ns
FROM mz_introspection.mz_scheduling_elapsed_per_worker mse
WHERE mlm.operator_id_start <= id AND id < mlm.operator_id_end
GROUP BY worker_id) aebi
CROSS JOIN LATERAL ( SELECT worker_id, SUM(elapsed_ns) AS elapsed_ns
FROM mz_introspection.mz_scheduling_elapsed_per_worker mse
WHERE mlm.operator_id_start <= id AND id < mlm.operator_id_end
GROUP BY worker_id) epw
JOIN mz_catalog.mz_objects mo
ON (mlm.global_id = mo.id)
WHERE mo.name IN ('wins_by_item', 'winning_bids')
GROUP BY mo.name, global_id, lir_id, nesting, operator, worker_id
ORDER BY global_id, lir_id DESC;
```

{{< yaml-table data="query_attribution_worker_skew_output" >}}

The `ratio` column tells you whether a worker is particularly over- or
under-loaded:

- a `ratio` below 1 indicates a worker doing a below average amount of work.

- a `ratio` above 1 indicates a worker doing an above average amount of work.

While there will always be some amount of variation, very high ratios indicate a
skewed workload.

### Writing your own attribution queries

Materialize maps LIR nodes to ranges of dataflow operators in
[`mz_introspection.mz_lir_mapping`](/sql/system-catalog/mz_introspection/#mz_lir_mapping).
By combining information from
[`mz_catalog`](/sql/system-catalog/mz_catalog/) and
[`mz_introspection`](/sql/system-catalog/mz_introspection/),
you can better understand your dataflows' behavior. Using the above queries as a
starting point, you can build your own attribution queries. When building your own, keep the following in mind:

- `mz_lir_mapping.operator` is not stable and **should not be parsed**.

- If you want to traverse the LIR tree, use `mz_lir_mapping. parent_lir_id`.

- To request additional metadata that would be useful for us to provide,
please [contact our team](https://materialize.com/contact/).

- Include `GROUP BY global_id` to avoid mixing `lir_ids` from different
`global_id`s. Mixing `lir_id`s from different `global_id`s will produce
nonsense. `global_id`s will produce nonsense.

- Use `REPEAT(' ', 2 * nesting) || operator` and `ORDER BY lir_id DESC` to
correctly render the LIR tree.

- `mz_lir_mapping.operator_id_start` is inclusive;
`mz_lir_mapping.operator_id_end` is exclusive. If they are equal to each
other, that LIR operator does not correspond to any dataflow operators.

- To only see output for views, materialized views, and indexes you're
interested in, join with `mz_catalog.mz_objects` and restrict based on
`mz_objects.name`. Otherwise, you will see information on everything installed
in your current cluster.

## How do I troubleshoot slow queries?

Materialize stores a (sampled) log of the SQL statements that are issued against
Expand Down
Loading

0 comments on commit 960265c

Please sign in to comment.