Skip to content

Commit

Permalink
Updates for v1.3 release.
Browse files Browse the repository at this point in the history
  • Loading branch information
AmitBM committed Sep 24, 2024
1 parent 5de0cca commit 7e73b85
Show file tree
Hide file tree
Showing 29 changed files with 396 additions and 71 deletions.
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ project (RGD)

# Define version information
set(RGD_MAJOR_VERSION 1)
set(RGD_MINOR_VERSION 2)
set(RGD_MINOR_VERSION 3)
set(RGD_PATCH_NUMBER 0)
if (NOT RGD_BUILD_NUMBER)
set(RGD_BUILD_NUMBER 0)
Expand Down
27 changes: 7 additions & 20 deletions RGD_RELEASE_NOTES.txt
Original file line number Diff line number Diff line change
@@ -1,27 +1,16 @@
Radeon™ GPU Detective v1.2 Release Notes
=======================================
Radeon™ GPU Detective v1.3 Release Notes
========================================
Radeon GPU Detective (RGD) is a tool for post-mortem analysis of GPU crashes (TDRs).
Using the tool you can capture and analyze AMD GPU crash dumps and produce information that can help narrow down the search for the crash's root cause.
Such information includes page fault details, resource details and execution markers reflecting the GPU work that was in progress at the moments leading to the crash

Highlights
==========
This release improves the default execution markers which are baked into the AMD drivers and provide additional information even without inserting user markers. Examples of the information added:
This release adds support for Driver Experiments, a powerful new feature that lets you change the behavior and performance characteristics of your application without modifying its source code or its configuration and can be useful in debugging crashes.

* Index, vertex and instance counts for draw calls
* Thread group count for compute dispatches
* Improved default raytracing and mesh shader markers
* Barriers
* Queue type (Direct for graphics, Compute for compute)
If an AMD GPU crash dump (.rgd file) was captured while having Driver Experiments activated, it will contain this information and that would be presented by the RGD output as part of the Driver Info section.

Example output (with a single "Frame 429" user marker):

Command Buffer ID: 0x883 (Queue type: Direct)
=============================================
[>] "Frame 429"
├─[X] Draw(VertextCount=3, InstanceCount=1)
├─[X] ----------Barrier----------
└─[>] Dispatch(ThreadGroupCount=[16,16,1])
For more details about the Driver Experiments feature see the "RGD documentation" in the documentation subfolder of this repository.

Explicit exclusions
===================
Expand All @@ -33,19 +22,17 @@ Known Issues
message saying: "Summary generation failed” and clicking on "Show error" will display a text description that ends with "execution marker information missing [UMD]". As a workaround, restart RDP. A fix for this issue will be
included in an upcoming driver update - there is no need to update the tool.
* In certain cases, trying to capture a GPU crash dump of an app that has Microsoft® DRED enabled can lead to a system crash.
* In Radeon Developer Panel (RDP), it may happen that generated .rgd crash dump files appear with a wrong file size of 0 bytes.
* Attempting to capture GPU crash dumps on a system with a Ryzen CPU that includes integrated graphics (with no connected discrete Radeon GPU) may result in a BSOD.
* A system reboot is recommended after the driver installation. An invalid crash dump file may get generated when RGD workflow is executed after a fresh driver installation without a system reboot.

System Requirements
===================
* Operating system: Windows 10 or 11.

* Latest Adrenalin Software driver (minimum version 23.12.1).
* Latest Adrenalin Software driver (minimum version 24.9.1). A system reboot is recommended after the driver installation.

* GPU: Radeon™ RX 6000 series (RDNA™2) or RX 7000 series (RDNA™3) card.

* Latest RDP (Radeon Developer Panel) version, which is available as part of the Radeon Developer Tool Suite and can be downloaded from GPUOpen.com. Make sure you are using RDP v2.12.0.7 or later.
* Latest RDP (Radeon Developer Panel) version, which is available as part of the Radeon Developer Tool Suite and can be downloaded from GPUOpen.com. Make sure you are using RDP v3.2 or later.

Note that this version of RGD supports DirectX® 12 and Vulkan® applications, so you will need either DX12 or vulkan application that crashes. For the best experience, it is recommended to:
* Use string markers around render passes using the AMD GPU Services (AGS) library, as these will appear in the command line tool's output and will help identifying the code that was executing during the crash.
Expand Down
6 changes: 4 additions & 2 deletions documentation/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
#import os
#import sys



# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
Expand Down Expand Up @@ -53,9 +55,9 @@
# built documents.
#
# The short X.Y version.
version = u'1.2'
version = u'1.3'
# The full version, including alpha/beta/rc tags.
release = u'1.2'
release = u'1.3'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
35 changes: 32 additions & 3 deletions documentation/source/help_manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,9 +75,36 @@ This section is titled ``SYSTEM INFO`` and includes information about the system

* **Operating system** information
* **Graphics driver** information
* List of active Driver Experiments
* Details about the installed **CPUs**
* Details about the installed **GPUs**

Driver Experiments for Crash Analysis
"""""""""""""""""""""""""""""""""""""

RGD v1.3 supports a powerful new feature called **Driver Experiments** which lets you toggle certain driver features and optimizations that can change the behavior of your application without modifying its source code or configuration. This is done using Driver Experiments that control the low-level behavior of the Radeon Adrenalin driver. These experiments control features like raytracing or mesh shader support, compiler optimizations and more and can be useful in debugging GPU crashes.

AMD GPU crash dumps (.rgd files) record the list of Driver Experiments that were active during the crash analysis session, so that you always have an accurate picture of the driver configuration with which your app crashed. RGD's crash analysis output summary text file will display the list of Driver Experiments that were active as part of the System Info section. This information will also be available in RGD's machine-readable JSON output file.
For more details about this feature, please refer to the :ref:`quickstart-guide`.

For a detailed description of each supported experiment, please refer to the Driver Experiments section of the `RDP documentation <https://gpuopen.com/manuals/rdp_manual/rdp_manual-index/>`_.

Here is an example of active Driver Experiments::

===========
SYSTEM INFO
===========

Driver info
===========
...
Experiments : total of 4 Driver Experiments were active while capturing the AMD GPU crash dump:
1. Disable sampler feedback support
2. Disable raytracing support
3. Disable variable rate shading
4. Hull shader wave size: Force 32 threads per wave


Markers in progress
"""""""""""""""""""

Expand Down Expand Up @@ -172,7 +199,7 @@ The tree structure and contents are also configurable through the RDP options (o
Note that RGD will collapse nodes which have all of their subnodes in finished state to remove noise and improve the tree's readability.


.. image:: images/image2024-06-19-advanced-options.png
.. image:: images/rgd-advanced-options.png

Page fault summary
""""""""""""""""""
Expand Down Expand Up @@ -282,7 +309,7 @@ Let's elaborate:
but a different other type of problem, e.g. a shader hang due to timeout (too long execution) or an infinite loop.


Scope of v1.2
Scope of v1.3
-------------
RGD is designed to capture **GPU crashes** on Windows. If a GPU fault (such as memory page fault or infinite loop in a shader) causes the GPU driver to not respond to the OS for some pre-determined
time period (the default on Windows is 2 seconds), the OS will detect that and attempt to restart or remove the device. This mechanism is also known as "TDR" (Timeout Detection and Recovery) and is what we
Expand All @@ -300,7 +327,6 @@ Please use CPU debugging mechanisms like Microsoft Visual Studio to investigate
Rendering code which **incorrectly uses D3D12 or Vulkan** may also fail purely on the CPU and not reach the graphics driver or the GPU.
Therefore, such crashes are not captured by RGD. They usually result in ``DXGI_ERROR_INVALID_CALL`` error code returned, and
are usually detected by the D3D12 Debug Layer.


.. note::
When debugging a problem in any D3D12 application, first **enable the D3D12 Debug Layer** and
Expand Down Expand Up @@ -331,6 +357,9 @@ Usage tips for RGD

* In Vulkan, the old device extension VK_EXT_debug_marker is also supported by RGD, but it is now deprecated in favor of the VK_EXT_debug_utils instance extension.

* **Try Crash Analysis with Driver Experiments**: If you suspect that certain optimizations or features enabled by the driver might be causing the crash,
you can try to disable them using Driver Experiments. This can help you narrow down the search for the cause of the crash.

Known issues and workarounds
----------------------------

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 23 additions & 6 deletions documentation/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ This guide will get you up and running with RGD, a tool for post-mortem GPU cras
.. note::
Review these requirements to make sure that this tool is relevant for your use case:

* RGD v1.1 supports **DirectX12** and **Vulkan**.
* RGD v1.3 supports **DirectX12** and **Vulkan**.
* **Windows 10 or 11**.
* **RDNA™2** (RX 6000 series) **or RDNA™3** (RX 7000 series) card.
* Must **TDR** (we don't catch it if there is no TDR).
Expand All @@ -33,19 +33,36 @@ Capture GPU crash dump

1. Before you start, if you ever changed the TdrLevel registry setting, make sure it is set to TdrLevelRecover(3).
2. Run RDP GUI app (RadeonDeveloperPanel.exe).
3. Under CAPTURE -> "Available features", enable "Crash Analysis".
3. Under CAPTURE -> "Available features", enable "Crash Analysis".

.. image:: images/image2024-enable-ca.png
.. image:: images/enable-crash-analysis.png

4. Under the "Crash Analysis" tab, make sure that the Text checkbox is checked for the automatic crash summary generation.

.. image:: images/image2024-select-text.png
.. image:: images/select-text-output-format.png

5. Run the crashing app and reproduce the TDR.

.. note::
You can always generate the text or JSON summary files from an .rgd file after has been captured. This can be done either by right-clicking the .rgd file entry in RDP and using the context menu or by invoking the rgd command line tool directly (run ``rgd -h`` to see the help manual).

Capture GPU crash dump with Driver Experiments enabled
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Under CAPTURE -> "Available features", enable "Driver Experiments"

.. image:: images/enable-driver-experiments.png

2. Under the "Driver Experiments" tab, select the API you want to enable the experiments for (DirectX12 or Vulkan).

.. image:: images/driver-experiments-select-api.png

3. Under the "Driver Experiments" tab, enable/select the experiments you want to activate.

.. image:: images/driver-experiments-select-experiment.png

4. Follow the steps in the previous section to capture the GPU crash dump.

Crash analysis
^^^^^^^^^^^^^^

Expand All @@ -55,7 +72,7 @@ RGD doesn't offer a GUI tool to open these files.
Instead, you can convert them to a report in text or JSON format directly from RDP.
To do it, right-click and select “Open text summary”:

.. image:: images/image2024-open-text-summary.png
.. image:: images/open-text-summary.png


This will open the .txt crash analysis file which includes information that can help narrow down the search for the crash's root cause::
Expand Down Expand Up @@ -112,4 +129,4 @@ of Sale.

AMD, the AMD Arrow logo, Radeon, Ryzen, CrossFire, RDNA and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in
this publication are for identification purposes only and may be trademarks of their respective companies.
© 2023 Advanced Micro Devices, Inc. All rights reserved.
© 2024 Advanced Micro Devices, Inc. All rights reserved.
Binary file modified samples/sample_crash_dump.rgd
Binary file not shown.
26 changes: 25 additions & 1 deletion source/radeon_gpu_detective_backend/rgd_data_types.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//=============================================================================
// Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
/// @author AMD Developer Tools Team
/// @file
/// @brief global data types.
Expand All @@ -13,6 +13,9 @@
#include <unordered_map>
#include <unordered_set>

// JSON.
#include "json/single_include/nlohmann/json.hpp"

// RDF.
#include "rdf/rdf/inc/amdrdf.h"

Expand Down Expand Up @@ -61,6 +64,24 @@ static const std::unordered_set<std::string> kBarrierMarkerStrings = { kBarrierS
static const char* kChunkIdTraceProcessInfo = "TraceProcessInfo";
static const uint32_t kChunkMaxSupportedVersionTraceProcessInfo = 1;

static const char* kChunkIdDriverOverrides = "DriverOverrides";

// DriverOverrides chunk version constants.
static const uint32_t kChunkMaxSupportedVersionDriverOverrides = 3;

// DriverOverrides chunk JSON element name constants.
static const char* kJsonElemComponentsDriverOverridesChunk = "Components";
static const char* kJsonElemComponentDriverOverridesChunk = "Component";
static const char* kJsonElemStructuresDriverOverridesChunk = "Structures";
static const char* kJsonElemExperimentsDriverOverridesChunk = "Experiments";
static const char* kJsonElemSettingNameDriverOverridesChunk = "SettingName";
static const char* kJsonElemUserOverrideDriverOverridesChunk = "UserOverride";
static const char* kJsonElemWasSupportedDriverOverridesChunk = "Supported";
static const char* kJsonElemCurrentDriverOverridesChunk = "Current";
static const char* kJsonElemIsDriverExperimentsDriverOverridesChunk = "IsDriverExperiments";
static const char* kErrorMsgInvalidDriverOverridesJson = "invalid DriverOverrides JSON";
static const char* kErrorMsgFailedToParseDriverExperimentsInfo = "failed to parse Driver Experiments info";

// Represents the execution status of an execution marker.
// A marker can be in a one of 3 states:
// 1. Hasn't started executing
Expand Down Expand Up @@ -184,6 +205,9 @@ struct RgdCrashDumpContents
TraceProcessInfo crashing_app_process_info;
// Mapping between command buffer ID and the indices for umd_crash_data.events array of its relevant execution marker events.
std::unordered_map<uint64_t, std::vector<size_t>> cmd_buffer_mapping;

// Driver Experiments JSON
nlohmann::json driver_experiments_json;
};

#endif // RADEON_GPU_DETECTIVE_SOURCE_RGD_DATA_TYPES_H_
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//=============================================================================
// Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
/// @author AMD Developer Tools Team
/// @file
/// @brief execution marker tree serialization.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//=============================================================================
// Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
/// @author AMD Developer Tools Team
/// @file
/// @brief execution marker tree serialization.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//=============================================================================
// Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
/// @author AMD Developer Tools Team
/// @file
/// @brief execution marker serialization.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//=============================================================================
// Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
/// @author AMD Developer Tools Team
/// @file
/// @brief execution marker serialization.
Expand Down
60 changes: 58 additions & 2 deletions source/radeon_gpu_detective_backend/rgd_parsing_utils.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//=============================================================================
// Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved.
/// @author AMD Developer Tools Team
/// @file
/// @brief utilities for parsing raw data.
Expand Down Expand Up @@ -613,4 +613,60 @@ bool RgdParsingUtils::ParseTraceProcessInfoChunk(rdf::ChunkFile& chunk_file, con
ret = error_txt.str().empty();

return ret;
}
}

bool RgdParsingUtils::ParseDriverOverridesChunk(rdf::ChunkFile& chunk_file, const char* chunk_identifier, nlohmann::json& driver_experiments_json)
{
bool ret = true;
const int64_t kChunkCount = chunk_file.GetChunkCount(chunk_identifier);
const int64_t kChunkIdx = 0;
const char* kErrorMsg = "failed to extract the list of enabled Driver Experiments";
std::stringstream error_txt;

// Parse DriverOverrides chunk. It will not be present for the files captured with RDP 3.0 and before.
if (kChunkCount > 0)
{
const uint32_t kChunkVersion = chunk_file.GetChunkVersion(chunk_identifier);
if (kChunkVersion <= kChunkMaxSupportedVersionDriverOverrides)
{
// Only one DriverOverrides chunk is expected so chunk index is set to 0 (first chunk).
assert(kChunkCount == 1);
uint64_t payload_size = chunk_file.GetChunkDataSize(chunk_identifier, kChunkIdx);
if (payload_size > 0)
{
std::string driver_overrides_json_data(payload_size, '\0');

// Read the DriverOverrides chunk payload data.
chunk_file.ReadChunkDataToBuffer(chunk_identifier, kChunkIdx, driver_overrides_json_data.data());
try
{
driver_experiments_json = nlohmann::json::parse(driver_overrides_json_data.data());
}
catch (const std::exception& e)
{
error_txt << kErrorMsg << " (" << e.what() << ")";
RgdUtils::PrintMessage(error_txt.str().c_str(), RgdMessageType::kError, true);
}
}
else
{
error_txt << kErrorMsg << " (invalid chunk payload size [" << kChunkIdDriverOverrides << "])";
RgdUtils::PrintMessage(error_txt.str().c_str(), RgdMessageType::kError, true);
}
}
else
{
error_txt << kErrorMsg << " (unsupported chunk version: " << kChunkVersion << " [" << kChunkIdDriverOverrides << "])";
RgdUtils::PrintMessage(error_txt.str().c_str(), RgdMessageType::kError, true);
}
}
else
{
error_txt << kErrorMsg << " (Driver Experiments information missing [" << kChunkIdDriverOverrides << "])";
RgdUtils::PrintMessage(error_txt.str().c_str(), RgdMessageType::kError, true);
}

ret = error_txt.str().empty();

return ret;
}
Loading

0 comments on commit 7e73b85

Please sign in to comment.