Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive GDI (region) leak. Help needed. #11334

Open
kirsan31 opened this issue May 8, 2024 · 15 comments
Open

Massive GDI (region) leak. Help needed. #11334

kirsan31 opened this issue May 8, 2024 · 15 comments
Labels
tenet-performance Improve performance, flag performance regressions across core releases
Milestone

Comments

@kirsan31
Copy link
Contributor

kirsan31 commented May 8, 2024

.NET version

.Net7 and .Net8

Did it work in .NET Framework?

Yes

Did it work in any of the earlier releases of .NET Core or .NET 5+?

We didn't see these problems before .Net7.

Issue description

For several month we are trying to investigate huge GDI (regions) leak in our app. This leak is critical because can reach GDI limit (10k) in one day.

  • The leaking GDI objects are regions.
  • This is not managed leak - we have no leaking managed objects in dumps.
  • It's start leaking suddenly and most of the time continue to the the limit or app restart, but some time stopped.
  • The leak is related to elements redraw because of regions of course and most of the time it's happens around rdp connect/disconnect.
  • We can't repro this leak :(
  • We once managed to catch this behavior using performance HUD. This is quite problematic, because HUD slows down the working PC. And what we saw was very strange, it felt like something was starting to leak that had not leak before. Unfortunately, the call stack saved in HUD (*.hudinsight) does not show the names of the methods when viewed (may be some one knew how to overcome this?) :( And even those copied manually as text also turned out to be cropped due to the large size. Therefore, I will present here what I managed to get (sorry for this).
  1. 479 leaked regions due to redrawing (almost all are system calls):
    1
    1.csv

  2. 84 leaked regions due to close all opened child mdi forms. Closing all windows is done through the menu, when opened, the child menu is filled with open forms (15 in our case). Big part of it is leaking in MenuStrip -> Control.SetBoundsCore -> SetWindowPos. Call stacks ToolStripDropDownItem.OnDropDownOpened -> EtwWriteTransfer and ToolStripDropDownItem.OnDropDownClosed -> EtwWriteTransfer are full.
    01
    02
    03
    04
    2.csv
    The logic in tsmiWidow_DropDownOpened is populate childe DropDownItems with 15 (in this situation) items.
    The logic in tsmiWidow_DropDownClosed is clear all items previously added:

while (tmi.DropDownItems.Count > 0) 
{
    ToolStripItem ti = tmi.DropDownItems[tmi.DropDownItems.Count - 1];
    ti.MouseDown -= ActivateW;
    var img = ti.Image;
    ti.Dispose(); // will remove on dispose
    img?.Dispose();
}

There are no managed leaks, all objects were properly deleted (this is not always 100% true I explain below).
It is very strange that the leak occurs both when adding and removing elements. In 99% of cases everything works completely correctly.
While researching I found a small managed leak here:
timer1
timer2
This small managed leak is reproducible and can't lead to such catastrophic consequences. Can easily be fixed with WeakReference here (I will open a PR later):

// Consider - weak reference?
private ToolStripItem? _currentItem;

  1. 800+ (maybe not all of 937 are leaked) leaked regions including points 1. and 2. This is also done through the menu, and one submenu is filled and cleared with 43 items in our case, exactly the same as in point 2. These are all the regions that the performance HUD detected after closing all child windows. In this state (in normal condition) our app consume about 100 GDI objects and 10 of them are regions. And in this situation there were about 5500 GDI objects and 5400 of them are regions.
    001
    002
    003
    004
    all.csv

OS: Windows 10 Pro for Workstations 22H2.

In conclusion, it seems to me that the problem is somewhere in Winforms, or even in the OS. Any assistance in further investigation is greatly appreciated. 🙏

Steps to reproduce

--

@kirsan31 kirsan31 added the untriaged The team needs to look at this issue in the next triage label May 8, 2024
@lonitra lonitra added this to the .NET 9.0 milestone May 8, 2024
@JeremyKuhne
Copy link
Member

@kirsan31 I'm very interested in looking more deeply at this. Unfortunately, I'm tied up for a number of weeks on critical BinaryFormatter work. If you have some mitigations like the WeakReference piece, I'm happy to take a look at PRs there.

When the other work is done, I can try to see what I might be able to find out.

Also, have you been able to repro the same thing with .NET 9?

@lonitra lonitra removed the untriaged The team needs to look at this issue in the next triage label May 8, 2024
@elachlan elachlan added tenet-performance Improve performance, flag performance regressions across core releases 💥 regression-release Regression from a public release labels May 9, 2024
@kirsan31
Copy link
Contributor Author

kirsan31 commented May 9, 2024

@JeremyKuhne

Also, have you been able to repro the same thing with .NET 9?

I can’t reproduce this at all (no matter how hard I try). This only happens on a working machine and only during work. Moreover, leaks have never started immediately after launching the application, only after a few days. Therefore, my ability to experiment there is very limited and I cannot use .Net9 :(
I will continue my experiments...

@weltkante sorry to bother you. but may be you have some ideas?


I'm tied up for a number of weeks on critical BinaryFormatter work.

By the way, I have a question on this topic that no one has answered yet.

@weltkante
Copy link
Contributor

@weltkante sorry to bother you. but may be you have some ideas?

No problem, unfortunately this is nothing I've come across in the past. So far I've always been able to rely on managed leaks and memory snapshots/dumps to compare, or being able to reproduce the problem locally and do a time travel debug trace for inspecting the unmanaged leaks. Seems like neither is an option for you.

If I had to diagnose this issue I'd probably try to isolate what effects it:

  • make a dump of the leaking process from the task manager and check for any unexpected 3rd party dlls that may have injected themselves into the process
  • have a second machine being setup and used. if it never happens on another machine the machine may be simply broken or the windows installation has been corrupted
  • if possible consider some sort of virtualization for the second setup for easy transfer between machines. I've been using hyperv based windows sandbox scripts lately to setup isolated applications in tricky cases, but there may be alternatives that are easier

@weltkante
Copy link
Contributor

weltkante commented May 9, 2024

Oh, and make sure the finalizer thread is not stuck on something (look at a few dumps in a debugger after leaks started and check that the thread is idle or at least differs between dumps). Depending on your tooling finalizable objects may not show up as leaks in your managed analysis, but if the finalizer thread is hanging and can't finalize things anymore that may end up this way.

@kirsan31
Copy link
Contributor Author

kirsan31 commented May 9, 2024

@weltkante

make a dump of the leaking process from the task manager and check for any unexpected 3rd party dlls that may have injected themselves into the process

Nice case, just checked - everything is ok here. But the probability was extremely low because... A work PC has very high restrictions on installed software and Internet use.

have a second machine being setup and used. if it never happens on another machine the machine may be simply broken or the windows installation has been corrupted
if possible consider some sort of virtualization for the second setup for easy transfer between machines. I've been using hyperv based windows sandbox scripts lately to setup isolated applications in tricky cases, but there may be alternatives that are easier

Due to the specifics, there is no way for us to configure either a second physical machine or a virtual one. And users won’t approve of this, it’s easier for them to restart the application every few days :)

Oh, and make sure the finalizer thread is not stuck on something (look at a few dumps in a debugger after leaks started and check that the thread is idle or at least differs between dumps). Depending on your tooling finalizable objects may not show up as leaks in your managed analysis, but if the finalizer thread is hanging and can't finalize things anymore that may end up this way.

There's nothing wrong with that. Because other objects are finalized normally, and most regions too. Also in dumps after GC, ready for finalization objects are empty and nothing extraordinary in dead objects.

Thank you for your attention any way 🙏

@kirsan31
Copy link
Contributor Author

A small update on what I found out.

  • As a result, native functions leak, very often these are SetWindowPos.
  • When running our two applications in parallel, the leaks are not the same. In one application, every menu restructuring is leaked, while in another it is not leaked at all. It follows from this that the problem is not system wide, but begins in a certain process after certain conditions are met (and after certain conditions can stop).

@weltkante
Copy link
Contributor

weltkante commented May 10, 2024

As a result, native functions leak, very often these are SetWindowPos.

Just as a side note to avoid other people reading this drawing the wrong conclusions: SetWindowPos can trigger a lot messages, including callbacks into managed code, so its unlikely to be the direct cause of the leak

@kirsan31
Copy link
Contributor Author

SetWindowPos can trigger a lot messages, including callbacks into managed code, so its unlikely to be the direct cause of the leak

Of course that's not the direct cause of the leak. And SetWindowPos really trigger a lot messages, but no one callback to managed code:
image
I point this method like the most common last managed method in call stack.

@weltkante
Copy link
Contributor

weltkante commented May 10, 2024

but no one callback to managed code

Sounds weird, all WinForms controls should, at the very least, go through the managed message handler of the control. And SetWindowPos can (and usually does) trigger resize and redraw logic, both of which can have managed event handlers that need to be dispatched too, even if they are empty.

Anyways, just meant to say that seeing this method as call root doesn't mean the problem is guaranteed to be on the native code.

@lonitra
Copy link
Member

lonitra commented Jul 23, 2024

@kirsan31 Do you think you could provide consistent repro for us to investigate this?

@lonitra lonitra added the waiting-author-feedback The team requires more information from the author label Jul 23, 2024
@kirsan31
Copy link
Contributor Author

@kirsan31 Do you think you could provide consistent repro for us to investigate this?

I hope so... Currently the issue still exist (very sporadically) and I can't get the root cause :(

@dotnet-policy-service dotnet-policy-service bot added untriaged The team needs to look at this issue in the next triage and removed waiting-author-feedback The team requires more information from the author labels Jul 23, 2024
@JeremyKuhne JeremyKuhne modified the milestones: .NET 9.0, Future Jul 24, 2024
@JeremyKuhne
Copy link
Member

@kirsan31 as soon as we get actionable stuff here we can assign it to whatever the current release is.

@JeremyKuhne JeremyKuhne removed the untriaged The team needs to look at this issue in the next triage label Jul 24, 2024
@kirsan31
Copy link
Contributor Author

kirsan31 commented Aug 6, 2024

Once again we were able to directly catch leaks and use the performance HUD. What we found out:

  • Leaks are completely related to the RDP session; they occur during connection, disconnection and also when manipulating the RDP window; one of the leaks happened simply when minimizing the RDP window.
  • Leaks do not depend on the RDP client application.
  • Leaks do not depend on the client OS.
  • Two of our applications were launched - leaks were observed simultaneously in both.
  • As I already wrote, it feels like absolutely everything related to redrawing is starting to leak. Moreover, this is not a constant process - not every connection/disconnect/change in the RDP window entails a leak.

This time I copied (in several approaches) all the stacks for two applications from the performance HUD (If necessary, I will provide them all). But they don’t give anything new - all the stacks are some kind of drawing of a menu/tooltip, etc., which always end with EtwWriteTransfer. Example:

leaked.mp4

From all this and the fact that before .Net7 such behavior was not observed, I have only two possible assumptions - either this is somehow a Windows bug/corruption (appeared with some kind of system update), or, after all, a regression in .Net.

Does anyone have any other ideas or tips?

P.S. why do all messages go through office component ComponentManager.Microsoft.Office.IMsoComponentManager.FPushMessageLoop?
//cc @JeremyKuhne

@weltkante
Copy link
Contributor

P.S. why do all messages go through office component
ComponentManager.Microsoft.Office.IMsoComponentManager.FPushMessageLoop?

Thats just an interface for Office/VisualStudio compatibility, the naming is historical, WinForms uses its own implementation if Office/VS is not detected to provide the interface implementation.

@JeremyKuhne
Copy link
Member

omponentManager.FPushMessageLoop?

Thats just an interface for Office/VisualStudio compatibility, the naming is historical, WinForms uses its own implementation if Office/VS is not detected to provide the interface implementation.

Note that this will now be turned off by default (.NET 9). It can be turned back on with the "Switch.System.Windows.Forms.EnableMsoComponentManager" switch.

Even with the stub it was a fair amount of overhead for message processing. As we had to rewrite all of our COM for ComWrappers we took the opportunity to simplify the message loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tenet-performance Improve performance, flag performance regressions across core releases
Projects
None yet
Development

No branches or pull requests

5 participants