-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPP Auth Failed on dpp-enrollee example (IDFGH-9228) #10615
Comments
Update on this: Hmm. |
I'm having the same issue (dpp-enrollee yields ESP_ERR_DPP_TX_FAILURE). For me it is strangely persistent but unpredictable. That is, I've had it work correctly in the past, but somehow once a given device gets into this state it will yield this error over and over no matter what I do, even if I reset the flash. I believe @knight-ryu12 is describing something similar where a seemingly unrelated change "fixed it". Looks like others have also experienced this: https://esp32.com/viewtopic.php?t=28573. Needless to say this unpredictability makes it impossible to actually ship something that uses DPP, so we need to find some predictable way of detecting and working around this condition. For reference, my ESP32-C3-DevKit-M1U is working, my ESP32-C3-DevKit-M1 isn't. Both are using v4.4.4. |
Hi @jasta , we are trying to reproduce it locally. Are you using the example? Please share your sdkconfig as well as IDF version commit. |
Yes, I'm able to reproduce consistently with the unmodified example on the current v4.4.4 tag of esp-idf. I am using the default settings (channel 6, no provided key/device info). Here's the sdkconfig: https://gist.github.com/42ef0f07990ca812bba8b541685ef798 |
The bug really smells like a race condition IMO. I had it all working perfectly and changed nothing of significance about my app and it just started giving me ESP_ERR_DPP_TX_FAILURE over and over. Then no amount of reverting my code could fix it, including reverting all the way back to the unmodified example which is where I started. It of course used to work with the sample and even my considerably more robust full app. I suspect that the "condition" that changes to cause it to become persistently broken is literally in the air -- something about my Wi-Fi setup must be able to consistently reproduce an "unlikely" race condition outcome. No hardware has been deliberately modified or replaced since it was once working, so the only possibilities in my view are environmental (wireless signals themselves changing) or through automated software updates of either my router (Google WiFi infrastructure) or my phone (Android Pixel 6) |
Well that's interesting, even though I get ESP_ERR_DPP_TX_FAILURE pretty much every time, I now just got ESP_ERR_DPP_INVALID_ATTR. Adds quite a bit of evidence to my theory that this is a race :) I swear it is seeming like the difference between the two errors is whether I have the esp32 on my desk (DPP_INVALID_ATTR) or in my component drawer (DPP_TX_FAILURE). Bizarre :) |
As an aside, I am the one working on adding DPP support to Rust and ran into this issue maturing the implementation even though I didn't change anything functionally interesting with respect to esp_* calls: esp-rs/esp-idf-svc#228 . |
I have logs with the full debugging turned up but I'm not posting here as I believe they will contain my Wi-Fi creds. Lmk and I can share them privately or reproduce with a dummy network (though I suspect changing my network around will change the results) |
Please black out wifi cred from the logs and share the rest. |
I can't actually tell what is and isn't sensitive about this, but I'm interested enough in getting this solved that I'll risk it hehe: https://gist.github.com/09fe320e7b549967b37088170e59c5cb. Here is the updated sdkconfig after I enabled logging: https://gist.github.com/8ab02d9c9ca064861e9e1cdf22261545 I confirmed again this morning it still repros. I seemingly have a 100% reliable repro (it's happened at least the last 20 or 30 times I've tried) on a devkit module that once worked just fine. One maybe important detail though is that in the example I'm unable to scan the QR code in the console (my phone never recognizes it at as a QR code), so I'm copying the QR text into qtqr and generating one there. I can confirm that if I fudge the QR text I get a different error ("No matching own bootstrapping key found as responder - ignoring message"), and then it yields ESP_ERR_DPP_INVALID_ATTR (different than the INVALID_ATTR case I got when I was randomly moving the device around physically). |
I don't know for sure, but it crashes Wi-Fi event handler hardly afterwards. I cannot receive any events after DPP had failed. I assume that might have to do with how LWIP is handling packets? (Or is it just my condition being bad...) |
I see the same thing re the hard crash, but I don't believe it is related to LWIP packet fragmentation. These aren't even IP packets AFAICT, they're action frame packets in 802.11. Further, I'm seeing essentially identical behavior to you with the unmodified example, which I'm guessing is the same thing our friends at Espressif are testing with but seeing different results. Highly likely an environmental condition causing the difference. |
From my testing today, I've found that if the channel the esp is currently listening on happens to match whichever channel my phone was connected to AP on, the chance of So from what I can tell, the issue comes down to just being purely a channel mismatch somehow and the esp not wanting to send to a different channel than its listening to(just a wild guess on that last part though). And, it seems that there's a fix that works perfectly(from my testing atleast), all it takes is enabling Multi Band Support in your sdkconfig.
Instead of only doing This fix seems to work well enough that I can even have the esp listening only on channel 10, my phone can be sitting on channel 149, and I can get the esp to connect to a network on channel 1 with no issue |
Afaik it doesnt actually crash anything, it just sets From my testing, the wifi driver and event loop are all still fully functioning even when that error happens, its just there's no events actually going on to show up in the log but if you set Also I doubt lwip actually has anything to do with the issue since i dont remember any mention of lwip in esp_dpp.c, and iirc lwip is a tcp/ip stack, and during dpp configuration it wouldnt make sense for there to be a tcp/ip stack considering only a limited amount of raw frames are sent between the two devices anyways. |
Ahh good catch, however what I'm seeing is that @kapilkedawat, so at the very least we've identified one clear bug that needs fixing: @gayafhannah, I'm going to go try your MBO workaround now and report back, awesome sleuthing BTW! |
@jasta I've also just noticed that since i'm not testing on the master branch and am instead on, in the master branch it seems to be called |
@gayafhannah, MBO support with v4.4.4 at least didn't seem to affect the results, though I still think you're onto something with your analysis. My network set-up is that my phone is connected to a Google Wi-Fi network on 5GHz ch 149, but the 2.4GHz network is on channel 1. I've tried a bunch of configurations of different channels for DPP to use though and still can't get it to work. I'm a bit stumped still. Note: I used idf.py menuconfig to enable MBO support so the settings should match my local checkout. I'm using the v4.4.4 tag in esp-idf. |
I know that without MBO support, i was able to get it to be more consistent at not having the error(although not 100%) if the only channel dpp was configured to listen on was the channel of the wifi network, and with my phone also connected to that network, and therefore on that same channel, which is what led me to mbo support when snooping around menuconfig. |
Disabling MBO and setting channels to 1,6 worked once, but then I rebooted the device and tried one more time and it failed. Then rebooted and tried again several times until it finally worked (which I can only do without rebooting because of my patch in I've attached an abridged log comparing the failed attempt vs the successful attempt (no reboot occurred, I just tried again on the phone): https://gist.github.com/ca10e95e786b79fe847c3a13b297e732 The interesting bit from the logs is that seemingly nothing is different about the failure case and successful one. Both using Channel-6, exchanging seemingly the same data with the same timings, etc. |
Have thrown together a pull request #10812 that should work for the latest version of espidf with the changes i made to get fully functional dpp |
Looks good, you might want to also include my fix to esp_supp_dsp_start_listen which resets the s_dpp_stop_listening flag. This is what enables retries to work as was intended by the example (i.e. not needing deinit/init before it can be retried).
Planning to retest all this on v5.0.1 but was sticking to v4.4.4 because that's where it once worked then stopped so I figured we could learn the most about what exact environmental difference actually matters. |
I'll have to give your fix a shot in the morning Also my guess for environmental differences causing the issues is just the phone/AP switching channels. It's the only thing I can come up with since like I said in earlier comments, setting the listen channel to be the same as the ap and phone channel, maybe 80-90% success rate atleast. Instead of the not even 1% chance if it wasn't the same. A fun way to test it was to in the channel listen list that you put into one of the functions, just up all channels 1 through to 13, I also added a little thing in esp_dpp to show the current channel as it was cycling through. If I managed to get the timing pressing the button on my phone just right to land on channel 1 on the esp(channel 1 also what the ap and my phone were on) then it'd almost always connect fine, but if I missed that channel(difficult to get it right when the timeout is 500ms) then it'd fail to connect except for maybe exactly one time. Also if I remember correctly, setting the channel to be near the same as the ap, if the channel width for the ap is 40MHz, I noticed even being +-2 channels from the centre channel, it had moderate success rates. Definitely not the same as setting the listen channel directly ontop of the centre channel though. Also 40MHz wide ones were slightly less successful on the centre channel than 20MHz wide(probably because 20MHz is literally just 1 channel wide). |
I'm not observing any strange side effects from the fix FWIW. And more importantly I'm noticing that my phone will actually automatically retry now with this setting on, indicating that the DPP standard actually expects some intermittent failures (at least due to RF interference would be my guess, but maybe even this channel weirdness). That suggests to me that the lack of working retry is a big part of the root cause here. Imagine what would happen to our global TCP infrastructure if you disabled packet retransmits, for example...
Definitely plausible and consistent with my experience. If I just sit there and hit Retry on my Android phone over and over with my fix it will eventually work. Not ideal UX by any stretch, though, and I'd be quite surprised if we (or maybe just Espressif) couldn't come up with a more reliable way to fix this.
I can't really say for sure, I'd probably need to dive deep into how this stuff is even supposed to function at the standard level before I could speculate why that would play a role. One curious thing I did notice however was that the QR code does in fact include the channel that the enrollee (the esp32) is listening on, so it doesn't make any sense why the configurator (the phone) would have any troubles sending/receiving messages on the correct channel. |
I noticed the channels being listed in the qr code aswell, and your exact reasons are why I'm hesitant that my guess is actually correct. It's just that pattern being there made it really seem like it has to have something to do with the channel being wrong. Especially because the error afaik is caused by the esp failing to send the packet "offchannel" and not in receiving any packets, as even with the error happening, it still successful receives the first packet/frame from the phone just phone. |
@kapilkedawat The plot thickens trying to work around these bugs. It looks like if you try to work around this issue by calling esp_supp_dpp_deinit/init again and just starting all over, you will quickly run into a race condition in esp_dpp.c in which this assert fails: https://gist.github.com/a067c88e511dfff52e7f704d469e157f I am using unmodified ESP-IDF 4.4.4 to cause this behaviour. The reason appears to be because of an inherent race condition in which esp_supp_dpp_init makes no effort to check if the previous s_dpp_evt_queue and associated task is drained and cleaned up before proceeding to start up esp_dpp_task again. Illustrated as such:
It seems to me like the fix would be to have a condition variable and associated state that deinit sets true, SIG_DPP_DEL_TASK sets false, and init checks for / waits until it is false. A workaround that seems reliable enough is to sleep for 1s between deinit/init though of course this isn't really a fix. This also suggests to me the proposed fix of clearing s_dpp_stop_listening in esp_supp_dpp_start_listen is even more important. EDIT: Filed a separate issue for this, #10879 |
I have experienced issues with DPP running on IDF v5.2.1, out of 47 attempts I experienced a failure rate of 51%. 40 attempts were made with verbose logging enabled, and WPA logs disabled, and 7 attempts were made with both verbose logging and WPA logs enabled.
Logs of all failures can be found here
As @michael-betz has also experienced, changing the value of |
Hi @Kaden-mpc , I am sharing a debug build for dpp_enrollee example on v5.2.1 here with some extra logs and few fixes. The file flash_cmd contains command for flashing. Please use that to flash the device. Thanks. |
@Aditi-Lonkar I have collected 20 logs of the build you shared, 8/20 failed. I have not yet collected any network packets. I will get back to you with those, and additional logs when I get them. The logs can be found here. Thank you for jumping on this issue! |
@Kaden-mpc Thanks for trying the earlier build and sharing the logs. We are actively working on the issue, however we are not able to reproduce exact same results in our setup. I am attaching one more build here with some more fixes added. Will you please test and share the logs and sniffer capture with this too? The fixes includes
dpp_test_build_521_2.zip Thanks. |
@Aditi-Lonkar thank you for the responsiveness, @Kaden-mpc and I have the ESP logs from the new version but do not have the sniffer capture we are able to share yet due to our company data security policy. We are working on an alternative way to get the packet capture and will get that over to you early next week. Thank you again for the help! |
@Aditi-Lonkar we have acquired logs alongside network traffic, however the access point used did not support any form of pcap, ssh, or port mirroring, and an external device was used to capture the packets using wireshark. Let us know if you require any additional logs, or if we need to acquire a AP that supports proper pcap. Additionally let us know if you have any recommendations for capturing more thorough debug information to aid with resolving issues. The logs and packet captures can be found here. Thank you for the ongoing support! |
@Aditi-Lonkar we have captured more logs, with router network sniffing as well. Out of 25 attempts, we had 2 failures. The logs can be found here. Thank you for the help! |
Hi @Kaden-mpc, Thank you for sharing the logs and helping us debug this issue further. What we need is a wireless capture where we can verify whether the mobile device is acknowledging the packets. You can see how to capture wireless packets here. In our internal testing, we have observed that sometimes the station stops receiving DPP frames altogether from the responder (ESP) device. We believe this is an issue with the mobile device, as the ESP responds to the frames within the stipulated timeframe. This behavior seems inconsistent, which explains why the connection sometimes works flawlessly and at other times fails. If you are able to consistently reproduce the issue, could you try the same test with a different responder device (not ESP) to see if the problem persists? |
@kapilkedawat I have collected more logs, and wireless packet captures using iw monitor mode, which can be found here. We will send logs and sniffs of non ESP preforming DPP, tomorrow or early next week. Let us know if there is anything else we can provide to aid in debugging. Thanks for the support! |
@kapilkedawat We have tested a non ESP device (Google Chromecast) using the same AP, phone, sniffer, and same relative device position. We continued retrying the same experiment until we received a failure, which took 50 attempts. Whereas with the ESP we are able to consistently reproduce roughly every 10 attempts. Additionally, the Chromecast DPP appears much faster, with the success message appearing on the phone almost instantaneously, and never shows the retry screen, whereas with the ESP DPP it almost always take a few seconds, and every few attempts shows the retry screen (as shown in 09122024, attempts 3, 8, and 15. 09132024 attempt 11. 09172024 attempt 12. And potentially other runs where I marked it as successful without noting retries). The Chromecast sniffs can be found here. Thank you for the support! |
@Aditi-Lonkar @kapilkedawat wanted to follow up and see what other information we can provide to assist. I also tested on the Chromecast and was able to get 50 consecutive successes on the Chromecast showing that It is not an issue with the mobile device. We look forward to hearing from you! |
Hi @ragnarmargus @Kaden-mpc, apologies for the delayed response. Yes, Chromecast sessions are faster since Chromecast has significantly more computing power compared to the ESP32. However, we are looking into optimizing the code further to improve performance, which may, in turn, reduce the error rates. This is still just a hunch, but we will try to provide another build tomorrow. |
HI @ragnarmargus @Kaden-mpc apologies for the delay. Could you please try with the attached experimental build, this reduces response time of DPP frames processing. |
@kapilkedawat We have collected 50 runs from the build you shared, 6 required pressing the retry button, and 1 failed. There is a noticeable increase in speed between scanning the QR code and receiving the success status. The failure that we experienced did not succeed on pressing the retry button, or rescanning the QR code; is this a fatal failure, or would rerunning the full DPP process enable another attempt at preforming DPP, similarly to how Chromecast handled its failure? The logs can be found here. Thank you for the ongoing support! |
Hi @Kaden-mpc, Thank you very much for your continued testing and providing the results, really appreciate the same. The failure in attempt 46 is recoverable, and I have made some changes to handle that. Unfortunately, we cannot make the response faster than this in current CPU freq, as we are limited by the computational power available in the chip. DPP request frame parsing and the creation of the response frame involve two ECDH operations, each taking more than ~150 ms. I have created another build that includes a fix for the issue mentioned above. This build also increases the CPU frequency to 240MHz to make the operations a bit faster. Please help to test and provide the results. |
@kapilkedawat We have collected 50 more logs using your latest build. 5 of the attempts required retries, and 2 were failures. For 1 of the 2 failures, the ESP connected to the wifi successfully, but the phone showed a failure. And the other stopped responding, with no errors shown in the logs, and the sniff showing the ESP not responding. Thanks for the quick response, and for the support! |
Hi @Kaden-mpc, Thank you for your invaluable testing efforts. I think we can now focus only on the non-recoverable errors. Instance 49 is recoverable, but the timeout for that was 120 seconds. We can add some further optimizations there. I have made some adjustments to the timeout in the gas query. Please try the attached build. |
@kapilkedawat We have collected 50 more logs. 9 of the attempts required retires (2 of which required retrying twice). And we experienced another false failure, where the phone displayed failure but it connected to the wifi. Once non recoverable errors are resolved, will progress be made on making the retries less frequent, or are we locked at the current 10-22% requiring retries? Thank you for the support! |
Hi @Kaden-mpc, thanks again for your testing efforts and for providing the logs. All the non-recoverable errors have been resolved, and the errors we are currently encountering are due to the mobile device either not replying in time or not receiving the frames (likely going off-channel). Unfortunately, there isn't much more that can be done from an optimization perspective, as we are processing the frames as fast as possible. I want to emphasize that the ESP device is following the protocol correctly, and our behavior aligns with the current upstream wpa_supplicant codebase. You will observe different behavior with various configurations, depending on how much the driver honors wpa_supplicant's requests. For example, in our internal testing with a b43 driver based chip, we encountered 0 errors out of 10. |
@kapilkedawat a sample size of 10 attempts is likely insufficient to demonstrate that the code works consistently. We are expecting this to be utilized by hundreds of customers per day. We would like to see at least 25 attempts to have a statistically relevant sample size. @Kaden-mpc has been doing 50 for even more consistent and repeatable data. The consistent 10-15% failure rate we have seen on our 50 attempt test runs does not meet the quality standards our end customers expect and is not shippable. Anything more than 1-2% failure will result in a poor experience given that the DPP sharing experience is driven by the Android Operating system and we have no way to control what that 10-15% of users experience during failure. Additionally we are having a hard time understanding how the chip is insufficient to support closer to a 98% success rate without retry as has been shown with the Chromecast. Additionally @Kaden-mpc has utilized the same phone for all testing, both with the Chromecast and ESP device showing that is not the issue. |
Hi @ryanrampage1, We conducted around 40 additional tests, and the error occurred 3 times during those retries. I would like to reiterate that the issues we faced are not due to the Espressif chip but rather because the initiator is not following the protocol correctly. For example, in the shard logs:
I’ve created a new build where the device waits for 1.6 seconds for the frame instead. Could you please try that? Also, do you have any ESP32C2 or ESP32C6 devkits available? We would like to check the behavior on those as well, as they have ECC hardware, which should allow the response to be sent much faster and see if that makes any difference. |
@kapilkedawat We are unable to capture the DPP process with your latest build:
We do not have any ESP32C2 or ESP32C6 devkits currently. Thank you for your ongoing support! |
Hi @Kaden-mpc, tried the same build and able to run the example correctly.
Could you please clean nvs and retry flashing the build? Tha assert is failing since mode is not station however we explicitly set the mode in example https://github.com/espressif/esp-idf/blob/release/v5.2/examples/wifi/wifi_easy_connect/dpp-enrollee/main/dpp_enrollee_main.c#L158. |
@kapilkedawat I have tried cleaning the chip, and reflashing but received the same results. I have also tried 2 brand new ESP32S3 devkits, and experienced the exact same error I have previously shared. |
@kapilkedawat I reflashed the first build that @Aditi-Lonkar sent (#10615 (comment)), which worked. Upon reflashing your build, your build then worked. It seems like previous builds had an initialization step which was being written to the nvs that is no longer working in newer builds. Completely fresh boards, and boards with a cleaned nvs cannot run your latest builds (tested flashing sep30 and newer with same results). |
@Kaden-mpc please try attached build. |
@kapilkedawat and @Aditi-Lonkar sorry for the delay here, @Kaden-mpc and I have been caught up with a few other priorities. Will run that build and report back results in the next week or two. |
Answers checklist.
IDF version.
v4.4.3
Operating System used.
Windows
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
PowerShell
Development Kit.
ESP32-WROVER-DevkitC
Power Supply used.
USB
What is the expected behavior?
I expected it to correctly authenticate with Accesspoint.
What is the actual behavior?
It does not authenticate with accesspoint at all;
Seems that it fails with WPA being
D (66148) wpa: Mgmt Tx Status - 1, Cookie - 0x400e036c
, and returns ESP_ERR_DPP_TX_FAILURE no matter what accesspoint it uses.Steps to reproduce.
Code
It is 1:1 to the example, expect I removed the key part. which should just work without it.
Debug Logs.
More Information.
Seems afterwards it crashes Wi-Fi module.
The text was updated successfully, but these errors were encountered: