Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPP Auth Failed on dpp-enrollee example (IDFGH-9228) #10615

Open
3 tasks done
knight-ryu12 opened this issue Jan 25, 2023 · 67 comments
Open
3 tasks done

DPP Auth Failed on dpp-enrollee example (IDFGH-9228) #10615

knight-ryu12 opened this issue Jan 25, 2023 · 67 comments
Assignees
Labels
Status: In Progress Work is in progress Type: Bug bugs in IDF

Comments

@knight-ryu12
Copy link

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v4.4.3

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

PowerShell

Development Kit.

ESP32-WROVER-DevkitC

Power Supply used.

USB

What is the expected behavior?

I expected it to correctly authenticate with Accesspoint.

What is the actual behavior?

It does not authenticate with accesspoint at all;
Seems that it fails with WPA being D (66148) wpa: Mgmt Tx Status - 1, Cookie - 0x400e036c, and returns ESP_ERR_DPP_TX_FAILURE no matter what accesspoint it uses.

Steps to reproduce.

Code
It is 1:1 to the example, expect I removed the key part. which should just work without it.

Debug Logs.

D (64538) wpa: DPP: Rx Public Action frame: action - 9
I (64538) wpa: DPP: Authentication Request from ac:3e:b1:02:80:e0
D (64538) event: running post WIFI_EVENT:20 with handler 0x400d6adc and context 0x3ffbdcfc on loop 0x3ffbc9f0
0x400d6adc: event_handler(void*, char const*, int, void*) at C:/Users/Alma/sensor-data-update/main/main.cpp:55

E (64548) wpa: curve: secp256r1

E (64558) wpa: prime len is 32

E (64558) wpa: Q_x: - hexdump(len=32):
E (64568) wpa: 2e cc 68 18 67 c2 7a 7f d2 59 20 1f 89 c5 99 28
E (64568) wpa: d2 59 3b 12 b9 07 13 4d 04 02 9a 80 0e 57 69 5c
E (64578) wpa: Q_y: - hexdump(len=32):
E (64578) wpa: 52 00 62 4e d9 33 98 b6 79 86 f6 ba 1c 20 a0 34
E (64588) wpa: 6f ec 12 e8 8d 28 56 47 7e 6b 09 00 48 04 84 29 
E (64598) wpa: d:      - hexdump(len=32):
E (64598) wpa: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
E (64608) wpa: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
D (64558) event: running post WIFI_EVENT:20 with handler 0x400e0420 and context 0x3ffcf050 on loop 0x3ffbc9f0
0x400e0420: offchan_event_handler at C:/Users/Alma/esp443/esp-idf/components/wpa_supplicant/esp_supplicant/src/esp_dpp.c:449

D (64858) wpa: DPP: ECDH shared secret (M.x) - hexdump(len=32):
D (64858) wpa: f0 8c 6a ab fc 7b b8 4e 1b 6f f0 ff b7 2d a7 51
D (64858) wpa: 0a 20 b7 ac ae e7 de 04 58 42 a2 fd 44 22 44 9c 
D (64868) wpa: DPP: PRK = HKDF-Extract(<>, IKM=M.x) - hexdump(len=32):
D (64878) wpa: bc 82 9b f7 0b 3d 4d 33 4a 14 1f 11 9d d0 76 ff
D (64878) wpa: 76 f6 99 be c7 1f 2e 87 b2 8d d7 11 4f e5 ce 4f 
D (64888) wpa: DPP: k1 = HKDF-Expand(PRK, info, L) - hexdump(len=32):
D (64888) wpa: 0a e1 3c 28 f3 44 cf 2d 60 c0 f1 75 63 9d 19 4a
D (64898) wpa: 55 49 3f 69 61 4f 8a f0 68 22 7d ab 9e 0e 11 48 
D (64908) wpa: DDP: AES-SIV AD[0] - hexdump(len=6):
D (64908) wpa: 50 6f 9a 1a 01 00
D (64908) wpa: DDP: AES-SIV AD[1] - hexdump(len=145):
D (64918) wpa: 02 10 20 00 d1 29 db 70 63 65 7c e0 a6 69 1d 0e
D (64928) wpa: 7b b4 2e c4 dc b0 9b f1 70 94 43 b4 45 44 d9 d5
D (64928) wpa: c2 50 90 a7 01 10 20 00 ad 99 e9 8c 5d c3 e0 73 
D (64938) wpa: 5f 15 68 77 32 88 6c 42 64 18 88 a2 eb 22 85 52
D (64938) wpa: f3 15 f7 d4 16 d7 21 f7 03 10 40 00 2e cc 68 18 
D (64948) wpa: 67 c2 7a 7f d2 59 20 1f 89 c5 99 28 d2 59 3b 12
D (64958) wpa: b9 07 13 4d 04 02 9a 80 0e 57 69 5c 52 00 62 4e
D (64958) wpa: d9 33 98 b6 79 86 f6 ba 1c 20 a0 34 6f ec 12 e8 
D (64968) wpa: 8d 28 56 47 7e 6b 09 00 48 04 84 29 19 10 01 00
D (64968) wpa: 02
D (64978) wpa: DPP: AES-SIV ciphertext - hexdump(len=41):
D (64978) wpa: 5e 3c b5 ca a8 35 62 48 31 16 73 3f 86 1b 05 fd
D (64988) wpa: ca ee a0 d1 f2 ec c4 c2 a1 be f9 d3 f1 a5 f8 21
D (64988) wpa: 2d 76 12 f0 68 d6 36 44 b2 
D (64998) wpa: DPP: AES-SIV cleartext - hexdump(len=25):
D (64998) wpa: 05 10 10 00 0c fc 37 2f 56 14 8d 2e 3d e5 34 6b
D (65008) wpa: 74 4e 2e f9 06 10 01 00 02 
D (65008) wpa: DPP: I-nonce - hexdump(len=16):
D (65018) wpa: 0c fc 37 2f 56 14 8d 2e 3d e5 34 6b 74 4e 2e f9
D (65018) wpa: DPP: I-capabilities: 0x02
D (65028) wpa: DPP: Acting as Enrollee
D (65028) wpa: DPP: Build Authentication Response
D (65038) wpa: DPP: R-nonce - hexdump(len=16):
D (65038) wpa: f3 e7 c0 99 c3 ce 25 58 c4 e3 96 ab a0 47 3c f8
D (65288) wpa: DPP: Generating a keypair
E (65288) wpa: curve: secp256r1

E (65288) wpa: prime len is 32

E (65288) wpa: Q_x: - hexdump(len=32):
E (65298) wpa: 27 57 28 28 50 5c 1a a0 32 d2 b2 ba ac c4 24 9b
E (65298) wpa: f1 e5 87 a5 07 5c 3f 90 ce 43 6e 8a 89 eb eb 49 
E (65308) wpa: Q_y: - hexdump(len=32):
E (65308) wpa: 2b c3 e2 49 c8 08 4a 73 d1 ad f4 4d 90 54 40 6c
E (65318) wpa: 2f a1 32 50 a1 9e de 32 0a 11 c4 ed 61 9f 56 25 
E (65328) wpa: d:      - hexdump(len=32):
E (65328) wpa: c1 ec c8 73 ad 41 bf 26 37 2a a9 bf e5 9a d5 06 
E (65338) wpa: c1 f2 d3 81 0e 79 3c 8b dc d6 92 50 bd 1e 27 05
D (65588) wpa: DPP: ECDH shared secret (N.x) - hexdump(len=32):
D (65588) wpa: be 3d 4a 7a eb 32 19 3b 94 6a f7 70 66 d8 0c dc
D (65588) wpa: bc 32 0a 6f de 6f 70 4c 21 7b 6c 0f 42 2e 20 e8
D (65598) wpa: DPP: PRK = HKDF-Extract(<>, IKM=N.x) - hexdump(len=32):
D (65608) wpa: 4d 13 bb a1 f2 0a 4c 8d 51 0c 6c e1 7d cb f0 2c
D (65608) wpa: 65 b8 1e 4f 2c 67 06 46 bf b2 a5 0e e5 40 6f 27
D (65618) wpa: DPP: k2 = HKDF-Expand(PRK, info, L) - hexdump(len=32):
D (65618) wpa: ea 9f 7b 91 4c 0d d2 65 71 a5 29 9f a3 1f 05 ee
D (65628) wpa: 46 97 50 c3 18 32 1c 3d 1d 46 f7 d1 0f dd ef 76
D (65638) wpa: DPP: PRK = HKDF-Extract(<>, IKM) - hexdump(len=32):
D (65638) wpa: bb 7e e1 47 48 0f 02 d3 bc 1d 71 90 b8 cd 9d 73 
D (65648) wpa: b9 b1 3f 2a 34 cb 8a 67 f5 ae 43 ff f0 86 b0 7c
D (65648) wpa: DPP: ke = HKDF-Expand(PRK, info, L) - hexdump(len=32):
D (65658) wpa: ff d7 30 d0 7c 67 c5 a3 c3 bc da c9 f8 2a 6e 72 
D (65668) wpa: 16 76 98 2b d7 c7 1e bd 08 17 08 cb d9 9d 7a d5
D (65668) wpa: DPP: R-auth hash components
D (65678) wpa: DPP: hash component - hexdump(len=16):
D (65678) wpa: 0c fc 37 2f 56 14 8d 2e 3d e5 34 6b 74 4e 2e f9
D (65688) wpa: DPP: hash component - hexdump(len=16):
D (65688) wpa: f3 e7 c0 99 c3 ce 25 58 c4 e3 96 ab a0 47 3c f8 
D (65698) wpa: DPP: hash component - hexdump(len=32):
D (65698) wpa: 2e cc 68 18 67 c2 7a 7f d2 59 20 1f 89 c5 99 28 
D (65708) wpa: d2 59 3b 12 b9 07 13 4d 04 02 9a 80 0e 57 69 5c
D (65718) wpa: DPP: hash component - hexdump(len=32):
D (65718) wpa: 27 57 28 28 50 5c 1a a0 32 d2 b2 ba ac c4 24 9b 
D (65728) wpa: f1 e5 87 a5 07 5c 3f 90 ce 43 6e 8a 89 eb eb 49
D (65728) wpa: DPP: hash component - hexdump(len=32):
D (65738) wpa: c2 3f 6d 38 b9 ec 86 c0 75 d5 78 35 81 61 c4 1e 
D (65738) wpa: 5f 23 d3 3a a0 4d a1 fc 62 64 ce e5 ed a6 dc c3
D (65748) wpa: DPP: hash component - hexdump(len=1):
D (65758) wpa: 00
D (65758) wpa: DPP: R-auth - hexdump(len=32):
D (65758) wpa: 9c 68 7a de 49 3a fe 3d d4 e9 25 80 32 19 3b fe
D (65768) wpa: 43 2c 6c c1 ef c7 ae f2 c4 d6 eb 56 a0 e2 12 c2
D (65768) wpa: DPP: {R-auth}ke - hexdump(len=52):
D (65778) wpa: b6 59 c0 88 77 cc bf f5 4a 17 84 e6 56 44 cc f2
D (65778) wpa: 5c f9 89 67 ef 97 6e 43 94 a4 bb f4 46 2c a4 8b
D (65788) wpa: e2 4c 9c 5d 1c cd 0f 97 ae 80 d8 16 4e 3d ac 09
D (65798) wpa: 59 e9 4b 2e
D (65798) wpa: DPP: Status 0
D (65798) wpa: DPP: R-Bootstrap Key Hash
D (65808) wpa: DDP: AES-SIV AD[0] - hexdump(len=6):
D (65808) wpa: 50 6f 9a 1a 01 01
D (65808) wpa: DDP: AES-SIV AD[1] - hexdump(len=109):
D (65818) wpa: 00 10 01 00 00 02 10 20 00 d1 29 db 70 63 65 7c
D (65828) wpa: e0 a6 69 1d 0e 7b b4 2e c4 dc b0 9b f1 70 94 43
D (65828) wpa: b4 45 44 d9 d5 c2 50 90 a7 09 10 40 00 27 57 28 
D (65838) wpa: 28 50 5c 1a a0 32 d2 b2 ba ac c4 24 9b f1 e5 87
D (65838) wpa: a5 07 5c 3f 90 ce 43 6e 8a 89 eb eb 49 2b c3 e2
D (65848) wpa: 49 c8 08 4a 73 d1 ad f4 4d 90 54 40 6c 2f a1 32 
D (65858) wpa: 50 a1 9e de 32 0a 11 c4 ed 61 9f 56 25
D (65858) wpa: DPP: AES-SIV cleartext - hexdump(len=101):
D (65868) wpa: 07 10 10 00 f3 e7 c0 99 c3 ce 25 58 c4 e3 96 ab
D (65868) wpa: a0 47 3c f8 05 10 10 00 0c fc 37 2f 56 14 8d 2e
D (65878) wpa: 3d e5 34 6b 74 4e 2e f9 08 10 01 00 01 04 10 34 
D (65878) wpa: 00 b6 59 c0 88 77 cc bf f5 4a 17 84 e6 56 44 cc
D (65888) wpa: f2 5c f9 89 67 ef 97 6e 43 94 a4 bb f4 46 2c a4
D (65898) wpa: 8b e2 4c 9c 5d 1c cd 0f 97 ae 80 d8 16 4e 3d ac 
D (65898) wpa: 09 59 e9 4b 2e
D (65908) wpa: DPP: AES-SIV ciphertext - hexdump(len=117):
D (65908) wpa: 67 1e e0 44 0e 3e df b0 13 64 39 c7 17 84 df ff 
D (65918) wpa: 5d 44 24 40 0a 8e e8 55 8b e5 41 33 d2 13 75 19
D (65918) wpa: 53 0a f3 91 1b 4d 06 97 c7 d4 ac 95 29 2d 4a da
D (65928) wpa: f8 e2 1b a7 f5 69 a5 2a 26 5b f9 8f 6a cd a1 18 
D (65938) wpa: 3b 6f 8b 4f ed e4 15 f8 25 b0 c9 35 7f c6 15 bb
D (65938) wpa: d2 77 dc b9 33 9b 5b 73 5b 8c 81 8e 96 45 c9 08 
D (65948) wpa: d6 39 7c 20 90 4c 70 14 06 87 41 e0 96 3d 9e b7
D (65948) wpa: 5a 9e b1 4e 7e
D (65958) wpa: DPP: Authentication Response frame attributes - hexdump(len=238):
D (65958) wpa: 04 09 50 6f 9a 1a 01 01 00 10 01 00 00 02 10 20
D (65968) wpa: 00 d1 29 db 70 63 65 7c e0 a6 69 1d 0e 7b b4 2e 
D (65978) wpa: c4 dc b0 9b f1 70 94 43 b4 45 44 d9 d5 c2 50 90
D (65978) wpa: a7 09 10 40 00 27 57 28 28 50 5c 1a a0 32 d2 b2
D (65988) wpa: ba ac c4 24 9b f1 e5 87 a5 07 5c 3f 90 ce 43 6e
D (65988) wpa: 8a 89 eb eb 49 2b c3 e2 49 c8 08 4a 73 d1 ad f4
D (65998) wpa: 4d 90 54 40 6c 2f a1 32 50 a1 9e de 32 0a 11 c4
D (66008) wpa: ed 61 9f 56 25 04 10 75 00 67 1e e0 44 0e 3e df
D (66008) wpa: b0 13 64 39 c7 17 84 df ff 5d 44 24 40 0a 8e e8
D (66018) wpa: 55 8b e5 41 33 d2 13 75 19 53 0a f3 91 1b 4d 06
D (66018) wpa: 97 c7 d4 ac 95 29 2d 4a da f8 e2 1b a7 f5 69 a5
D (66028) wpa: 2a 26 5b f9 8f 6a cd a1 18 3b 6f 8b 4f ed e4 15
D (66038) wpa: f8 25 b0 c9 35 7f c6 15 bb d2 77 dc b9 33 9b 5b
D (66038) wpa: 73 5b 8c 81 8e 96 45 c9 08 d6 39 7c 20 90 4c 70
D (66048) wpa: 14 06 87 41 e0 96 3d 9e b7 5a 9e b1 4e 7e
D (66048) wpa: DPP: Mgmt Tx - MAC:ac:3e:b1:02:80:e0, Channel-6, WaitT-500        
D (66138) event: running post WIFI_EVENT:19 with handler 0x400d6adc and context 0x3ffbdcfc on loop 0x3ffbc9f0
0x400d6adc: event_handler(void*, char const*, int, void*) at C:/Users/Alma/sensor-data-update/main/main.cpp:55

D (66138) event: running post WIFI_EVENT:19 with handler 0x400e0420 and context 0x3ffcf020 on loop 0x3ffbc9f0
0x400e0420: offchan_event_handler at C:/Users/Alma/esp443/esp-idf/components/wpa_supplicant/esp_supplicant/src/esp_dpp.c:449

D (66148) wpa: Mgmt Tx Status - 1, Cookie - 0x400e036c
0x400e036c: esp_supp_rx_action at C:/Users/Alma/esp443/esp-idf/components/wpa_supplicant/esp_supplicant/src/esp_dpp.c:413

I (66148) wifi dpp-enrollee: DPP Auth failed (Reason: ESP_ERR_DPP_TX_FAILURE), retry...
D (66658) event: running post WIFI_EVENT:20 with handler 0x400d6adc and context 0x3ffbdcfc on loop 0x3ffbc9f0
0x400d6adc: event_handler(void*, char const*, int, void*) at C:/Users/Alma/sensor-data-update/main/main.cpp:55

D (66658) event: running post WIFI_EVENT:20 with handler 0x400e0420 and context 0x3ffcf050 on loop 0x3ffbc9f0
0x400e0420: offchan_event_handler at C:/Users/Alma/esp443/esp-idf/components/wpa_supplicant/esp_supplicant/src/esp_dpp.c:449

More Information.

Seems afterwards it crashes Wi-Fi module.

@knight-ryu12 knight-ryu12 added the Type: Bug bugs in IDF label Jan 25, 2023
@espressif-bot espressif-bot added the Status: Opened Issue is new label Jan 25, 2023
@github-actions github-actions bot changed the title DPP Auth Failed on dpp-enrollee example DPP Auth Failed on dpp-enrollee example (IDFGH-9228) Jan 25, 2023
@knight-ryu12
Copy link
Author

Update on this:
if, LWIP Enable reassembly incoming fragmented IP4 packets are enabled, above example will not work. if disabled, it suddenly work now.

Hmm.

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

I'm having the same issue (dpp-enrollee yields ESP_ERR_DPP_TX_FAILURE). For me it is strangely persistent but unpredictable. That is, I've had it work correctly in the past, but somehow once a given device gets into this state it will yield this error over and over no matter what I do, even if I reset the flash. I believe @knight-ryu12 is describing something similar where a seemingly unrelated change "fixed it". Looks like others have also experienced this: https://esp32.com/viewtopic.php?t=28573.

Needless to say this unpredictability makes it impossible to actually ship something that uses DPP, so we need to find some predictable way of detecting and working around this condition.

For reference, my ESP32-C3-DevKit-M1U is working, my ESP32-C3-DevKit-M1 isn't. Both are using v4.4.4.

@kapilkedawat
Copy link
Collaborator

Hi @jasta , we are trying to reproduce it locally. Are you using the example?

Please share your sdkconfig as well as IDF version commit.

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

Yes, I'm able to reproduce consistently with the unmodified example on the current v4.4.4 tag of esp-idf. I am using the default settings (channel 6, no provided key/device info). Here's the sdkconfig: https://gist.github.com/42ef0f07990ca812bba8b541685ef798

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

The bug really smells like a race condition IMO. I had it all working perfectly and changed nothing of significance about my app and it just started giving me ESP_ERR_DPP_TX_FAILURE over and over. Then no amount of reverting my code could fix it, including reverting all the way back to the unmodified example which is where I started. It of course used to work with the sample and even my considerably more robust full app. I suspect that the "condition" that changes to cause it to become persistently broken is literally in the air -- something about my Wi-Fi setup must be able to consistently reproduce an "unlikely" race condition outcome.

No hardware has been deliberately modified or replaced since it was once working, so the only possibilities in my view are environmental (wireless signals themselves changing) or through automated software updates of either my router (Google WiFi infrastructure) or my phone (Android Pixel 6)

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

Well that's interesting, even though I get ESP_ERR_DPP_TX_FAILURE pretty much every time, I now just got ESP_ERR_DPP_INVALID_ATTR. Adds quite a bit of evidence to my theory that this is a race :)

I swear it is seeming like the difference between the two errors is whether I have the esp32 on my desk (DPP_INVALID_ATTR) or in my component drawer (DPP_TX_FAILURE). Bizarre :)

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

As an aside, I am the one working on adding DPP support to Rust and ran into this issue maturing the implementation even though I didn't change anything functionally interesting with respect to esp_* calls: esp-rs/esp-idf-svc#228 .

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

I have logs with the full debugging turned up but I'm not posting here as I believe they will contain my Wi-Fi creds. Lmk and I can share them privately or reproduce with a dummy network (though I suspect changing my network around will change the results)

@kapilkedawat
Copy link
Collaborator

I have logs with the full debugging turned up but I'm not posting here as I believe they will contain my Wi-Fi creds. Lmk and I can share them privately or reproduce with a dummy network (though I suspect changing my network around will change the results)

Please black out wifi cred from the logs and share the rest.

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

I can't actually tell what is and isn't sensitive about this, but I'm interested enough in getting this solved that I'll risk it hehe: https://gist.github.com/09fe320e7b549967b37088170e59c5cb. Here is the updated sdkconfig after I enabled logging: https://gist.github.com/8ab02d9c9ca064861e9e1cdf22261545

I confirmed again this morning it still repros. I seemingly have a 100% reliable repro (it's happened at least the last 20 or 30 times I've tried) on a devkit module that once worked just fine. One maybe important detail though is that in the example I'm unable to scan the QR code in the console (my phone never recognizes it at as a QR code), so I'm copying the QR text into qtqr and generating one there. I can confirm that if I fudge the QR text I get a different error ("No matching own bootstrapping key found as responder - ignoring message"), and then it yields ESP_ERR_DPP_INVALID_ATTR (different than the INVALID_ATTR case I got when I was randomly moving the device around physically).

@knight-ryu12
Copy link
Author

I don't know for sure, but it crashes Wi-Fi event handler hardly afterwards. I cannot receive any events after DPP had failed. I assume that might have to do with how LWIP is handling packets? (Or is it just my condition being bad...)

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

I don't know for sure, but it crashes Wi-Fi event handler hardly afterwards. I cannot receive any events after DPP had failed. I assume that might have to do with how LWIP is handling packets? (Or is it just my condition being bad...)

I see the same thing re the hard crash, but I don't believe it is related to LWIP packet fragmentation. These aren't even IP packets AFAICT, they're action frame packets in 802.11. Further, I'm seeing essentially identical behavior to you with the unmodified example, which I'm guessing is the same thing our friends at Espressif are testing with but seeing different results. Highly likely an environmental condition causing the difference.

@gayafhannah
Copy link

From my testing today, I've found that if the channel the esp is currently listening on happens to match whichever channel my phone was connected to AP on, the chance of ESP_ERR_DPP_TX_FAILURE dramatically drops and would happen rarely.
This would explain the issue being intermittent and also not fixing itself for a while, since AP's dont really change their channels too often, if at all.
However that doesn't get us very far since it means we need to somehow know the channel before we know what AP or anything. Which also doesn't make sense since the phone already sends the first frame to the esp on the correct channel to begin with (channel 6 in the example), and then attempts to send it back to the phone on that same channel.
From what I can tell, the issue comes from the call esp_wifi_action_tx_req on line 96 inside esp_send_action_frame
Although the error isn't immediately sent to the callback in the lines below and instead that callback is sent the error only when the WIFI_EVENT_ACTION_TX_STATUS wifi event is triggered on line 447 in offchan_event_handler

So from what I can tell, the issue comes down to just being purely a channel mismatch somehow and the esp not wanting to send to a different channel than its listening to(just a wild guess on that last part though).


And, it seems that there's a fix that works perfectly(from my testing atleast), all it takes is enabling Multi Band Support in your sdkconfig.
So essentially all you have to do is:

CONFIG_WPA_11KV_SUPPORT=y
CONFIG_WPA_SCAN_CACHE=y
CONFIG_WPA_MBO_SUPPORT=y
CONFIG_WPA_DPP_SUPPORT=y

Instead of only doing CONFIG_WPA_DPP_SUPPORT=y
(11KV_SUPPORT and SCAN_CACHE are both required by MBO_SUPPORT)

This fix seems to work well enough that I can even have the esp listening only on channel 10, my phone can be sitting on channel 149, and I can get the esp to connect to a network on channel 1 with no issue
I have no idea if this is meant to fix it, or if it's going to break somewhere down the like, but nothing I can do can cause the ESP_ERR_DPP_TX_FAILURE as long as I have MBO Support enabled

@gayafhannah
Copy link

gayafhannah commented Feb 21, 2023

I don't know for sure, but it crashes Wi-Fi event handler hardly afterwards. I cannot receive any events after DPP had failed. I assume that might have to do with how LWIP is handling packets? (Or is it just my condition being bad...)

I see the same thing re the hard crash, but I don't believe it is related to LWIP packet fragmentation. These aren't even IP packets AFAICT, they're action frame packets in 802.11. Further, I'm seeing essentially identical behavior to you with the unmodified example, which I'm guessing is the same thing our friends at Espressif are testing with but seeing different results. Highly likely an environmental condition causing the difference.

Afaik it doesnt actually crash anything, it just sets s_dpp_stop_listening to true after getting the first packet, which stops it from sending more ROC events recursively, with the intention of setting s_dpp_stop_listening to false upon a successful dpp configure, and then sending a ROC event to start the recursive loop of them using esp_supp_dpp_start_listen()

From my testing, the wifi driver and event loop are all still fully functioning even when that error happens, its just there's no events actually going on to show up in the log but if you set s_dpp_stop_listening to false and call the start_listen function again, or if you manually post events, they'll show up normally

Also I doubt lwip actually has anything to do with the issue since i dont remember any mention of lwip in esp_dpp.c, and iirc lwip is a tcp/ip stack, and during dpp configuration it wouldnt make sense for there to be a tcp/ip stack considering only a limited amount of raw frames are sent between the two devices anyways.
Changing lwip settings likely only fixes things due to how inconsistent the ESP_ERR_DPP_TX_FAILURE error is without CONFIG_WPA_MBO_SUPPORT=y since i know atleast for me changing things like that would fix it for exactly the first time i ran it after flashing and no more(even if i reset the nvs once the error reappears)

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

Afaik it doesnt actually crash anything, it just sets s_dpp_stop_listening to true after getting the first packet, which stops it from sending more ROC events recursively, with the intention of setting s_dpp_stop_listening to false upon a successful dpp configure, and then sending a ROC event to start the recursive loop of them using esp_supp_dpp_start_listen()

Ahh good catch, however what I'm seeing is that esp_supp_dpp_start_listen() does not actually once again start listening. That is, if you hit Retry in the Android Wi-Fi UI it will fail without printing any more debug output on the esp32. Looking deeper from your analysis, it looks like the reason is because esp_supp_dpp_start_listen() doesn't clear the s_dpp_stop_listening flag so even if you call esp_supp_dpp_start_listen() again the call to offchan_event_handler trying to trigger SIG_DPP_LISTEN_NEXT_CHANNEL will be silently ignored (because s_dpp_stop_listening is still true). The code is subtle in that no matter what it'll get one SIG_DPP_LISTEN_NEXT_CHANNEL event handled after esp_supp_dpp_start_listen(), but the timeout is 500ms and it will not try again even on the same channel.

@kapilkedawat, so at the very least we've identified one clear bug that needs fixing: esp_supp_dpp_start_listen() needs to call s_dpp_stop_listening = false. I've tested this locally and confirmed that it indeed lets us try again any number of times using the example code.

@gayafhannah, I'm going to go try your MBO workaround now and report back, awesome sleuthing BTW!

@gayafhannah
Copy link

@jasta I've also just noticed that since i'm not testing on the master branch and am instead on, in the master branch it seems to be called ESP_WIFI_MBO_SUPPORT in Kconfig, but WPA_MBO_SUPPORT in 5.0.0
I'll have to try grab myself the master branch and see how that goes

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

@gayafhannah, MBO support with v4.4.4 at least didn't seem to affect the results, though I still think you're onto something with your analysis. My network set-up is that my phone is connected to a Google Wi-Fi network on 5GHz ch 149, but the 2.4GHz network is on channel 1. I've tried a bunch of configurations of different channels for DPP to use though and still can't get it to work. I'm a bit stumped still.

Note: I used idf.py menuconfig to enable MBO support so the settings should match my local checkout. I'm using the v4.4.4 tag in esp-idf.

@gayafhannah
Copy link

@gayafhannah, MBO support with v4.4.4 at least didn't seem to affect the results, though I still think you're onto something with your analysis. My network set-up is that my phone is connected to a Google Wi-Fi network on 5GHz ch 149, but the 2.4GHz network is on channel 1. I've tried a bunch of configurations of different channels for DPP to use though and still can't get it to work. I'm a bit stumped still.

Note: I used idf.py menuconfig to enable MBO support so the settings should match my local checkout. I'm using the v4.4.4 tag in esp-idf.

I know that without MBO support, i was able to get it to be more consistent at not having the error(although not 100%) if the only channel dpp was configured to listen on was the channel of the wifi network, and with my phone also connected to that network, and therefore on that same channel, which is what led me to mbo support when snooping around menuconfig.

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

Disabling MBO and setting channels to 1,6 worked once, but then I rebooted the device and tried one more time and it failed. Then rebooted and tried again several times until it finally worked (which I can only do without rebooting because of my patch in esp_supp_dpp_start_listen(). How bizarre.

I've attached an abridged log comparing the failed attempt vs the successful attempt (no reboot occurred, I just tried again on the phone): https://gist.github.com/ca10e95e786b79fe847c3a13b297e732

The interesting bit from the logs is that seemingly nothing is different about the failure case and successful one. Both using Channel-6, exchanging seemingly the same data with the same timings, etc.

@gayafhannah
Copy link

Have thrown together a pull request #10812 that should work for the latest version of espidf with the changes i made to get fully functional dpp
I should note though that i haven't tried using the example in the repo and instead have been testing with my own heavily modified version, although there shouldn't be much difference in how they work in the end.
Also @jasta there may be a chance that on v4.4.4 MBO support may not work quite the same, especially considering its an entire major release difference.
99% of my testing has been done on v5.0.0

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

Have thrown together a pull request #10812 that should work for the latest version of espidf with the changes i made to get fully functional dpp

Looks good, you might want to also include my fix to esp_supp_dsp_start_listen which resets the s_dpp_stop_listening flag. This is what enables retries to work as was intended by the example (i.e. not needing deinit/init before it can be retried).

I should note though that i haven't tried using the example in the repo and instead have been testing with my own heavily modified version, although there shouldn't be much difference in how they work in the end.
Also @jasta there may be a chance that on v4.4.4 MBO support may not work quite the same, especially considering its an entire major release difference.
99% of my testing has been done on v5.0.0

Planning to retest all this on v5.0.1 but was sticking to v4.4.4 because that's where it once worked then stopped so I figured we could learn the most about what exact environmental difference actually matters.

@gayafhannah
Copy link

gayafhannah commented Feb 21, 2023

Looks good, you might want to also include my fix to esp_supp_dsp_start_listen which resets the s_dpp_stop_listening flag. This is what enables retries to work as was intended by the example (i.e. not needing deinit/init before it can be retried).

I'll have to give your fix a shot in the morning
That fix was actually the first thing I tried, and while it did allow me to retry even after significant errors, it had some... strange side effects. I can't explain what caused them exactly but that could just be because I was really fucking with esp_dpp.c to figure out just where was it all going wrong and somehow because of that little fix I was getting too many ROC events every second to even bother counting.
Hopefully that change works better with a cleaner and working esp_dpp.c than it did this morning.

Also my guess for environmental differences causing the issues is just the phone/AP switching channels. It's the only thing I can come up with since like I said in earlier comments, setting the listen channel to be the same as the ap and phone channel, maybe 80-90% success rate atleast. Instead of the not even 1% chance if it wasn't the same. A fun way to test it was to in the channel listen list that you put into one of the functions, just up all channels 1 through to 13, I also added a little thing in esp_dpp to show the current channel as it was cycling through. If I managed to get the timing pressing the button on my phone just right to land on channel 1 on the esp(channel 1 also what the ap and my phone were on) then it'd almost always connect fine, but if I missed that channel(difficult to get it right when the timeout is 500ms) then it'd fail to connect except for maybe exactly one time.

Also if I remember correctly, setting the channel to be near the same as the ap, if the channel width for the ap is 40MHz, I noticed even being +-2 channels from the centre channel, it had moderate success rates. Definitely not the same as setting the listen channel directly ontop of the centre channel though. Also 40MHz wide ones were slightly less successful on the centre channel than 20MHz wide(probably because 20MHz is literally just 1 channel wide).
I have no explanation for this stuff to do with channel width, it's just something I noticed during testing since the extender here is 40MHz wide and the main ap is 20MHz wide and I was constantly switching networks. I feel like I'm going crazy because I have no solid evidence and it could all just be a coincidence and there was some other environmental stuff going on, but I spent alot of time testing things and that pattern of width playing a small part(and being directly ontop of the right channel being an even bigger part) have been constants the entire time.
This would also likely explain why it sounds like others have been having much more inconsistent errors, since if the target ap has a wide channel and being close to a wide one increases chances but only a bit, whereas for me having a network where if I'm not right on it, it doesn't work, until I use the extender but then it works only sometimes

@jasta
Copy link
Contributor

jasta commented Feb 21, 2023

I'll have to give your fix a shot in the morning That fix was actually the first thing I tried, and while it did allow me to retry even after significant errors, it had some... strange side effects. I can't explain what caused them exactly but that could just be because I was really fucking with esp_dpp.c to figure out just where was it all going wrong and somehow because of that little fix I was getting too many ROC events every second to even bother counting. Hopefully that change works better with a cleaner and working esp_dpp.c than it did this morning.

I'm not observing any strange side effects from the fix FWIW. And more importantly I'm noticing that my phone will actually automatically retry now with this setting on, indicating that the DPP standard actually expects some intermittent failures (at least due to RF interference would be my guess, but maybe even this channel weirdness). That suggests to me that the lack of working retry is a big part of the root cause here. Imagine what would happen to our global TCP infrastructure if you disabled packet retransmits, for example...

Also my guess for environmental differences causing the issues is just the phone/AP switching channels. It's the only thing I can come up with since like I said in earlier comments, setting the listen channel to be the same as the ap and phone channel, maybe 80-90% success rate atleast. Instead of the not even 1% chance if it wasn't the same. A fun way to test it was to in the channel listen list that you put into one of the functions, just up all channels 1 through to 13, I also added a little thing in esp_dpp to show the current channel as it was cycling through. If I managed to get the timing pressing the button on my phone just right to land on channel 1 on the esp(channel 1 also what the ap and my phone were on) then it'd almost always connect fine, but if I missed that channel(difficult to get it right when the timeout is 500ms) then it'd fail to connect except for maybe exactly one time.

Definitely plausible and consistent with my experience. If I just sit there and hit Retry on my Android phone over and over with my fix it will eventually work. Not ideal UX by any stretch, though, and I'd be quite surprised if we (or maybe just Espressif) couldn't come up with a more reliable way to fix this.

Also if I remember correctly, setting the channel to be near the same as the ap, if the channel width for the ap is 40MHz, I noticed even being +-2 channels from the centre channel, it had moderate success rates. Definitely not the same as setting the listen channel directly ontop of the centre channel though. Also 40MHz wide ones were slightly less successful on the centre channel than 20MHz wide(probably because 20MHz is literally just 1 channel wide). I have no explanation for this stuff to do with channel width, it's just something I noticed during testing since the extender here is 40MHz wide and the main ap is 20MHz wide and I was constantly switching networks. I feel like I'm going crazy because I have no solid evidence and it could all just be a coincidence and there was some other environmental stuff going on, but I spent alot of time testing things and that pattern of width playing a small part(and being directly ontop of the right channel being an even bigger part) have been constants the entire time. This would also likely explain why it sounds like others have been having much more inconsistent errors, since if the target ap has a wide channel and being close to a wide one increases chances but only a bit, whereas for me having a network where if I'm not right on it, it doesn't work, until I use the extender but then it works only sometimes

I can't really say for sure, I'd probably need to dive deep into how this stuff is even supposed to function at the standard level before I could speculate why that would play a role. One curious thing I did notice however was that the QR code does in fact include the channel that the enrollee (the esp32) is listening on, so it doesn't make any sense why the configurator (the phone) would have any troubles sending/receiving messages on the correct channel.

@gayafhannah
Copy link

gayafhannah commented Feb 22, 2023

I can't really say for sure, I'd probably need to dive deep into how this stuff is even supposed to function at the standard level before I could speculate why that would play a role. One curious thing I did notice however was that the QR code does in fact include the channel that the enrollee (the esp32) is listening on, so it doesn't make any sense why the configurator (the phone) would have any troubles sending/receiving messages on the correct channel.

I noticed the channels being listed in the qr code aswell, and your exact reasons are why I'm hesitant that my guess is actually correct. It's just that pattern being there made it really seem like it has to have something to do with the channel being wrong. Especially because the error afaik is caused by the esp failing to send the packet "offchannel" and not in receiving any packets, as even with the error happening, it still successful receives the first packet/frame from the phone just phone.
I'd love to give it a try with my phone not connected to any network while doing dpp so it's not bound to a channel(although when my phone can handle 2 wifi connections like it's nothing that's hard to believe it's an issue) except thanks to Samsung's Oneui being borked, I can barely access the dpp provisioning to begin with, and I can't select any network other than the current one for provisioning, so I can't test that.

@jasta
Copy link
Contributor

jasta commented Feb 28, 2023

@kapilkedawat The plot thickens trying to work around these bugs. It looks like if you try to work around this issue by calling esp_supp_dpp_deinit/init again and just starting all over, you will quickly run into a race condition in esp_dpp.c in which this assert fails:

https://gist.github.com/a067c88e511dfff52e7f704d469e157f

I am using unmodified ESP-IDF 4.4.4 to cause this behaviour.

The reason appears to be because of an inherent race condition in which esp_supp_dpp_init makes no effort to check if the previous s_dpp_evt_queue and associated task is drained and cleaned up before proceeding to start up esp_dpp_task again. Illustrated as such:

Caller Task dppT Task #1 dppT Task #2
dpp_init()
xQueueReceive (blocked)
dpp_deinit()
Handling SIG_DPP_DEL_TASK
dpp_init()
xQueueReceive (blocked)
s_dpp_evt_queue = NULL;
xQueueReceive crashes: s_dpp_evt_queue is NULL)

It seems to me like the fix would be to have a condition variable and associated state that deinit sets true, SIG_DPP_DEL_TASK sets false, and init checks for / waits until it is false. A workaround that seems reliable enough is to sleep for 1s between deinit/init though of course this isn't really a fix.

This also suggests to me the proposed fix of clearing s_dpp_stop_listening in esp_supp_dpp_start_listen is even more important.

EDIT: Filed a separate issue for this, #10879

@Kaden-mpc
Copy link

I have experienced issues with DPP running on IDF v5.2.1, out of 47 attempts I experienced a failure rate of 51%. 40 attempts were made with verbose logging enabled, and WPA logs disabled, and 7 attempts were made with both verbose logging and WPA logs enabled.

  • IDF version: v5.2.1
  • Test device: ESP32-S3-DevKitC-1
  • Mobile device used: Motorola Edge 2021
  • Mobile device OS: Android 13

Logs of all failures can be found here


Then I enabled WPA_DEBUG_PRINT in menuconfig and re-compiled.

Surprisingly, the next 15 attempts all succeeded. Even after changing EXAMPLE_DPP_LISTEN_CHANNEL_LIST to 12 and letting the phone connect to the wifi.

Then I disabled WPA_DEBUG_PRINT, re-compiled and only 4 out of 8 attempts succeeded.

What a strange result.

With WPA_DEBUG_PRINT enabled: 7 / 8 attempts succeeded

Same setup but WPA_DEBUG_PRINT disabled: 2 / 8 attempts succeeded.

As @michael-betz has also experienced, changing the value of WPA_DEBUG_PRINT changes the results a lot, without any other code changed.

@Aditi-Lonkar
Copy link
Collaborator

Hi @Kaden-mpc , I am sharing a debug build for dpp_enrollee example on v5.2.1 here with some extra logs and few fixes.
Will you please test with this build and share the logs and sniffer capture?

dpp_test_build_s3.zip

The file flash_cmd contains command for flashing. Please use that to flash the device. Thanks.

@Kaden-mpc
Copy link

@Aditi-Lonkar I have collected 20 logs of the build you shared, 8/20 failed. I have not yet collected any network packets. I will get back to you with those, and additional logs when I get them. The logs can be found here. Thank you for jumping on this issue!

@Aditi-Lonkar
Copy link
Collaborator

@Kaden-mpc Thanks for trying the earlier build and sharing the logs. We are actively working on the issue, however we are not able to reproduce exact same results in our setup.

I am attaching one more build here with some more fixes added. Will you please test and share the logs and sniffer capture with this too?

The fixes includes

  1. retrying sending auth response for 5 times on same channel in 1second interval if we didn't get auth confirm.
  2. starting timer of 1second for getting auth confirm after getting ack for auth_response
  3. staying on current channel until the whole auth process finishes
  4. stopping listen when dpp returns failure and some reason code related changes.
  5. keep rf enable until the offchannel transmission is done
  6. starting a timer to fail if config not received in 120 secs

dpp_test_build_521_2.zip
The file flash_cmd contains command for flashing. Please use that to flash the device.

Thanks.

@ryanrampage1
Copy link

ryanrampage1 commented Sep 13, 2024

@Aditi-Lonkar thank you for the responsiveness, @Kaden-mpc and I have the ESP logs from the new version but do not have the sniffer capture we are able to share yet due to our company data security policy. We are working on an alternative way to get the packet capture and will get that over to you early next week.

Thank you again for the help!

@Kaden-mpc
Copy link

@Aditi-Lonkar we have acquired logs alongside network traffic, however the access point used did not support any form of pcap, ssh, or port mirroring, and an external device was used to capture the packets using wireshark. Let us know if you require any additional logs, or if we need to acquire a AP that supports proper pcap. Additionally let us know if you have any recommendations for capturing more thorough debug information to aid with resolving issues. The logs and packet captures can be found here. Thank you for the ongoing support!

@Kaden-mpc
Copy link

@Aditi-Lonkar we have captured more logs, with router network sniffing as well. Out of 25 attempts, we had 2 failures. The logs can be found here. Thank you for the help!

@kapilkedawat
Copy link
Collaborator

Hi @Kaden-mpc,

Thank you for sharing the logs and helping us debug this issue further. What we need is a wireless capture where we can verify whether the mobile device is acknowledging the packets. You can see how to capture wireless packets here.

In our internal testing, we have observed that sometimes the station stops receiving DPP frames altogether from the responder (ESP) device. We believe this is an issue with the mobile device, as the ESP responds to the frames within the stipulated timeframe. This behavior seems inconsistent, which explains why the connection sometimes works flawlessly and at other times fails.

If you are able to consistently reproduce the issue, could you try the same test with a different responder device (not ESP) to see if the problem persists?

@Kaden-mpc
Copy link

@kapilkedawat I have collected more logs, and wireless packet captures using iw monitor mode, which can be found here. We will send logs and sniffs of non ESP preforming DPP, tomorrow or early next week. Let us know if there is anything else we can provide to aid in debugging. Thanks for the support!

@Kaden-mpc
Copy link

@kapilkedawat We have tested a non ESP device (Google Chromecast) using the same AP, phone, sniffer, and same relative device position. We continued retrying the same experiment until we received a failure, which took 50 attempts. Whereas with the ESP we are able to consistently reproduce roughly every 10 attempts. Additionally, the Chromecast DPP appears much faster, with the success message appearing on the phone almost instantaneously, and never shows the retry screen, whereas with the ESP DPP it almost always take a few seconds, and every few attempts shows the retry screen (as shown in 09122024, attempts 3, 8, and 15. 09132024 attempt 11. 09172024 attempt 12. And potentially other runs where I marked it as successful without noting retries).

The Chromecast sniffs can be found here. Thank you for the support!

@ryanrampage1
Copy link

@Aditi-Lonkar @kapilkedawat wanted to follow up and see what other information we can provide to assist. I also tested on the Chromecast and was able to get 50 consecutive successes on the Chromecast showing that It is not an issue with the mobile device. We look forward to hearing from you!

@kapilkedawat
Copy link
Collaborator

Hi @ragnarmargus @Kaden-mpc, apologies for the delayed response. Yes, Chromecast sessions are faster since Chromecast has significantly more computing power compared to the ESP32. However, we are looking into optimizing the code further to improve performance, which may, in turn, reduce the error rates. This is still just a hunch, but we will try to provide another build tomorrow.

@kapilkedawat
Copy link
Collaborator

HI @ragnarmargus @Kaden-mpc apologies for the delay. Could you please try with the attached experimental build, this reduces response time of DPP frames processing.
test_build_30_sept.zip

@Kaden-mpc
Copy link

@kapilkedawat We have collected 50 runs from the build you shared, 6 required pressing the retry button, and 1 failed. There is a noticeable increase in speed between scanning the QR code and receiving the success status. The failure that we experienced did not succeed on pressing the retry button, or rescanning the QR code; is this a fatal failure, or would rerunning the full DPP process enable another attempt at preforming DPP, similarly to how Chromecast handled its failure? The logs can be found here. Thank you for the ongoing support!

@kapilkedawat
Copy link
Collaborator

Hi @Kaden-mpc,

Thank you very much for your continued testing and providing the results, really appreciate the same.

The failure in attempt 46 is recoverable, and I have made some changes to handle that. Unfortunately, we cannot make the response faster than this in current CPU freq, as we are limited by the computational power available in the chip. DPP request frame parsing and the creation of the response frame involve two ECDH operations, each taking more than ~150 ms.

I have created another build that includes a fix for the issue mentioned above. This build also increases the CPU frequency to 240MHz to make the operations a bit faster.

Please help to test and provide the results.

test_build_1_oct.zip

@Kaden-mpc
Copy link

@kapilkedawat We have collected 50 more logs using your latest build. 5 of the attempts required retries, and 2 were failures. For 1 of the 2 failures, the ESP connected to the wifi successfully, but the phone showed a failure. And the other stopped responding, with no errors shown in the logs, and the sniff showing the ESP not responding.

Logs can be found here

Thanks for the quick response, and for the support!

@kapilkedawat
Copy link
Collaborator

Hi @Kaden-mpc,

Thank you for your invaluable testing efforts. I think we can now focus only on the non-recoverable errors.

Instance 49 is recoverable, but the timeout for that was 120 seconds. We can add some further optimizations there. I have made some adjustments to the timeout in the gas query. Please try the attached build.

test_build_3oct.zip

@Kaden-mpc
Copy link

@kapilkedawat We have collected 50 more logs. 9 of the attempts required retires (2 of which required retrying twice). And we experienced another false failure, where the phone displayed failure but it connected to the wifi. Once non recoverable errors are resolved, will progress be made on making the retries less frequent, or are we locked at the current 10-22% requiring retries?

Logs can be found here

Thank you for the support!

@kapilkedawat
Copy link
Collaborator

Hi @Kaden-mpc, thanks again for your testing efforts and for providing the logs. All the non-recoverable errors have been resolved, and the errors we are currently encountering are due to the mobile device either not replying in time or not receiving the frames (likely going off-channel). Unfortunately, there isn't much more that can be done from an optimization perspective, as we are processing the frames as fast as possible.

I want to emphasize that the ESP device is following the protocol correctly, and our behavior aligns with the current upstream wpa_supplicant codebase. You will observe different behavior with various configurations, depending on how much the driver honors wpa_supplicant's requests. For example, in our internal testing with a b43 driver based chip, we encountered 0 errors out of 10.

@ryanrampage1
Copy link

@kapilkedawat a sample size of 10 attempts is likely insufficient to demonstrate that the code works consistently. We are expecting this to be utilized by hundreds of customers per day. We would like to see at least 25 attempts to have a statistically relevant sample size. @Kaden-mpc has been doing 50 for even more consistent and repeatable data.

The consistent 10-15% failure rate we have seen on our 50 attempt test runs does not meet the quality standards our end customers expect and is not shippable. Anything more than 1-2% failure will result in a poor experience given that the DPP sharing experience is driven by the Android Operating system and we have no way to control what that 10-15% of users experience during failure.

Additionally we are having a hard time understanding how the chip is insufficient to support closer to a 98% success rate without retry as has been shown with the Chromecast. Additionally @Kaden-mpc has utilized the same phone for all testing, both with the Chromecast and ESP device showing that is not the issue.

@kapilkedawat
Copy link
Collaborator

Hi @ryanrampage1,

We conducted around 40 additional tests, and the error occurred 3 times during those retries.

I would like to reiterate that the issues we faced are not due to the Espressif chip but rather because the initiator is not following the protocol correctly.

For example, in the shard logs:

  • Run 3: The confirm frame was sent after ~1.5 seconds. We waited for 600ms on the channel for the response (note: the upstream supplicant waits for 500ms).
  • Run 13: The initiator did not receive an authentication response, even after sending the packet 5 times. Each of these 5 retries would have included 32 hardware retries. The response was sent within 300ms, which should have been acceptable.
  • Run 21: The configuration response was sent after ~1.5 seconds from the initiator.
  • Run 28: No response was received for the configuration request in the first instance. In the second instance, the configuration response was sent after ~1.5 seconds.
  • Run 29: The authentication confirm frame was sent after ~1.5 seconds.
  • Run 32: The authentication confirm frame was sent after ~1.4 seconds.
  • Run 35: The authentication confirm frame was sent after ~1.5 seconds.
  • Run 36: The authentication confirm frame was sent after ~1.4 seconds.
  • Run 40: The authentication confirm frame was sent after ~1.5 seconds.
  • Run 47: The authentication confirm frame was sent after ~1.5 seconds.

I’ve created a new build where the device waits for 1.6 seconds for the frame instead. Could you please try that?

Also, do you have any ESP32C2 or ESP32C6 devkits available? We would like to check the behavior on those as well, as they have ECC hardware, which should allow the response to be sent much faster and see if that makes any difference.

@Kaden-mpc
Copy link

@kapilkedawat We are unable to capture the DPP process with your latest build:

�[0;32mI (514) wifi_init: WiFi IRAM OP enabled�[0m
�[0;32mI (514) wifi_init: WiFi RX IRAM OP enabled�[0m
�[0;31mE (524) wpa: DPP: failed to init as not in station mode.�[0m
ESP_ERROR_CHECK failed: esp_err_t 0xffffffff (ESP_FAIL) at 0x42009c1f
file: "/IDF/examples/wifi/wifi_easy_connect/dpp-enrollee/main/dpp_enrollee_main.c" line 158
func: dpp_enrollee_init
expression: esp_supp_dpp_init(dpp_enrollee_event_cb)

abort() was called at PC 0x4037c043 on core 0


Backtrace: 0x40375a1e:0x3fca3e50 0x4037c04d:0x3fca3e70 0x40383afd:0x3fca3e90 0x4037c043:0x3fca3f00 0x42009c1f:0x3fca3f30 0x42009daf:0x3fca4040 0x4209295f:0x3fca4060 0x4037cb59:0x3fca4090




ELF file SHA256: 7d83770fa

We do not have any ESP32C2 or ESP32C6 devkits currently.

Thank you for your ongoing support!

@kapilkedawat
Copy link
Collaborator

Hi @Kaden-mpc, tried the same build and able to run the example correctly.

 python -m serial.tools.miniterm --rts 0 --eol CR --dtr 0 --raw /dev/ttyUSB17 115200; exit
--- forcing DTR inactive
--- forcing RTS inactive
--- Miniterm on /dev/ttyUSB17  115200,8,N,1 ---
--- Quit: Ctrl+] | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H ---
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x1 (POWERON),boot:0x2b (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3810,len:0x178c
load:0x403c9700,len:0x4
load:0x403c9704,len:0xcbc
load:0x403cc700,len:0x2da0
entry 0x403c9914
I (27) boot: ESP-IDF v5.2.1-1-g18a3eebeba8-dirty 2nd stage bootloader
I (27) boot: compile time Oct 10 2024 11:29:45
I (28) boot: Multicore bootloader
I (32) boot: chip revision: v0.1
I (36) boot.esp32s3: Boot SPI Speed : 80MHz
I (40) boot.esp32s3: SPI Mode       : DIO
I (45) boot.esp32s3: SPI Flash Size : 2MB
I (50) boot: Enabling RNG early entropy source...
I (55) boot: Partition Table:
I (59) boot: ## Label            Usage          Type ST Offset   Length
I (66) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (74) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (81) boot:  2 factory          factory app      00 00 00010000 00100000
I (89) boot: End of partition table
I (93) esp_image: segment 0: paddr=00010020 vaddr=3c0a0020 size=28744h (165700) map
I (131) esp_image: segment 1: paddr=0003876c vaddr=3fc98000 size=04690h ( 18064) load
I (135) esp_image: segment 2: paddr=0003ce04 vaddr=40374000 size=03214h ( 12820) load
I (140) esp_image: segment 3: paddr=00040020 vaddr=42000020 size=937c4h (604100) map
I (253) esp_image: segment 4: paddr=000d37ec vaddr=40377214 size=10d2ch ( 68908) load
I (277) boot: Loaded app from partition at offset 0x10000
I (277) boot: Disabling RNG early entropy source...
I (289) cpu_start: Multicore app
I (298) cpu_start: Pro cpu start user code
I (298) cpu_start: cpu freq: 240000000 Hz
I (298) cpu_start: Application information:
I (301) cpu_start: Project name:     dpp-enrollee
I (307) cpu_start: App version:      v5.2.1-1-g18a3eebeba8-dirty
I (313) cpu_start: Compile time:     Oct 10 2024 11:29:24
I (320) cpu_start: ELF file SHA256:  7d83770fa...
I (325) cpu_start: ESP-IDF:          v5.2.1-1-g18a3eebeba8-dirty
I (332) cpu_start: Min chip rev:     v0.0
I (336) cpu_start: Max chip rev:     v0.99 
I (341) cpu_start: Chip rev:         v0.1
I (346) heap_init: Initializing. RAM available for dynamic allocation:
I (353) heap_init: At 3FCA02F0 len 00049420 (293 KiB): RAM
I (359) heap_init: At 3FCE9710 len 00005724 (21 KiB): RAM
I (365) heap_init: At 3FCF0000 len 00008000 (32 KiB): DRAM
I (371) heap_init: At 600FE010 len 00001FD8 (7 KiB): RTCRAM
I (379) spi_flash: detected chip: generic
I (382) spi_flash: flash io: dio
W (386) spi_flash: Detected size(8192k) larger than the size in the binary image header(2048k). Using the size in the binary image header.
I (400) sleep: Configure to isolate all GPIO pins in sleep state
I (406) sleep: Enable automatic switching of GPIO sleep configuration
I (414) main_task: Started on CPU0
I (424) main_task: Calling app_main()
I (444) pp: pp rom version: e7ae62f
I (444) net80211: net80211 rom version: e7ae62f
I (454) wifi:wifi driver task: 3fcaa5bc, prio:23, stack:6656, core=0
I (464) wifi:wifi firmware version: a9f5b59
I (464) wifi:wifi certification version: v7.0
I (464) wifi:config NVS flash: enabled
I (464) wifi:config nano formating: disabled
I (464) wifi:Init data frame dynamic rx buffer num: 32
I (474) wifi:Init static rx mgmt buffer num: 5
I (474) wifi:Init management short buffer num: 32
I (474) wifi:Init dynamic tx buffer num: 32
I (484) wifi:Init static tx FG buffer num: 2
I (484) wifi:Init static rx buffer size: 1600
I (494) wifi:Init static rx buffer num: 10
I (494) wifi:Init dynamic rx buffer num: 32
I (494) wifi_init: rx ba win: 6
I (504) wifi_init: tcpip mbox: 32
I (504) wifi_init: udp mbox: 6
I (504) wifi_init: tcp mbox: 6
I (514) wifi_init: tcp tx win: 5760
I (514) wifi_init: tcp rx win: 5760
I (524) wifi_init: tcp mss: 1440
I (524) wifi_init: WiFi IRAM OP enabled
I (534) wifi_init: WiFi RX IRAM OP enabled
I (534) wpa: esp_dpp_task prio:2, stack:6656
I (584) wifi dpp-enrollee: Scan below QR Code to configure the enrollee:
I (584) QRCODE: Encoding below text with ECC LVL 0 & QR Code Version 10
I (584) QRCODE: DPP:C:81/6;M:7c:df:a1:e7:25:e8;K:MDkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDIgADFUzYHsU/u6ZN9ZDo/bvqKBkuRjmJkQs70S93fKMUIts=;;
                                              
  █▀▀▀▀▀█ █ █▀█ ▀▀▀█▀█▄▄█▀ ▄▀▀▀█ ▄█ █▀▀▀▀▀█   
  █ ███ █ ██ ▄▀▀▄ ▀▄▄  █▀█▀ █ █▀█▀▀ █ ███ █   
  █ ▀▀▀ █ ▄▀█▀█ █ ▄▄██▀▄▄▀ █▀▀▀██▄▀ █ ▀▀▀ █   
  ▀▀▀▀▀▀▀ ▀▄▀ ▀ ▀▄█▄█▄▀▄▀ ▀ █ █▄▀▄█ ▀▀▀▀▀▀▀   
  ▀█ ▄██▀▄ █▀▄▄▄▀▄█ ▄ ▄▄█ ▀ ▀█▀▀▄ ▀▄▄▀▄███▀   
  ██▀▄▀▄▀▄▄▀  ▄ ▄█▄█▄▄ ▄▀▀█▀ █ ▄  ▄ █▀▀  ▄▀   
  ▄▀█▀▀▄▀▄█▄  ▀█▀▀▄ ▀ ▄▄▀▄▄ ▄▄▄▀ ▀▀▄▄█▄ ▄█    
   ▄█ ▄ ▀ █▀▀ ▄   ▀█▄▄▀█▀█▀  ▀██▄ ▄▀  ██ ▄█   
  █▀▀▀▀ ▀▄▄▀██▀▀▄▀▄▄  ▀▄▀█  ▀▄▀▀▀▀▀▀ ▀▄▄      
  █▀█▄█▄▀▀ █▄█▀█▀▀█ ▀▄ █ ▀▀▀▄▀█▄▀▄ ▄▀ █▄▀▄█   
  ▀█  █▀▀ ▀▄█▀▄█▄▀▀ ▄ ▄▄██▄▀  █▀ ▄▀█  ▄█▀█    
  ▄██ ▀▄▀ █▄▀█▀ ▀█▄█▄ ███▀▀▀▄█ █▄▄  ▄  ▄▀▄█   
  █▀▄ ▄▀▀▀▄▄█ ▀█▄ █     ▀▄▀█▄▄▀▀▀ ▄▄▄ ▄▄█▀    
    █ ▀█▀█████▄ ▄▀ █▄ ▀█ █▄ ▄▀██▄▄▀█ ▄ ▀▀ █   
  ▄██ █▀▀█▀ █ ▀█▀▀▄▄▀ ▄▄▀    ██▄▄█ ▀ ██▄      
    ▄ █▀▀▀▄█▄ ▄ ▄ ▀█▄▄▀▄▀█ ▀▄▀▄▄  ▀█ ▄▀█  ▀   
  ▀▀ ▀▀▀▀▀█▄ ▀▀▀ █▄▄▄█▀█▀ ▀▄▀█▀█▄▄█▀▀▀██▀     
  █▀▀▀▀▀█ ▄▄█▀▀█▀▀▄ ▀▄ █▀██▀▄▀ ▄▄██ ▀ █ ▀ ▀   
  █ ███ █ ▀▀  ▄█▄▀▀▄▄  ███▄ ▀▄▄ ▄▀▀▀▀█▀▄ █▄   
  █ ▀▀▀ █ ▄▀▄█▀ ▄█▄█▄ ▄▄ ▀█▀ ▀████▄  █ ▄ ▄█   
  ▀▀▀▀▀▀▀ ▀   ▀      ▀  ▀ ▀▀  ▀▀▀▀▀ ▀    ▀    
                                              

I (864) phy_init: phy_version 640,cd64a1a,Jan 24 2024,17:28:12
I (1014) wifi:mode : sta (7c:df:a1:e7:25:e8)
I (1024) wifi:enable tsf
I (1024) wifi dpp-enrollee: Started listening for DPP Authentication
I (8144) wpa: DPP: Authentication Request from 0c:84:dc:3b:8c:8c
I (8414) wpa: DPP: Sending auth resp frame
I (8704) wpa: DPP: Authentication Confirmation from 0c:84:dc:3b:8c:8c
I (8704) wpa: DPP: GAS request to 0c:84:dc:3b:8c:8c (chan 6)
I (8754) wifi dpp-enrollee: DPP Authentication successful, connecting to AP : Nighthawk
I (8794) wifi:new:<6,0>, old:<1,0>, ap:<255,255>, sta:<6,0>, prof:1
I (8794) wifi:state: init -> auth (b0)
I (9074) wifi:state: auth -> assoc (0)
I (9094) wifi:state: assoc -> run (10)
I (9144) wifi:connected with Nighthawk, aid = 1, channel 6, BW20, bssid = da:21:f9:31:2f:b0
I (9144) wifi:security: WPA2-PSK, phy: bgn, rssi: -34
I (9144) wifi:pm start, type: 1

I (9144) wifi:dp: 1, bi: 102400, li: 3, scale listen interval from 307200 us to 307200 us
I (9154) wifi:set rx beacon pti, rx_bcn_pti: 0, bcn_timeout: 25000, mt_pti: 0, mt_time: 10000
I (9214) wifi:AP's beacon interval = 102400 us, DTIM period = 1
I (9214) wifi:<ba-add>idx:0 (ifx:0, da:21:f9:31:2f:b0), tid:6, ssn:2, winSize:64
I (10164) esp_netif_handlers: sta ip: 192.168.50.231, mask: 255.255.255.0, gw: 192.168.50.1
I (10164) wifi dpp-enrollee: got ip:192.168.50.231
I (10164) wifi dpp-enrollee: connected to ap SSID:Nighthawk password:password
I (10174) main_task: Returned from app_main()

Could you please clean nvs and retry flashing the build? Tha assert is failing since mode is not station however we explicitly set the mode in example https://github.com/espressif/esp-idf/blob/release/v5.2/examples/wifi/wifi_easy_connect/dpp-enrollee/main/dpp_enrollee_main.c#L158.

@Kaden-mpc
Copy link

@kapilkedawat I have tried cleaning the chip, and reflashing but received the same results. I have also tried 2 brand new ESP32S3 devkits, and experienced the exact same error I have previously shared.

@Kaden-mpc
Copy link

@kapilkedawat I reflashed the first build that @Aditi-Lonkar sent (#10615 (comment)), which worked. Upon reflashing your build, your build then worked. It seems like previous builds had an initialization step which was being written to the nvs that is no longer working in newer builds. Completely fresh boards, and boards with a cleaned nvs cannot run your latest builds (tested flashing sep30 and newer with same results).

@kapilkedawat
Copy link
Collaborator

@Kaden-mpc please try attached build.
15oct_build.zip

@ryanrampage1
Copy link

@kapilkedawat and @Aditi-Lonkar sorry for the delay here, @Kaden-mpc and I have been caught up with a few other priorities. Will run that build and report back results in the next week or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: In Progress Work is in progress Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests