Mysql mb4 #6409

sbulen · 2020-12-14T17:59:17Z

Partial for #7173
Fixes #6405
Fixes #7776
Fixes #5108

Overview...
Mysql charsets supported should be utf8 (for 2.1 users migrated before utf8mb4 support) and utf8mb4. The only difference between the two is entity encoding needed for 4-byte utf8 chars when not yet utf8mb4. pg supports mb4 already.

All conditional utf8 logic (utf8 vs iso-8859) is removed.

Current thinking:

Install/upgrade to utf8mb4
Needs new 'convert to mb4', like 2.0 had with utf8
Yes, db changes are necessary, index size constraints (*** Can be avoided if we have change min mysql version to 5.7 & mandate InnoDB...***)
Stripped non-utf8/utf8mb4 logic wherever found; assume utf8/utf8mb4 ONLY going forward in 2.1; Making it easier to refactor down the road

191...

Where does 191 come from? (int) 767/4... Index size max varies based on innodb vs myisam & the innodb row format, which changed over time... Some configs/versions, the max is 767, for some, it's 1000, & for the rest is 3072. 767 is a safe assumption.
To do a good a/b test, you should emulate a prefix length of 767. You can do this by using innodb tables & changing innodb_default_row_format to 'compact', which was the default prior to mysql 5.7. (Or, use MyISAM, which has a limit of 1000.)

Trying to keep field meanings straight...

global_character_set - used in http headers "UTF-8"; had been used as a flag whether utf8 conversion had been run (no change)
$txt['lang_character_set'] - used in http headers "UTF-8" (no change)
$context['character_set'] = derived from the above 2 settings, used in http headers (no change)
$db_character_set = "utf8" or "utf8mb4" (requires changes throughout...)
The above 4 really need to be in sync...
$utf8 & $context['utf8'] = flags used throughout code to determine utf8/utf8mb4 string support - most often used to add u modifier to regexes, derived from above (NO LONGER NEEDED requires changes throughout...)
$db_mb4 = flag for whether mb4 is supported; set after utf8mb4 conversion (or pg in use)... Only used in entity conversion. This really should derived in load.php, not set by user.

Read up on index prefixes here:
https://dev.mysql.com/doc/refman/8.0/en/column-indexes.html
https://dev.mysql.com/doc/refman/8.0/en/innodb-limits.html#:~:text=The%20index%20key%20prefix%20length%20limit%20is%20767%20bytes%20for,4%20bytes%20for%20each%20character.

mb4 support history:
https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/

Mysql usage & support by version:
https://www.eversql.com/mysql-8-adoption-usage-rate/
https://fromdual.com/support-for-mysql-from-oracle

albertlast · 2020-12-15T10:57:25Z

In my eyes would be this a huge uplift to redurce the complexity,
by force everthing to be utf8mb4 AND to raise the min mysql version to 5.5.

So I like the pr.

Underdog-01 · 2021-01-18T02:20:10Z

I see that in your changes are using utf8mb4_general_ci and I would like to propose using utf8mb4_unicode_ci.
"general" does not use the unicode rules for sorting and may have unexpected results when used with various languages.

The added performance of using general over unicode makes little difference with modern CPU & memory power/size.

Underdog-01 · 2021-01-18T02:24:17Z

Also text fields may be affected in some circumstances.
They may require being changed to their upper tiered counterpart.
(ie. text to mediumtext)

sbulen · 2021-01-18T03:15:47Z

Also text fields may be affected in some circumstances.
They may require being changed to their upper tiered counterpart.
(ie. text to mediumtext)

Pretty sure we're good here... The settings are now by characters, not bytes.

The problems experienced db-structure-wise were the index lengths...

sbulen · 2021-01-18T03:16:48Z

I see that in your changes are using utf8mb4_general_ci and I would like to propose using utf8mb4_unicode_ci.
"general" does not use the unicode rules for sorting and may have unexpected results when used with various languages.

The added performance of using general over unicode makes little difference with modern CPU & memory power/size.

I did some tests & couldn't find an example where there was a difference.

Can you identify a scenario?

I understand the benefit in theory, just not sure it's really there...

sbulen · 2021-01-18T03:17:11Z

& THANK YOU for the feedback. I think this is an important topic.

Underdog-01 · 2021-01-18T04:35:56Z

It's data sorting for various languages that will present obstacles when using general.
A brief explanation with some examples can be found on the MySQL forum:
https://forums.mysql.com/read.php?103,187048,188748

Underdog-01 · 2021-01-18T04:55:41Z

BTW I agree that this is a very important topic.
The MySQL website states that utf8mb3 is deprecated and in the very near future will be removed
Look at the first "Note" here:
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html

jdarwood007 · 2021-03-21T00:46:28Z

As per my other comments on this. This is a big change in how we handle the database layer. As such its too late in 2.1 to make this. So this should not make 2.1 and be closed out. Future versions of SMF we should make this from the get go with support. Of course this means that future versions of SMF will need to support the conversion to mb4 either manually or during the upgrade process.

live627 · 2021-03-21T05:45:40Z

Why can't this repo contin future versions?

…

On Sat, Mar 20, 2021 at 5:46 PM Jeremy D ***@***.***> wrote: As per my other comments on this. This is a big change in how we handle the database layer. As such its too late in 2.1 to make this. So this should not make 2.1 and be closed out. Future versions of SMF we should make this from the get go with support. Of course this means that future versions of SMF will need to support the conversion to mb4 either manually or during the upgrade process. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6409 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADJNN4NBBOBOZGHKW5ZPMLTEU6XFANCNFSM4U3DTNPA> .

albertlast · 2021-03-21T07:44:49Z

in my eyes smf 2.1 should run only in mb4 mode and should drop mb3 support,
since it get again mysql and pg similiar and
you don't had the struggle with newer mysql version where mb4 get the new default coalation when you define a column as utf8.

sbulen · 2021-03-22T04:10:06Z

As per my other comments on this. This is a big change in how we handle the database layer. As such its too late in 2.1 to make this. So this should not make 2.1 and be closed out. Future versions of SMF we should make this from the get go with support. Of course this means that future versions of SMF will need to support the conversion to mb4 either manually or during the upgrade process.

I disagree with you on this. SMF2.1 should be mb4, IMO. But we've waited too long & we are appropriately maintaining scope. This is appropriate for 2.1.x.

There is nothing different at all in how we handle the db layer, no changes in DB calls at all. This PR mirrors how we dealt with mb3 in 2.0.

This PR is about 10% fixing bugs, 70% cleaning up old latin1 logic that should have already been removed from 2.1, and 20% migrating to mb4. But we agreed to wait after 2.1, so we are.

jdarwood007 · 2021-03-22T23:46:31Z

Well I do disagree with that because of all the alterations to the core functionality of code. But I will stand by my statement made on others. This has to make RC4 or its not making 2.1. We can't introduce this between RC4 and final.

jdarwood007 · 2021-03-22T23:46:56Z

Fyi, this is also showing as a draft PR.

sbulen · 2021-03-23T01:12:10Z

It's marked as a draft because the dev team had agreed to wait until after 2.1 goes live for mb4 support. I want to ensure it doesnt get merged prematurely.

other/Settings.php

Sources/ManageMaintenance.php

sbulen · 2021-09-24T02:55:52Z

I brought this current, in case folks wanted to test. I incorporated most feedback above. This required a rebase due to many layers of conflicts.

I have not yet responded to @live627's feedback on changes interfering with postgresql. In my testing, it hasn't - this works fine in pg as-is. Still trying to understand why, though...

albertlast · 2021-09-24T03:03:34Z

I brought this current, in case folks wanted to test. I incorporated most feedback above. This required a rebase due to many layers of conflicts.

I have not yet responded to @live627's feedback on changes interfering with postgresql. In my testing, it hasn't - this works fine in pg as-is. Still trying to understand why, though...

Exists any issue with pg?

sbulen · 2021-09-24T03:34:56Z

I haven't seen any issues with pg yet. Still early in testing, though.

albertlast · 2021-09-24T04:25:10Z

From my pov the whole topic didn't exists in pg.

Signed by Shawn Bulen, [email protected]

jdarwood007

I approving this on the grounds the code looks good. But we need to get it tested. I don't have any non english forums to test on (other than the obvious). So we need to work on ensuring stability of this and that it converts just fine.

albertlast · 2023-07-09T16:42:42Z

Smf.org would be a nice place 😀

sbulen · 2023-07-09T16:46:47Z

I did a ton of testing, but like, 2.5 years ago...

It needs to be repeated with the current codebase.

jdarwood007 · 2023-07-09T17:15:51Z

@albertlast Of course we will, once we feel its stable enough for a production environment. We eat our own food.

Signed by Shawn Bulen, [email protected]

sbulen · 2023-07-12T04:01:31Z

Just fixed an issue where log_search_subjects was still getting inappropriately truncated (byte-based substr).

sbulen · 2023-07-12T13:27:57Z

I am still re-testing this one.

Just found an issue with upgrading old forums - mainly due to index renames over the years.

jdarwood007 · 2023-07-13T00:00:07Z

Would it seem logical to track all columns that would be affected in the index, convert, then drop and recreate that index? Its a bit of a long way around, but we can find what columns are in a index.

sbulen · 2023-07-13T00:32:17Z

This PR currently drops & recreates the indexes. Both in the maint function & in the upgrader. The upgrader already tries various old names. The maint function needs to do so as well.

Also, we have a new index added after this PR (and before 2.1 release) that needs tweaking as well...

jdarwood007 · 2023-07-13T01:39:58Z

What happens for custom indexes that may not be included in this? Will the conversion fail? If it fails, I was thinking we track all indexes where we can find columns that are converted, then rebuild those indexes with a drop/create.

sbulen · 2023-07-13T02:15:54Z

Hmmm... Custom indexes might be out there. For that matter, custom columns with custom indexes might be problematic also.

To rebuild the indexes, we'd have to save off the index build info (parse it from SHOW CREATE TABLE), tweak it, & recreate the indexes. This is feeling pretty complex, for what I expect is an edge case.

My take is that covering the known variants covers us 99.9%+ of the time. The folks who have customized their DB to that degree will need to complete the conversion manually. Kinda the price to pay for making said customizations...

Signed by Shawn Bulen, [email protected]

albertlast · 2023-07-13T05:06:05Z

Sources/DbPackages-mysql.php

@@ -226,6 +226,8 @@ function smf_db_create_table($table_name, $columns, $indexes = array(), $paramet
 	$table_query .= ') ENGINE=' . $parameters['engine'];
 	if (!empty($db_character_set) && $db_character_set == 'utf8')
 		$table_query .= ' DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci';
+	else


Maybe an esle if would be better here?

It's either 'utf8' (currently an alias of 'utf8mb3') or 'utf8mb4'. These are the only two options at this point. And as noted in the description to this PR, we are defaulting to 'utf8mb4' going forward.

jdarwood007 · 2023-07-13T21:21:02Z

Hmmm... Custom indexes might be out there. For that matter, custom columns with custom indexes might be problematic also.

I don't know about custom columns. That wouldn't affect this because we didn't convert that column to mb4.

To rebuild the indexes, we'd have to save off the index build info (parse it from SHOW CREATE TABLE), tweak it, & recreate the indexes. This is feeling pretty complex, for what I expect is an edge case.

We have database tools that can list indexes. $smcFunc['db_list_indexes'] does what you need I think.

sbulen · 2023-07-16T21:45:14Z

I'll take a peek at recreating all indexes...

It's feeling like it's getting too complex for an edge case, though.

Note the CONVERT command converts all columns, and we are stepping thru all tables.

Signed by Shawn Bulen, [email protected]

jdarwood007 · 2023-07-17T14:29:13Z

Understable it can get complex with edge cases. Which is why I wondered what happens for custom indexes. I need to setup a few to test. I suspect they would be corrupted and not work. But that converting the table should still happen? Or do they need to drop first for the conversion to occur?

sbulen · 2023-07-17T16:44:00Z

I don't think the tables or indexes would get corrupted at any point.

If there is a custom index that meets the criteria (on any column, custom or not), the conversion will error out, & processing will stop. The table will not be converted. The index will need to be removed, the process restarted, and the index re-added after completion.

I doubt we will see a lot of custom indexes that meet the exact criteria. Probably none. Again, I consider this an unlikely edge case.

Sesquipedalian · 2024-06-26T15:41:25Z

Note: This PR targets release-2.1, but it has been assigned to the milestones for SMF 3.0. The PR will need to be updated and resubmitted to target release-3.0. I am leaving it open for now so that it doesn't get forgotten.

Sesquipedalian · 2024-06-29T18:44:26Z

Further note: Much has changed in SMF 3.0, and I don't want to lay the burden for reimplementing all of this on @sbulen, so I have assigned #7938 to myself. I will use this PR as a reference, since @sbulen has already thought through all the things that need to be done.

sbulen · 2024-06-30T16:49:51Z

I think the new approach should be more along the lines of what was discussed over in #7938 than the approach used here.

Specifically, if we can ensure InnoDB and COMPACT rows, then there is no need for a 191 byte limitation on the indexes. The DB structures themselves don't need to change, the DB engine & storage traits do. The real question is the best way to do that.

I suggest that as a pre-upgrader step, outside of the normal upgrade process, so we don't end up overloading the upgrader logic as we did in 2.1. Such a pre-upgrader can also check for missing standard columns, missing default values, etc., the types of things that tend to bork the upgrader.

Also:

UTF8MB4_unicode_ci
Ripping out all the old ISO-8859 logic in various places. You should be able to grep it & only find any references in the upgrader (if that...).
Similar with regex - all the logic that chooses between ISO-8859 & utf8 needs to go (like in this PR); all regexes need the "u" going forward - all those ternaries should go...
No need for a utf8 conversion form & logic (like this PR has)

Last, & possibly most important - we should get rid of all of the entity-encoding logic for multi-byte characters altogether. All those $smcFunc calls should go - it's overly complicated & we are always chasing bugs in it.

We should be storing MB3/4 chars as chars, not entities. That's the point.

Sesquipedalian · 2024-06-30T19:17:27Z

Excellent points all around. But in this PR you did ping pretty much all of the different bits that need to be addressed. Even if we address them somewhat differently for 3.0, we still need to address all these things. So it's excellent reference material.

pr-triage bot added the PR: draft label Dec 14, 2020

sbulen added MySQL Database Charset/Encoding UTF8 & mb4 encoding related issues labels Dec 14, 2020

MissAllSunday mentioned this pull request Dec 15, 2020

utfmb4 database #5108

Closed

sbulen mentioned this pull request Feb 4, 2021

fix: Limit index size #6494

Closed

live627 added this to the The future milestone Mar 2, 2021

Sesquipedalian mentioned this pull request Jun 16, 2021

Consolidates and simplifies entity handling #6786

Open

live627 previously requested changes Jul 17, 2021

View reviewed changes

other/Settings.php Show resolved Hide resolved

Sources/ManageMaintenance.php Show resolved Hide resolved

sbulen mentioned this pull request Jul 22, 2021

Clean up utf8 parameter usage #6868

Merged

sbulen force-pushed the mysql_mb4 branch from 35155c8 to 30d52bd Compare September 24, 2021 01:39

Preserve legacy settings in case used by mods

13a6271

Signed by Shawn Bulen, [email protected]

jdarwood007 approved these changes Jul 9, 2023

View reviewed changes

Only utf8 going forward

583f69f

Signed by Shawn Bulen, [email protected]

More mb4 index changes

a9bac51

Signed by Shawn Bulen, [email protected]

albertlast reviewed Jul 13, 2023

View reviewed changes

mb_substr required here for upgrades

0250924

Signed by Shawn Bulen, [email protected]

Merge remote-tracking branch 'upstream/release-2.1' into mysql_mb4

3168adb

sbulen mentioned this pull request Nov 25, 2023

Search custom index: log_search_words not built in correct collation #7830

Open

live627 mentioned this pull request Dec 4, 2023

Use UTF8MB4 everywhere #7938

Open

3 tasks

Sesquipedalian modified the milestones: 2.1.5, 3.0 Alpha 2 Mar 16, 2024

Sesquipedalian modified the milestones: 3.0 Alpha 2, 3.0 Alpha 3 Apr 22, 2024

Mysql mb4 #6409

Are you sure you want to change the base?

Mysql mb4 #6409

Conversation

sbulen commented Dec 14, 2020 • edited Loading

albertlast commented Dec 15, 2020

Underdog-01 commented Jan 18, 2021

Underdog-01 commented Jan 18, 2021

sbulen commented Jan 18, 2021

sbulen commented Jan 18, 2021

sbulen commented Jan 18, 2021

Underdog-01 commented Jan 18, 2021

Underdog-01 commented Jan 18, 2021 • edited Loading

jdarwood007 commented Mar 21, 2021

live627 commented Mar 21, 2021 via email

albertlast commented Mar 21, 2021

sbulen commented Mar 22, 2021

jdarwood007 commented Mar 22, 2021

jdarwood007 commented Mar 22, 2021

sbulen commented Mar 23, 2021

sbulen commented Sep 24, 2021 • edited Loading

albertlast commented Sep 24, 2021

sbulen commented Sep 24, 2021

albertlast commented Sep 24, 2021

jdarwood007 left a comment

Choose a reason for hiding this comment

albertlast commented Jul 9, 2023

sbulen commented Jul 9, 2023

jdarwood007 commented Jul 9, 2023

sbulen commented Jul 12, 2023

sbulen commented Jul 12, 2023

jdarwood007 commented Jul 13, 2023

sbulen commented Jul 13, 2023

jdarwood007 commented Jul 13, 2023

sbulen commented Jul 13, 2023 • edited Loading

albertlast Jul 13, 2023

Choose a reason for hiding this comment

sbulen Jul 16, 2023

Choose a reason for hiding this comment

jdarwood007 commented Jul 13, 2023 • edited Loading

sbulen commented Jul 16, 2023

jdarwood007 commented Jul 17, 2023

sbulen commented Jul 17, 2023 • edited Loading

Sesquipedalian commented Jun 26, 2024

Sesquipedalian commented Jun 29, 2024

sbulen commented Jun 30, 2024 • edited Loading

Sesquipedalian commented Jun 30, 2024 • edited Loading

sbulen commented Dec 14, 2020 •

edited

Loading

Underdog-01 commented Jan 18, 2021 •

edited

Loading

sbulen commented Sep 24, 2021 •

edited

Loading

sbulen commented Jul 13, 2023 •

edited

Loading

jdarwood007 commented Jul 13, 2023 •

edited

Loading

sbulen commented Jul 17, 2023 •

edited

Loading

sbulen commented Jun 30, 2024 •

edited

Loading

Sesquipedalian commented Jun 30, 2024 •

edited

Loading