-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089
base: master
Are you sure you want to change the base?
Conversation
…fy returning arrays that might be null
…arget frameworks that support System.Text.Unicode.Utf8. Added tests to verify fallback is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
I went ahead and pushed an optimization to the Term.ToString()
method, since System.Text.Unicode.Utf8
supports fallback without an exception.
I also discovered that the default fallback character for encodings in .NET doesn't match that in Java, so we need to adjust the value set to IOUtils.ENCODING_UTF_8_NO_BOM
to:
private static UTF8Encoding CreateUtf8Encoding()
{
// Create a UTF8Encoding instance
UTF8Encoding encoding = new UTF8Encoding(
encoderShouldEmitUTF8Identifier: false, // No BOM, similar to Java
throwOnInvalidBytes: false // Match Java's silent handling of invalid bytes
);
// Set the DecoderFallback to use '\uFFFD' as the replacement character
encoding.DecoderFallback = new DecoderReplacementFallback("\uFFFD");
// Set the EncoderFallback to use '\uFFFD' for invalid character encoding
encoding.EncoderFallback = new EncoderReplacementFallback("\uFFFD");
return encoding;
}
For the Encoding
-derived classes, the default fallback character is ?
, but in Java it is \uFFFD
. The new System.Text.Unicode.Utf8
class uses the same default fallback character as in Java.
Note that the fallback character has been defined in other places in the codebase as ?
, as well. Those should be reviewed.
Since this PR is encoding-related, I think it would be fine to put them here, but feel free to open separate issues if you want to keep them separate.
@NightOwl888 In regards to:
This is only true for
I confirmed this is the case on net462 as well. The only place in non-test code where Given that the rest of the places we use UTF8, I don't think we need to make that change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a more exhaustive review and I found some additional issues that we should address.
Exception Handling
We need to fix the ExceptionExtensions
class so that any existing catch blocks will account for gaps between encoding fallback exceptions between Java and .NET.
- In Java,
CharacterCodingException
subclassesIOException
. - In .NET,
EncoderFallbackException
andDecoderFallbackException
subclassArgumentException
I already checked where we are catching ArgumentException
explicitly, and I think we are good there.
IsIOException()
should be changed to include bothDecoderFallbackException
andEncoderFallbackException
.IsIllegalArgumentException()
should be changed to exclude bothDecoderFallbackException
andEncoderFallbackException
.
Explicit Encoding Char Replacement
As I previously mentioned, there are a few places where we are using ?
as the replacement character instead of \uFFFD
. I located them and they are all in tests.
lucenenet/src/Lucene.Net.Tests.Facet/Taxonomy/WriterCache/TestCharBlockArray.cs
Lines 43 to 88 in c3b6896
// This is essentially the equivalent of // CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder() // .onUnmappableCharacter(CodingErrorAction.REPLACE) // .onMalformedInput(CodingErrorAction.REPLACE); // // Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, // new EncoderReplacementFallback("?"), // new DecoderReplacementFallback("?")); for (int i = 0; i < n; i++) { Random.NextBytes(buffer); int size = 1 + Random.Next(50); // This test is turning random bytes into a string, // this is asking for trouble. Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, new EncoderReplacementFallback("?"), new DecoderReplacementFallback("?")); string s = decoder.GetString(buffer, 0, size); array.Append(s); builder.Append(s); } for (int i = 0; i < n; i++) { Random.NextBytes(buffer); int size = 1 + Random.Next(50); // This test is turning random bytes into a string, // this is asking for trouble. Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, new EncoderReplacementFallback("?"), new DecoderReplacementFallback("?")); string s = decoder.GetString(buffer, 0, size); array.Append(s); builder.Append(s); } for (int i = 0; i < n; i++) { Random.NextBytes(buffer); int size = 1 + Random.Next(50); // This test is turning random bytes into a string, // this is asking for trouble. Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, new EncoderReplacementFallback("?"), new DecoderReplacementFallback("?")); lucenenet/src/Lucene.Net.Tests.Facet/Taxonomy/WriterCache/TestCompactLabelToOrdinal.cs
Lines 46 to 65 in c3b6896
// This is essentially the equivalent of // CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder() // .onUnmappableCharacter(CodingErrorAction.REPLACE) // .onMalformedInput(CodingErrorAction.REPLACE); // // Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, // new EncoderReplacementFallback("?"), // new DecoderReplacementFallback("?")); Random random = Random; for (int i = 0; i < numUniqueValues;) { random.NextBytes(buffer); int size = 1 + random.Next(buffer.Length); // This test is turning random bytes into a string, // this is asking for trouble. Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, new EncoderReplacementFallback("?"), new DecoderReplacementFallback("?"));
While technically, they are fine because they are in tests, during porting what often happens when trying to figure out how to do something is to search the codebase to see how it was done elsewhere. And leaving it like this could cause it to be copied incorrectly elsewhere.
So, let's change those tests to use StandardCharsets.UTF_8
and add a comment to indicate that configuring CodingErrorAction.REPLACE
explicitly is unnecessary because it is the default behavior of StandardCharsets.UTF_8
(in .NET, anyway).
Default Encoding
During review, I noticed that we have translated Charset.defaultCharset()
to Encoding.GetEncoding(0)
. It would be better to change this to Encoding.Default
. We have comments to indicate Encoding.GetEncoding(0)
represents the system default in most cases, but using the property would make it clearer.
ChatGPT recommends using Encoding.Default
instead of Encoding.GetEncoding(0)
because apparently the latter will throw an exception in some environments on .NET Core. I tried on my system using net8.0
, but it didn't throw. However, the property definitely won't throw even if ChatGPT is wrong about this.
Default Encoding with System.Console
Note that there is one place where the default encoding wasn't ported correctly.
TextWriter logger = Console.Out; //new StreamWriter(Console, Encoding.GetEncoding(0));
This line can be changed to:
using TextWriter logger = new StreamWriter(Console.OpenStandardOutput(), Encoding.Default) { AutoFlush = true };
ISO 8859-14 Encoding
This encoding (only the decoder) was ported from Hunspell in Lucene because neither .NET nor Java have an implementation.
Caching
Unlike the other encodings it is not cached in .NET. Let's turn ISO8859_14Encoding
into a singleton by adding a static Default
property.
Fallback Behavior
I ran some tests in Java to find that this neither throws nor replaces invalid input characters. Instead, it simply leaves them as-is. I checked the iso-8859-1 charset (which the Lucene implementation of iso-8859-14 extends) and it does the same thing by default in Java.
But since the user is supplying dictionaries (usually standardized ones that are downloaded from websites that gather them) it makes me wonder whether Lucene should be throwing on invalid input characters. Certainly, leaving them as-is is not a typical option in a .NET Encoding class. Maybe we should report this upstream to find out whether this was intentional or an oversight.
NOTE: Hunspell was among the most complicated parts to port in the Lucene.Net.Analysis.Common package, so it might be worth the extra effort to at least make sure users aren't feeding it invalid input.
Also, I downloaded all of the Hunspell dictionaries that Lucene uses in tests and put them in a repository here: https://github.com/NightOwl888/lucenenet-hunspell. These are not part of the main repository because it is a huge amount of binary data. The tests currently must be run manually after downloading the files. IIRC, there are still a few dictionaries that don't pass the tests in Lucene.NET, but I don't see that as a showstopper for the release because most of them work.
GB2312 Encoding
Encoding.GetEncoding("GB2312")
is being called inside of a tight loop in multiple places in Lucene.Net.Analysis.SmartCn. While the instance returned from Encoding.GetEncoding("GB2312")
is cached by the encoding provider, it still needs to do a dictionary lookup every time it is called. Let's change it to statically cache this instance in AbstractDictionary
so the dictionary lookup only occurs once.
The change to the tests exposed something else to fix: Java's standard Charsets will throw on malformed input and unmappable characters by default, meaning |
It appears that there is a difference between the default when using:
The default behavior of the latter is @Test
public void testDefaultUTF8DecoderWithInvalidBytes() {
// Define a mix of valid and invalid UTF-8 bytes
byte[] testBytes = {
(byte) 0xE2, (byte) 0x82, (byte) 0xAC, // Valid UTF-8 for €
(byte) 0xC3, (byte) 0xA9, // Valid UTF-8 for é
(byte) 0xFF, // Invalid byte
(byte) 0x41, // Valid ASCII for 'A'
(byte) 0x80, // Invalid byte
(byte) 0xF0, (byte) 0x9F, (byte) 0x92, (byte) 0x96 // Valid UTF-8 for 💖
};
// Decode the bytes using the default UTF-8 behavior
String decodedString = new String(testBytes, StandardCharsets.UTF_8);
// Print each character and its code point
System.out.println("Decoded String Code Points: ");
for (int i = 0; i < decodedString.length(); i++) {
System.out.printf("Character at position %d: %d%n", i, (int) decodedString.charAt(i));
}
// Verify the decoded string (replacement character appears at invalid bytes)
assertEquals("€é\uFFFDA\uFFFD💖", decodedString);
} |
Use DecoderFallback.ExceptionFallback to match Java's
CodingErrorAction.REPORT
Fixes #1076
Description
This adds an extension method to make it easy to match Java's
CodingErrorAction.REPORT
use inCharSet
where it will throw an exception upon encountering invalid character sequences. All places where this is used in Lucene have been updated to use this extension method.This should improve reliability of ensuring valid input in cases where Lucene would throw on invalid input, particularly with
IOUtils.GetDecodingReader
which, per its docs, was intended to throw on invalid input, but was doing a character replacement in Lucene.NET previously.