Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

paulirwin · 2025-01-06T19:36:11Z

You've read the Contributor Guide and Code of Conduct.
You've included unit or integration tests for your change, where applicable.
You've included inline docs for your change, where applicable.
There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT

Description

This adds an extension method to make it easy to match Java's CodingErrorAction.REPORT use in CharSet where it will throw an exception upon encountering invalid character sequences. All places where this is used in Lucene have been updated to use this extension method.

This should improve reliability of ensuring valid input in cases where Lucene would throw on invalid input, particularly with IOUtils.GetDecodingReader which, per its docs, was intended to throw on invalid input, but was doing a character replacement in Lucene.NET previously.

…1076

…fy returning arrays that might be null

…arget frameworks that support System.Text.Unicode.Utf8. Added tests to verify fallback is working.

NightOwl888

Thanks for the PR.

I went ahead and pushed an optimization to the Term.ToString() method, since System.Text.Unicode.Utf8 supports fallback without an exception.

I also discovered that the default fallback character for encodings in .NET doesn't match that in Java, so we need to adjust the value set to IOUtils.ENCODING_UTF_8_NO_BOM to:

    private static UTF8Encoding CreateUtf8Encoding()
    {
        // Create a UTF8Encoding instance
        UTF8Encoding encoding = new UTF8Encoding(
            encoderShouldEmitUTF8Identifier: false, // No BOM, similar to Java
            throwOnInvalidBytes: false             // Match Java's silent handling of invalid bytes
        );

        // Set the DecoderFallback to use '\uFFFD' as the replacement character
        encoding.DecoderFallback = new DecoderReplacementFallback("\uFFFD");

        // Set the EncoderFallback to use '\uFFFD' for invalid character encoding
        encoding.EncoderFallback = new EncoderReplacementFallback("\uFFFD");

        return encoding;
    }

For the Encoding-derived classes, the default fallback character is ?, but in Java it is \uFFFD. The new System.Text.Unicode.Utf8 class uses the same default fallback character as in Java.

Note that the fallback character has been defined in other places in the codebase as ?, as well. Those should be reviewed.

Since this PR is encoding-related, I think it would be fine to put them here, but feel free to open separate issues if you want to keep them separate.

src/Lucene.Net/Support/Text/EncodingExtensions.cs

paulirwin · 2025-01-10T03:19:46Z

@NightOwl888 In regards to:

For the Encoding-derived classes, the default fallback character is ?, but in Java it is \uFFFD. The new System.Text.Unicode.Utf8 class uses the same default fallback character as in Java.

This is only true for Encoding.ASCII (at least amongst those defined on Encoding). Encoding.UTF8 by default returns \uFFFD (from csharprepl):

> Encoding.ASCII.GetString(new byte[] { 0xc3 })
"?"
> Encoding.UTF8.GetString(new byte[] { 0xc3 })
"�"
> Encoding.UTF8.GetString(new byte[] { 0xc3 })[0] == '\ufffd'
true

I confirmed this is the case on net462 as well.

The only place in non-test code where Encoding.ASCII is used is in ConnectionCostsBuilder, and that doesn't matter because this PR changes it to throw on invalid characters anyways.

Given that the rest of the places we use UTF8, I don't think we need to make that change.

NightOwl888

I did a more exhaustive review and I found some additional issues that we should address.

Exception Handling

We need to fix the ExceptionExtensions class so that any existing catch blocks will account for gaps between encoding fallback exceptions between Java and .NET.

In Java, CharacterCodingException subclasses IOException.
In .NET, EncoderFallbackException and DecoderFallbackException subclass ArgumentException

I already checked where we are catching ArgumentException explicitly, and I think we are good there.

IsIOException() should be changed to include both DecoderFallbackException and EncoderFallbackException.
IsIllegalArgumentException() should be changed to exclude both DecoderFallbackException and EncoderFallbackException.

Explicit Encoding Char Replacement

As I previously mentioned, there are a few places where we are using ? as the replacement character instead of \uFFFD. I located them and they are all in tests.

lucenenet/src/Lucene.Net.Tests.Facet/Taxonomy/WriterCache/TestCharBlockArray.cs

Lines 43 to 88 in c3b6896

    
           // This is essentially the equivalent of 
        
           // CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder() 
        
           //     .onUnmappableCharacter(CodingErrorAction.REPLACE) 
        
           //     .onMalformedInput(CodingErrorAction.REPLACE); 
        
           // 
        
           // Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, 
        
           //     new EncoderReplacementFallback("?"), 
        
           //     new DecoderReplacementFallback("?")); 
        
           for (int i = 0; i < n; i++) 
        
           { 
        
               Random.NextBytes(buffer); 
        
               int size = 1 + Random.Next(50); 
        
               // This test is turning random bytes into a string, 
        
               // this is asking for trouble. 
        
               Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, 
        
                   new EncoderReplacementFallback("?"), 
        
                   new DecoderReplacementFallback("?")); 
        
               string s = decoder.GetString(buffer, 0, size); 
        
               array.Append(s); 
        
               builder.Append(s); 
        
           } 
        
           for (int i = 0; i < n; i++) 
        
           { 
        
               Random.NextBytes(buffer); 
        
               int size = 1 + Random.Next(50); 
        
               // This test is turning random bytes into a string, 
        
               // this is asking for trouble. 
        
               Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, 
        
                   new EncoderReplacementFallback("?"), 
        
                   new DecoderReplacementFallback("?")); 
        
               string s = decoder.GetString(buffer, 0, size); 
        
               array.Append(s); 
        
               builder.Append(s); 
        
           } 
        
           for (int i = 0; i < n; i++) 
        
           { 
        
               Random.NextBytes(buffer); 
        
               int size = 1 + Random.Next(50); 
        
               // This test is turning random bytes into a string, 
        
               // this is asking for trouble. 
        
               Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, 
        
                   new EncoderReplacementFallback("?"), 
        
                   new DecoderReplacementFallback("?"));

lucenenet/src/Lucene.Net.Tests.Facet/Taxonomy/WriterCache/TestCompactLabelToOrdinal.cs

Lines 46 to 65 in c3b6896

    
           // This is essentially the equivalent of 
        
           // CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder() 
        
           //     .onUnmappableCharacter(CodingErrorAction.REPLACE) 
        
           //     .onMalformedInput(CodingErrorAction.REPLACE); 
        
           //  
        
           // Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,  
        
           //     new EncoderReplacementFallback("?"),  
        
           //     new DecoderReplacementFallback("?")); 
        
           Random random = Random; 
        
           for (int i = 0; i < numUniqueValues;) 
        
           { 
        
               random.NextBytes(buffer); 
        
               int size = 1 + random.Next(buffer.Length); 
        
               // This test is turning random bytes into a string, 
        
               // this is asking for trouble. 
        
               Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage, 
        
                   new EncoderReplacementFallback("?"), 
        
                   new DecoderReplacementFallback("?"));

While technically, they are fine because they are in tests, during porting what often happens when trying to figure out how to do something is to search the codebase to see how it was done elsewhere. And leaving it like this could cause it to be copied incorrectly elsewhere.

So, let's change those tests to use StandardCharsets.UTF_8 and add a comment to indicate that configuring CodingErrorAction.REPLACE explicitly is unnecessary because it is the default behavior of StandardCharsets.UTF_8 (in .NET, anyway).

Default Encoding

During review, I noticed that we have translated Charset.defaultCharset() to Encoding.GetEncoding(0). It would be better to change this to Encoding.Default. We have comments to indicate Encoding.GetEncoding(0) represents the system default in most cases, but using the property would make it clearer.

ChatGPT recommends using Encoding.Default instead of Encoding.GetEncoding(0) because apparently the latter will throw an exception in some environments on .NET Core. I tried on my system using net8.0, but it didn't throw. However, the property definitely won't throw even if ChatGPT is wrong about this.

Default Encoding with System.Console

Note that there is one place where the default encoding wasn't ported correctly.

lucenenet/src/Lucene.Net.Benchmark/Quality/Trec/QueryDriver.cs

Line 83 in 85c0141

TextWriter logger = Console.Out; //new StreamWriter(Console, Encoding.GetEncoding(0));

This line can be changed to:

using TextWriter logger = new StreamWriter(Console.OpenStandardOutput(), Encoding.Default) { AutoFlush = true };

ISO 8859-14 Encoding

This encoding (only the decoder) was ported from Hunspell in Lucene because neither .NET nor Java have an implementation.

Caching

Unlike the other encodings it is not cached in .NET. Let's turn ISO8859_14Encoding into a singleton by adding a static Default property.

Fallback Behavior

I ran some tests in Java to find that this neither throws nor replaces invalid input characters. Instead, it simply leaves them as-is. I checked the iso-8859-1 charset (which the Lucene implementation of iso-8859-14 extends) and it does the same thing by default in Java.

But since the user is supplying dictionaries (usually standardized ones that are downloaded from websites that gather them) it makes me wonder whether Lucene should be throwing on invalid input characters. Certainly, leaving them as-is is not a typical option in a .NET Encoding class. Maybe we should report this upstream to find out whether this was intentional or an oversight.

NOTE: Hunspell was among the most complicated parts to port in the Lucene.Net.Analysis.Common package, so it might be worth the extra effort to at least make sure users aren't feeding it invalid input.

Also, I downloaded all of the Hunspell dictionaries that Lucene uses in tests and put them in a repository here: https://github.com/NightOwl888/lucenenet-hunspell. These are not part of the main repository because it is a huge amount of binary data. The tests currently must be run manually after downloading the files. IIRC, there are still a few dictionaries that don't pass the tests in Lucene.NET, but I don't see that as a showstopper for the release because most of them work.

GB2312 Encoding

Encoding.GetEncoding("GB2312") is being called inside of a tight loop in multiple places in Lucene.Net.Analysis.SmartCn. While the instance returned from Encoding.GetEncoding("GB2312") is cached by the encoding provider, it still needs to do a dictionary lookup every time it is called. Let's change it to statically cache this instance in AbstractDictionary so the dictionary lookup only occurs once.

src/Lucene.Net/Support/Text/EncodingExtensions.cs

…apache#1076

…eException tests

paulirwin · 2025-01-11T04:19:29Z

The change to the tests exposed something else to fix: Java's standard Charsets will throw on malformed input and unmappable characters by default, meaning CodingErrorAction.REPORT is already the default. In .NET, the default encodings replace instead of throw, but our StandardCharsets.UTF_8/IOUtils.ENCODING_UTF_8_NO_BOM is already set up to match Java in this behavior. That means we do need to explicitly use DecoderReplacementFallback for those tests mentioned above in "Explicit Encoding Char Replacement" if we use StandardCharsets.UTF_8, and will likely need another extension method. But more importantly, this means that any place in Lucene that uses a Charset without specifying CodingErrorAction.REPLACE, we need to use one that throws (such as our StandardCharsets.UTF_8 support value), or else we are not throwing on invalid byte sequences when Lucene is (i.e. if we use Encoding.UTF8). I'll get this change made tomorrow.

NightOwl888 · 2025-01-11T09:23:25Z

It appears that there is a difference between the default when using:

StandardCharsets.UTF_8
StandardCharsets.UTF_8.newDecoder()

The default behavior of the latter is CodingErrorAction.REPORT, but the same is not true when using the former (which is how Lucene normally passes it to other APIs).

  @Test
  public void testDefaultUTF8DecoderWithInvalidBytes() {
      // Define a mix of valid and invalid UTF-8 bytes
      byte[] testBytes = {
              (byte) 0xE2, (byte) 0x82, (byte) 0xAC, // Valid UTF-8 for €
              (byte) 0xC3, (byte) 0xA9,             // Valid UTF-8 for é
              (byte) 0xFF,                          // Invalid byte
              (byte) 0x41,                          // Valid ASCII for 'A'
              (byte) 0x80,                          // Invalid byte
              (byte) 0xF0, (byte) 0x9F, (byte) 0x92, (byte) 0x96 // Valid UTF-8 for 💖
      };

      // Decode the bytes using the default UTF-8 behavior
      String decodedString = new String(testBytes, StandardCharsets.UTF_8);

      // Print each character and its code point
      System.out.println("Decoded String Code Points: ");
      for (int i = 0; i < decodedString.length(); i++) {
          System.out.printf("Character at position %d: %d%n", i, (int) decodedString.charAt(i));
      }
      
      // Verify the decoded string (replacement character appears at invalid bytes)
      assertEquals("€é\uFFFDA\uFFFD💖", decodedString);
  }

paulirwin added 2 commits January 6, 2025 12:23

Use DecoderFallback.ExceptionFallback to match Java behavior, apache#…

961d3e1

…1076

Add unit test for WithDecoderExceptionFallback

ba468c1

paulirwin added the notes:improvement An enhancement to an existing feature label Jan 6, 2025

paulirwin requested a review from NightOwl888 January 6, 2025 19:36

paulirwin and others added 3 commits January 6, 2025 16:50

Fix unit test namespace and doc comment

fa73676

Lucene.Net.Support.Buffers: Added ArrayPoolExtensions class to simpli…

f7a9186

…fy returning arrays that might be null

Lucene.Net.Index.Term::ToString(): Optimized writing UTF8 string on t…

fcc412e

…arget frameworks that support System.Text.Unicode.Utf8. Added tests to verify fallback is working.

NightOwl888 requested changes Jan 8, 2025

View reviewed changes

src/Lucene.Net/Support/Text/EncodingExtensions.cs Outdated Show resolved Hide resolved

NightOwl888 self-requested a review January 10, 2025 18:30

NightOwl888 requested changes Jan 10, 2025

View reviewed changes

src/Lucene.Net/Support/Text/EncodingExtensions.cs Outdated Show resolved Hide resolved

paulirwin added 5 commits January 10, 2025 20:16

Cache decoder fallback encoding lookup, apache#1076

955872b

Treat Encoder/DecoderFallbackExceptions as IOExceptions to match Java, …

b61a322

…apache#1076

Fix translation of replacement fallback test code, IOException/Runtim…

d2c4b52

…eException tests

Use Encoding.Default instead of GetEncoding(0), apache#1076

66f3e3c

Cache GB2312 encoding lookup, apache#1076

c07cc87

paulirwin marked this pull request as draft January 11, 2025 04:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

paulirwin commented Jan 6, 2025

NightOwl888 left a comment

paulirwin commented Jan 10, 2025

NightOwl888 left a comment

paulirwin commented Jan 11, 2025

NightOwl888 commented Jan 11, 2025

	// This is essentially the equivalent of
	// CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
	// .onUnmappableCharacter(CodingErrorAction.REPLACE)
	// .onMalformedInput(CodingErrorAction.REPLACE);
	//
	// Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
	// new EncoderReplacementFallback("?"),
	// new DecoderReplacementFallback("?"));

	for (int i = 0; i < n; i++)
	{
	Random.NextBytes(buffer);
	int size = 1 + Random.Next(50);
	// This test is turning random bytes into a string,
	// this is asking for trouble.
	Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
	new EncoderReplacementFallback("?"),
	new DecoderReplacementFallback("?"));
	string s = decoder.GetString(buffer, 0, size);
	array.Append(s);
	builder.Append(s);
	}

	for (int i = 0; i < n; i++)
	{
	Random.NextBytes(buffer);
	int size = 1 + Random.Next(50);
	// This test is turning random bytes into a string,
	// this is asking for trouble.
	Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
	new EncoderReplacementFallback("?"),
	new DecoderReplacementFallback("?"));
	string s = decoder.GetString(buffer, 0, size);
	array.Append(s);
	builder.Append(s);
	}

	for (int i = 0; i < n; i++)
	{
	Random.NextBytes(buffer);
	int size = 1 + Random.Next(50);
	// This test is turning random bytes into a string,
	// this is asking for trouble.
	Encoding decoder = Encoding.GetEncoding(Encoding.UTF8.CodePage,
	new EncoderReplacementFallback("?"),
	new DecoderReplacementFallback("?"));

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

Are you sure you want to change the base?

Use DecoderFallback.ExceptionFallback to match Java's CodingErrorAction.REPORT, #1076 #1089

Conversation

paulirwin commented Jan 6, 2025

Description

NightOwl888 left a comment

Choose a reason for hiding this comment

paulirwin commented Jan 10, 2025

NightOwl888 left a comment

Choose a reason for hiding this comment

Exception Handling

Explicit Encoding Char Replacement

Default Encoding

Default Encoding with System.Console

ISO 8859-14 Encoding

Caching

Fallback Behavior

GB2312 Encoding

paulirwin commented Jan 11, 2025

NightOwl888 commented Jan 11, 2025