Implementation of CRAM 3.1 codecs. #1714

cmnbroad · 2024-09-04T15:55:35Z

This is a squashed and rebased branch containing all of the changes from Yash's CRAM 3.1 codec branches. Replacement for #1618, 1644, 1663 and 1704.

jerryliu2005 · 2024-12-10T20:11:54Z

Hello. Thank you very much for this PR! I learned this fix is critical for an IGV upgrade to support CRAM v3.1, which is very important to our projects. I am wondering if there’s an estimated timeline for merging. Much appreciated!

cmnbroad · 2024-12-10T22:47:49Z

@jerryliu2005 This branch is nearly done, but there are a few TODOs left, plus I want to do some more large-scale round trip testing. Those should get done over the next couple of weeks. However, this branch doesn't actually turn on CRAM 3.1 support - I have a separate, smaller branch for that which I'm using for testing. Once those are merged, we'll need to do an htsjdk release, and then IGV can upgrade. I'd hope/expect that all to happen in the Jan/Feb timeframe.

jerryliu2005 · 2024-12-10T23:05:14Z

Hi @cmnbroad, thanks very much for the info! I will let the team know about the rough timeframe to be ready :)

…tion in tests.

…safe, precompile the name Tok regex, eliminate unnecessary code.

…ns, fix the read name separator.

lbergelson

@cmnbroad I have a few very minor comments. I did not do a deep review.

One thing that I would really like to have is a clear indication of what exact commit of hts-specs the data came from. Maybe a top level README in the test data folder?

Comments on the compressors describing what sort of data they're intended for would be useful for the future. Things like FQZComp are non obvious.

One thing I noticed that I hadn't though about before, it looks like there is a lot of array creation and copying going on during compression/ decompression. If we ever want to speed things up that might be a good place to take a look. I can imagine many of the buffers could probably be reused.

lbergelson · 2025-01-22T16:36:11Z

src/main/java/htsjdk/samtools/cram/compression/CompressionUtils.java

+        return i;
+    }
+
+    public static ByteBuffer encodePack(


A comment describing what this method is for would be helpful

lbergelson · 2025-01-22T16:46:05Z

src/main/java/htsjdk/samtools/cram/compression/CompressionUtils.java

+    }
+
+    // returns a new LITTLE_ENDIAN ByteBuffer of size = bufferSize
+    //TODO: rename this to allocateLittleEndianByteBuffer


did you want to rename these?

lbergelson · 2025-01-22T16:47:08Z

src/main/java/htsjdk/samtools/cram/compression/fqzcomp/FQZCompDecode.java

+            len |= model.getLength()[1].modelDecode(inBuffer, rangeCoder) << 8;
+            len |= model.getLength()[2].modelDecode(inBuffer, rangeCoder) << 16;
+            len |= model.getLength()[3].modelDecode(inBuffer, rangeCoder) << 24;
+            //TODO: it is entirely f'd up that fixedlen appears to be used as a flag elsewhere, but as an actual length here


Entirely fucked up doesn't sound ideal. 🙈

lbergelson · 2025-01-22T16:51:26Z

src/main/java/htsjdk/samtools/cram/compression/fqzcomp/FQZCompDecode.java

+
+import java.nio.ByteBuffer;
+
+public class FQZCompDecode {


It would be good if each decompressor had a comment describing it's reason for existence, like what sort of data it is good at compressing. For the ones that were implemented off the C or Javascript a link to that code might be useful for future reference.

lbergelson added the cram label Oct 1, 2024

cmnbroad force-pushed the cn_cram_3_1_codecs branch 2 times, most recently from b92597b to bfb60ea Compare November 25, 2024 20:28

yash-puligundla and others added 18 commits December 18, 2024 08:43

Implementation of CRAM 3.1 codecs.

306459c

Many FQZComp fixes, with roundtrip tests working.

dd9b17e

Name tokenization codec fixes.

9534f3d

Interop test data from htslib.

36d8622

Remove unnecessary mac files.

0f7c991

Use shared rans decoder in interop tests.

ce7430f

Update useArith type in name tokenizer, remove unecessary object crea…

ca64055

…tion in tests.

Remove sketchy exception suppression, make test encoding params type-…

56b8322

…safe, precompile the name Tok regex, eliminate unnecessary code.

Code cleanup.

d354ab2

Store token stream in arrays instead of lists.

f7a4ee6

More naming, removal of unnecessary code, switch sanitization.

ea0b8b2

Temp update.

8559902

Precompile regular expression patterns, optimize some string operatio…

561595d

…ns, fix the read name separator.

Checkpoint 1.

7777e24

Checkpoint 2.

fdc2153

Repair haphazard stream management.

d7f2095

Consolidate and optimize, remove unecessary code.

d39ddc0

Fix spotbugs issue.

70665d1

cmnbroad force-pushed the cn_cram_3_1_codecs branch from 07828c3 to 70665d1 Compare December 18, 2024 13:45

cmnbroad added 5 commits January 6, 2025 14:16

Remove unnecessary code.

22e0dfa

Fix sketchy byte conversion to use UTF-8 charset for names.

2f3ce74

Remove obsolete comment.

a8a7e55

Upgrade to samtools 1.21.

a71f9a9

Remove redundant/duplicate tests.

bdf9338

cmnbroad added 3 commits January 7, 2025 12:18

Eliminate intermediate String representation for decoded names.

bad9ece

Standardize input/output read name buffer separator.

91c1cce

Update name separator handling.

2b4ffd8

cmnbroad force-pushed the cn_cram_3_1_codecs branch from ce90f4d to 2b4ffd8 Compare January 7, 2025 22:30

cmnbroad added 5 commits January 7, 2025 17:36

Comment update.

b543c57

Remove old interop data.

7a50f40

Add updated interop test files.

71e7145

Remove .DS_Store files.

a37b995

Conform to updated interop test structure.

282322a

cmnbroad force-pushed the cn_cram_3_1_codecs branch from df62f59 to 282322a Compare January 14, 2025 22:35

cmnbroad mentioned this pull request Jan 22, 2025

Wire up CRAM 3.1 codecs for reading. #1736

Draft

cmnbroad marked this pull request as ready for review January 22, 2025 14:55

lbergelson approved these changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of CRAM 3.1 codecs. #1714

Implementation of CRAM 3.1 codecs. #1714

cmnbroad commented Sep 4, 2024 •

edited

Loading

jerryliu2005 commented Dec 10, 2024

cmnbroad commented Dec 10, 2024

jerryliu2005 commented Dec 10, 2024

lbergelson left a comment

lbergelson Jan 22, 2025

lbergelson Jan 22, 2025

lbergelson Jan 22, 2025

lbergelson Jan 22, 2025

Implementation of CRAM 3.1 codecs. #1714

Are you sure you want to change the base?

Implementation of CRAM 3.1 codecs. #1714

Conversation

cmnbroad commented Sep 4, 2024 • edited Loading

jerryliu2005 commented Dec 10, 2024

cmnbroad commented Dec 10, 2024

jerryliu2005 commented Dec 10, 2024

lbergelson left a comment

Choose a reason for hiding this comment

lbergelson Jan 22, 2025

Choose a reason for hiding this comment

lbergelson Jan 22, 2025

Choose a reason for hiding this comment

lbergelson Jan 22, 2025

Choose a reason for hiding this comment

lbergelson Jan 22, 2025

Choose a reason for hiding this comment

cmnbroad commented Sep 4, 2024 •

edited

Loading