Sequence collection data models using contig alias #105

tcezard · 2023-05-23T14:27:37Z

The current java model in contig alias has two main entities:

Chromosome: representing a single sequence provided by Genbank and ENA
Assembly: representing a group of sequence provided by Genbank and ENA

This issue will investigate how this model can be modified to support storing and providing sequence collections for the assemblies already represented.
We need to be able to represent all 3 levels of sequence collections:

The top level digest S3LCyI788LE6vq89Tc_LojEcsMZRixzP
The compact level

{
    "sequences": "EiYgJtUfGyad7wf5atL5OG4Fkzohp2qe",
    "lengths": "5K4odB173rjao1Cnbk5BnvLt9V7aPAa2",
    "names": "g04lKdxiYtG3dOGeUC5AdKEifw65G0Wp"
}

The canonical level

{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "76f9f3315fa4b831e93c36cd88196480",
    "d5171e863a3d8f832f0559235987b1e5",
    "b9b1baaa7abf206f6b70cf31654172db"
  ]
}

Additional question: Can we add another property to a sequence collection in this datamodel

The text was updated successfully, but these errors were encountered:

waterflow80 · 2023-05-24T19:38:09Z

@tcezard
Well, the main issue for us while choosing the data model will be to avoid the data redundancy as much as possible.
Because we have multiple SeqCol objects that will contain, in their canonical representation, the same array of lengths and sequences.

And storing these same data multiple times for different SeqCol objects will be insanely consuming in terms of database capacity.
So I suggest to use two different tables (for the canonical representation):

1. seqcol_lengths_and_sequences:

A table that will store the digest of the object, the sequences and the length of these sequences. So the table will be as follows:

And every time we want to add a new digest (for a new object), we just add a new column for that table and populate it based on the sequences (and lengths) that it contains:

Populate it:

So we get:

2.seqcol_names:

A table that will map each digest and sequence with the correct name:

So the redundancy will only be at the level of sequence_id (that references seq_id in the seqcol_lengths_and_sequences table) which will be a simple integer and the occupied disk space will be negligible.

For the level 0 and the level 1, we'll create a table for each representation, of which the handling will be straightforward.

waterflow80 · 2023-06-03T20:06:36Z

I've managed to populate the the md5checksum column of chromosome table with the md5 hash of the sequence.

I think it's possible to use the existing contig-alias data model to implement the sequence collection specification. The evaluation of this approach is described here.

tcezard added the sequence-collections label May 23, 2023

waterflow80 mentioned this issue Jun 3, 2023

Sequence collection data models without using contig alias #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence collection data models using contig alias #105

Sequence collection data models using contig alias #105

tcezard commented May 23, 2023

waterflow80 commented May 24, 2023

waterflow80 commented Jun 3, 2023

Sequence collection data models using contig alias #105

Sequence collection data models using contig alias #105

Comments

tcezard commented May 23, 2023

waterflow80 commented May 24, 2023

waterflow80 commented Jun 3, 2023