Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence collection data models using contig alias #105

Open
tcezard opened this issue May 23, 2023 · 2 comments
Open

Sequence collection data models using contig alias #105

tcezard opened this issue May 23, 2023 · 2 comments

Comments

@tcezard
Copy link
Member

tcezard commented May 23, 2023

The current java model in contig alias has two main entities:

  • Chromosome: representing a single sequence provided by Genbank and ENA
  • Assembly: representing a group of sequence provided by Genbank and ENA

This issue will investigate how this model can be modified to support storing and providing sequence collections for the assemblies already represented.
We need to be able to represent all 3 levels of sequence collections:

  • The top level digest S3LCyI788LE6vq89Tc_LojEcsMZRixzP
  • The compact level
{
    "sequences": "EiYgJtUfGyad7wf5atL5OG4Fkzohp2qe",
    "lengths": "5K4odB173rjao1Cnbk5BnvLt9V7aPAa2",
    "names": "g04lKdxiYtG3dOGeUC5AdKEifw65G0Wp"
}
  • The canonical level
{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "76f9f3315fa4b831e93c36cd88196480",
    "d5171e863a3d8f832f0559235987b1e5",
    "b9b1baaa7abf206f6b70cf31654172db"
  ]
}

Additional question: Can we add another property to a sequence collection in this datamodel

@waterflow80
Copy link
Collaborator

@tcezard
Well, the main issue for us while choosing the data model will be to avoid the data redundancy as much as possible.
Because we have multiple SeqCol objects that will contain, in their canonical representation, the same array of lengths and sequences.

And storing these same data multiple times for different SeqCol objects will be insanely consuming in terms of database capacity.
So I suggest to use two different tables (for the canonical representation):

1. seqcol_lengths_and_sequences:

A table that will store the digest of the object, the sequences and the length of these sequences. So the table will be as follows:
Screenshot from 2023-05-24 20-20-37
And every time we want to add a new digest (for a new object), we just add a new column for that table and populate it based on the sequences (and lengths) that it contains:
Screenshot from 2023-05-24 20-19-22
Populate it:
Screenshot from 2023-05-24 19-00-12
So we get:
Screenshot from 2023-05-24 20-17-58

2.seqcol_names:

A table that will map each digest and sequence with the correct name:
Screenshot from 2023-05-24 20-27-18
So the redundancy will only be at the level of sequence_id (that references seq_id in the seqcol_lengths_and_sequences table) which will be a simple integer and the occupied disk space will be negligible.

For the level 0 and the level 1, we'll create a table for each representation, of which the handling will be straightforward.

@waterflow80
Copy link
Collaborator

I've managed to populate the the md5checksum column of chromosome table with the md5 hash of the sequence.
Screenshot from 2023-06-03 19-47-58

I think it's possible to use the existing contig-alias data model to implement the sequence collection specification. The evaluation of this approach is described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants