You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current java model in contig alias has two main entities:
Chromosome: representing a single sequence provided by Genbank and ENA
Assembly: representing a group of sequence provided by Genbank and ENA
This issue will investigate how this model can be modified to support storing and providing sequence collections for the assemblies already represented.
We need to be able to represent all 3 levels of sequence collections:
The top level digest S3LCyI788LE6vq89Tc_LojEcsMZRixzP
@tcezard
Well, the main issue for us while choosing the data model will be to avoid the data redundancy as much as possible.
Because we have multiple SeqCol objects that will contain, in their canonical representation, the same array of lengths and sequences.
And storing these same data multiple times for different SeqCol objects will be insanely consuming in terms of database capacity.
So I suggest to use two different tables (for the canonical representation):
1. seqcol_lengths_and_sequences:
A table that will store the digest of the object, the sequences and the length of these sequences. So the table will be as follows:
And every time we want to add a new digest (for a new object), we just add a new column for that table and populate it based on the sequences (and lengths) that it contains:
Populate it:
So we get:
2.seqcol_names:
A table that will map each digest and sequence with the correct name:
So the redundancy will only be at the level of sequence_id (that references seq_id in the seqcol_lengths_and_sequences table) which will be a simple integer and the occupied disk space will be negligible.
For the level 0 and the level 1, we'll create a table for each representation, of which the handling will be straightforward.
I've managed to populate the the md5checksum column of chromosome table with the md5 hash of the sequence.
I think it's possible to use the existing contig-alias data model to implement the sequence collection specification. The evaluation of this approach is described here.
The current java model in contig alias has two main entities:
This issue will investigate how this model can be modified to support storing and providing sequence collections for the assemblies already represented.
We need to be able to represent all 3 levels of sequence collections:
S3LCyI788LE6vq89Tc_LojEcsMZRixzP
Additional question: Can we add another property to a sequence collection in this datamodel
The text was updated successfully, but these errors were encountered: