Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A proposal for Filament #130

Open
nekrut opened this issue Oct 16, 2024 · 1 comment
Open

A proposal for Filament #130

nekrut opened this issue Oct 16, 2024 · 1 comment
Assignees

Comments

@nekrut
Copy link
Contributor

nekrut commented Oct 16, 2024

Filament – A large-scale structure in the universe, consisting of a network of galaxies and galaxy clusters interconnected by dark matter and gas.

Why Filament?

In context of life sciences most analyses are performed in a context of an organism. For example, to understand genetic variation in, say, mosquito you want to map reads against a particular Anopheles or Culex reference genome.

Note

There are exceptions to this "one reference" paradigm that include genome assembly analysis (a reference does not exist), as well as metagenomic/metatranscriptomic types of studies.

As a result in most cases analysis does not actually start in Galaxy. It starts at some external resource such as NCBI, EBI, VEuPathDb, UCSC Genome Browser and so on. From these, only UCSC has direct link to Galaxy. IN all other cases the researchers have no way of knowing that (most) analyses they need to do can be done via Galaxy.

Filament is a lightweight application that will provide access to an arbitrary set of genomic data, allow storage of additional data that does not fit into NCBI/EBI/UCSC frameworks and will allow invoking (Galaxy) workflows.

Design overview

The first prototype of the Filament framework will be built to satisfy the needs of two projects: BRC Analytics and VGP (GenomeArk).

filement_click_flow

This is a static site populated from data as described in #135. See below for explanation of each page

Organism list

Data explorer is a searchable list of organisms currently supported by a given instance of the Filament. Two views should be supported:

  1. List view
  2. Hierarchical view (tree)

List view

Is a simple list that contains a search pane on the left (just like the current https://brc-analytics.org or https://explore.anvilproject.org/datasets). In this list species names are unique (e.g., if a species has multiple assemblies associated with it the species is listed only once. Multiple assemblies will be visible in the species view.)

image

The list contains the following elements:

  1. Checkbox to enable multiple selection
  2. Species name
  3. TaxId linkable to NCBI Taxonomy
  4. # of references = the number of genomes associated with this taxon
  5. Tags = tags will enable fine grain classification of taxa. For example: "VEuPath", "VGP", "T2T" etc...

Clicking on species name will bring the user to the "Taxa page". If the user selects multiple species by using checkboxes, this create a button "Go to Taxa page", which will point to the "Taxa page" as well.

Tree view

Tree view will provide a Hierarchical view of all data. It can be a phylogenetic tree, a treemap etc.

image

It should also support ability to select either a single or multiple species. A way to enable this is to allow user to click on a species and them provide an ability to add that species to a "cart".

Genomes page

Genomes page provides a detailed view on reference genomes available within a given Filament instance. A users gets to this page from the Organism view page. If a single organism was selected on Data Explorer page all genomes available for this organism are listed here. If multiple organisms are selected the page shows a list of genomes available for all selected taxa.

image

Each row contains the following columns:

  1. Checkbox allowing for multiple selection
  2. Action Buttons (see below)
  3. Universal assembly ID (e.g., RefSeq ID)
  4. # Scaffolds = measure of assembly quality
  5. N50 = another measure of assembly completeness
  6. Tags

The Action Buttons should be a configurable set of buttons:

image

Here the buttons are:

  1. "Analyze" (see below)
  2. UCSC = Link to the UCSC Genome Browser
  3. NCBI = Link to NCBI Datasets
  4. EBI = Link to EBI
  5. PDN = Link to the future Pathogen Data Network site

Clicking on "Analyze" button will bring the user to the final Filament page. This page will look different depending whether the user select single or multiple species.

Species page

This page will list Galaxy workflows available on for this reference as well as additional files. For prototyping the additional files functionality we will use current data from VGPs GenomeArk. This data includes intermediate analysis datasets, haplotype assemblies, QC metrics, JBrowse2 instances generated by post-curation workflows etc.

image

Comparative page

Multiple species analyze page will be the key component for the comparative genomics aspect of Filament framework.

image

It will contain a list of genomes selected on the previous page. We still need to think through which columns this list will have. One of the columns should indicate whether the species are included into pre-computed multiple alignment generated with VGP, Zoonomia or other projects.

It will allow performing analyses that involve "multiple" species such as alignment generation, visualization in comparative browsers (such as the one developed by CGR at NCBI) and performing selection analyses with tools such as HyPhy.

@nekrut nekrut converted this from a draft issue Oct 16, 2024
@Smeds
Copy link
Collaborator

Smeds commented Oct 28, 2024

@nekrut How should we present species information where we have no ready assemblies? Most of the data presented in Genomes Page will be missing. Should we have an additional page with species that have genomic data and no assemblies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants