A proposal for Filament #130

nekrut · 2024-10-16T17:50:59Z

Filament – A large-scale structure in the universe, consisting of a network of galaxies and galaxy clusters interconnected by dark matter and gas.

Why Filament?

In context of life sciences most analyses are performed in a context of an organism. For example, to understand genetic variation in, say, mosquito you want to map reads against a particular Anopheles or Culex reference genome.

Note

There are exceptions to this "one reference" paradigm that include genome assembly analysis (a reference does not exist), as well as metagenomic/metatranscriptomic types of studies.

As a result in most cases analysis does not actually start in Galaxy. It starts at some external resource such as NCBI, EBI, VEuPathDb, UCSC Genome Browser and so on. From these, only UCSC has direct link to Galaxy. IN all other cases the researchers have no way of knowing that (most) analyses they need to do can be done via Galaxy.

Filament is a lightweight application that will provide access to an arbitrary set of genomic data, allow storage of additional data that does not fit into NCBI/EBI/UCSC frameworks and will allow invoking (Galaxy) workflows.

Design overview

The first prototype of the Filament framework will be built to satisfy the needs of two projects: BRC Analytics and VGP (GenomeArk).

This is a static site populated from data as described in #135. See below for explanation of each page

Organism list

Data explorer is a searchable list of organisms currently supported by a given instance of the Filament. Two views should be supported:

List view
Hierarchical view (tree)

List view

Is a simple list that contains a search pane on the left (just like the current https://brc-analytics.org or https://explore.anvilproject.org/datasets). In this list species names are unique (e.g., if a species has multiple assemblies associated with it the species is listed only once. Multiple assemblies will be visible in the species view.)

The list contains the following elements:

Checkbox to enable multiple selection
Species name
TaxId linkable to NCBI Taxonomy
# of references = the number of genomes associated with this taxon
Tags = tags will enable fine grain classification of taxa. For example: "VEuPath", "VGP", "T2T" etc...

Clicking on species name will bring the user to the "Taxa page". If the user selects multiple species by using checkboxes, this create a button "Go to Taxa page", which will point to the "Taxa page" as well.

Tree view

Tree view will provide a Hierarchical view of all data. It can be a phylogenetic tree, a treemap etc.

It should also support ability to select either a single or multiple species. A way to enable this is to allow user to click on a species and them provide an ability to add that species to a "cart".

Genomes page

Genomes page provides a detailed view on reference genomes available within a given Filament instance. A users gets to this page from the Organism view page. If a single organism was selected on Data Explorer page all genomes available for this organism are listed here. If multiple organisms are selected the page shows a list of genomes available for all selected taxa.

Each row contains the following columns:

Checkbox allowing for multiple selection
Action Buttons (see below)
Universal assembly ID (e.g., RefSeq ID)
# Scaffolds = measure of assembly quality
N50 = another measure of assembly completeness
Tags

The Action Buttons should be a configurable set of buttons:

Here the buttons are:

"Analyze" (see below)
UCSC = Link to the UCSC Genome Browser
NCBI = Link to NCBI Datasets
EBI = Link to EBI
PDN = Link to the future Pathogen Data Network site

Clicking on "Analyze" button will bring the user to the final Filament page. This page will look different depending whether the user select single or multiple species.

Species page

This page will list Galaxy workflows available on for this reference as well as additional files. For prototyping the additional files functionality we will use current data from VGPs GenomeArk. This data includes intermediate analysis datasets, haplotype assemblies, QC metrics, JBrowse2 instances generated by post-curation workflows etc.

Comparative page

Multiple species analyze page will be the key component for the comparative genomics aspect of Filament framework.

It will contain a list of genomes selected on the previous page. We still need to think through which columns this list will have. One of the columns should indicate whether the species are included into pre-computed multiple alignment generated with VGP, Zoonomia or other projects.

It will allow performing analyses that involve "multiple" species such as alignment generation, visualization in comparative browsers (such as the one developed by CGR at NCBI) and performing selection analyses with tools such as HyPhy.

Smeds · 2024-10-28T17:12:24Z

@nekrut How should we present species information where we have no ready assemblies? Most of the data presented in Genomes Page will be missing. Should we have an additional page with species that have genomic data and no assemblies?

nekrut added this to BRC development tasks Oct 16, 2024

nekrut converted this from a draft issue Oct 16, 2024

nekrut mentioned this issue Oct 17, 2024

Refactor organism list page #136

Open

nekrut mentioned this issue Nov 5, 2024

Rendering filament pages using NCBI dataset API #157

Open

2 tasks

MillenniumFalconMechanic assigned MillenniumFalconMechanic and NoopDog Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A proposal for Filament #130

A proposal for Filament #130

nekrut commented Oct 16, 2024 •

edited

Loading

Smeds commented Oct 28, 2024

A proposal for Filament #130

A proposal for Filament #130

Comments

nekrut commented Oct 16, 2024 • edited Loading

Why Filament?

Design overview

Organism list

List view

Tree view

Genomes page

Species page

Comparative page

Smeds commented Oct 28, 2024

nekrut commented Oct 16, 2024 •

edited

Loading