Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem while trying to filter gtf file by longest transcript #71

Open
eduardopdev opened this issue Jul 7, 2020 · 3 comments
Open

Comments

@eduardopdev
Copy link

First of all, i downloaded Araport11 from www.arabidopsis.org
Then I used cgat-apps to generate a gtf file out of the file i downloaded
$ cat Araport11_GFF3_genes_transposons.201606.gff | cgat gff32gtf > Araport.gtf
and had no problems.

I saw that input gtf files must be sorted. so i sorted Araport.gtf:
cat Araport.gtf | cgat gtf2gtf > Araport_sort.gtf --method=sort --sort-order=gene+transcript

and then, I filtered the sorted file by longest_transcript:
cat Araport_sort.gtf | cgat gtf2gtf > Araport_longest.gtf --method=filter --filter-method=longest-transcript

the problem is that gtf2gtf is not always selecting the transcript with the bigger length. On gene AT1G01030, for example, the sum of the length of all exons in transcript AT1G01030.1 is bigger than that of AT1G01030.2, but on Araport_longest.gtf the AT1G01030.2 transcript is selected.

Here is the AT1G01030 gene in the original Araport_sort.gtf:

Chr1 Araport11 three_prime_UTR 11649 11863 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:three_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:three_prime_UTR:1";
Chr1 Araport11 exon 11649 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:exon:3"; Parent "AT1G01030.1"; Name "NGA3:exon:3";
Chr1 Araport11 CDS 11864 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:CDS:2"; Parent "AT1G01030.1"; Name "NGA3:CDS:2";
Chr1 Araport11 five_prime_UTR 12941 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:five_prime_UTR:2"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:2";
Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1";
Chr1 Araport11 five_prime_UTR 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:five_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:1";
Chr1 Araport11 exon 11649 12354 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:4"; Parent "AT1G01030.2"; Name "NGA3:exon:4";
Chr1 Araport11 three_prime_UTR 11649 11863 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:three_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:three_prime_UTR:1";
Chr1 Araport11 CDS 11864 12354 . - 2 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:3"; Parent "AT1G01030.2"; Name "NGA3:CDS:3";
Chr1 Araport11 CDS 12424 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:1"; Parent "AT1G01030.2"; Name "NGA3:CDS:1";
Chr1 Araport11 exon 12424 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:2"; Parent "AT1G01030.2"; Name "NGA3:exon:2";
Chr1 Araport11 five_prime_UTR 12941 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:five_prime_UTR:2"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:2";
Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1";
Chr1 Araport11 five_prime_UTR 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:five_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:1";

And here is the same gene on the Araport_longest.gtf:

Chr1 Araport11 exon 11649 12354 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:4"; Parent "AT1G01030.2"; Name "NGA3:exon:4";
Chr1 Araport11 CDS 11864 12354 . - 2 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:3"; Parent "AT1G01030.2"; Name "NGA3:CDS:3";
Chr1 Araport11 CDS 12424 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:1"; Parent "AT1G01030.2"; Name "NGA3:CDS:1";
Chr1 Araport11 exon 12424 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:2"; Parent "AT1G01030.2"; Name "NGA3:exon:2";
Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1";

@Acribbs
Copy link
Contributor

Acribbs commented Jul 8, 2020

Indeed, I can see that the wrong transcript is being picked up. I will look into this further, I suspect that its to do with this bit of code:

elif args.filter_method in ("longest-transcript",

Things are quite busy at the moment, but when I have time I will use this gene and set of transcripts to debug further.

@jarobin
Copy link

jarobin commented Jul 29, 2020

Hello, I am wondering if there is an update on this issue, as I am also interested in extracting the longest transcript based on summed exon length.

@Acribbs
Copy link
Contributor

Acribbs commented Jul 29, 2020

Think this is fixed with the pull request here: #72

Haven't tested it extensively though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants