Bioperl:Genbank Seq parsing script


The  script parsing Genbank format  to CDS and protein sequences.

seq_parsing.pl

#!/usr/bin/perl -w
use strict;

use Bio::SeqIO;
use Bio::Seq;

use Data::Dumper;
# To run this, pass in two filenames
my $file = $ARGV[0];
my $ofile = $ARGV[1];
my $ofile2 = $ARGV[2];

my $input = Bio::SeqIO->new(-format => 'genbank',
       -file   => $file);
my $output = Bio::SeqIO->new(-format => 'fasta',
        -file   => ">$ofile");
my $output_protein = Bio::SeqIO->new(-format => 'fasta',
         -file   => ">$ofile2");
while( my $seq = $input->next_seq ) {
  # get all the feature from the sequence object
 my @features = $seq->get_SeqFeatures;
 foreach my $f ( @features ) {
   # test if the feature TYPE is 'CDS'
   #print "feature object is $f\n";
   #print Dumper($f), "\n";
   if( $f->primary_tag eq 'CDS' ) {
     my @tags = $f->get_all_tags;
     print "the tags are @tags\n";
     my @namelist = $f->get_tag_values('gene');
     my $name = $namelist[0];
     my ($desc) = $f->get_tag_values('product');
     # process location information
     my $location = $f->location;
     print "location str is ", $location->to_FTstring(), "\n";
     my @sublocs = $location->each_Location;
     my $cds_string;
     for my $l ( @sublocs ) {
       print $l->to_FTstring(), "\n";
       my $start = $l->start;
       my $end   = $l->end;
       my $strand= $l->strand;
       my $exon_seq = $seq->subseq($start,$end);
       print "exon_seq is $exon_seq\n";
       $cds_string .= $exon_seq;
     }
     my $cds_seq2 = $f->spliced_seq;
     my $cds_seq = Bio::Seq->new(-seq => $cds_string,
     -id  => $name,
     -desc=> $desc);    
     $output->write_seq($cds_seq);
     $output->write_seq($cds_seq2);
     $output_protein->write_seq($cds_seq->translate);
     #print $f->start, "..", $f->end, " for feature ", $name, "\n";
   }
 }
 
}

《“Bioperl:Genbank Seq parsing script”》 有 2 条评论

  1. 版主,请问,下面这一行是做什么用的呢?
    ——————
    my $cds_seq2 = $f->spliced_seq;
    —————–
    我看了关于splice_seq的说明,可是没看明白到底什么意思?

  2. 说明文档是这么解释的:可是我就是不明白什么意思。。
    ————————–
    Title : spliced_seq

    Usage : $seq = $feature->spliced_seq()
    $seq = $feature_with_remote_locations->spliced_seq($db_for_seqs)

    Function: Provides a sequence of the feature which is the most
    semantically “relevant” feature for this sequence. A
    default implementation is provided which for simple cases
    returns just the sequence, but for split cases, loops over
    the split location to return the sequence. In the case of
    split locations with remote locations, eg

    join(AB000123:5567-5589,80..1144)

    in the case when a database object is passed in, it will
    attempt to retrieve the sequence from the database object,
    and “Do the right thing”, however if no database object is
    provided, it will generate the correct number of N’s (DNA)
    or X’s (protein, though this is unlikely).

    This function is deliberately “magical” attempting to
    second guess what a user wants as “the” sequence for this
    feature

    Implementing classes are free to override this method with
    their own magic if they have a better idea what the user
    wants

    Args : [optional] A Bio::DB::RandomAccessI compliant object
    Returns : A Bio::Seq
    ————————–