The script parsing Genbank format to CDS and protein sequences.
seq_parsing.pl
#!/usr/bin/perl -w use strict; use Bio::SeqIO; use Bio::Seq; use Data::Dumper; # To run this, pass in two filenames my $file = $ARGV[0]; my $ofile = $ARGV[1]; my $ofile2 = $ARGV[2]; my $input = Bio::SeqIO->new(-format => 'genbank', -file => $file); my $output = Bio::SeqIO->new(-format => 'fasta', -file => ">$ofile"); my $output_protein = Bio::SeqIO->new(-format => 'fasta', -file => ">$ofile2"); while( my $seq = $input->next_seq ) { # get all the feature from the sequence object my @features = $seq->get_SeqFeatures; foreach my $f ( @features ) { # test if the feature TYPE is 'CDS' #print "feature object is $f\n"; #print Dumper($f), "\n"; if( $f->primary_tag eq 'CDS' ) { my @tags = $f->get_all_tags; print "the tags are @tags\n"; my @namelist = $f->get_tag_values('gene'); my $name = $namelist[0]; my ($desc) = $f->get_tag_values('product'); # process location information my $location = $f->location; print "location str is ", $location->to_FTstring(), "\n"; my @sublocs = $location->each_Location; my $cds_string; for my $l ( @sublocs ) { print $l->to_FTstring(), "\n"; my $start = $l->start; my $end = $l->end; my $strand= $l->strand; my $exon_seq = $seq->subseq($start,$end); print "exon_seq is $exon_seq\n"; $cds_string .= $exon_seq; } my $cds_seq2 = $f->spliced_seq; my $cds_seq = Bio::Seq->new(-seq => $cds_string, -id => $name, -desc=> $desc); $output->write_seq($cds_seq); $output->write_seq($cds_seq2); $output_protein->write_seq($cds_seq->translate); #print $f->start, "..", $f->end, " for feature ", $name, "\n"; } } }
《“Bioperl:Genbank Seq parsing script”》 有 2 条评论
版主,请问,下面这一行是做什么用的呢?
——————
my $cds_seq2 = $f->spliced_seq;
—————–
我看了关于splice_seq的说明,可是没看明白到底什么意思?
说明文档是这么解释的:可是我就是不明白什么意思。。
————————–
Title : spliced_seq
Usage : $seq = $feature->spliced_seq()
$seq = $feature_with_remote_locations->spliced_seq($db_for_seqs)
Function: Provides a sequence of the feature which is the most
semantically “relevant” feature for this sequence. A
default implementation is provided which for simple cases
returns just the sequence, but for split cases, loops over
the split location to return the sequence. In the case of
split locations with remote locations, eg
join(AB000123:5567-5589,80..1144)
in the case when a database object is passed in, it will
attempt to retrieve the sequence from the database object,
and “Do the right thing”, however if no database object is
provided, it will generate the correct number of N’s (DNA)
or X’s (protein, though this is unlikely).
This function is deliberately “magical” attempting to
second guess what a user wants as “the” sequence for this
feature
Implementing classes are free to override this method with
their own magic if they have a better idea what the user
wants
Args : [optional] A Bio::DB::RandomAccessI compliant object
Returns : A Bio::Seq
————————–