Bioperl:Genbank Seq parsing script
The script parsing Genbank format to CDS and protein sequences.
seq_parsing.pl
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
use Bio::Seq;
use Data::Dumper;
# To run this, pass in two filenames
my $file = $ARGV[0];
my $ofile = $ARGV[1];
my $ofile2 = $ARGV[2];
my $input = Bio::SeqIO->new(-format => 'genbank',
-file => $file);
my $output = Bio::SeqIO->new(-format => 'fasta',
-file => ">$ofile");
my $output_protein = Bio::SeqIO->new(-format => 'fasta',
-file => ">$ofile2");
while( my $seq = $input->next_seq ) {
# get all the feature from the sequence object
my @features = $seq->get_SeqFeatures;
foreach my $f ( @features ) {
# test if the feature TYPE is 'CDS'
#print "feature object is $f\n";
#print Dumper($f), "\n";
if( $f->primary_tag eq 'CDS' ) {
my @tags = $f->get_all_tags;
print "the tags are @tags\n";
my @namelist = $f->get_tag_values('gene');
my $name = $namelist[0];
my ($desc) = $f->get_tag_values('product');
# process location information
my $location = $f->location;
print "location str is ", $location->to_FTstring(), "\n";
my @sublocs = $location->each_Location;
my $cds_string;
for my $l ( @sublocs ) {
print $l->to_FTstring(), "\n";
my $start = $l->start;
my $end = $l->end;
my $strand= $l->strand;
my $exon_seq = $seq->subseq($start,$end);
print "exon_seq is $exon_seq\n";
$cds_string .= $exon_seq;
}
my $cds_seq2 = $f->spliced_seq;
my $cds_seq = Bio::Seq->new(-seq => $cds_string,
-id => $name,
-desc=> $desc);
$output->write_seq($cds_seq);
$output->write_seq($cds_seq2);
$output_protein->write_seq($cds_seq->translate);
#print $f->start, "..", $f->end, " for feature ", $name, "\n";
}
}
}
有点相关的文章
- Bioperl:对序列数据进行操作 (1.000)
- Bioperl:使用和解析 BLAST (1.000)
- Bioperl:解析blast结果 (1.000)
- Bioperl:把Genbank格式的序列转换为基因结构图 (1.000)
- bioperl提供序列模块(Seq和SeqIO)的使用 (1.000)
- 使用linux中的sed编辑器 (RANDOM - 0.500)
版主,请问,下面这一行是做什么用的呢?
——————
my $cds_seq2 = $f->spliced_seq;
—————–
我看了关于splice_seq的说明,可是没看明白到底什么意思?
说明文档是这么解释的:可是我就是不明白什么意思。。
————————–
Title : spliced_seq
Usage : $seq = $feature->spliced_seq()
$seq = $feature_with_remote_locations->spliced_seq($db_for_seqs)
Function: Provides a sequence of the feature which is the most
semantically “relevant” feature for this sequence. A
default implementation is provided which for simple cases
returns just the sequence, but for split cases, loops over
the split location to return the sequence. In the case of
split locations with remote locations, eg
join(AB000123:5567-5589,80..1144)
in the case when a database object is passed in, it will
attempt to retrieve the sequence from the database object,
and “Do the right thing”, however if no database object is
provided, it will generate the correct number of N’s (DNA)
or X’s (protein, though this is unlikely).
This function is deliberately “magical” attempting to
second guess what a user wants as “the” sequence for this
feature
Implementing classes are free to override this method with
their own magic if they have a better idea what the user
wants
Args : [optional] A Bio::DB::RandomAccessI compliant object
Returns : A Bio::Seq
————————–