转自Perlchina.org
分子生物学数据库(如GeneBank,DDBJ, EMBL,SwissProt,PIR等)包含许多种格式的序列数据,有简单的格式也有复杂的格式。简单的格式如FASTA,GCG,只包含名字和序列,复杂的格式如GeneBank, EMBL格式,包含序列,参考文献,注释等等许多信息。
FASTA格式是这个样子的:
>ID Description(Free text)
AGTGATGATAGTGAGTAGGA
>gi|number|emb|ACCESSION
AGATAGTAGGGGATAGAG
而 GeneBank,EMBL 等格式则十分复杂,此处不再列出。在日常工作中我们常常需要在这些格式中进行转换,已经有很多的程序可以进行格式转换和提取信息,但是最方便和最自由的我认为还是 Bioperl,更重要的是它能进行大批量的自动化处理,那些图形界面的工具常常是无法做到的。
要对序列进行操作,你最先应该熟悉的应该是 Bio::Seq 模块,它是 Bioperl 的核心模块。
下面的几行程序直接创建了一个 seq object,并将它打印出来。
#!/bin/perl -w
use Bio::Seq;
$seq_obj = Bio::Seq->new(-seq => “aaaatgggggggggggccccgtt”);
print $seq_obj->seq;
下一个应该接触的模块是 Bio::SeqIO,记住当你准备进行一些输入输出工作的时候,应该记得使用 ***IO 模块。 I 表示 input,O 表示 Output。在这里我们准备进行 sequence 的输入和输出,所以使用 Bio::SeqIO 模块。
以下这段代码将创建一个 seqence.fasta 格式的文件,将$seq_obj 的 new 方法所得到的参数写入该文件。
#!/bin/perl -w
use Bio::Seq;
use Bio::SeqIO;
$seq_obj = Bio::Seq->new(-seq => “aaaatgggggggggggccccgtt”,
-display_id => “#12345”,
-desc => “example 1” );
$seqio_obj = Bio::SeqIO->new(-file => ‘>sequence.fasta’,
-format => ‘fasta’ );
$seqio_obj->write_seq($seq_obj);
你将在 sequence.fasta 文件中看到这样的结果:
>#12345 example 1
aaaatgggggggggggccccgtt
当你想进行格式转换的时候,比如将上例中的 fasta 格式转换为 GenBank 格式,只要将 $seqio_obj 中的 -format 参数设为”genbank”,然后文件中输出的结果将会是这样:
LOCUS #12345 23 bp dna linear UNK
DEFINITION example 1
ACCESSION unknown
FEATURES Location/Qualifiers
BASE COUNT 4 a 4 c 12 g 3 t
ORIGIN
1 aaaatggggg ggggggcccc gtt
//
多么简洁漂亮!
Bio::Seq 和 Bio::SeqIO 还有很多方法,篇幅所限,这里不一一介绍了,可以使用 perldoc 命令直接查阅它们的完整文档,如 perldoc Bio::Seq ,非常的详细清晰。
3条回应:“Bioperl:对序列数据进行操作”
这是解析blast一程序
#! /usr/bin/perl -w
use Bio::SearchIO;
my $cutoff = ‘0.001’;
my $file = ‘ARGV[0]’;
my $in = new Bio::SearchIO(-format => ‘blast’,
-file => $file);
open (OUT,”>ARGV[1]”);
while( my $r = $in->next_result ) {
print OUT “Query is: “, $r->query_name, ” “,
$r->query_description,” “,$r->query_length,” aa\n”;
print OUT ” Matrix was “, $r->get_parameter(‘matrix’), “\n”;
while( my $h = $r->next_hit ) {
last if $h->significance > $cutoff;
print OUT “Hit is “, $h->name, “\n”;
while( my $hsp = $h->next_hsp ) {
print OUT ” HSP Len is “, $hsp->length(‘total’), ” “,
” E-value is “, $hsp->evalue, ” Bit score “,
$hsp->score, ” \n”,
” Query loc: “,$hsp->query->start, ” “,
$hsp->query->end,” “,
” Sbject loc: “,$hsp->hit->start, ” “,
$hsp->hit->end,”\n”;
}
}
}
close (OUT);
运行之后出现下列提示:
C:\Perl>E:\vm1\perl\readblast.pl C:\blast_1.txt C:\blast_2.txt
Subroutine new redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResult.pm
line 162.
Subroutine algorithm redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResu
lt.pm line 248.
Subroutine algorithm_version redefined at C:/Perl/site/lib/Bio/Search/Result/Gen
ericResult.pm line 268.
Subroutine next_hit redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResul
t.pm line 295.
Subroutine query_name redefined at C:/Perl/site/lib/Bio/Search/Result/GenericRes
ult.pm line 321.
Subroutine query_accession redefined at C:/Perl/site/lib/Bio/Search/Result/Gener
icResult.pm line 341.
Subroutine query_gi redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResul
t.pm line 362.
Subroutine query_length redefined at C:/Perl/site/lib/Bio/Search/Result/GenericR
esult.pm line 383.
Subroutine query_description redefined at C:/Perl/site/lib/Bio/Search/Result/Gen
ericResult.pm line 404.
Subroutine database_name redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
Result.pm line 426.
Subroutine database_letters redefined at C:/Perl/site/lib/Bio/Search/Result/Gene
ricResult.pm line 449.
Subroutine database_entries redefined at C:/Perl/site/lib/Bio/Search/Result/Gene
ricResult.pm line 471.
Subroutine get_parameter redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
Result.pm line 492.
Subroutine available_parameters redefined at C:/Perl/site/lib/Bio/Search/Result/
GenericResult.pm line 507.
Subroutine get_statistic redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
Result.pm line 524.
Subroutine available_statistics redefined at C:/Perl/site/lib/Bio/Search/Result/
GenericResult.pm line 539.
Subroutine add_hit redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResult
.pm line 558.
Subroutine hit_factory redefined at C:/Perl/site/lib/Bio/Search/Result/GenericRe
sult.pm line 583.
Subroutine rewind redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResult.
pm line 600.
Subroutine _nexthitindex redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
Result.pm line 613.
Subroutine add_parameter redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
Result.pm line 630.
Subroutine add_statistic redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
Result.pm line 647.
Subroutine num_hits redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResul
t.pm line 664.
Subroutine hits redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResult.pm
line 684.
Subroutine algorithm_reference redefined at C:/Perl/site/lib/Bio/Search/Result/G
enericResult.pm line 712.
Subroutine program_reference redefined at C:/Perl/site/lib/Bio/Search/Result/Gen
ericResult.pm line 731.
Subroutine no_hits_found redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
Result.pm line 740.
Subroutine set_no_hits_found redefined at C:/Perl/site/lib/Bio/Search/Result/Gen
ericResult.pm line 757.
Subroutine to_string redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResu
lt.pm line 778.
想请教怎样解决?thanks very much
这段程序本来是没错的.可能是你修改错了而已. 有变量的地方不要用单引号”,要用双引号。我修改了。已经测试有效,请看:http://www.liucheng.name/?p=875
把NCBI整个nr数据库下载到本地后 如何用bioperl 从里面提取数据呢 那几种方法我试了 貌似都不行啊