Bioperl:对序列数据进行操作


转自Perlchina.org

分子生物学数据库(如GeneBank,DDBJ, EMBL,SwissProt,PIR等)包含许多种格式的序列数据,有简单的格式也有复杂的格式。简单的格式如FASTA,GCG,只包含名字和序列,复杂的格式如GeneBank, EMBL格式,包含序列,参考文献,注释等等许多信息。

FASTA格式是这个样子的:

>ID Description(Free text)
AGTGATGATAGTGAGTAGGA
>gi|number|emb|ACCESSION
AGATAGTAGGGGATAGAG

而 GeneBank,EMBL 等格式则十分复杂,此处不再列出。在日常工作中我们常常需要在这些格式中进行转换,已经有很多的程序可以进行格式转换和提取信息,但是最方便和最自由的我认为还是 Bioperl,更重要的是它能进行大批量的自动化处理,那些图形界面的工具常常是无法做到的。

要对序列进行操作,你最先应该熟悉的应该是 Bio::Seq 模块,它是 Bioperl 的核心模块。

下面的几行程序直接创建了一个 seq object,并将它打印出来。

#!/bin/perl -w
use Bio::Seq;
$seq_obj = Bio::Seq->new(-seq => “aaaatgggggggggggccccgtt”);
print $seq_obj->seq;

下一个应该接触的模块是 Bio::SeqIO,记住当你准备进行一些输入输出工作的时候,应该记得使用 ***IO 模块。 I 表示 input,O 表示 Output。在这里我们准备进行 sequence 的输入和输出,所以使用 Bio::SeqIO 模块。

以下这段代码将创建一个 seqence.fasta 格式的文件,将$seq_obj 的 new 方法所得到的参数写入该文件。

#!/bin/perl -w

use Bio::Seq;
use Bio::SeqIO;

$seq_obj = Bio::Seq->new(-seq => “aaaatgggggggggggccccgtt”,
-display_id => “#12345”,
-desc => “example 1” );

$seqio_obj = Bio::SeqIO->new(-file => ‘>sequence.fasta’,
-format => ‘fasta’ );
$seqio_obj->write_seq($seq_obj);

你将在 sequence.fasta 文件中看到这样的结果:

>#12345 example 1
aaaatgggggggggggccccgtt

当你想进行格式转换的时候,比如将上例中的 fasta 格式转换为 GenBank 格式,只要将 $seqio_obj 中的 -format 参数设为”genbank”,然后文件中输出的结果将会是这样:

LOCUS #12345 23 bp dna linear UNK
DEFINITION example 1
ACCESSION unknown
FEATURES Location/Qualifiers
BASE COUNT 4 a 4 c 12 g 3 t
ORIGIN
1 aaaatggggg ggggggcccc gtt
//

多么简洁漂亮!

Bio::Seq 和 Bio::SeqIO 还有很多方法,篇幅所限,这里不一一介绍了,可以使用 perldoc 命令直接查阅它们的完整文档,如 perldoc Bio::Seq ,非常的详细清晰。


3条回应:“Bioperl:对序列数据进行操作”

  1. 这是解析blast一程序
    #! /usr/bin/perl -w
    use Bio::SearchIO;
    my $cutoff = ‘0.001’;
    my $file = ‘ARGV[0]’;
    my $in = new Bio::SearchIO(-format => ‘blast’,
    -file => $file);
    open (OUT,”>ARGV[1]”);
    while( my $r = $in->next_result ) {
    print OUT “Query is: “, $r->query_name, ” “,
    $r->query_description,” “,$r->query_length,” aa\n”;
    print OUT ” Matrix was “, $r->get_parameter(‘matrix’), “\n”;
    while( my $h = $r->next_hit ) {
    last if $h->significance > $cutoff;
    print OUT “Hit is “, $h->name, “\n”;
    while( my $hsp = $h->next_hsp ) {
    print OUT ” HSP Len is “, $hsp->length(‘total’), ” “,
    ” E-value is “, $hsp->evalue, ” Bit score “,
    $hsp->score, ” \n”,
    ” Query loc: “,$hsp->query->start, ” “,
    $hsp->query->end,” “,
    ” Sbject loc: “,$hsp->hit->start, ” “,
    $hsp->hit->end,”\n”;
    }
    }
    }
    close (OUT);

    运行之后出现下列提示:
    C:\Perl>E:\vm1\perl\readblast.pl C:\blast_1.txt C:\blast_2.txt
    Subroutine new redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResult.pm
    line 162.
    Subroutine algorithm redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResu
    lt.pm line 248.
    Subroutine algorithm_version redefined at C:/Perl/site/lib/Bio/Search/Result/Gen
    ericResult.pm line 268.
    Subroutine next_hit redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResul
    t.pm line 295.
    Subroutine query_name redefined at C:/Perl/site/lib/Bio/Search/Result/GenericRes
    ult.pm line 321.
    Subroutine query_accession redefined at C:/Perl/site/lib/Bio/Search/Result/Gener
    icResult.pm line 341.
    Subroutine query_gi redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResul
    t.pm line 362.
    Subroutine query_length redefined at C:/Perl/site/lib/Bio/Search/Result/GenericR
    esult.pm line 383.
    Subroutine query_description redefined at C:/Perl/site/lib/Bio/Search/Result/Gen
    ericResult.pm line 404.
    Subroutine database_name redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
    Result.pm line 426.
    Subroutine database_letters redefined at C:/Perl/site/lib/Bio/Search/Result/Gene
    ricResult.pm line 449.
    Subroutine database_entries redefined at C:/Perl/site/lib/Bio/Search/Result/Gene
    ricResult.pm line 471.
    Subroutine get_parameter redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
    Result.pm line 492.
    Subroutine available_parameters redefined at C:/Perl/site/lib/Bio/Search/Result/
    GenericResult.pm line 507.
    Subroutine get_statistic redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
    Result.pm line 524.
    Subroutine available_statistics redefined at C:/Perl/site/lib/Bio/Search/Result/
    GenericResult.pm line 539.
    Subroutine add_hit redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResult
    .pm line 558.
    Subroutine hit_factory redefined at C:/Perl/site/lib/Bio/Search/Result/GenericRe
    sult.pm line 583.
    Subroutine rewind redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResult.
    pm line 600.
    Subroutine _nexthitindex redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
    Result.pm line 613.
    Subroutine add_parameter redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
    Result.pm line 630.
    Subroutine add_statistic redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
    Result.pm line 647.
    Subroutine num_hits redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResul
    t.pm line 664.
    Subroutine hits redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResult.pm
    line 684.
    Subroutine algorithm_reference redefined at C:/Perl/site/lib/Bio/Search/Result/G
    enericResult.pm line 712.
    Subroutine program_reference redefined at C:/Perl/site/lib/Bio/Search/Result/Gen
    ericResult.pm line 731.
    Subroutine no_hits_found redefined at C:/Perl/site/lib/Bio/Search/Result/Generic
    Result.pm line 740.
    Subroutine set_no_hits_found redefined at C:/Perl/site/lib/Bio/Search/Result/Gen
    ericResult.pm line 757.
    Subroutine to_string redefined at C:/Perl/site/lib/Bio/Search/Result/GenericResu
    lt.pm line 778.

    想请教怎样解决?thanks very much

    • 这段程序本来是没错的.可能是你修改错了而已. 有变量的地方不要用单引号”,要用双引号。我修改了。已经测试有效,请看:http://www.liucheng.name/?p=875

  2. 把NCBI整个nr数据库下载到本地后 如何用bioperl 从里面提取数据呢 那几种方法我试了 貌似都不行啊