MoSDi软件(Motif Statistics and Discovery)


  • 规定一个可以衡量标准的值(例如:得分score)
  • 找到一个得分好的基序(或最优的)

在过去的几年里,通过一些运算法则预测了大量的基序。在我们的研究中,我们试图寻找一种更高效率和更准确的方法来预测基序。最后,我们引入了一种叫probabilistic arithmetic automata的算法,这是一种更快和更准实现基序统计的理论构架(见参考文献1)。在这里,用p值(p-value)来描述预测最优基序的运算法则(见参考文献2)。


MoSDi,是一个基于JAVA的序列基序统计和预测软件。这个软件目前是处在一个试验性的阶段。具有同样预测基因功能的软件有MEME, Weeder。但MoSDi的效果更好。


  • MoSDi Revision 468 (as of 2009/01/13): JAR-File
  • MoSDi uses jopt-simple to parse command line options. Right now, a version prior to 3.0 must be used (interface is not backwards-compatible and we haven’t ported our code yet). The last 2.x version is 2.4.1.



MoSDi is configured to look for the file jopt-simple.jar in the directory of the MoSDi JAR-file. The easiest way to satisfy this requirement is to create a link named jopt-simple.jar in the same directory as the MoSDi JAR-file:

$ cd <path-to-mosdi-jar>
$ ln -s <path-to-jopt-simple>/jopt-simple-<version>.jar jopt-simple.jar

下载这个文件当测试用: this file with (toy-)example sequences. (右键另存为。格式:一行一个序列,目前暂不支持fasta的格式)。出于便利,把这个文件保存在MoSDi JAR-file目录下。

In our toy example, we wish to examine all IUPAC motifs of length 8 with at most 2 of the characters R, Y, W, S, K, M (two-nucleotide-wildcards) and with at most 2 Ns (four-nucleotide-wildcard). At first, we need to generate a list of abelian patterns that satisfy these conditions:

$ java -jar mosdi_r468.jar iupac_abelian_gen -M 8,2,0,2 8 > abelian_patterns

Now we examine the example sequences with respect to these abelian patterns

$ java -jar mosdi_r468.jar discovery -t 1e-15 -F example-sequences abelian_patterns |grep ‘>>’ > results

The switch “-t 1e-15” tells the algorithm to look for motifs with a p-value below 1e-15. Grepping for >> saves us from the software’s rather detailed output. Note that, in order to parallelize the computation, we may split the file abelian_patterns into chunks and process them on different cores/machines.

Sorting results by p-value gives us the winner motif (along with some statistics):

>>p_value>> 3.617957e-37 LIN >>stats>> TAARASGA 1 9 9 17 5.170981e-02 >>poisson>> 5.170981e-02 1.000000e+00 >>runtimes>> 0.000000e+00 0.000000e+00 0.000000e+00

The motif discovery process is as yet restricted to i.i.d. background models. Thus we might want to re-evaluate all returned motifs with respect to a Markovian background model. Firstly, we create a list of all returned motifs:

$ grep '>>p_value>' results | cut -d ' ' -f 5 > motifs

Then, we re-evaluate these motifs w.r.t. an third order Markov model (M3):

$ java -jar mosdi_r468.jar calc_scores -M3 -F example-sequences motifs |grep ‘>>’ > results_M3



[1] Tobias Marschall and Sven Rahmann. Probabilistic arithmetic automata and their application to pattern matching statistics. In Paolo Ferragina and Gad Landau, editors, Combinatorial Pattern Matching (CPM’08), volume 5029 of LNCS, pages 95-106. Springer, 2008.

[2] Tobias Marschall and Sven Rahmann. Efficient Exact Motif Discovery. Submitted.

Last changed 2009/06/09