Grammatical Inference Benchmarks Repository and Links

Some public positive and negative training samples for grammatical inference, let me know if you know others…

My old site has been deleted, this page was recovered temporarily from the wayback machine. I’ll try to fix it soon…

Quick links :

Artificial data
- Automata inference
  - From the litterature
  - Abbadingo
  - Gowachin
  - DFA from noisy samples
  - Zulu: active learning of DFA with membership queries
  - Stamina: learning DFA with large alphabet
Real world data
- Molecular Biology
  - Subcellular Location of Proteins
  - Related Irvine datasets
  - Major Intrinsic Proteins family (MIP)
  - Coiled coil domains
- Bird songs
  - Skylark problem database

Artificial data

Automata inference

Let’s begin by an historical aspect. Do you know any GI benchmark prior to that of M. Tomita [T82] ?
Examples are on a two letters alphabet, and the size of the target DFA’s is from 1 to 4 states.
The DFA’s have been chosen to exhibit various configurations, the corresponding languages are:

L1 : a*
L2 : (ab)*
L3 : any sentence without an odd number of consecutive a’s after an odd number of consecutive b’s.
L4 : any sentence over the alphabet a,b without more than two consecutive a’s.
L5 : any sentence with an even number of a’s and an even number of b’s.
L6 : any sentence such that the number of a’s differs from the number of b’s by 0 modulo 3.
L7 : a*b*a*b*

I have found the samples set in a paper by A. Blair and J. Pollack [BP97] dealing with language induction with second order recurrent neural networks… You can find in this targzipped file these samples in the Abbadingo One format.
This benchmark has been progressively completed by L. Miclet and C. de Gentile [MG94] and by P. Dupont [D94]. The new languages are :

L8 : a*b
L9 : (a* + c*)b (3 letters !)
L10 : (aa)*(bbb)*
L11 : any sentence with an even number of a’s and an odd number of b’s.
L12 : a(aa)*b
L13 : any sentence over the alphabet a,b with an even number of a’s.
L14 : (aa)*ba*
L15 : bc*b + ac*a (3 letters !)

I have got the set of training samples used in [D94]. They have been generated by random walks in the canonical DFA of L and its complement:

(1) I+ consists of a set of strings randomly generated from the canonical DFA of L, with the property that I+ is structurally complete. Let |I+|c be the sample size required to randomly get such a sample. Depending on the value of a parameter denoted M , we pursue the random generation until the sample size |I+| is equal to M .
(2) I- is constructed in the same way but from the complementary automaton, i.e. the automaton accepting Sigma* – L. For each language, 20 training sets have been randomly generated according to the protocol described above. The resulting samples range from 1 to 156 strings whereas their length varies from 1 to 1017 letters. Samples 01 to 10 have been generated for M = 1 whereas samples 11 to 20 have been generated for M = 3. Therefore, the first 10 samples are smaller than the 10 last samples which are more likely to be characteristic…

This was my favorite benchmark to debug my programs (ooh, did I ?). You can get the compressed version of all the samples (“targzipped”) in the Abbadingo format.

You can also have a look at A. Oliveira and J. Silva benchmark. Here is their presentation of the benchmark generated to evaluate BIC Algorithm [OS97] :

115 finite state machines with binary inputs and outputs were randomly generated. These finite state machines were reduced and unreachable states were removed before the experiments were run. The size of the machines (after reduction) varied between 3 and 19 states. A total of 575 training sets were generated, with each training set containing twenty strings of length 30.

The benchmark is available in A. Oliveira’s home page, (direct link to the targzipped file in abbadingo format).

If the previous link doesn’t work try this one.
The previous benchmarks are somehow biased toward DFA. What about other types of generation biases ? To test their algorithm DeLeTe2 [DLT04], the authors used several generation modes: from random DFA, NFA and regular expressions. The benchmark and the algorithm are available here as a targzipped file in abbadingo format or directly available from Aurelien’s homepage (http://www.grappa.univ-lille3.fr/~lemay/, click on “Recherche” and then on “Annexes”).

In [CF03], we added generation mode from random UFA (unambiguous automata): targzipped file in abbadingo format.

And don’t forget the Abbadingo One problems. It is composed of 12 randomly generated DFA’s whose size states after reduction between 63 and 519 states. The 1 521 to 60 000 strings training sets were drawn without replacement from a uniform distribution over the set of all strings not longer than 5 plus the target DFA depth. Challenging isn’t it? The competition is now over but you can still download the challenge problems. The oracle is still answering your queries… And if you solve problem T let us know about it :o)

A follow-on to Abbadingo One is the on demand problem server: Gowachin. The same generation procedure than in Abbadingo is used, but you can specify your own size and number of sequences parameters. Introduction of noise on labels and symbols have also been added…

A follow-on to Gowachin is the Learning DFA from noisy samples contest for GECCO 2004 (The closing date for this contest is Friday May 28th 2004). This contest focuses on a fairly small DFA (nominal 50 states), but with a high level of label noise (10%), and a moderate training sample size (5,000). The programs in this contest will have to run on the same machine within the same time constraints: 10 minutes CPU time (on a 2.4ghz Pentium 4 running Windows or Linux)!

The Zulu competition is about active learning of DFA with memberships queries. The competition is closed since the 30th June 2010 but the Oracle, the problem generator and the problems used during the competition are still available.

Motivated by software engineering, the Stamina competition is about learning regular languages with large alphabets. The competition finishes on the 31st December 2010…

Context-free inference

A competition on learning context-free languages has been set for the International Colloquium on Grammatical Inference ICGI’04 held in Athens. Theoritical learning results are not encouraging but practically, what can be done ? More on the Omphalos site !

Context sensitive, context free and regular languages inference

Alexander Clark, Christophe Costa Florencio and Chris Watkins have prepared random data sets for a variety of context sensitive, context free and regular languages for their experiments presented in: Languages as Hyperplanes: grammatical inference with string kernels, ECML 2006, Berlin.

The data set is available from the GISK (Grammatical Inference with String Kernels) project page or directly here as targzipped file.

Transducer inference

Brad and Menno are organizing with Dominique Estival a new competition named Tenjinno held in conjunction with the International Colloquium on Grammatical Inference (ICGI 2006). The aim of this competition is to improve the state-of-the art in the inference of formal models that can translate from one language into another.

Molecular Biology

I’ve made a page on the Prediction of the Subcellular Location of Proteins task based on A. Reinhardt and T. Hubbard’s data containing a translation of the data in Abbadingo format and gathering usefull links on the subject… An easy way to test your favorite algorithm on a real task!

3 databases at the famous UCI Repository of Machine Learning Databases :

Major intrinsic protein (MIP) data set used to evaluate Protomata-L [CK05].

79 MIP sequences identified with a real biological experiment-based annotation by a biology expert. Since it was highly redundant, a subset s.t. the identity between each pair of proteins is below 90% was built:

- Set M (44 sequences)
- A control set constituted of close proteins not belonging to the MIP family is also available : Set C (34 sequences)

For discrimination tasks, two subfamilies in M were considered:

- Set W+ (24 sequences) water specific proteins
- Set W- (16 sequences) glycerol or small molecule facilitator sequences

if the previous links did not work try this one.

Coiled-coil domains prediction [PLCS06].

Rather a transducer learning task than a pure grammatical inference task. But the authors reduced it to a regular language inference problem: they proposed an even linear language inference formulation of the problem and they reduced it to a learning regular language task whose result was translated back to a transducer (used to label unknown sequences). Let us note that amino acid encodings (hydrophobic/polar encoding and Dayhoff encoding) of the sequences were used in the paper.
The file coiled.txt contains 350 proteins that were reported in the paper as the subset from SwissProt. The format is quite simple: every protein is composed with three text lines that has the following meaning:

Line 1: Internal name of the protein
Line 2: Sequence of amino acid names
Line 3: annotations (C means coiled position while the dot means non coiled position, or at least non annotated one)

Delorenzi’s dataset, also used in the paper, is available at his web page following the link “Web Marcoil” http://www.isrec.ch/research/groups/research_groups_detail_eid_1686_lid_2.htm.

Bird songs

Skylark problem database.
Songs for nine larks coming from three different localities (three birds per locality) are available (2 or 3 songs per bird). The songs are given as sequences of syllables, and the duration of the syllables are also given. The goal is to locate the specific sequences to a bird (individual signature) and those common to all the birds of a locality (signature of group or micro-dialect).

References :

[T82] M. Tomita, Dynamic Construction of Finite Automata From Examples Using Hill Climbing, Proc. of the 4th Annual Cognitive Science Conference, USA, pp. 105- 108, 1982.
[D94] P.Dupont, Regular Grammatical Inference from Positive and Negatives Samples by Genetic Search : the GIG method, Lecture Notes in Artificial Intelligence, No. 862, Springer Verlag, Grammatical Inference and Applications, ICGI’94, pp 236–245, 1994.
[MG94] L. Miclet and C. de Gentile, Inference Grammaticale partir d’Exemples et de Contre-Exemples : deux Algorithmes Optimaux (BIG et RIG) et une Version Heuristique (BRIG), Actes des JFA-94, Strasbourg, France, pp. F1-F13, 1994.
[BP97] A.D. Blair and J.B. Pollack, Analysis of Dynamical Recognizers.
Neural Computation 9(5), 1127-1142, 1997.
[OS97] A. Oliveira and J. Silva. Efficient Search Techniques for the Inference of Minimum Size Finite Automata , Workshop on Automata Induction, Grammatical Inference, and Language Acquisition The Fourteenth International Conference on Machine Learning (ICML-97) July 12, 1997, Nashville, Tennessee
[CF03] F. Coste, D. Fredouille, Unambiguous automata inference by means of state-merging methods, ECML 2003.
[DLT04] F. Denis and A. Lemay and A. Terlutte. Learning regular languages using RFSAs, Theoretical Computer Science 2 313, 267-294
[CK05] F. Coste and G. Kerbellec, A Similar Fragments Merging Approach to Learn Automata on Proteins, ECML 2005, Porto.
[CFW06] A. Clark, C. Costa Florencio and C. Watkins, Languages as Hyperplanes: grammatical inference with string kernels, ECML 2006.
[PLCS06] P. Peris, D. Lopez, M. Campos and J. Sempere, Protein Motif Prediction by Grammatical Inference, ICGI 2006.

Comments and remarks are welcome
Back to my home page

gibenchs