Universal Levenshtein Automata. Building and Properties - CiteSeerX

3 downloads 149 Views 2MB Size Report
Sofia University St. Kliment Ohridski. Faculty of ...... 10 Ims s. ,n = 2. We define Mχ s : 1) χ = ϵ. Mϵ s def. = {M
Sofia University St. Kliment Ohridski Faculty of Mathematics and Informatics Department of Mathematical Logic and Applications

Universal Levenshtein Automata. Building and Properties A thesis submitted for the degree of Master of Computer Science by Petar Nikolaev Mitankin

supervisor: Dr. Stoyan Mihov

— Sofia, 2005 —

Contents 1 Introduction.

2

2 Levenshtein distances. Properties.

3

3 Nondeterministic finite Levenshtein automata for fixed word.

8

4 Deterministic finite Levenshtein automata for fixed word.

13

5 Universal Levenshtein automata.

28

∀,ms ∀,t . 6 Building of A∀,² n , An and An 6.1 Summarized pseudo code . . . 6.2 Detailed pseudo code . . . . . . 6.3 Complexity . . . . . . . . . . . 6.4 Some final results . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

48 48 49 58 59

∀,t ∀,ms 7 Minimality of A∀,² . n , An and An

59

8 Some properties of A∀,² n .

72

1

1

Introduction.

One possible measure for the proximity of two strings is the so-called Levenshtein distance (known also as edit distance), based on primitive edit operations. Primitive edit operations are replacement of one symbol with another (substitution), deletion of a symbol, insertion of a symbol and others. The distance between two strings w and v is defined as the minimal number of the primitive edit operations that transform w into v. This master thesis gives a detailed formal review of the so-called universal Levenshtein automaton. The input word for this automaton is a sequence of bit vectors i(w, v) which is computed by given two words w and v. The automaton recognizes i(w, v) iff the distance between w and v is not greater than n. The greatest advantage of the universal Levenshtein automata A∀,χ is obn tained when we have to extract from a dictionary all words v that are close enough to a given word w. If the dictionary is repesented as a finite deterministic automaton D we can traverse parallelly the two automata A∀,χ n and D to find all these words. Description of this algorithm and its modified version called forward-backward method, which is extremely fast in practice, can be found in [MSFASLD]. Short review of the contents Section 2 - definition of three different Levenshtein distances based on the number of edit operations. Section 3 - definition of the nondeterministic LevenD,χ D,χ (w) consists (w) and proof that the language of AN shtein automaton AN n n of all strings x such that the distance between w and x is not greater than n. Section 4 - definition of the deterministic Levenshtein automaton AD,χ n (w) and D,χ D,χ proof that the languages of AN (w) and A (w) are equal. The universal n n is defined in section 5. Section 6 the algorithm Levenshtein automaton A∀,χ n for its building. Section 7 - proof that A∀,χ is minimal. Section 8 - some propn erties of A∀,² . n Remarks The aim of this master thesis is to review the deterministic Levenshtein automata and the universal Levenshtein automata presented by their authors Mihov and Schulz in [SMFSCLA] and [MSFASLD]. The main efforts in this master thesis are concentrated on the strict proofs and the details. This paper is a draft transation of the original text with additional comments and more figures. The original can be found at [ORIG]. The term Levenshtein distances is used in the text for d²L , dtL and dms L , although for the words w1 = abcd, w2 = abdc and w3 = bdac the triangle inequality is not satisfied for dtL . dtL (abcd, abdc) = 1, dtL (abdc, bdac) = 2, but dtL (abcd, bdac) = 4.

2

2

Levenshtein distances. Properties.

Let Σ be a finite set of letters. Definition 1 d²L : Σ∗ × Σ∗ → N Let v, w, v 0 , w0 ∈ Σ∗ and a, b ∈ Σ. 1) v = ² or w = ² def

d²L (v, w) = max(|v|, |w|) 2) |v| ≥ 1 and |w| ≥ 1 Let v = av 0 and w = bw0 . def

d²L (v, w) =

min(

if (a = b, d²L (v 0 , w0 ), ∞), 1 + d²L (v 0 , bw0 ), 1 + d²L (av 0 , w0 ), 1 + d²L (v 0 , w0 ) )

Notations Here and in what follows the value of the expression if (Condition, V alueIf ConditionIsT rue, V alueIf ConditionIsF alse) is V alueIf ConditionIsT rue if Condition is satisfied and V alueIf ConditionIsF alse otherwise. |x| denotes the length of x. The function d²L is called Levenshtein distance. d²L (v, w) is called Levenshtein distance between the words v and w. The Levenshtein distance between the words v and w is the minimal number of primitive edit operations that transform v into w. The primitive edit operations are deletion of a letter, insertion of a letter and substitution of one letter with another. Definition 20 ,→: Σ∗ × N → Σ∗ Let k ∈ N , x1 , x2 , ..., xk ∈ Σ and t ∈ N . ½ def ² x1 x2 ...xk ,→ t = xt+1 xt+2 ...xk

if t ≥ k otherwise

Treating the transposition of two letters also as a primitive edit operation we receive the following definition of Levenshtein distance extended with transposition: Definition 2 dtL : Σ∗ × Σ∗ → N Let v, w, v 0 , w0 ∈ Σ∗ and a, b, a1 , b1 ∈ Σ. 1) v = ² or w = ² def

dtL (v, w) = max(|v|, |w|) 2) |v| ≥ 1 and |w| ≥ 1 Let v = av 0 and w = bw0 .

3

def

dtL (v, w) =

min(

if (a = b, dtL (v 0 , w0 ), ∞), 1 + dtL (v 0 , bw0 ), 1 + dtL (av 0 , w0 ), 1 + dtL (v 0 , w0 ), if (a1 < v 0 & b1 < w0 & a = b1 & a1 = b, 1 + dtL (v ,→ 2, w ,→ 2), ∞) )

Notations We use c < d to denote that c is a prefix of d if c and d are words. The function dtL is called Levenshtein distance extended with transposition. When merging of two letters into one and splitting of one letter into two other letters are considered as primitive edit operations we use the following definition of Levenshtein distance extended with merge and split: ∗ ∗ Definition 3 dms L :Σ ×Σ →N 0 0 ∗ Let v, w, v , w ∈ Σ and a, b ∈ Σ. 1) v = ² or w = ² def

dms L (v, w) = max(|v|, |w|) 2) |v| ≥ 1 and |w| ≥ 1 Let v = av 0 and w = bw0 . def

dms L (v, w) =

min(

0 0 if (a = b, dms L (v , w ), ∞), ms 0 0 1 + dL (v , bw ), 0 0 1 + dms L (av , w ), ms 0 0 1 + dL (v , w ), 0 if (|w| ≥ 2, 1 + dms L (v , w ,→ 2), ∞), ms if (|v| ≥ 2, 1 + dL (v ,→ 2, w0 ), ∞) )

The function dms L is called Levenshtein distance extended with merge and split. Notations We use χ as a metasymbol. For example dχL denotes d²L , dtL or if χ ∈ {², t, ms}.

dms L

Proposition 1 Let χ ∈ {², t, ms} and v, w ∈ Σ∗ . Then dχL (v, w) = 0 ⇔ v = w . Proof ⇐) Let v = w = x. Using induction on |x| we prove that dχL (x, x) = 0. 1) |x| = 0 dχL (x, x) = dχL (², ²) = 0 2) Induction hypothesis: dχL (x, x) = 0 Let a ∈ Σ. We prove that dχL (ax, ax) = 0: dχL (ax, ax) =

min(

if (a = a, dχL (x, x), ∞), ... ) = 4

min(

if (a = a, 0, ∞), ... ) = 0

⇒) With induction on |v| we prove that dχL (v, w) = 0 ⇒ v = w. 1) v = ². Let dχL (v, w) = 0. dχL (v, w) = max(|v|, |w|) = 0. Hence w = ². 2) Induction hypothesis: ∀w ∈ Σ∗ (dχL (v, w) = 0 ⇒ v = w) Let a ∈ Σ and w ∈ Σ∗ . We have to prove that dχL (av, w) = 0 ⇒ av = w. Let dχL (av, w) = 0. From the definition of dχL it follows that |w| ≥ 1. Let b ∈ Σ, w0 ∈ Σ∗ and w = bw0 . From the definition of dχL it follows that a = b and dχL (v, w0 ) = 0. The induction hypothesis implies that v = w0 . Therefore av = w. Proposition 2 Let χ ∈ {², t, ms} and v, w ∈ Σ∗ . Then dχL (v, w) = dχL (w, v). The proof of the Proposition 2 is straightforward. Remark As we know Proposition 1 and Proposition 2, it remains to prove the triangle inequality for dχL ( dχL (v, w) ≤ dχL (v, x) + dχL (x, w) ) to show that dχL is distance. But this property is used nowhere in this paper. That’s why we don’t prove it. Definition 4 Let χ ∈ {², t, ms}. LχLev : N × Σ∗ → P (Σ∗ ) def

LχLev (n, w) = {v|dχL (v, w) ≤ n} We can find the definitions of L²Lev , LtLev and Lms Lev in [SMFSCLA]. Proposition 3 Let χ ∈ {², t, ms}. = k ⇒ dχL (av, w) ≤ k + 1.

dχL (v, w)

Let a ∈ Σ and v, w ∈ Σ∗ .

Then

Proof Let dχL (v, w) = k. 1) w = ² dχL (av, w) = dχL (av, ²) = k + 1 2) |w| ≥ 1 Form the definition of dχL it follows that dχL (av, w) ≤ 1 + dχL (v, w) = k + 1. Proposition 4 Let χ ∈ {², t, ms}. Let a, w1 ∈ Σ and v, w ∈ Σ∗ . Then = k ⇒ dχL (av, w1 w) ≤ k + 1.

dχL (v, w)

Proof Let dχL (v, w) = k. From the definition of dχL it follows that dχL (av, w1 w) ≤ 1 + dms L (v, w) = k + 1. Proposition 5 Let χ ∈ {², t, ms}. Let w1 ∈ Σ and v, w ∈ Σ∗ . Then = k ⇒ dχL (v, w1 w) ≤ k + 1.

dχL (v, w)

Proof Proposition 5 follows directly from Proposition 3 and Proposition 2.

5

Proposition 6 Let χ ∈ {², t, ms}. Let w1 ∈ Σ and v, w ∈ Σ∗ . Then = k ⇒ dχL (w1 v, w1 w) ≤ k.

dχL (v, w)

Proof Let dχL (v, w) = k. From the definition of dχL it follows that dχL (w1 v, w1 w) ≤ = k.

dms L (v, w)

Proposition 7 Let χ ∈ {², t, ms}. Let w ∈ Σ∗ , w = w1 w2 ...wp , p ≥ 1 and n > 0. Then S LχLev (n, w) ⊇ Σ.LχLev (n − 1, w) S Σ.LχLev (n − 1, w2 w3 ...wpS ) LχLev (n − 1, w2 w3 ...wp ) w1 .LχLev (n, w2 w3 ...wp ). Proof From Properties 3, 4, 5 and 6 it follows respectively that LχLev (n, w) ⊇ Σ.LχLev (n − 1, w) , LχLev (n, w) ⊇ Σ.LχLev (n − 1, w2 w3 ...wp ) , LχLev (n, w) ⊇ LχLev (n − 1, w2 w3 ...wp ) and LχLev (n, w) ⊇ w1 .LχLev (n, w2 w3 ...wp ) . Therefore S LχLev (n, w) ⊇ Σ.LχLev (n − 1, w) S χ Σ.LLev (n − 1, w2 w3 ...wpS ) χ LLev (n − 1, w2 w3 ...wp ) w1 .LχLev (n, w2 w3 ...wp ). We show how to extend A=

S Σ.LχLev (n − 1, w) S ) Σ.LχLev (n − 1, w2 w3 ...wpS LχLev (n − 1, w2 w3 ...wp ) w1 .LχLev (n, w2 w3 ...wp )

to LχLev (n, w). First we define Rχ as an extension of A and afterwards we prove that Rχ = LχLev . Definition 5 Let χ ∈ {², t, ms}. Rχ : N + × Σ+ → P (Σ∗ ) Let w ∈ Σ∗ , w = w1 w2 ...wp , p ≥ 1 and n ≥ 1. 1) χ = ² def

R² (n, w) =

S Σ.L²Lev (n − 1, w) S Σ.L²Lev (n − 1, w2 w3 ...wpS ) L²Lev (n − 1, w2 w3 ...wp ) w1 .L²Lev (n, w2 w3 ...wp )

6

2) χ = t S Σ.LtLev (n − 1, w) S Σ.LtLev (n − 1, w2 w3 ...wpS ) LtLev (n − 1, w2 w3 ...wp ) S w1 .LtLev (n, w2 w3 ...wp ) if (|w| ≥ 2, w2 w1 .LtLev (n − 1, w3 ...wp ), φ).

def

Rt (n, w) =

3) χ = ms def

Rms (n, w) =

S Σ.Lms Lev (n − 1, w) S ) Σ.Lms Lev (n − 1, w2 w3 ...wpS Lms Lev (n − 1, w2 w3 ...wp ) S w1 .Lms Lev (n, w2 w3 ...wp ) S Σ.Σ.Lms Lev (n − 1, w2 w3 ...wp ) if (|w| ≥ 2, Σ.Lms Lev (n − 1, w ,→ 2), φ).

Proposition 8 Let w ∈ Σ∗ , w = w1 w2 ...wp , p ≥ 1 and n ≥ 1. Then = Rχ (n, w).

LχLev (n, w)

Proof ⊇) 1) χ = ² From Proposition 7 it follows that L²Lev (n, w) ⊇ R² (n, w). 2) χ = t We have to prove that (∗t ) |w| ≥ 2 ⇒ LtLev (n, w) ⊇ w2 w1 .LtLev (n − 1, w3 ...wp ). Let |w| ≥ 2 and v ∈ LtLev (n − 1, w3 ...wp ). Hence dtL (v, w3 ...wp ) ≤ n − 1. From the definition of dtL it follows that dtL (w2 w1 v, w1 w2 w3 ...wp ) ≤ 1 + dtL (v, w3 ...wp ) ≤ n. From (∗t ) and Proposition 7 it directly follows that LtLev (n, w) ⊇ Rt (n, w). 3) χ = ms We have to prove that ms ms (∗ms 1 ) LLev (n, w) ⊇ Σ.Σ.LLev (n − 1, w2 ...wp ) and ms ) |w| ≥ 2 ⇒ L (n, w) ⊇ Σ.Lms (∗ms 2 Lev Lev (n − 1, w3 ...wp ). 3.1) First we prove (∗ms ) 1 ms Let v ∈ Lms Lev (n − 1, w2 ...wp ) and a, b ∈ Σ. Hence dL (v, w2 ...wp ) ≤ n − 1. ms ms From the definition of dms it follows that d (abv, w w ...w 1 2 p ) ≤ 1+dL (v, w2 ...wp ) ≤ L L n. 3.2) We prove (∗ms 2 ) ms Let |w| ≥ 2, v ∈ Lms Lev (n − 1, w3 ...wp ) and a ∈ Σ. Hence dL (v, w3 ...wp ) ≤ ms n − 1. From the definition of dms it follows that d (av, w w 1 2 w3 ...wp ) ≤ 1 + L L dms (v, w ...w ) ≤ n. 3 p L ms ms From (∗ms 1 ), (∗2 ) and Proposition 7 it directly follows that LLev (n, w) ⊇ ms R (n, w).

7

Therefore LχLev (n, w) ⊇ Rχ (n, w). ⊆) Let v ∈ LχLev (n, w) and dχL (v, w) = k ≤ n. 1) v = ² dχL (v, w) = |w| = k dχL (v, w2 ...wp ) = |w2 ...wp | = k − 1 ≤ n − 1 Therefore v ∈ LχLev (n − 1, w2 ...wp ). 2) |v| ≥ 1 Let |v| = t and v = v1 v2 ...vt . Hence dχL (v, w) =

min(

if (v1 = w1 , dχL (v2 ...vt , w2 ...wp ), ∞), 1 + dχL (v2 ...vt , w), 1 + dχL (v, w2 ...wp ), 1 + dχL (v2 ...vt , w2 ...wp ), ...) = k

2.1) v1 = w1 & dχL (v2 ...vt , w2 ...wp ) = k ≤ n In this case v ∈ w1 .LχLev (n, w2 ...wp ). 2.2) dχL (v2 ...vt , w) = k − 1 ≤ n − 1 In this case v ∈ Σ.LχLev (n − 1, w). 2.3) dχL (v, w2 ...wp ) = k − 1 ≤ n − 1 In this case v ∈ LχLev (n − 1, w2 ...wp ). 2.4) dχL (v2 ...vt , w2 ...wp ) = k − 1 ≤ n − 1 In this case v ∈ Σ.LχLev (n − 1, w2 ...wp ). Therefore L²Lev (n, w) ⊆ R² (n, w). 2.5) χ = t and |w| ≥ 2 & |v| ≥ 2 & v1 = w2 & v2 = w1 & dtL (v ,→ 2, w ,→ 2) = k − 1 ≤ n − 1 In this case v ∈ Σ.Σ.LtLev (n − 1, w2 ...wp ). Therefore LtLev (n, w) ⊆ Rt (n, w). 2.6) χ = ms and |w| ≥ 2 & dms L (v2 ...vt , w3 ...wp ) = k − 1 ≤ n − 1 In this case v ∈ Σ.Lms (n − 1, w ,→ 2). Lev 2.7) χ = ms and |v| ≥ 2 & dms L (v3 ...vt , w2 ...wp ) = k − 1 ≤ n − 1 In this case v ∈ Σ.Σ.Lms (n − 1, w2 ...wp ). Lev ms Therefore Lms (n, w) ⊆ R (n, w). Lev So LχLev (n, w) ⊆ Rχ (n, w).

3

Nondeterministic finite Levenshtein automata for fixed word.

Notations We denote the tuples , e >, , e > and , e > with i#e , i#e and i#e t s correspondingly. Definition 6 Let χ ∈ {², t, ms}. Let w ∈ Σ∗ and n ∈ N . We define the D,χ nondeterministic finite Levenshtein automaton AN (w). n ∗

def

D,χ D,χ N D,χ AN (w) = < Σ, QN ,I , FnN D,χ , δnN D,χ > n n

8

d B is partial function. We use the expresNotations Suppose that γ : A−→ sion !γ(π) in order to denote that γ(π) is defined and ¬!γ(π) - to denote that γ(π) is not defined. The special expression < S π1 , a, π2 >∈ δnN D,χ used for the N D,χ N D,² D,² d P (QN transition partial function δn : Qn × Σ {²} −→ ) means that n N D,χ ∗ N D,χ ∗ !δn (π1 , a) & π2 ∈ δn (π1 , a). Let |w| = p and w = w1 w2 ...wp . 1) χ = ² def

D,² QN = {i#e |0 ≤ i ≤ p & 0 ≤ e ≤ n} n def

I N D,² = {0#0 } ∗ def

FnN D,² =S{p#e |0 ≤ e ≤ n} D,² Let a ∈ Σ {²} and q1 , q2 ∈ QN . n def

< q1 , a, q2 >∈ δnN D,² ⇔ q1 = i#e & q1 = i#e+1 & a ∈ Σ or q1 = i#e & q1 = i + 1#e+1 or q1 = i#e & q1 = i + 1#e & a = wi+1 2) χ = t def

D,t D,² QN = QN n n

S

{i#e t |0 ≤ i ≤ p − 2 & 1 ≤ e ≤ n}

def

I N D,t = {0#0 } ∗ def



FnN D,t =SFnN D,² D,t . Let a ∈ Σ {²} and q1 , q2 ∈ QN n def

< q1 , a, q2 >∈ δnN D,t ⇔ < q1 , a, q2 >∈ δnN D,² or q1 = i#e & q2 = i#e+1 & a = wi+2 or t #e q1 = i#e & q = i + 2 & a = wi+1 2 t 3) χ = ms def

D,ms D,² QN = QN n n

I

S

{i#e s |0 ≤ i ≤ p − 1 & 1 ≤ e ≤ n}

N D,ms def

= {0#0 }

∗ def FnN D,ms S =

Let a ∈ Σ



FnN D,² D,ms {²} and q1 , q2 ∈ QN . n def

< q1 , a, q2 >∈ δnN D,ms ⇔ < q1 , a, q2 >∈ δnN D,² or q1 = i#e & q2 = i + 2#e+1 & a ∈ Σ or q1 = i#e & q2 = i + 1#e s & a ∈ Σ or #e q1 = i#e & q = i + 1 &a∈Σ 2 s ∗

D,χ The extended transition function δnN D,χ for AN is defined as usual. First n N D,² N D,² we define the ²-closure Cl² : Qn → P (Qn ): [ def Cl² (q) = {q} {π|∃k ≥ 0∃η1 , η2 , ..., ηk (< q, ², η1 >, < η1 , ², η2 >, ..., < ηk , ², π >∈ δnN D,χ )}

9

D,² D,² We define ²-closure for set of states ( Cl² : P (QN ) → P (QN ) ) in the n n following way: def [ Cl² (A) = Cl² (π) π∈A ∗

Let v ∈ Σ and a ∈ Σ. We define recursively the partial function ∗ D,² D,² d P (QN δnN D,χ : QN × Σ∗ −→ ): n n ∗

def

δnN D,χ (q, ²) = Cl² (q)   ¬!   ¬! def ∗ δnN D,χ (q, va) =   S  Cl² ( π∈δnN D,χ ∗ (q,v) δnN D,χ (π, a))



if ¬!δnN D,χ (q, v) N D,χ ∗ if (q, v) & S !δn N D,χ N D,χ ∗ (π, a) = φ π∈δn (q,v) δn otherwise ∗

In what follows we use the expression < π1 , v, π2 >∈ δnN D,χ to denote that ∗ ∗ !δnN D,χ (π1 , v) & π2 ∈ δnN D,χ (π1 , v). ∗

D,χ , FnN D,χ and δnN D,χ depend on the word w. When we use Remark QN n these notations the word w will be clear from the context. D,² Description of AN can be found in [MSFASLD]. n

D,² Fig. 1 AN (wS 1 w2 ...w5 ) 2 ² (Σ denotes Σ {²}.)

D,χ The transitions in the automaton AN (w) correspond with the definin χ N D,² #e tion of R : in An the transitions < i , a, i#e+1 > (a ∈ Σ) correspond to

10

Σ.L²Lev (n − 1, w). The transitions < i#e , a, i + 1#e+1 > (a ∈ Σ) - to Σ.L²Lev (n − 1, w2 w3 ...wp ). The transitions < i#e , ², i#e+1 > - to L²Lev (n − 1, w2 w3 ...wp ). And the transitions < i#e , wi , i + 1#e > - to w1 .L²Lev (n, w2 w3 ...wp ). Later we D,χ prove that L(AN (w)) = L²Lev (n, w). n

Fig. 2

Fig. 3

D,t AN (w1 w2 ...w5 ) 2

D,ms AN (w1 w2 ...w5 ) 2

11

D,χ Proposition 9 Let χ ∈ {², t, ms}. Let n ∈ N and w ∈ Σ∗ . Let i#e ∈ QN . n χ #e Then L(i ) = LLev (n − e, wi+1 ...wp ). ∗

( L(π) = {w|∃π 0 ∈ FnN D,χ (< π, w, π 0 >∈ δnN D,χ )} ) The properties L(i#e ) = L²Lev (n−e, wi+1 ...wp ), L(i#e ) = LtLev (n−e, wi+1 ...wp ) and L(i#e ) = Lms Lev (n − e, wi+1 ...wp ) are formulated in [SMFSCLA]. Proof Induction on i. 1) i = p L(p#e ) = {x|x ∈ Σ∗ & |x| ≤ n − e} = LχLev (n − e, ²) 2) 0 ≤ i ≤ p − 1 Induction hypothesis: (IH1 ) ∀j ≥ 1∀e(L(i + j #e ) = LχLev (n − e, wi+j+1 ...wp )) We prove with induction on e that L(i#e ) = LχLev (n − e, wi+1 ...wp ). 2.1) e = n L(i#n ) = wi+1 .L(i + 1#n ) =IH1 wi+1 .LχLev (0, wi+2 ...wp ) = wi+1 ...wp = LχLev (0, wi+1 ...wp ) = LχLev (n − e, wi+1 ...wp ) 2.2) 0 ≤ e ≤ n − 1 Induction hypothesis: (IH2 ) L(i#e+1 ) = LχLev (n − e − 1, wi+1 ...wp ) 2.2.1) χ = ² S S S L(i#e ) = Σ.L(i#e+1 ) Σ.L(i+1#e+1 ) L(i+1#e+1 ) wi+1 .L(i+1#e ) =IH1,2 S Σ.L²Lev (n − e − 1, wi+1 ...wp ) S Σ.L²Lev (n − e − 1, wi+2 ...wpS) L²Lev (n − e − 1, wi+2 ...wp ) wi+1 .L²Lev (n − e, wi+2 ...wp ) = R² (n − e, wi+1 ...wp ) =Proposition 8 L²Lev (n − e, wi+1 ...wp ) 2.2.2) χ = t S S S S L(i#e ) = Σ.L(i#e+1 ) Σ.L(i + 1#e+1 ) L(i + 1#e+1 ) wi+1 .L(i + 1#e ) if (i ≤ p − 2, wi+2 wi+1 .L(i + 1#e+1 ), φ) =IH1,2 S Σ.LtLev (n − e − 1, wi+1 ...wp ) S Σ.LtLev (n − e − 1, wi+2 ...wpS) LtLev (n − e − 1, wi+2 ...wp ) S wi+1 .LtLev (n − e, wi+2 ...wp ) if (|wi+1 ...wp | ≥ 2, wi+2 wi+1 .LtLev (n − e − 1, wi+3 ...wp ), φ) = Rt (n − e, wi+1 ...wp ) =Proposition 8 LtLev (n − e, wi+1 ...wp ) 2.2.3) χ = ms

12

S S S S L(i#e ) = Σ.L(iS#e+1 ) Σ.L(i + 1#e+1 ) L(i + 1#e+1 ) wi+1 .L(i + 1#e ) Σ.Σ.L(i + 1#e+1 ) if (i ≤ p − 2, Σ.L(i + 2#e+1 ), φ) =IH1,2 S Σ.Lms Lev (n − e − 1, wi+1 ...wp ) S Σ.Lms Lev (n − e − 1, wi+2 ...wp S) Lms (n − e − 1, w ...w ) i+2 p Lev S wi+1 .Lms Lev (n − e, wi+2 ...wp ) S Σ.Σ.Lms Lev (n − e − 1, wi+2 ...wp ) if (|wi+1 ...wp | ≥ 2, Σ.Lms Lev (n − e − 1, wi+3 ...wp ), φ) = Rms (n − e, wi+1 ...wp ) =Proposition 8 Lms Lev (n − e, wi+1 ...wp ) Corollary Let χ ∈ {², t, ms}, w ∈ Σ∗ and n ∈ N . Proposition 9 implies that D,χ L(AN (w)) = L(0#0 ) = LχLev (n, w). n

4

Deterministic finite Levenshtein automata for fixed word.

D,χ In this section we show a special way for determinization of AN (w). As a n D,χ result we receive the deterministic automaton An (w).

Definition 7 Let χ ∈ {², t, ms}. def

QN D,² = {i#e |i, e ∈ Z} S def QN D,t = QN D,² {i#e t |i, e ∈ Z} S #e N D,ms def N D,² Q = Q {is |i, e ∈ Z} Let n ∈ N . We define δeD,χ : QN D,χ × {0, 1}∗ → P (QN D,χ ). Let b ∈ {0, 1}∗ , k ∈ N and b = b1 b2 ...bk . 1) χ = ²  {i + 1#e } if 1 < b    #e+1  , i + 1#e+1 } if b = 0k & b 6= ² & e < n  {i def D,² #e #e+1 #e+1 #e+j−1 {i ,i + 1 ,i + j } if 0 < b & j = µz[bz = 1] δe (i , b) =  #e+1  {i } if b = ² & e < n    φ otherwise µz[A] denotes the least z such that A is satisfied. 2) χ = t 2.1)

13

 {i + 1#e }     {i#e+1 , i + 1#e+1 , i + 2#e+1 , i#e+1 }  t   #e+1 def {i , i + 1#e+1 , i + j #e+j−1 } D,t #e δe (i , b) = #e+1  , i + 1#e+1 }  {i  #e+1   {i }   φ

if 1 < b if 01 < b if 00 < b & j = µz[bz = 1] if b = 0k & b 6= ² & e < n if b = ² & e < n otherwise

2.2) def

½

δeD,t (i#e t , b) =

{i + 2#e } if 1 < b φ otherwise

3) χ = ms 3.1)  {i + 1#e }     , i + 1#e+1 , i + 2#e+1 }  {i#e+1 , i#e+1 s def D,ms #e #e+1 #e+1 {i , is , i + 1#e+1 } δe (i , b) =  #e+1  {i }    φ

if 1 < b if 00 < b ∨ 01 < b if 0 = b & e < n if ² = b & e < n otherwise

def

#e } 3.2) δeD,ms (i#e s , b) = {i + 1

The function δeD,χ is called function of the elementary transitions. Definition 8 Let χ ∈ {², t, ms}. Let w ∈ Σ∗ , n ∈ N and ∗ D,χ N D,χ =< Σ, QN ,I , FnN D,χ , δnN D,χ >. n N D,χ ∗ We define w[ ] : Qn →Σ . D,χ Let w = w1 w2 ...wp and π ∈ QN . n #e N D,χ 1) π = i ∈ Qn w[i#e ] = wi+1 wi+2 ...wi+k where k = min(n − e + 1, p − i) D,t 2) π = i#e ∈ QN t n w[i#e ] = w[i#e ]

D,χ AN (w) n

t

N D,ms 3) π = i#e s ∈ Qn w[i#e = w[i#e ] s ] The word w[π] is called relevant subword of w for π ([SMFSCLA]).

Definition 9 β : Σ × Σ∗ → {0, 1}∗ β(x, w1 w2 ...wp ) = b1 b2 ...bp where bi = 1 ⇔ x = wi . β(x, w1 w2 ...wp ) is called characteristic vector of x with respect to the word w1 w2 ...wp . Definition 10 Let χ ∈ {², t, ms}. Let w ∈ Σ∗ , n ∈ N and ∗ D,χ N D,χ =< Σ, QN ,I , FnN D,χ , δnN D,χ >. n

D,χ AN (w) n

14

D,χ D,χ We define δeD,χ : QN × Σ → P (QN ). n n def

δeD,χ (π, x) = δeD,χ (π, β(x, w[π] )) The definitions of δeD,² , δeD,t and δeD,ms are given in [SMFSCLA]. Definition 11 Let χ ∈ {², t, ms}. We define ∈ δnN D,χ & < ηs , x, q >∈ δnN D,χ ).

26

Fig. 7

Proof Proposition 16 follows directly from the definitions of δeD,χ and δnN D,χ . Proposition 17 Let χ ∈ {², t, ms}. Let w ∈ Σ∗ , n ∈ N , D,χ D,χ > and AD,χ , FnD,χ , δnD,χ >. n (w) =< Σ, Qn , I

D,χ D,χ N D,χ AN (w) =< Σ, QN ,I , FnN D,χ , δnN D,χ n n N D,χ D,χ Then L(An (w)) ⊇ L(An (w)).

#0 Proof Let x1 x2 ...xk ∈ L(AD,χ }, Mi+1 = δnD,χ (Mi , xi+1 ) n (w)). Let M0 = {0 D,χ for 0 ≤ i ≤ k − 1 and Mk ∈ Fn . We prove with induction on i that ∗ ∀π ∈ Mi (< 0#0 , x1 x2 ...xi , π >∈ δnN D,χ ) if 0 ≤ i ≤ k. 1) i = 0 < 0#0 , ², 0#0 >∈ δnN D,χ ∗ 2) Induction hypothesis: ∀π ∈ Mi (< 0#0 , x1 x2 ...xi , π >∈ δnN D,χ ) ∗ We have to prove that ∀π 0 ∈ Mi+1 (< 0#0 , x1 x2 ...xi+1 , π 0 >∈ δnN D,χ ). G [ Mi+1 = δeD,χ (q, xi+1 ) ⊆ δeD,χ (q, xi+1 ) q∈Mi

q∈Mi

Let π 0 ∈ Mi+1 . Therefore ∃q ∈ Mi (π 0 ∈ δeD,χ (q, xi+1 )). Let q be such that ∗ q ∈ Mi and π 0 ∈ δeD,χ (q, xi+1 ). Therefore < 0#0 , x1 x2 ...xi , q >∈ δnN D,χ . ∗ It follows from Proposition 16 that < q, xi+1 , π 0 >∈ δnN D,χ . Therefore < ∗ 0#0 , x1 x2 ...xi+1 , π 0 >∈ δnN D,χ . ∗ We proved that ∀π ∈TMi (< 0#0 , x1 x2 ...xi , π >∈ δnN D,χ ) if 0 ≤ i ≤ k. Let π be such that π ∈ Mk FnN D,χ (since Mk ∈ FnD,χ such π exists). Therefore ∗ < 0#0 , x1 x2 ...xk , π >∈ δnN D,χ and x1 x2 ...xk ∈ L(AD,χ n (w)). Corollary Let χ ∈ {², t, ms}. Let w ∈ Σ∗ , n ∈ N , D,χ N D,χ D,χ D,χ =< Σ, QN ,I , FnN D,χ , δnN D,χ > and AD,χ , FnD,χ , δnD,χ >. n n (w) =< Σ, Qn , I χ N D,χ It follows from Proposition 17 and Proposition 15 that LLev (n, w) = L(An (w)) = L(AD,χ n (w)). D,χ AN (w) n

27

In [SMFSCLA] we can also find proof that LχLev (n, w) = L(AD,χ n (w)). Proposition 18 Let χ ∈ {², t, ms}. Let n ∈ N and b ∈ {0, 1}∗ . Then #f D,χ #e (i , b)} 1) δeD,χ (i + t#e , b) = {j + t#f (t|s) |j(t|s) ∈ δe #f #f D,χ #e (it , b)} 2) δeD,χ (i + t#e t , b) = {j + t(t) |j(t) ∈ δe #f #f D,χ #e 3) δeD,χ (i + t#e (is , b)} s , b) = {j + t(s) |j(s) ∈ δe

Proof Proposition 18 follows directly from the definition of δeD,χ .

5

Universal Levenshtein automata.

We show that for each n ∈ N we can build finite deterministic automaton A∀,χ n in such way that: 1) when χ = ² every nonfinal state of A∀,χ is finite set that consists of n elements of the type I + i#e and every final state is finite set that consists of elements of the type M +j #f . (When χ = t there are in the states also elements of the type It + i#e and Mt + j #f , when χ = ms - of the type Is + i#e and Ms + j #f ); 2) each symbol from the input alphabet for A∀,χ is binary vector, i.e. word n from the language {0, 1}∗ ; 3) for every two words v1 v2 ...vk and w from Σ∗ we can build b = b1 b2 ...bk ∀,χ D,χ such that bi ∈ {0, 1}∗ and b ∈ L(A∀,χ n ) ⇔ v ∈ L(An (w)), i.e. b ∈ L(An ) ⇔ χ ∀ v ∈ LLev (n, w) (vi is called symbol corresponding to the word bi ). Let q0 , q1∀ , ..., with b and let q0D , q1D , ..., that we visit traversing A∀,χ qk∀ be the states of A∀,χ n n D D,χ qk be the states of An (w) that we visit traversing AD,χ n (w) with v. We build in such way that we receive qjD when we replace the parameters I in qj∀ A∀,χ n with j (if qj∀ is nonfinal) or the parameters M in qj∀ with k (if qj∀ is final). And also: qj∀ is final only if qjD is final. Notations We use expressions of the type F (I)#e , F (It )#e , F (Is )#e , F (M )#e , F (Mt )#e and F (Ms )#e to denote correspondingly tuples of the type , e >, , e >, , e >, , e >, , e > and , e >. Definition 15 Let χ ∈ {², t, ms}. Let n ∈ N . We define the finite automaton A∀,χ n . def

∀,χ A∀,χ = < Σ∀n , Q∀,χ , Fn∀,χ , δn∀,χ > n n ,I def

Σ∀n = {x|x ∈ {0, 1}+ & |x| ≤ 2n + 2} We define Isχ : 1) χ = ² def

Is² = {I + t#k | |t| ≤ k & − n ≤ t ≤ n & 0 ≤ k ≤ n} 28

Fig. 8

2) χ = t def

Ist = Is²

S

{It + t#k | |t + 1| + 1 ≤ k & − n ≤ t ≤ n − 2 & 1 ≤ k ≤ n}

Fig. 9

3) χ = ms def

Isms = Is²

Is² , n = 2

S

Ist , n = 2

{Is + t#k | |t + 1| + 1 ≤ k & − n ≤ t ≤ n − 2 & 1 ≤ k ≤ n} 29

Fig. 10

Isms , n = 2

We define Msχ : 1) χ = ² def

Ms² = {M + t#k | k ≥ −t − n & − 2n ≤ t ≤ 0 & 0 ≤ k ≤ n}

Fig. 11

Ms² , n = 2

2) χ = t 30

def

Mst = Ms²

S

{Mt + t#k | k ≥ −t − n & − 2n ≤ t ≤ −2 & 1 ≤ k ≤ n}

Mst , n = 2

Fig. 12

3) χ = ms def

Msms = Ms²

S

{Ms + t#k | k ≥ −t − n & − 2n ≤ t ≤ −1 & 1 ≤ k ≤ n}

Msms , n = 2

Fig. 13 We define | !δn∀,t (q1 , b) & q2 = δn∀,t (q1 , b)}| 187 6805 229025 7730973 267593313 9515031337 χ = ms |{< q1 , b, q2 > | !δn∀,ms (q1 , b) & q2 = δn∀,ms (q1 , b)}| 197 8307 317039 12126471 476227735

∀,t ∀,ms Minimality of A∀,² . n , An and An

χ Proposition 20 Let χ ∈ {², t, ms}. Let n ∈ N and A ∈ Istates . χ ∃b ∈ Σ∀n (!δn∀,χ (A, b) & δn∀,χ (A, b) ∈ Mstates ).

59

Then

χ Proof Let k be the least element of 5a (A). Obviously δn∀,χ (A, 1k ) ∈ Mstates .

Proposition 21 t 1) A ∈ Istates & I + i#e ∈ A ⇒ I + i + 1#e ∈ A t #e t 2) A ∈ Mstates & M + it ∈ A ⇒ M + i + 1#e ∈ A ms #e 3) A ∈ Istates & I + i#e ∈A s ∈A⇒I +i ms #e 4) A ∈ Mstates & M + i#e ∈ A ⇒ M + i ∈A s Proof S t t t 1) Let A ∈ Istates and I + i#e ∃B ∈ Istates Mstates ∃x ∈ t . Therefore S t ∀ ∀,t t ∀ Σn (δn (B, x) = A). Let B ∈ Istates Mstates and x ∈ Σn be such that δn∀,t (B, x) = A. Therefore ∃π ∈ B(I+i#e ∈ δe∀,t (π, x)∨I+i#e ∈ mn (δe∀,t (π, x), |x|)). t t #e #e ∀,t Let π ∈ B be such that I + it ∈ δe (π, x) or I + it ∈ mn (δe∀,t (π, x), |x|)). Therefore I + i + 1#e ∈ δe∀,t (π, x) or I + i + 1#e ∈ mn (δe∀,t (π, x), |x|)). It follows from I +i#e ∈ A that ¬∃π 0 ∈ A(π 0 m then it follows from 1) that L(A0 ) 6= 6∈ B) and ¬∃π 0 ∈ min(B 0 )∃j(π 0 = L(B 0 )). Obviously π 6∈ B 0 (I + i#m s m 0 ∀,ms I + js ). Since π ∈ A we have !δn (A0 , c) and M IN (δn∀,ms (A0 , c)) = m. Hence !δn∀,ms (B 0 , c) and L(δn∀,ms (A0 , c)) = L(δn∀,ms (B 0 , c)). Since π 6∈ B 0 and ¬∃π 0 ∈ min(B 0 )∃j(π 0 = I + jsm ) we have M IN (δn∀,ms (B 0 , c)) = m + 1. It follows from 1) that L(δn∀,ms (A0 , c)) 6= L(δn∀,ms (B 0 , c)). Contradiction. 3.1.3.2) π = I + i#m s . ms Let c = 02n+2 . Obviously !δn∀,ms (A, c), δn∀,ms (A, c) ∈ Istates , M IN (δn∀,ms (A, c)) = #m ∀,ms ∀,ms ∀,ms ms m, I+i ∈ δn (A, c), !δn (B, c), δn (B, c) ∈ Istates , M IN (δn∀,ms (B, c)) = #m m, I + i 6∈ δn∀,ms (B, c) and L(δn∀,ms (A, c)) = L(δn∀,ms (B, c)). Like in 3.1.3.1) we receive contradiction. 3.2) ∃π ∈ min(B)(π 6∈ min(A)) Like in 3.1) we receive contradiction. The Lemma is proved. We define f loor : P (Isχ ) × N + → P (Isχ ): def

f loor(X, 1) = min(X) def

f loor(X, i + 1) = min(X\(

[

1≤j≤i

63

f loor(X, j)))

Fig. 23

n = 5, X = {I − 2#2 , I − 1#2 , I + 1#3 , I + 2#3 , I + 5#5 }

χ d N: We define F LOOR : Istates × N + −→ ½ def e if ∃π ∈ f loor(X, s)(y(π) = e) F LOOR(X, s) = ¬! if f loor(X, s) = φ χ We prove with induction on i that ∀i∀A, B ∈ Istates (L(A) = L(B) ⇒ f loor(A, i) = f loor(B, i)). 1) i = 1 χ Let A, B ∈ Istates and L(A) = L(B). f loor(A, 1) = min(A) = min(B) = f loor(B, 1) χ 2) Induction hypothesis: ∀j ≤ i∀A, B ∈ Istates (L(A) = L(B) ⇒ f loor(A, j) = f loor(B, j)). χ We have to prove that ∀A, B ∈ Istates (L(A) = L(B) ⇒ f loor(A, i + 1) = χ f loor(B, i + 1)). Let A, B ∈ Istates and L(A) = L(B). Let us suppose that f loor(A, i + 1) 6= f loor(B, i + 1). 2.1) ∃π ∈ f loor(A, i + 1)(π 6∈ f loor(B, i + 1)) 2.1.1) ∃π ∈ f loor(A, i + 1)∃t∃r(π 6∈ f loor(B, i + 1) & π = I + t#r ) Let π = I+t#r ∈ f loor(A, i+1) and π 6∈ f loor(B, i+1). Hence !F LOOR(A, i+ 1) and !F LOOR(A, 1). Let f = F LOOR(A, i + 1) − F LOOR(A, 1) and x = 0n+t 1n+1−t . We build rows {Aα }fα=0 and {Bα }fα=0 in such way that

∀α ∈ [0, f ](

χ Aα , Bα ∈ Istates & L(Aα ) = L(Bα ) & π ∈ Aα & π 6∈ Bα & S ¬∃π 0 (π 0