The Undecidability of the Schematron Conformance Problem

by Bob Lyons
November, 2002

Is It Codeable?

Years ago, I worked with my good friend Jim at a telecom company. He wrote the requirements for new network features that I developed. I remember when he burst into my office one day to describe a great idea for a new feature. After he finished telling me about the feature, he asked "Is it codeable?" I recall being impressed that Jim, who majored in engineering rather than computer science, had an intuitive understanding that not all programming problems are solvable.

I was a computer science student when I first learned that there are math and computer science problems that are undecidable (i.e., unsolvable). Like many students, I was surprised to learn that mathematicians and computer scientists have proved the undecidability of a variety of problems. A problem is undecidable if it is not possible to devise an algorithm that solves the problem. It's widely believed that if a problem is undecidable, then it's not "codeable" (i.e., it can't be solved using software). Perhaps the most famous undecidable problem is Alan Turing's Halting Problem.

While surfing the web recently, I learned about the Post Correspondence Problem (PCP), which is another famous undecidable problem, which we'll examine below. The proof that the PCP is undecidable can be built upon the undecidability of the Halting Problem. The undecidability of the PCP has often been used to prove the undecidability of other problems. It occurred to me that the undecidability of the PCP could be used to prove the undecidability of the Schematron Conformance Problem.

In the remainder of this paper, we'll describe the Schematron Conformance Problem (SCP) and Post Correspondence Problem (PCP). We'll then prove that the Schematron Conformance Problem is undecidable (even if we restrict ourselves to Schematron schemas that don't use the document and key functions). We assume that you are familiar with the basics of Schematron, which is a powerful XML schema language in which validity constraints are defined using XPath expressions. For information about Schematron, please see the Schematron resources that are listed in the Further Reading section below.

The following is the outline of this paper:

What is the Schematron Conformance Problem (SCP)?

The Schematron Conformance Problem (SCP) is defined as follows:

Given a Schematron schema, is there an XML document that conforms to this schema?

When we say that the SCP is undecidable, we mean that it's not possible to devise an algorithm that accepts any Schematron schema as input and responds with one of the following two outputs:

  • "Yes" if there is an XML document that conforms to the Schematron schema.
  • "No" if there is no XML document that conforms to the Schematron schema.

The fact that the SCP is undecidable doesn't mean that you can't devise a specific algorithm that determines if there is an XML document that conforms to a particular schema. The undecidability of the SCP does mean that it is impossible to devise a general algorithm that can make this determination for any Schematron schema.

Note that the SCP presumes that there are Schematron schemas for which there are no conforming XML documents. In other words, the SCP presumes that you can find a Schematron schema, such that there are no XML documents that conform to this Schematron schema. For example, the following is a Schematron schema that has no conforming XML documents:

<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">

    <!-- This schematron does not match any XML document.
    -->

    <pattern name="check-root">
        <rule context="/">
            <assert test="count(*) != 1">
                Error: The number of document elements is
                equal to 1.
            </assert>
        </rule>
    </pattern>

</schema>

An XML document would conform to this Schematron schema only if the document does not have exactly one document element. However, all XML documents have exactly one document element. So, there are no XML documents that conform to this Schematron schema.

The fact that the SCP is undecidable means that a Schematron authoring tool (i.e., an editor) would not be able to analyze the user's schema and always let the user know whether or not there are any XML documents that conform to the schema.

Why is the SCP is Unsolvable?

To prove that the SCP is undecidable, we show, by way of contradiction, that if the SCP is decidable, then the PCP is decidable. But before delving into the proof, let's describe the PCP. (If you are already familiar with the PCP, then you may want to jump ahead to the Back to the Proof section.)

The Post Correspondence Problem (PCP)

In 1946, Emil Post proved that the following problem is undecidable:

Let A be an alphabet with at least two letters, and let L and M be two lists of nonempty strings over A, such that L and M have the same number of strings. We can represent L and M as follows:

  • L = ( s1, s2, s3, ..., sk )
  • M = ( t1, t2, t3, ..., tk )

Does there exist a sequence of one or more integers, which we represent as ( i1, i2, i3, ..., im ), that meet the following requirements?

  • Each of the integers is greater than or equal to one.
  • Each of the integers is less than or equal to k. (Recall that each list has k strings).
  • The concatenation of si1, si2, si3, ..., and sim is equal to the concatenation of ti1, ti2, ti3, ..., and tim.

Let's look at some examples. We'll look at three instances of the PCP; the first two have solutions and the third one has no solution.

Example #1 of a PCP Instance

Consider the following instance of the PCP:

  • Alphabet A = { a, b }
  • List L = ( a, ab )
  • List M = ( aa, b )

To enhance the readability of the solution to this PCP instance, we use different colors for the strings in each list; note that the color of the ith string of list L is the same as the color of the ith string of list M.

We see that ( 1, 2 ) is a sequence of integers that solves this PCP instance, since the concatenation of "a" and "ab" is equal to the concatenation of "aa" and "b" (i.e., aab = aab).

Other solutions include:

  • ( 1, 2, 1, 2 )
  • ( 1, 2, 1, 2, 1, 2 )
  • and so on

Example #2 of a PCP Instance

Consider the following instance of the PCP (from David Matuszek's Post's Correspondence Problem course notes):

  • Alphabet A = { a, b }
  • List L = ( a, abaaa, ab )
  • List M = ( aaa, ab, b )

A sequence of integers that solves this problem is ( 2, 1, 1, 3 ), since the concatenation of "abaaa", "a", "a" and "ab" is equal to the concatenation of "ab", "aaa", "aaa" and "b" (i.e., abaaaaaab = abaaaaaab).

Example #3 of a PCP Instance

Consider the following instance of the PCP:

  • Alphabet A = { a, b, c }
  • List L = ( a, b, c )
  • List M = ( aa, bb, cc )

There is no solution to this problem, since for any potential solution, the concatenation of the strings from list L will contain half as many letters as the concatenation of the corresponding strings from list M.

An Alternative Description of the PCP

The following is a less abstract description of the PCP:

We have a finite number of rectangles. Each rectangle consists of an upper half and a lower half. On each half of each rectangle, there is a non-empty string (e.g., "aabb"). The following is an illustration of such a rectangle:

aabb

ccab

Can one or more of these rectangles be arranged in a row (left to right), such that the following requirements are met?

  • A rectangle may appear in the row once, multiple times or not at all.
  • When constructing the row, you are free to place the rectangles in any order. However, you can't rotate any of the rectangles (e.g., you can't flip one upside down).
  • The concatenation of the strings in the upper half of the row is equal to the concatenation of the strings in the lower half of the row.

Reformulation of Example #1

So, we can reformulate our first example as the following set of rectangles:

a

aa

ab

b

The strings in the top halves of the rectangles are the strings in list L, and the strings in the bottom halves of the rectangles are the strings in list M. The first string in list L (i.e., "a") is the string in the top half of the first rectangle, and the second string in list L (i.e., "ab") is the string in the top half of the second rectangle. Likewise, the first string in list M (i.e., "aa") is the string in the bottom half of the first rectangle, and the second string in list L (i.e., "b") is the string in the bottom half of the second rectangle.

This instance of the PCP can be solved with the following row, since the concatenation of the strings in the top half of the row (i.e., "aab") is equal to the concatenation of the strings in the bottom half of the row:

a

ab

aa

b

Reformulation of Example #2

We can reformulate our second example as the following set of rectangles:

a

aaa

abaaa

ab

ab

b

This instance of the PCP can be solved with the following row, since the concatenation of the strings in the top half of the row is "abaaaaaab" and the concatenation of the strings in the bottom half of the row is also "abaaaaaab":

abaaa

a

a

ab

ab

aaa

aaa

b

Back to the Proof

Now that we understand the PCP, we can get back to our proof that the SCP is undecidable.

Recall that we assumed that the SCP is solvable (i.e., we assumed that there is an algorithm for solving the SCP). Now we extend this algorithm into a new algorithm that solves the PCP.

The input to the new algorithm is an instance of the PCP (i.e., two lists of nonempty strings over an alphabet containing at least two letters, such that the two lists have the same number of strings).

The following is the new algorithm that solves the PCP:

  1. We assume that the strings in the two lists do not contain any whitespace characters. (For PCP instances where this assumption is not true, we could always transform the PCP instance into an equivalent instance by replacing each whitespace character with a non-whitespace character.)
  2. We transform the PCP instance into a Schematron schema, such that only XML documents that contain a solution to the PCP instance will conform to the Schematron schema. We'll demonstrate in the next section how to construct such Schematron schemas.
  3. We use the algorithm that solves the SCP to determine whether or not there are any XML documents that match our Schematron schema. If there is an XML document that matches our Schematron schema, then our new algorithm responds with "Yes" (i.e., yes, there is a solution to the PCP instance). If there are no XML documents that match our Schematron schema, then our new algorithm responds with "No" (i.e., no, there is no solution to the PCP instance).

So, we now have an algorithm that solves the PCP. However, the PCP is unsolvable. So, our initial assumption (that there is an algorithm for solving the SCP) must be false. Therefore, the SCP is undecidable.

Now we'll show how a PCP instance can be transformed into a Schematron schema, such that an XML document conforms to the schema only if the XML document contains a solution to the PCP instance.

XML Document Containing a Solution from Example #2

First, we describe the format of the XML documents that contain solutions to PCP instances. The following XML document contains a solution for the PCP instance of example #2:

<?xml version="1.0"?>
<pcp-instance-solution>
    <list-1>
        <string id="2">abaaa</string>
        <string id="1">a</string>
        <string id="1">a</string>
        <string id="3">ab</string>
    </list-1>
    <list-2>
        <string id="2">ab</string>
        <string id="1">aaa</string>
        <string id="1">aaa</string>
        <string id="3">b</string>
    </list-2>
</pcp-instance-solution>

Recall that a solution to this PCP instance is the following sequence of integers: ( 2, 1, 1, 3 ). This sequence of integers is specified by the id attributes of the /pcp-instance-solution/list-1/string elements; the solution is specified again by the id attributes of the /pcp-instance-solution/list-2/string elements.

The /pcp-instance-solution/list-1 element contains the 2nd string of list L, followed by the 1st string of list L, followed by the 1st string of list L, followed by the 3rd string of list L. The /pcp-instance-solution/list-2 element contains the 2nd string of list M, followed by the 1st string of list M, followed by the 1st string of list M, followed by the 3rd string of list M.

Notice that the concatenation of the values of the /pcp-instance-solution/list-1/string elements is equal to the concatenation of the values of the /pcp-instance-solution/list-2/string elements. (In other words, "abaaaaaab" is equal to "abaaaaaab".)

You may have noticed that this XML document contains much more information than the sequence of integers that solves the problem. The extra information makes it easier for the reader to verify the solution. Also, if the document merely specified a list of integers that solved the PCP instance, then I would not have been able to write a Schematron schema that verifies that the solution is correct (unless the schema used the document function to invoke a CGI script, which would be cheating!).

Schematron Schema for Example #2

The following is a Schematron schema for example #2. An XML document will match this schema only if it provides a solution for the PCP instance of example #2.

<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">

    <!-- This Schematron schema is used to verify that 
         an XML document contains a solution to the 
         following instance of the Post Correspondence 
         Problem (PCP): 

         Alphabet A = { a, b }

         List L = ( a, abaaa, ab )

         List M = ( aaa, ab, b )
         
         An XML document conforms to this Schematron
         schema if it contains a solution to this
         PCP instance.

         If an XML document conforms to this Schematron
         schema, then the schema will not issue any
         messages when validating the document. If an 
         XML document does not conform to this Schematron 
         schema, then the schema will issue one or more 
         error messages when validating the document.
    -->

    <pattern name="check-root">
        <rule context="/">
            <assert test="pcp-instance-solution">
                The /pcp-instance-solution element was not
                found.
            </assert>
        </rule>
    </pattern>

    <pattern name="check-pcp-instance-solution">
        <rule context="/pcp-instance-solution">
            <assert test="list-1">
                The /pcp-instance-solution/list-1 element 
                is missing.
            </assert>
            <assert test="list-2">
                The /pcp-instance-solution/list-2 element 
                is missing.
            </assert>
            <assert test="count(*) = 2">
                The /pcp-instance-solution element has an
                incorrect number of subelements.
            </assert>
            <assert test="translate( 
                                 normalize-space(list-1),
                                 ' ',
                                 '' ) =
                          translate( 
                                 normalize-space(list-2), 
                                 ' ',
                                 '' )">
                The concatenation of the 
                /pcp-instance-solution/list-1/string/text()
                nodes is not equal to the 
                concatenation of the 
                /pcp-instance-solution/list-2/string/text()
                nodes.
            </assert>
        </rule>
    </pattern>

    <pattern name="check-list-1">
        <rule context=
          "/pcp-instance-solution/list-1/string[@id='1']">
            <assert test=". = 'a'">
                The value of the 
                /pcp-instance-solution/list-1/string[@id='1']
                element is not equal to 'a'.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-1/string[@id='2']">
            <assert test=". = 'abaaa'">
                The value of the 
                /pcp-instance-solution/list-1/string[@id='2']
                element is not equal to 'abaaa'.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-1/string[@id='3']">
            <assert test=". = 'ab'">
                The value of the 
                /pcp-instance-solution/list-1/string[@id='3']
                element is not equal to 'ab'.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-1/string">
            <assert test="@id = '1' or 
                          @id = '2' or 
                          @id = '3'">
                The value of a 
                /pcp-instance-solution/list-1/string/@id
                attribute is not equal to '1', '2' or '3'.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-1/text()">
            <assert test="normalize-space( . ) = ''">
                Found a non-whitespace text node as 
                child of the
                /pcp-instance-solution/list-1 element. 
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-1/*">
            <assert test="name( . ) = 'string'">
                The /pcp-instance-solution/list-1 element
                has a subelement other than a string 
                element.
            </assert>
            <assert test="@id">
                A subelement of the 
                /pcp-instance-solution/list-1 element 
                has no id attribute.
            </assert>
        </rule>
        <rule context="/pcp-instance-solution/list-1">
            <assert test="string">
                The /pcp-instance-solution/list-1 element
                does not contain any string subelements.
            </assert>
        </rule>
        <rule context="/pcp-instance-solution/list-1">
            <assert test="count( string ) =
                          count( ../list-2/string )">
                The /pcp-instance-solution/list-1 element
                does not have the same number of string 
                elements as the 
                /pcp-instance-solution/list-2 element.
            </assert>
        </rule>
    </pattern>

    <pattern name="check-list-2">
        <rule context=
          "/pcp-instance-solution/list-2/string[@id='1']">
            <assert test=". = 'aaa'">
                The value of the 
                /pcp-instance-solution/list-2/string[@id='1']
                element is not equal to 'aaa'.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-2/string[@id='2']">
            <assert test=". = 'ab'">
                The value of the 
                /pcp-instance-solution/list-2/string[@id='2']
                element is not equal to 'ab'.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-2/string[@id='3']">
            <assert test=". = 'b'">
                The value of the 
                /pcp-instance-solution/list-2/string[@id='3']
                element is not equal to 'b'.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-2/string">
            <assert test="@id = '1' or 
                          @id = '2' or 
                          @id = '3'">
                The value of the 
                /pcp-instance-solution/list-2/string/@id
                attribute is not equal to '1', '2' or '3'.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-2/text()">
            <assert test="normalize-space( . ) = ''">
                Found a non-whitespace text node as the
                child of the
                /pcp-instance-solution/list-2 element. 
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-2/*">
            <assert test="name( . ) = 'string'">
                The /pcp-instance-solution/list-2 element
                has a subelement other than a string 
                element.
            </assert>
            <assert test="@id">
                A subelement of the 
                /pcp-instance-solution/list-2 element
                has no id attribute.
            </assert>
        </rule>
        <rule context=
          "/pcp-instance-solution/list-2">
            <assert test="string">
                The /pcp-instance-solution/list-2 element
                does not contain any string subelements.
            </assert>
        </rule>
    </pattern>

    <pattern name="check-correspondence">
        <rule context=
          "/pcp-instance-solution/list-1/string">
            <assert test=
              "@id = ../../list-2/string[ 
                   count( current()/preceding-sibling::* )
                   + 1 ]/@id">
                The value of an id attribute in a 
                /pcp-instance-solution/list-1/string 
                element is not equal to the value of the
                id attribute of the corresponding 
                /pcp-instance-solution/list-2/string 
                element.
            </assert>
        </rule>
    </pattern>

</schema>

The validations performed by this schema include the following, which verify that the solution solves the PCP instance:

  • The concatenation of the values of the string elements in the list-1 element must be equal to the concatenation of the values of the string elements in the list-2 element.
  • The value of each list-1/string element must be equal to the ith string in list L, where i is the value of the id attribute of the list-1/string element.
  • Likewise, the value of each list-2/string element must be equal to the ith string in list M, where i is the value of the id attribute of the list-2/string element.
  • The value of the @id attribute in the nth string element of the list-1 element must be equal to the value of the @id attribute in the nth string element of the list-2 element.
  • The list-1 element must contain at least one string subelement.
  • Likewise, the list-2 element must contain at least one string subelement.
  • The list-1 element must have the same number of string subelements as the list-2 element.

The schema also performs the following validations of the structure of the document:

  • The document element must be pcp-instance-solution.
  • The subelements of the pcp-instance-solution must be the list-1 and list-2 elements.
  • The subelements of the list-1 and list-2 elements must be string elements.
  • Each string element must contain an id attribute.
  • Each string element must not have any subelements and must not contain any attributes other than the id attribute.
  • Etc.

The Schematron Intersection Problem

A corollary of the undecidability of the Schematron Conformance Problem (SCP) is that the Schematron Intersection Problem (SIP) is undecidable. The SIP is defined as follows:

Given two Schematron schemas, is there an XML document that conforms to both schemas?

To prove the corollary, we show, by way of contradiction, that if the SIP is decidable, then the SCP is decidable. If the SIP is decidable, then we can devise an algorithm that accepts any two Schematron schemas as input and responds with one of the following:

  • "Yes" if there is an XML document that conforms to both schemas.
  • "No" if there is no XML document that conforms to both schemas.

Now we extend this algorithm into a new algorithm that solves the SCP.

The input to the new algorithm is a Schematron schema, which we'll call S. We pass S and another Schematron schema, which we'll call T, as inputs to the algorithm that solves the SIP. T is the following Schematron schema, which matches all XML documents:

<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">

    <!-- This schematron schema matches any XML document.
    -->

    <pattern name="check-root">
        <rule context="/">
            <assert test="count(*) = 1">
                Error: The number of document elements is
                not equal to 1.
            </assert>
        </rule>
    </pattern>

</schema>

Our new algorithm responds with one of the following:

  • "Yes" if the algorithm that solves the SIP determines that there is an XML document that conforms to both S and T. Thus, the new algorithm responds with "Yes" if there is an XML document that conforms to S.
  • "No" if the algorithm that solves the SIP determines that there is no XML document that conforms to both S and T. Thus, the new algorithm responds with "No" if there is no XML document that conforms to S, since all XML documents conform to T.

Thus, the new algorithm solves the SCP (i.e., the SCP is decidable). However, since we showed that the SCP is undecidable, our initial assumption (that the SIP is decidable) must be false. Therefore, the SIP is undecidable.

What About Other XML Schema Languages?

Is the schema conformance problem undecidable for other XML schema languages, such as XML Schema, DTD and RELAX NG? In their paper entitled What's Hard about XML Schema Constraints?, Marcelo Arenas, Wenfei Fan and Leonid Libkin show that the XML Schema Conformance Problem is undecidable.

The DTD Conformance Problem looks solvable. It seems easy to construct an XML document that conforms to a DTD, as long as the DTD contains at least one element type declaration and contains an element type declaration for each child element type that is mentioned in the element type declarations.

It is possible to write a RELAX NG schema, such that there are no XML documents that conform to the schema. The trick is to use a data type that includes two contradictory constraints (e.g., a string must consist of an even number of 'a' characters, but must be 3 characters in length).

For example, there are no XML documents that conform to the following RELAX NG schema:

<?xml version="1.0" encoding="UTF-8"?>
<element name="conflicted"
         xmlns="http://relaxng.org/ns/structure/1.0"
         datatypeLibrary=
           "http://www.w3.org/2001/XMLSchema-datatypes">
    <data type="string">
        <param name="pattern">(aa)+</param>
        <param name="length">3</param>
    </data>
</element>

So, the RELAX NG Conformance Problem is an interesting problem, but it doesn't seem to be unsolvable.

Summary

In this paper, we showed that the Schematron Conformance Problem (SCP) is undecidable. This is true even if we restrict ourselves to Schematron schemas that do not use the document and key functions of XPath. To prove the undecidability of the SCP, we showed that if the SCP is decidable, then the Post Correspondence Problem is decidable; however, the Post Correspondence Problem is undecidable. Therefore, the SCP must be undecidable.

The fact that the SCP is undecidable means that it's impossible to build a Schematron schema editor that evaluates the user's schema and always lets him/her know whether or not there are any XML documents that conform to the schema.

Acknowledgements

Thanks to John Cowan for his comments on DTD conformance.

Further Reading

The following resources provide information about XML schema languages:

The following resources provide information about the Post Correspondence Problem, the Halting Problem and undecidability: