Tony Graham mentioned in an email his use of Saxon’s optimization hint attribute xsl:function/@saxon:memo-function
to memo-ize the values of some functions. He had investigated it for his Open Source focheck project that checks XSL-FO scripts. I was intrigued as I had never used this technique,
and Tony kindly provided details for me and readers.
Memo-izing is where the implementation stores the return value for a function between invocations, to avoid it being recalculated for the same arguments: it is a kind of single-value caching. This is a useful technique where you are validating XML documents that use subclassing (e.g. DITA, XSL-FO, or my highly generic documents ), where you use attribute values to supply the specific name for the element rather than the normal generic identifier: for example if you have a structured HTML document where the names of structures you are validating are all in the @class attribute.
From Tony:
If, and it’s a big IF, your Schematron is doing a lot of running the same XPath test
expressions on the same values and you are also using either Saxon PE or Saxon EE
to run your XSLT2 binding, then you may benefit from using saxon:memo-function
( http://saxonica.com/documentation/index.html#!extensions/attributes/memo-function) to get Saxon to short-circuit reevaluating the same expression over and over again
and instead just return the result saved from the first time those parameter values
were used.
saxon:memo-function
is a Saxon extension attribute that may be used with xsl:function
. With saxon:memo-function="yes"
, Saxon caches the result from time the function is called. When the function is
called again with the same parameter values, Saxon returns the cached result for those
parameter values rather than reevaluating the function just to return the same result.
There are some caveats in the Saxon documentation about when not to use saxon:memo-function
— for example, when the function has side-effects or it accesses the current context
— but it should be generally usable with the expressions in Schematron tests.
Whether, and to what extent, using saxon:memo-function
can speed up your Schematron processing entirely depends on your Schematron and your
documents. As with most things to do with XSLT performance, you need to test it with
realistic documents and the particular XSLT processor version that you use before
you can say for sure.
An example where saxon:memo-function
does help is the focheck framework (https://github.com/AntennaHouse/focheck) for validating XSL-FO files. XSL-FO properties are expressed as XML attributes,
which focheck needs to check. However, property values can be expressions in the
expression language defined in the XSL 1.1 Recommendation (https://www.w3.org/TR/xsl11/#d0e5032), so focheck has to evaluate the property value expressions before working out if
the result is an allowed value for the current property. As a consequence, the focheck
Schematron has hundreds of:
<let name="expression" value="ahf:parser-runner(.)"/>
where ahf:parser-runner()
is:
<!-- ahf:parser-runner($input as xs:string) as element()+ -->
<!-- Runs the REx-generated parser on $input then reduces the parse
tree to a XSL 1.1 datatype. Uses @saxon:memo-function extension
to memorize return values (when used with Saxon PE or Saxon EE)
to avoid reparsing the same strings again and again when this is
used as part of validating an entire XSL-FO document. -->
<xsl:function name="ahf:parser-runner" as="element()+"
saxon:memo-function="yes"
xmlns:saxon="http://saxon.sf.net/" >
<xsl:param name="input" as="xs:string" />
...
</xsl:function>
saxon:memo-function
is used to avoid parsing the same expression over and over again just to return the
same result. There’s three main reasons why this is useful: the desire for a consistent
appearance in the formatted result means that a lot of the property values are repeated
throughout the XSL-FO document; XSL-FO documents are usually generated using XSLT,
so generating the same property values for the same element type is easy and happens
a lot; and, lastly, most property values in the XSL-FO XML are single tokens rather
than complex expressions, so running the parser on a single token that’s been seen
before adds a lot of overhead compared to just using saxon:memo-function
.
To test the effect of saxon:memo-function
, I validated an 808 kB XSL-FO document in oXygen 18.1 using focheck that alternately
had saxon:memo-function
enabled and disabled. To avoid any influence from oXygen possibly caching the XSLT
stylesheet that implements the parser, oXygen was restarted each time the stylesheet
was changed. With saxon:memo-function="yes"
, the Schematron component of validating the document took a minimum of 2 seconds;
with saxon:memo-function="no"
, the Schematron component took a minimum of 5 seconds.
saxon:memo-function
applies only to xsl:function
. If you are doing a lot of the same tests but don’t have an xsl:function
for the tests to which you could add saxon:memo-function
, then you might want to add an xsl:function
just so that you can add saxon:memo-function to it
. As before, however, you need to test with your own documents to determine whether
or not that’s useful to you.
Consider a DITA-related Schematron pattern such as:
<sch:pattern>
<sch:rule context="*[contains(@class, ' custom/paragraph ')]">
<sch:extends rule="custom-paragraph"/>
</sch:rule>
<sch:rule context="*[contains(@class, ' topic/p ')]">
<sch:extends rule="topic-p"/>
</sch:rule>
...
</sch:pattern>
There’s potentially a lot of string matching going on to determine which rule applies
to an element. If you think that there is enough repeated testing
of the same values to make it worthwhile to use saxon:memo-function
, then you could change the pattern to:
<sch:pattern>
<sch:rule context="*[my:contains(@class, ' custom/paragraph ')]">
<sch:extends rule="custom-paragraph"/>
</sch:rule>
<sch:rule context="*[my:contains(@class, ' topic/p ')]">
<sch:extends rule="topic-p"/>
</sch:rule>
...
</sch:pattern>
where my:contains()
is:
<xsl:function name="my:contains" as="xs:boolean"
saxon:memo-function="yes"
xmlns:saxon="http://saxon.sf.net/" >
<xsl:param name="class" as="xs:string" />
<xsl:param name="specialisation" as="xs:string" />
<xsl:sequence select="contains($class, $specialisation)" />
</xsl:function>
The memoization is most applicable (only useful?) where you have lots of the same sets of parameter values in your function call such that the overhead of the processing time for checking and caching is less than the processing time that you save by just returning the known value on the second and subsequent time that the function is called with the same parameter values.
Now, if Saxon uses a string comparison rather than, say, a hash to find previously
used string parameters, then my DITA example would make things slower because Saxon would
be checking to the end of both strings every time instead of contains()
returning as soon as it found the second string inside the first.
For a document that contains a large enough number of paragraphs, the overhead added
by saxon:memo-function
could be outweighed by the saving in not performing as many string comparisons, but
for smaller documents containing fewer paragraphs, there might be no advantage. Whether
or not saxon:memo-function
can speed up your Schematron processing is something that you’ll have to determine
for yourself.