'parse multi-document RELAX-NG schema using libxml2

I want to convert a RELAX-NG schema to a schemaInfo object so that it can be used in codemirror for xml-completion.

https://codemirror.net/demo/xmlcomplete.html

xmllint usage

libxml2 already has support for a multi-document relax-NG schema when used to validate a document like this:

xmllint --schema myschema.rng mydoc.xml

Question

Can libxml2 also be used to parse a multi-document schema file?

Here is an example for a multi-document schema:

here is some libxml2 functionality i don't understand but which could be helpful:

Assumption

I think I have to convert the multi-document schema into a single document schema using tools like: https://github.com/h4l/rnginline/tree/master/rnginline

Using libxml2 directly would be great since I could then support schemas without pre-processing.

update 3.5.2016

as you can see parsing the relax-NG schema shows only the top level file and it will not contain any files which are included using the include directive from the relax-NG main file (note: relax-NG schemas can be spilit into several files).

<!-- XHTML Basic -->

<grammar ns="http://www.w3.org/1999/xhtml"
         xmlns="http://relaxng.org/ns/structure/1.0">

<include href="modules/datatypes.rng"/>
<include href="modules/attribs.rng"/>
<include href="modules/struct.rng"/>
<include href="modules/text.rng"/>
<include href="modules/hypertext.rng"/>
<include href="modules/list.rng"/>
<include href="modules/basic-form.rng"/>
<include href="modules/basic-table.rng"/>
<include href="modules/image.rng"/>
<include href="modules/param.rng"/>
<include href="modules/object.rng"/>
<include href="modules/meta.rng"/>
<include href="modules/link.rng"/>
<include href="modules/base.rng"/>

</grammar>

source code

/**
 * section: Tree
 * synopsis: Navigates a tree to print element names
 * purpose: Parse a file to a tree, use xmlDocGetRootElement() to
 *          get the root element, then walk the document and print
 *          all the element name in document order.
 * usage: tree1 filename_or_URL
 * test: tree1 test2.xml > tree1.tmp && diff tree1.tmp $(srcdir)/tree1.res
 * author: Dodji Seketeli
 * copy: see Copyright for the status of this software.
 */
#include <stdio.h>
#include <libxml/parser.h>
#include <libxml/tree.h>

#ifdef LIBXML_TREE_ENABLED


#define ANSI_COLOR_RED     "\x1b[31m"
#define ANSI_COLOR_GREEN   "\x1b[32m"
#define ANSI_COLOR_YELLOW  "\x1b[33m"
#define ANSI_COLOR_BLUE    "\x1b[34m"
#define ANSI_COLOR_MAGENTA "\x1b[35m"
#define ANSI_COLOR_CYAN    "\x1b[36m"
#define ANSI_COLOR_RESET   "\x1b[0m"


/*
 *To compile this file using gcc you can type
 *gcc `xml2-config --cflags --libs` -o xmlexample libxml2-example.c
 */

/**
 * print_element_names:
 * @a_node: the initial xml node to consider.
 *
 * Prints the names of the all the xml elements
 * that are siblings or children of a given xml node.
 */

char* pad(int depth) {
//   if (depth <= 0)
//     return "";
  char str[2000];
//   sprintf(str, "%*s", " ", depth);
  for (int i=0; i <= depth; i++) {
    str[i] = ' ';
  }
  str[depth+1] = 0;
  return &str;
}

static void
print_element_names(xmlNode * a_node, int depth)
{
    xmlNode *cur_node = NULL;

    for (cur_node = a_node; cur_node; cur_node = cur_node->next) {
        if (cur_node->type == XML_ELEMENT_NODE) {
//        if (strcmp(cur_node->name, "element") == 0) {
//             printf("node type: Element, name: %s\n", cur_node->name);
            printf("%s %s\n", pad(depth), cur_node->name);
            for(xmlAttrPtr attr = cur_node->properties; NULL != attr; attr = attr->next)
            {
                printf("%s", ANSI_COLOR_MAGENTA);
                printf("%s %s: ", pad(depth), attr->name);
                xmlChar* value = xmlNodeListGetString(cur_node->doc, attr->children, 1);
                printf("%s \n", value);
                printf("%s", ANSI_COLOR_RESET);
            }
//   }

        }

        print_element_names(cur_node->children, depth+1);
    }
}


/**
 * Simple example to parse a file called "file.xml",
 * walk down the DOM, and print the name of the
 * xml elements nodes.
 */
int
main(int argc, char **argv)
{
    xmlDoc *doc = NULL;
    xmlNode *root_element = NULL;

    if (argc != 2)
        return(1);

    /*
     * this initialize the library and check potential ABI mismatches
     * between the version it was compiled for and the actual shared
     * library used.
     */
    LIBXML_TEST_VERSION

    /*parse the file and get the DOM */
    doc = xmlReadFile(argv[1], NULL, 0);

    if (doc == NULL) {
        printf("error: could not parse file %s\n", argv[1]);
    }

    /*Get the root element node */
    root_element = xmlDocGetRootElement(doc);

    print_element_names(root_element, 0);

    /*free the document */
    xmlFreeDoc(doc);

    /*
     *Free the global variables that may
     *have been allocated by the parser.
     */
    xmlCleanupParser();

    return 0;
}
#else
int main(void) {
    fprintf(stderr, "Tree support not compiled in\n");
    exit(1);
}
#endif

example usage

[nix-shell:~/Desktop/projects/nlnet/nlnet]$ ./tree1 html5-rng/xhtml-basic.rng
 grammar
  ns: http://www.w3.org/1999/xhtml 
   include
   href: modules/datatypes.rng 
   include
   href: modules/attribs.rng 
   include
   href: modules/struct.rng 
   include
   href: modules/text.rng 
   include
   href: modules/hypertext.rng 
   include
   href: modules/list.rng 
   include
   href: modules/basic-form.rng 
   include
   href: modules/basic-table.rng 
   include
   href: modules/image.rng 
   include
   href: modules/param.rng 
   include
   href: modules/object.rng 
   include
   href: modules/meta.rng 
   include
   href: modules/link.rng 
   include
   href: modules/base.rng 


Solution 1:[1]

Although the question is unnecessary lengthy, it's clear what's being asked for. As of version 2.9.14, Libxml2 appear to be not able to resolve the includes other than resolving an URL or looking in the filesystem, probably searching for a filename of the name of the href attribute in the current directory. This may already answer the question but it may be insufficient if the schema has to be loaded from buffers in memory. A clean approach could be supplying a callback to resolve the rng:include directives but it doesn't seem Libxml2 provides such API. Another approach, which could actually lead to more efficient operations, is to recursively merge the outer schema in a single one without the include directives. The following code worked for me merging a medium complexity schema (8 files). Just change the paths and filenames accordingly.

#include <memory>
#include <string>
#include <stdexcept>
#include <unordered_set>
#include <filesystem>

#include <libxml/tree.h>
#include <libxml/xmlsave.h>

using namespace std;
namespace fs = std::filesystem;

using DocPtr = std::unique_ptr<xmlDoc, decltype(&xmlFreeDoc)>;

constexpr const char* SchemaBasePath = R"(D:\Schemas)";
constexpr const char* RngSchemaFilename = "Schema.rng";
constexpr const char* MergedSchemaSavePath = R"(D:\Schemas\Schema_Merged.rng)";
constexpr const char* RngNS = "rng";
constexpr const char* RngNSHref = "http://relaxng.org/ns/structure/1.0";

struct Qualifier
{
    bool IsNamespace;
    string Name;
    string Value;
};

static DocPtr readDoc(const string_view& filepath);
static void followDoc(xmlDocPtr doc, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers);
static void followDoc(xmlNodePtr root, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers);
static void removeNode(xmlNodePtr element);
static string findHRef(const xmlNodePtr element);
static string getAttributeContent(const xmlAttrPtr attr);
static void saveDocToFile(xmlDocPtr doc, const string_view& filepath);
static void addNamespaceTo(vector<Qualifier>& qualifiers, xmlNsPtr ns);
static void addAttributeTo(vector<Qualifier>& qualifiers, xmlAttrPtr attr);

unordered_set<string> s_schemas;

int main()
{
    LIBXML_TEST_VERSION;
    auto packetRngPath = fs::u8path(SchemaBasePath) / RngSchemaFilename;
    auto packetRngDoc = readDoc(packetRngPath.u8string());

    vector<xmlNodePtr> nodes;
    vector<Qualifier> qualifiers;
    followDoc(packetRngDoc.get(), nodes, qualifiers);

    auto newDoc = DocPtr(xmlNewDoc(nullptr), &xmlFreeDoc);
    auto grammarNode = xmlNewChild((xmlNodePtr)newDoc.get(), nullptr, (const xmlChar*) "grammar", nullptr);
    if (grammarNode == nullptr)
        throw runtime_error("Can't create rng:grammar node");

    auto rngNs = xmlNewNs(grammarNode, (const xmlChar*)RngNSHref, (const xmlChar*)RngNS);
    if (rngNs == nullptr)
        throw runtime_error("Can't find or create rng namespace");
    xmlSetNs(grammarNode, rngNs);

    for (auto qualifier : qualifiers)
    {
        // Recreate the gathered namespaces and attributes
        if (qualifier.IsNamespace)
        {
            xmlNewNs(grammarNode, (const xmlChar*)qualifier.Value.data(),
                (const xmlChar*)qualifier.Name.data());
        }
        else
        {
            xmlNewProp(grammarNode, (const xmlChar*)qualifier.Name.data(),
                (const xmlChar*)qualifier.Value.data());
        }
    }

    for (auto node : nodes)
    {
        if (xmlAddChild(grammarNode, node) == nullptr)
            throw runtime_error("Can't add child node to grammar");
    }

    // This actually fixes the copied namespaces
    // to share just one instance
    if (xmlReconciliateNs(newDoc.get(), grammarNode) == -1)
        throw runtime_error("Can't reconciliate namespaces");

    saveDocToFile(newDoc.get(), MergedSchemaSavePath);

    return 0;
}

DocPtr readDoc(const string_view& filepath)
{
    return DocPtr(xmlReadFile(filepath.data(), nullptr,
        XML_PARSE_NOBLANKS), &xmlFreeDoc);
}

void followDoc(xmlDocPtr doc, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers)
{
    auto root = xmlDocGetRootElement(doc);

    // Fetch namespaces
    auto namespaces = xmlGetNsList(doc, root);
    unsigned i = 0;
    while (true)
    {
        auto ns = namespaces[i];
        if (ns == nullptr)
            break;

        addNamespaceTo(qualifiers, ns);
        i++;
    }
    xmlFree(namespaces);

    // Fetch attributes
    for (xmlAttrPtr attribute = root->properties; attribute; attribute = attribute->next)
        addAttributeTo(qualifiers, attribute);

    followDoc(root, nodes, qualifiers);
}

void followDoc(xmlNodePtr root, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers)
{
    for (auto child = xmlFirstElementChild(root); child; child = xmlNextElementSibling(child))
    {
        string href;
        if (child->ns != nullptr
            && string_view((const char*)child->ns->prefix) == "rng"
            && string_view((const char*)child->name) == "include"
            && (href = findHRef(child)).length() != 0)
        {
            if (s_schemas.find(href) == s_schemas.end())
            {
                auto schemaPath = fs::u8path(SchemaBasePath) / href;
                auto doc = readDoc(schemaPath.u8string());
                s_schemas.insert(href);
                followDoc(doc.get(), nodes, qualifiers);
            }

            continue;
        }

        auto copied = xmlCopyNode(child, 1);
        if (copied == nullptr)
            throw runtime_error("Can't copy child node");

        nodes.push_back(copied);
    }
}

void addNamespaceTo(vector<Qualifier>& qualifiers, xmlNsPtr xmlNs)
{
    for (auto ns : qualifiers)
    {
        // Ensure the namespace has not yet been added first
        if (ns.IsNamespace && ns.Name == (const char*)xmlNs->prefix)
            return;
    }
    qualifiers.push_back({ true, (const char*)xmlNs->prefix, (const char*)xmlNs->href });
}

void addAttributeTo(vector<Qualifier>& qualifiers, xmlAttrPtr xmlAttr)
{
    for (auto attr : qualifiers)
    {
        // Ensure the namespace has not yet been added first
        if (!attr.IsNamespace && attr.Name == (const char*)xmlAttr->name)
            return;
    }
    qualifiers.push_back({ false, (const char*)xmlAttr->name, getAttributeContent(xmlAttr) });
}

void removeNode(xmlNodePtr element)
{
    // Remove the existing ModifyDate. We recreate the element
    xmlUnlinkNode(element);
    xmlFreeNode(element);
}

string findHRef(const xmlNodePtr element)
{
    for (xmlAttrPtr attr = element->properties; attr; attr = attr->next)
    {
        if (string_view((const char*)attr->name) == "href")
            return getAttributeContent(attr);
    }

    return { };
}

string getAttributeContent(const xmlAttrPtr attr)
{
    xmlChar* content = xmlNodeGetContent((const xmlNode*)attr);
    if (content == nullptr)
        return { };

    unique_ptr<xmlChar, decltype(xmlFree)> contentFree(content, xmlFree);
    return string((const char*)content);
}

void saveDocToFile(xmlDocPtr doc, const string_view& filepath)
{
    auto ctx = xmlSaveToFilename(filepath.data(), "utf-8", XML_SAVE_FORMAT);
    if (ctx == nullptr || xmlSaveDoc(ctx, doc) == -1 || xmlSaveClose(ctx) == -1)
        throw runtime_error("Can't save XML document");
}

Solution 2:[2]

Can libxml2 also be used to parse a multi-document schema file?

xmllint calls the xmlRelaxNGValidateDoc method of libxml2:

xmlRelaxNGValidateDoc(xmlRelaxNGValidCtxtPtr ctxt,xmlDocPtr doc)

For example:

 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/types.h>

 #include <libxml/xmlmemory.h>
 #include <libxml/parser.h>
 #include <libxml/relaxng.h>

 int main(int argc, char *argv[])
 {
    int status;
    xmlDoc *doc;
    xmlRelaxNGPtr schema;
    xmlRelaxNGValidCtxtPtr validctxt;
    xmlRelaxNGParserCtxtPtr rngparser;

    doc = xmlParseFile(argv[1]);

    rngparser = xmlRelaxNGNewParserCtxt(argv[2]);
    schema = xmlRelaxNGParse(rngparser);
    validctxt = xmlRelaxNGNewValidCtxt(schema);

    status = xmlRelaxNGValidateDoc(validctxt, doc);
    printf("status == %d\n", status);

    xmlRelaxNGFree(schema);
    xmlRelaxNGFreeValidCtxt(validctxt);
    xmlRelaxNGFreeParserCtxt(rngparser);
    xmlFreeDoc(doc);
    exit(EXIT_SUCCESS);
 }

Validates the following source:

<?xml version="1.0"?>
<root>
  <t>foo</t>
</root>

with the following schema:

<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0">
  <start>
    <element name="t">
      <ref name="tcont"/>
    </element>
  </start>
  <define name="tcont">
    <text/>
  </define>
</grammar>

The difference is between support for the externalRef element:

The externalRef pattern can be used to reference a pattern defined in a separate file. The externalRef element has a required href attribute that specifies the URL of a file containing the pattern. The externalRef matches if the pattern contained in the specified URL matches.

For example:

<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0">
  <start>
    <element name="root">
      <externalRef href="595792-ext.rng"/>
    </element>
  </start>
</grammar>

versus the include element:

The include element allows grammars to be merged together. A grammar pattern may have include elements as children. An include element has a required href attribute that specifies the URL of a file containing a grammar pattern. The definitions in the referenced grammar pattern will be included in grammar pattern containing the include element.

The combine attribute is particularly useful in conjunction with include. If a grammar contains multiple definitions with the same name, then the definitions must specify how they are to be combined into a single definition by using the combine attribute.

For example:

demo.rng

<?xml version="1.0" encoding="iso-8859-1"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
 datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

<include href="demo2.rng">
<define name="TEI.prose"><ref name="INCLUDE"/></define>
</include>
</grammar>

demo2.rng

<?xml version="1.0" encoding="utf-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:t="http://www.thaiopensource.com/ns/annotations" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

   <start>
         <ref name="TEI.2"/>
   </start>
   <define name="IGNORE">
      <notAllowed/>
   </define>
   <define name="INCLUDE">
      <empty/>
   </define>


  <include href="demo3.rng"/>

   <define name="TEI.2">
      <element name="TEI.2">
         <text/>
      </element>
   </define>

</grammar>

demo3.rng

<?xml version="1.0" encoding="utf-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:t="http://www.thaiopensource.com/ns/annotations" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

   <define name="TEI.prose" combine="interleave">
      <ref name="IGNORE"/>
   </define>

</grammar>

References

Solution 3:[3]

I have implement this idea. It seemed to me that if your validation loss increased you have moved to a point in Nspace(N being the number of trainable parameters) that is less favorable than the point in Nspace you were at in the previous epoch. Therefore it is best to reset the weights to those of the previous epoch. However at this point the learning rate should be reduced or else on the next epoch you will end up in the same less favorable point. The same is NOT exactly for monitoring accuracy but if that is what you want to do the code below will do that for you. To use the callback use the code

callbacks=[DWELL(model, monitor_acc, factor, verbose)]

where model is the name of your compiled model. monitor_acc is a boolean. If set to true the training accuracy is monitored. If on the current epoch the training accuracy has decreased, then the model weights will be set to those of the previous epoch and the learning rate will be reduced. If monitor_acc is set to False, if the validation loss is higher on the current epoch than it was on the previous epoch the same procedure is followed. factor is a float between 0 and 1. When the metric being monitored does not improve for the current epoch the model learning rate will be set as new_lr=current_lr * factor. I typically set factor at .5. Verbose is a boolean. If set to True, a printout will occur during training if the metric being monitored does not improve. The print out advises that the model weights have been set back to those of the previous epoch and prints the new reduced learning rate value. If verbose is set to False no printout is produced. Below is an example of use:

callbacks=[DWELL(my_model, False, .5, True)]

Be sure to set callbacks=callbacks in model.fit

The code for the callback is shown below:

class DWELL(keras.callbacks.Callback):
    def __init__(self,model, monitor_acc,  factor, verbose):
        super(DWELL, self).__init__()
        self.model=model
        self.initial_lr=float(tf.keras.backend.get_value(model.optimizer.lr)) # get the initiallearning rate and save it  
        self.lowest_vloss=np.inf # set lowest validation loss to infinity initially
        self.best_weights=self.model.get_weights() # set best weights to model's initial weights 
        self.verbose=verbose
        self.monitor_acc= monitor_acc
        self.highest_acc=0
    def on_epoch_end(self, epoch, logs=None):  # method runs on the end of each epoch
        lr=float(tf.keras.backend.get_value(self.model.optimizer.lr)) # get the current learning rate        
        vloss=logs.get('val_loss')  # get the validation loss for this epoch 
        acc=logs.get('accuracy')
        if self.monitor_acc==False: # monitor validation loss
            if vloss>self.lowest_vloss:
                self.model.set_weights(self.best_weights)
                new_lr=lr * factor
                tf.keras.backend.set_value(self.model.optimizer.lr, new_lr)
                if self.verbose:
                    print( '\n model weights reset to best weights and reduced lr to ', new_lr, flush=True)
            else:
                self.lowest_vloss=vloss
        else:
            if acc< self.highest_acc: # monitor training accuracy
                self.model.set_weights(self.best_weights)
                new_lr=lr * factor
                tf.keras.backend.set_value(self.model.optimizer.lr, new_lr)
                if self.verbose:
                    print( '\n model weights reset to best weights and reduced lr to ', new_lr, flush=True)
            else:
                self.highest_acc=acc       

Below is a sample printout produced during training that shows what results when the validation loss for the current epoch exceeds that of the previous epoch

Epoch 23/40
25/25 [==============================] - 3s 110ms/step - loss: 0.5927 - accuracy: 0.9825 - val_loss: 0.6827 - val_accuracy: 0.9000
Epoch 24/40
24/25 [===========================>..] - ETA: 0s - loss: 0.5812 - accuracy: 0.9869
 model weights reset to best weights and reduced lr to  0.0012499999720603228
25/25 [==============================] - 2s 86ms/step - loss: 0.5821 - accuracy: 0.9869 - val_loss: 0.6846 - val_accuracy: 0.9500
Epoch 25/40
25/25 [==============================] - 2s 86ms/step - loss: 0.5646 - accuracy: 0.9958 - val_loss: 0.6772 - val_accuracy: 0.9250

My advice is to ALWAYS monitor the validation loss it is a better measure of model performance than training accuracy. A nice feature of the callback is that at the end of training if you set monitor_acc=False, your model weights are always set to the weights of the epoch with the lowest validation loss.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Community
Solution 3