'parse multi-document RELAX-NG schema using libxml2
I want to convert a RELAX-NG schema to a schemaInfo object so that it can be used in codemirror for xml-completion.
https://codemirror.net/demo/xmlcomplete.html
xmllint usage
libxml2 already has support for a multi-document relax-NG schema when used to validate a document like this:
xmllint --schema myschema.rng mydoc.xml
Question
Can libxml2 also be used to parse a multi-document schema file?
Here is an example for a multi-document schema:
here is some libxml2 functionality i don't understand but which could be helpful:
Assumption
I think I have to convert the multi-document schema into a single document schema using tools like: https://github.com/h4l/rnginline/tree/master/rnginline
Using libxml2 directly would be great since I could then support schemas without pre-processing.
update 3.5.2016
as you can see parsing the relax-NG schema shows only the top level file and it will not contain any files which are included using the include directive from the relax-NG main file (note: relax-NG schemas can be spilit into several files).
<!-- XHTML Basic -->
<grammar ns="http://www.w3.org/1999/xhtml"
xmlns="http://relaxng.org/ns/structure/1.0">
<include href="modules/datatypes.rng"/>
<include href="modules/attribs.rng"/>
<include href="modules/struct.rng"/>
<include href="modules/text.rng"/>
<include href="modules/hypertext.rng"/>
<include href="modules/list.rng"/>
<include href="modules/basic-form.rng"/>
<include href="modules/basic-table.rng"/>
<include href="modules/image.rng"/>
<include href="modules/param.rng"/>
<include href="modules/object.rng"/>
<include href="modules/meta.rng"/>
<include href="modules/link.rng"/>
<include href="modules/base.rng"/>
</grammar>
source code
/**
* section: Tree
* synopsis: Navigates a tree to print element names
* purpose: Parse a file to a tree, use xmlDocGetRootElement() to
* get the root element, then walk the document and print
* all the element name in document order.
* usage: tree1 filename_or_URL
* test: tree1 test2.xml > tree1.tmp && diff tree1.tmp $(srcdir)/tree1.res
* author: Dodji Seketeli
* copy: see Copyright for the status of this software.
*/
#include <stdio.h>
#include <libxml/parser.h>
#include <libxml/tree.h>
#ifdef LIBXML_TREE_ENABLED
#define ANSI_COLOR_RED "\x1b[31m"
#define ANSI_COLOR_GREEN "\x1b[32m"
#define ANSI_COLOR_YELLOW "\x1b[33m"
#define ANSI_COLOR_BLUE "\x1b[34m"
#define ANSI_COLOR_MAGENTA "\x1b[35m"
#define ANSI_COLOR_CYAN "\x1b[36m"
#define ANSI_COLOR_RESET "\x1b[0m"
/*
*To compile this file using gcc you can type
*gcc `xml2-config --cflags --libs` -o xmlexample libxml2-example.c
*/
/**
* print_element_names:
* @a_node: the initial xml node to consider.
*
* Prints the names of the all the xml elements
* that are siblings or children of a given xml node.
*/
char* pad(int depth) {
// if (depth <= 0)
// return "";
char str[2000];
// sprintf(str, "%*s", " ", depth);
for (int i=0; i <= depth; i++) {
str[i] = ' ';
}
str[depth+1] = 0;
return &str;
}
static void
print_element_names(xmlNode * a_node, int depth)
{
xmlNode *cur_node = NULL;
for (cur_node = a_node; cur_node; cur_node = cur_node->next) {
if (cur_node->type == XML_ELEMENT_NODE) {
// if (strcmp(cur_node->name, "element") == 0) {
// printf("node type: Element, name: %s\n", cur_node->name);
printf("%s %s\n", pad(depth), cur_node->name);
for(xmlAttrPtr attr = cur_node->properties; NULL != attr; attr = attr->next)
{
printf("%s", ANSI_COLOR_MAGENTA);
printf("%s %s: ", pad(depth), attr->name);
xmlChar* value = xmlNodeListGetString(cur_node->doc, attr->children, 1);
printf("%s \n", value);
printf("%s", ANSI_COLOR_RESET);
}
// }
}
print_element_names(cur_node->children, depth+1);
}
}
/**
* Simple example to parse a file called "file.xml",
* walk down the DOM, and print the name of the
* xml elements nodes.
*/
int
main(int argc, char **argv)
{
xmlDoc *doc = NULL;
xmlNode *root_element = NULL;
if (argc != 2)
return(1);
/*
* this initialize the library and check potential ABI mismatches
* between the version it was compiled for and the actual shared
* library used.
*/
LIBXML_TEST_VERSION
/*parse the file and get the DOM */
doc = xmlReadFile(argv[1], NULL, 0);
if (doc == NULL) {
printf("error: could not parse file %s\n", argv[1]);
}
/*Get the root element node */
root_element = xmlDocGetRootElement(doc);
print_element_names(root_element, 0);
/*free the document */
xmlFreeDoc(doc);
/*
*Free the global variables that may
*have been allocated by the parser.
*/
xmlCleanupParser();
return 0;
}
#else
int main(void) {
fprintf(stderr, "Tree support not compiled in\n");
exit(1);
}
#endif
example usage
[nix-shell:~/Desktop/projects/nlnet/nlnet]$ ./tree1 html5-rng/xhtml-basic.rng
grammar
ns: http://www.w3.org/1999/xhtml
include
href: modules/datatypes.rng
include
href: modules/attribs.rng
include
href: modules/struct.rng
include
href: modules/text.rng
include
href: modules/hypertext.rng
include
href: modules/list.rng
include
href: modules/basic-form.rng
include
href: modules/basic-table.rng
include
href: modules/image.rng
include
href: modules/param.rng
include
href: modules/object.rng
include
href: modules/meta.rng
include
href: modules/link.rng
include
href: modules/base.rng
Solution 1:[1]
Although the question is unnecessary lengthy, it's clear what's being asked for. As of version 2.9.14, Libxml2 appear to be not able to resolve the includes other than resolving an URL or looking in the filesystem, probably searching for a filename of the name of the href attribute in the current directory. This may already answer the question but it may be insufficient if the schema has to be loaded from buffers in memory. A clean approach could be supplying a callback to resolve the rng:include directives but it doesn't seem Libxml2 provides such API. Another approach, which could actually lead to more efficient operations, is to recursively merge the outer schema in a single one without the include directives. The following code worked for me merging a medium complexity schema (8 files). Just change the paths and filenames accordingly.
#include <memory>
#include <string>
#include <stdexcept>
#include <unordered_set>
#include <filesystem>
#include <libxml/tree.h>
#include <libxml/xmlsave.h>
using namespace std;
namespace fs = std::filesystem;
using DocPtr = std::unique_ptr<xmlDoc, decltype(&xmlFreeDoc)>;
constexpr const char* SchemaBasePath = R"(D:\Schemas)";
constexpr const char* RngSchemaFilename = "Schema.rng";
constexpr const char* MergedSchemaSavePath = R"(D:\Schemas\Schema_Merged.rng)";
constexpr const char* RngNS = "rng";
constexpr const char* RngNSHref = "http://relaxng.org/ns/structure/1.0";
struct Qualifier
{
bool IsNamespace;
string Name;
string Value;
};
static DocPtr readDoc(const string_view& filepath);
static void followDoc(xmlDocPtr doc, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers);
static void followDoc(xmlNodePtr root, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers);
static void removeNode(xmlNodePtr element);
static string findHRef(const xmlNodePtr element);
static string getAttributeContent(const xmlAttrPtr attr);
static void saveDocToFile(xmlDocPtr doc, const string_view& filepath);
static void addNamespaceTo(vector<Qualifier>& qualifiers, xmlNsPtr ns);
static void addAttributeTo(vector<Qualifier>& qualifiers, xmlAttrPtr attr);
unordered_set<string> s_schemas;
int main()
{
LIBXML_TEST_VERSION;
auto packetRngPath = fs::u8path(SchemaBasePath) / RngSchemaFilename;
auto packetRngDoc = readDoc(packetRngPath.u8string());
vector<xmlNodePtr> nodes;
vector<Qualifier> qualifiers;
followDoc(packetRngDoc.get(), nodes, qualifiers);
auto newDoc = DocPtr(xmlNewDoc(nullptr), &xmlFreeDoc);
auto grammarNode = xmlNewChild((xmlNodePtr)newDoc.get(), nullptr, (const xmlChar*) "grammar", nullptr);
if (grammarNode == nullptr)
throw runtime_error("Can't create rng:grammar node");
auto rngNs = xmlNewNs(grammarNode, (const xmlChar*)RngNSHref, (const xmlChar*)RngNS);
if (rngNs == nullptr)
throw runtime_error("Can't find or create rng namespace");
xmlSetNs(grammarNode, rngNs);
for (auto qualifier : qualifiers)
{
// Recreate the gathered namespaces and attributes
if (qualifier.IsNamespace)
{
xmlNewNs(grammarNode, (const xmlChar*)qualifier.Value.data(),
(const xmlChar*)qualifier.Name.data());
}
else
{
xmlNewProp(grammarNode, (const xmlChar*)qualifier.Name.data(),
(const xmlChar*)qualifier.Value.data());
}
}
for (auto node : nodes)
{
if (xmlAddChild(grammarNode, node) == nullptr)
throw runtime_error("Can't add child node to grammar");
}
// This actually fixes the copied namespaces
// to share just one instance
if (xmlReconciliateNs(newDoc.get(), grammarNode) == -1)
throw runtime_error("Can't reconciliate namespaces");
saveDocToFile(newDoc.get(), MergedSchemaSavePath);
return 0;
}
DocPtr readDoc(const string_view& filepath)
{
return DocPtr(xmlReadFile(filepath.data(), nullptr,
XML_PARSE_NOBLANKS), &xmlFreeDoc);
}
void followDoc(xmlDocPtr doc, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers)
{
auto root = xmlDocGetRootElement(doc);
// Fetch namespaces
auto namespaces = xmlGetNsList(doc, root);
unsigned i = 0;
while (true)
{
auto ns = namespaces[i];
if (ns == nullptr)
break;
addNamespaceTo(qualifiers, ns);
i++;
}
xmlFree(namespaces);
// Fetch attributes
for (xmlAttrPtr attribute = root->properties; attribute; attribute = attribute->next)
addAttributeTo(qualifiers, attribute);
followDoc(root, nodes, qualifiers);
}
void followDoc(xmlNodePtr root, vector<xmlNodePtr>& nodes, vector<Qualifier>& qualifiers)
{
for (auto child = xmlFirstElementChild(root); child; child = xmlNextElementSibling(child))
{
string href;
if (child->ns != nullptr
&& string_view((const char*)child->ns->prefix) == "rng"
&& string_view((const char*)child->name) == "include"
&& (href = findHRef(child)).length() != 0)
{
if (s_schemas.find(href) == s_schemas.end())
{
auto schemaPath = fs::u8path(SchemaBasePath) / href;
auto doc = readDoc(schemaPath.u8string());
s_schemas.insert(href);
followDoc(doc.get(), nodes, qualifiers);
}
continue;
}
auto copied = xmlCopyNode(child, 1);
if (copied == nullptr)
throw runtime_error("Can't copy child node");
nodes.push_back(copied);
}
}
void addNamespaceTo(vector<Qualifier>& qualifiers, xmlNsPtr xmlNs)
{
for (auto ns : qualifiers)
{
// Ensure the namespace has not yet been added first
if (ns.IsNamespace && ns.Name == (const char*)xmlNs->prefix)
return;
}
qualifiers.push_back({ true, (const char*)xmlNs->prefix, (const char*)xmlNs->href });
}
void addAttributeTo(vector<Qualifier>& qualifiers, xmlAttrPtr xmlAttr)
{
for (auto attr : qualifiers)
{
// Ensure the namespace has not yet been added first
if (!attr.IsNamespace && attr.Name == (const char*)xmlAttr->name)
return;
}
qualifiers.push_back({ false, (const char*)xmlAttr->name, getAttributeContent(xmlAttr) });
}
void removeNode(xmlNodePtr element)
{
// Remove the existing ModifyDate. We recreate the element
xmlUnlinkNode(element);
xmlFreeNode(element);
}
string findHRef(const xmlNodePtr element)
{
for (xmlAttrPtr attr = element->properties; attr; attr = attr->next)
{
if (string_view((const char*)attr->name) == "href")
return getAttributeContent(attr);
}
return { };
}
string getAttributeContent(const xmlAttrPtr attr)
{
xmlChar* content = xmlNodeGetContent((const xmlNode*)attr);
if (content == nullptr)
return { };
unique_ptr<xmlChar, decltype(xmlFree)> contentFree(content, xmlFree);
return string((const char*)content);
}
void saveDocToFile(xmlDocPtr doc, const string_view& filepath)
{
auto ctx = xmlSaveToFilename(filepath.data(), "utf-8", XML_SAVE_FORMAT);
if (ctx == nullptr || xmlSaveDoc(ctx, doc) == -1 || xmlSaveClose(ctx) == -1)
throw runtime_error("Can't save XML document");
}
Solution 2:[2]
Can libxml2 also be used to parse a multi-document schema file?
xmllint calls the xmlRelaxNGValidateDoc method of libxml2:
xmlRelaxNGValidateDoc(xmlRelaxNGValidCtxtPtr ctxt,xmlDocPtr doc)
For example:
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <libxml/xmlmemory.h>
#include <libxml/parser.h>
#include <libxml/relaxng.h>
int main(int argc, char *argv[])
{
int status;
xmlDoc *doc;
xmlRelaxNGPtr schema;
xmlRelaxNGValidCtxtPtr validctxt;
xmlRelaxNGParserCtxtPtr rngparser;
doc = xmlParseFile(argv[1]);
rngparser = xmlRelaxNGNewParserCtxt(argv[2]);
schema = xmlRelaxNGParse(rngparser);
validctxt = xmlRelaxNGNewValidCtxt(schema);
status = xmlRelaxNGValidateDoc(validctxt, doc);
printf("status == %d\n", status);
xmlRelaxNGFree(schema);
xmlRelaxNGFreeValidCtxt(validctxt);
xmlRelaxNGFreeParserCtxt(rngparser);
xmlFreeDoc(doc);
exit(EXIT_SUCCESS);
}
Validates the following source:
<?xml version="1.0"?>
<root>
<t>foo</t>
</root>
with the following schema:
<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<element name="t">
<ref name="tcont"/>
</element>
</start>
<define name="tcont">
<text/>
</define>
</grammar>
The difference is between support for the externalRef element:
The
externalRefpattern can be used to reference a pattern defined in a separate file. The externalRef element has a required href attribute that specifies the URL of a file containing the pattern. TheexternalRefmatches if the pattern contained in the specified URL matches.
For example:
<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<element name="root">
<externalRef href="595792-ext.rng"/>
</element>
</start>
</grammar>
versus the include element:
The include element allows grammars to be merged together. A grammar pattern may have include elements as children. An include element has a required href attribute that specifies the URL of a file containing a grammar pattern. The definitions in the referenced grammar pattern will be included in grammar pattern containing the include element.
The combine attribute is particularly useful in conjunction with include. If a grammar contains multiple definitions with the same name, then the definitions must specify how they are to be combined into a single definition by using the combine attribute.
For example:
demo.rng
<?xml version="1.0" encoding="iso-8859-1"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<include href="demo2.rng">
<define name="TEI.prose"><ref name="INCLUDE"/></define>
</include>
</grammar>
demo2.rng
<?xml version="1.0" encoding="utf-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:t="http://www.thaiopensource.com/ns/annotations" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<start>
<ref name="TEI.2"/>
</start>
<define name="IGNORE">
<notAllowed/>
</define>
<define name="INCLUDE">
<empty/>
</define>
<include href="demo3.rng"/>
<define name="TEI.2">
<element name="TEI.2">
<text/>
</element>
</define>
</grammar>
demo3.rng
<?xml version="1.0" encoding="utf-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:t="http://www.thaiopensource.com/ns/annotations" xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<define name="TEI.prose" combine="interleave">
<ref name="IGNORE"/>
</define>
</grammar>
References
- Validating xml against relax ng in ANSI C
- RELAX NG Tutorial
- Function: xmlRelaxNGParse
- Function: xmlRelaxNGValidateDoc
- Gnome Git Repo: libxml2 - xmllint.c
- Gnome Git Repo: libxml2 - test file 595792_0.xml
- Gnome Git Repo: libxml2 - test file 595792-ext.rng
- Gnome Git Repo: libxml2 - test file 595792.rng
- Gnome Git Repo: libxml2 - test file demo.rng
- Gnome Git Repo: libxml2 - test file demo2.rng
- Gnome Git Repo: libxml2 - test file demo3.rng
- Gnome Git Repo: libxml2 - check-relaxng-test-suite.py
- Gnome Bug #512131: crash in libxml2 xmlReader with RNG validation
Solution 3:[3]
I have implement this idea. It seemed to me that if your validation loss increased you have moved to a point in Nspace(N being the number of trainable parameters) that is less favorable than the point in Nspace you were at in the previous epoch. Therefore it is best to reset the weights to those of the previous epoch. However at this point the learning rate should be reduced or else on the next epoch you will end up in the same less favorable point. The same is NOT exactly for monitoring accuracy but if that is what you want to do the code below will do that for you. To use the callback use the code
callbacks=[DWELL(model, monitor_acc, factor, verbose)]
where model is the name of your compiled model. monitor_acc is a boolean. If set to true the training accuracy is monitored. If on the current epoch the training accuracy has decreased, then the model weights will be set to those of the previous epoch and the learning rate will be reduced. If monitor_acc is set to False, if the validation loss is higher on the current epoch than it was on the previous epoch the same procedure is followed. factor is a float between 0 and 1. When the metric being monitored does not improve for the current epoch the model learning rate will be set as new_lr=current_lr * factor. I typically set factor at .5. Verbose is a boolean. If set to True, a printout will occur during training if the metric being monitored does not improve. The print out advises that the model weights have been set back to those of the previous epoch and prints the new reduced learning rate value. If verbose is set to False no printout is produced. Below is an example of use:
callbacks=[DWELL(my_model, False, .5, True)]
Be sure to set callbacks=callbacks in model.fit
The code for the callback is shown below:
class DWELL(keras.callbacks.Callback):
def __init__(self,model, monitor_acc, factor, verbose):
super(DWELL, self).__init__()
self.model=model
self.initial_lr=float(tf.keras.backend.get_value(model.optimizer.lr)) # get the initiallearning rate and save it
self.lowest_vloss=np.inf # set lowest validation loss to infinity initially
self.best_weights=self.model.get_weights() # set best weights to model's initial weights
self.verbose=verbose
self.monitor_acc= monitor_acc
self.highest_acc=0
def on_epoch_end(self, epoch, logs=None): # method runs on the end of each epoch
lr=float(tf.keras.backend.get_value(self.model.optimizer.lr)) # get the current learning rate
vloss=logs.get('val_loss') # get the validation loss for this epoch
acc=logs.get('accuracy')
if self.monitor_acc==False: # monitor validation loss
if vloss>self.lowest_vloss:
self.model.set_weights(self.best_weights)
new_lr=lr * factor
tf.keras.backend.set_value(self.model.optimizer.lr, new_lr)
if self.verbose:
print( '\n model weights reset to best weights and reduced lr to ', new_lr, flush=True)
else:
self.lowest_vloss=vloss
else:
if acc< self.highest_acc: # monitor training accuracy
self.model.set_weights(self.best_weights)
new_lr=lr * factor
tf.keras.backend.set_value(self.model.optimizer.lr, new_lr)
if self.verbose:
print( '\n model weights reset to best weights and reduced lr to ', new_lr, flush=True)
else:
self.highest_acc=acc
Below is a sample printout produced during training that shows what results when the validation loss for the current epoch exceeds that of the previous epoch
Epoch 23/40
25/25 [==============================] - 3s 110ms/step - loss: 0.5927 - accuracy: 0.9825 - val_loss: 0.6827 - val_accuracy: 0.9000
Epoch 24/40
24/25 [===========================>..] - ETA: 0s - loss: 0.5812 - accuracy: 0.9869
model weights reset to best weights and reduced lr to 0.0012499999720603228
25/25 [==============================] - 2s 86ms/step - loss: 0.5821 - accuracy: 0.9869 - val_loss: 0.6846 - val_accuracy: 0.9500
Epoch 25/40
25/25 [==============================] - 2s 86ms/step - loss: 0.5646 - accuracy: 0.9958 - val_loss: 0.6772 - val_accuracy: 0.9250
My advice is to ALWAYS monitor the validation loss it is a better measure of model performance than training accuracy. A nice feature of the callback is that at the end of training if you set monitor_acc=False, your model weights are always set to the weights of the epoch with the lowest validation loss.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Community |
| Solution 3 |
