NLP with low-resourced languages: beyond bean counting artefacts

Let me start the new year with a long overdue topic to write about: low-resourced languages (LRL). If you take English as the benchmark and a calimero attitude of a ‘but they have it much easier with so many resources and that’s not fair’ for computational tasks involving NLP/HLT, then you can complain it’s not easy and unfair for 99.99% of the languages in the world. That’s not helpful to understand what ‘not easy’ really amounts to, nor what the – in many cases unrealistic – implications are of the ignorant (at best) statements along the line of ‘just get yourself together and work a little harder’ attitude, or just how much effort some work has taken even if it looks like mere baby steps from a highly resourced language technologies perspective.

Not just that. Let me cherry-pick a few anecdotes to illustrate. A conversation with a fellow researcher I spoke with at INLG2023 told me he just gets his new high quality annotated data delivered and can readily use it for his research. What?!?! Really?!! Other researchers, be it in NLP or requiring NLP tools for the intended task, have to collect and annotate it themselves, clean it up, and determine quality, or manage it to be done so as part of their research. My fellow researcher was surprised that also I had to do all that extra work in order to have enough results to write it up in a paper. The paper I was still working on at the time (presented at the 2024 edition, a year later [Mahlaza24]), it included trips to the library, online searches for grammar rules, and consulting with the linguists just to get a few examples and rules to begin with to try to work out how to convert numbers to text in isiZulu, and a number of iterations to plug the gaps in the limited rules’ documentation to get to a passable level of correct output. There was no data for good reasons; we generated that data to make it less hard for data-oriented researchers. OCR issues are documented elsewhere, and some can be gleaned from carefully reading between the lines of some of my NLG papers.

I probably could write a book about the anecdotes, but the plural of anecdote isn’t data. Were we just repeatedly unlucky? Didn’t we search hard enough? Do other people – researchers and software developers alike – working with low-resourced languages not have such problems? No, they struggle, too. For instance, having to adapt UD before being able to computerise sentence annotations for St Lawrence Island Yupik [Park21].

What makes it challenging, and could one ‘low-resourced’ be even worse than another ‘low-resourced’, akin to a very low-resourced? I’ve heard a Dutch colleague claiming Dutch was low-resourced, and he was staunchly convinced of it. I shook my head in disbelief and could not resist to comment, as he clearly didn’t know what low-resourced really was like. But it does raise the question: what is low-resourced? What are its characteristics so that it can be distinguished from intermediately resourced and well-resourced?

They’re not new questions and other researchers have tried to answer them with the typical approach of bean counting something: Wikipedia articles, number of corpora, number of tools, number of papers in top-level NLP conferences. They all have issues of missing out in the counting: much of the work on low-resourced languages (LRLs) doesn’t make it into the top-tier conference venues that take English as the gold standard, the number of tools doesn’t tell anything about whether they actually work, and available non-published tools are easily overlooked yet could be really useful, and Wikipedia has skewed editor issues.

For instance, media darling and recent SAICSIT emerging pioneer award winner Prof. Vukosi Marivate doesn’t get much of his work into premier international NLP conferences even though he most definitely uses data-driven techniques for years and managed to set up spin-off company lelapa.ai on NLP for African languages. It is not much different for P-rated Dr Jan Buys for his papers on African languages, who also has been working on data-driven approaches for years, including having a number of papers in the main NLP conferences – just not about African languages. Our trailblazing work before the LLM craze having resulted in, among others, 4 INGL papers, a COLING paper, and a CiCLing paper & prize and and journal articles in TALLIP and LRE on NLP for African languages is notable as well, but it mostly won’t reflect in, e.g., Joshi et al’s top conference paper-based counting [Joshi00]. Prof. (emer.) Laurette Pretorius’ LREC papers and ZulMorph were largely before the resource indexing and open source requirements, and AwezaMed, spellcheckers for MS Office, and so on and so forth would fall through the bean counting cracks as well.

Not being indexed by bean counting scrapers of the researchers from the Global North doesn’t mean nothing is happening. Admitted, variant spellings of names of languages and changes in names of languages hampers the bean counting approach when searching for resources—though locals know. Joshi et al.’s ‘Mbosi’ language search could have been augmented with ‘Mboshi’ and ‘Embosi’ if only they’d known, and then they would have found and included the LREC18 paper [Rialland18], for instance, but, alas.

Is there another way to capture the fuzzy notion of LRL differently? My collaborator, Langa Khumalo, and I set out to take a different, complementary, approach: contextualising the language to determine resourcedness.We focussed on three key issues:

  • What are really the distinguishing characteristics of LRLs (and, by extension, ‘non-LRLs’)?
  • What are the characteristics of levels of resourcedness?
  • Which language fits where and why?

The results are described in detail in our technical report [KeetKhumalo26]: we identified 11 dimensions of resourcedness, their components, and tentative scales or grouping buckets and matched the dimensions to Very LRL, LRL, RL, High RL, Very HRL levels, and assessed its operationalisability with isiNdebele and several other languages.

The dimensions concern the sort of things that actually impact developing NLP tools. For instance, the amount of people: fewer people are harder to find and more in demand, not to mention tens of participants (if that) for crowdsourcing who’d need to be paid internet data upfront to do the evaluation. Or take the participants’ level of education in that language: speaking and writing a language is not the same as a deep grasp to provide 100% correct feedback on the morphological analysis, say, or having received education in the language at least up to matric/high school exams. Less-than-correct feedback requires more rounds of human evaluation, or: takes more time to carry out the evaluations, more time to analyse the data, and more remuneration for the tasks. Or the choice or grammars, or the lack thereof: taking a UD or SUD from the shelf versus digging into old books and poring over various linguistics papers to determine what it is that needs to be represented in any formalism expressive enough to capture it. Having a choice of parsers versus no or outdated software that needs to be brushed up or re-implemented first.

Summary of the dimensions and components thereof, where applicable; see paper for details. (Source: [KeetKhumalo26])

The dimensions and described, motivated, and illustrated over a good 6 pages in the paper. There may be more dimensions, but this already gives a good basis to assess and classify languages, to develop policies to benchmark and assess changes in language resourcedness, for certain people to get down from their English high-horse incorrectly judging efforts for other languages, and to make better sense of ‘LRL paper tracks’ at conferences and workshops. And perhaps anyhow to gain an appreciation of NLP activities when there’s no cornucopia of tools and datasets.

We grouped the dimensions as contributing to characterising Very LRL, LRL, RL, HRL and Very HRL. Admittedly, there’s a notable flip at the RL level that asks for more fine-grained needling and characterising. Yet, the notion of getting the ball rolling being harder than keeping it rolling and amassing more thanks to the bandwagon effect applies to many areas.

We apply the dimensions to isiNdebele, a language spoken in South Africa and Zimbabwe with about 3.7 million first/home-language speakers overall. There are newspapers, TV news bulletins, schoolbooks, a dictionary and more in isiNdebele, i.e., it is actively used in daily life. It turns out that it is in the Very LRL category, albeit noting it’s not all doom and gloom, or: there are a few resources.

Image source: https://southafrica-info.com/arts-culture/the-languages-of-south-africa/

The discussion section of the report elaborates on various aspects, including policy implications, and there’s a bonus section on nitpicking about terminology, including low-resource vs. low-resourced vs. under-resourced languages. What can I say, besides NLP, co-author Langa is the Director of the South African Digital Languages Resource centre and I’m an ontologist. The paper’s flavour overall is distinctly on the languages side rather than computation, which may be taken as a warning or an encouragement; either way, I hope you’ll find something of interest in it. Opinions, additions, or your assessment of your language(s) of interest are welcome.

References

[Joshi00] Joshi, Pratik, et al. “The State and Fate of Linguistic Diversity and Inclusion in the NLP World”. arXiv [Cs.CL], 20 Apr. 2020, http://arxiv.org/abs/2004.09095.

[KeetKhumalo26] Keet, C.M., Khumalo, L. Contextualising levels of language resourcedness for NLP tasks. Arxiv report 2309.17035.17 January 2026. https://arxiv.org/abs/2309.17035.

[Mahlaza24] Mahlaza, Z., Magwenzi, T., Keet, C.M., Khumalo, L. Automatically Generating IsiZulu Words From Indo-Arabic Numerals. 17th International Natural Language Generation Conference (INLG’24), Tokyo, Japan, September 23-27, 2024. Association for Computational Linguistics.

[Park21] Park, Hyunji, et al. “Expanding Universal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik”. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, Association for Computational Linguistics, 2021.

[Rialland18] Rialland A, Adda-Decker M, Kouarata G-N, Adda G, Besacier L, et al. “Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville)”. 11th Language Resources and Evaluation Conference (LREC 2018), ELRA, May 2018, Miyazaki, Japan.

Just Turtle and RDF vs OWL examples: the CPEV and FIBO

This is a more concrete follow-up to the previous somewhat theoretical post on ontologies not being just RDF.

I tried hard to think of how other people might approach the syntax issues and assumptions. When I asked someone who’s involved in developing a vocabulary represented in a Turtle document about it, it resulted in a befuddled reply. Maybe there lies a problem. Regardless, I’ve seen a number of people editing the text files directly rather than through a user interface that helps the modeller avoiding making syntax errors. Either way, we’ll look at two documents in this post: SEMIC’s Core Public Event Vocabulary aimed at EU public administration interoperability, because it was the most recent one of all the Core Vocabularies with a published update, and the Financial Industry Business Ontology, because it has so many files, and both are actively maintained and each one has a user community that, to the best of my knowledge, does not overlap.

The Core Public Event Vocabulary CPEV

Having noted the manual editing of the serialisation, I imagine thy need a plain RDF or Turtle syntax checker at least, so let’s try that. The first hit in duckduckgo is the http://ttl.summerofcode.be/ and I copied the Core Public Event Vocabulary (CPEV) into it. And…

Yay! It is valid Turtle syntax. The verdict is satisfying, so why bother looking beyond it? Because we want that type-level vocabulary it claims to be, conformant to OWL so that anyone can use it for their ontology-driven, Semantic Web and W3C standards compliant application, be it to foster interoperability or to serve a stand-alone knowledge-driven software application (or use it in graphRAG if you insist).

Let me try another tool, Protégé, as the go-to ontology editor for most ontology developers that comes with the reliable OWL API, under the assumption that I don’t have any other options. When I try to load and parse the CPEV by URL from the GitHub repository where it is published, i.e., https://github.com/SEMICeu/Core-Public-Event-Vocabulary/blob/master/releases/1.1.0/voc/core-public-event.ttl, it returns a number of syntax errors. There’s that inconvenience of uploading ontologies on GitHub: you have to click the ‘raw’ button, and then load by URL from that URL rather, being https://raw.githubusercontent.com/SEMICeu/Core-Public-Event-Vocabulary/refs/heads/master/releases/1.1.0/voc/core-public-event.ttl. It loads. Downloading the core-public-event.ttl locally and opening it also works, in the sense of opening in Protégé without any imports. No imports were declared, either, although multiple prefixes have been used.

Inspecting the content, curiously, certain fields are empty in the header section (see image on the right). The file itself certainly does have content for those items, using FOAF. There’s also a foaf:Person class with a number of odd instances of the _:genid2147483656 variety, which it is not supposed to have, nor is that present in the ttl file that lists the actual names of the editors, so something does not add up. It might be tempting to blame the software ‘downloaded from the Internet’ that my operating system warns me for and prioritise the text file one can inspect directly in a text editor, but bear with me for a little more.

Since there’s something amiss with the persons, and so something to do with FOAF usage, one could try to import FOAF explicitly to see what happens, or delete the empty metadata fields, or convert it to the required exchange syntax RDF/XML or another syntax to attempt to gain insight what the issues are. Protégé’s error messages were a bit cryptic when I tried the latter with functional style syntax. I’ll save you reading through a few dead ends.

Was all this a productive use of my time? Most definitely not. A few of the erstwhile students of my ontology engineering course didn’t think so, either, and developed the OWL classifier that finds out straight away the DL fragment you ontology is in, in which OWL Species it is, and it presents a lists of violations of the other OWL Species profiles, if any, using two OWL API versions, for OWL and OWL 2. Here’s a screenshot from the downloaded CPEV ttl and one where I had imported FOAF:

The core-public-event.ttl contains 80 OWL 2 DL violations that causes it to be in OWL 2 Full only and the one where I had imported FOAF still contained 31 violations. Importing FOAF resolved 68 “undeclared x” issues, but added 19 more punning issues, which, at some point, looked promising to possibly pursue. I started to dig in earnest.

There are undeclared classes (e.g., foaf:Person) reported, undeclared annotation properties, punning problems (e.g., accessibility and event number), and so on and so forth. However, more of a smoking gun were the two detections of “Use of reserved vocabulary for class IRI: rdf:langString”, which remain in both lists of violations: the rdf:langString is not a permitted datatype for a data property range in OWL, and it shouldn’t have been reporting it as a class. The rdf:langString was new in RDF 1.1 of 2014 and is intended for “A literal is a language-tagged string if the third element is present.”, yet OWL 2 was last updated in 2012. The OWL API added support for it in February 2015 nonetheless. Removing the rdf:langString from the range declarations of its two uses, with declared data properties accessibility and event number, resolves most issues except for one. It resolves many because rdf:langString is treated as a class yet ought to be a data type, ‘confusing’ the rest of CPEV’s content.

Is a language tag for event number so relevant that we have to leave behind DL-based ontologies? No, that data type should be an xsd:integer or else rdfs:plainLiteral and if someone wants to add ‘rd’ or ‘de’ ‘ieme’ after a 3 in a user interface, then that has to be sorted out by the surface realiser where it ought to be addressed anyway. Must the accessibility data property have a language tag? I appreciate there will be one somewhere in the information system, but a language-tagged blurb about how the venue of the event is accessible for people with mobility restrictions does not warrant a transgression into undecidability of the vocabulary. Nor is it ontologically defensible. Any information system can handle that requirement trivially without destroying ontology-based data access prospects, without forcing undecidable FOL reasoners upon us. Add another field ‘in language’ if you must (which also gives the freedom for more languages and dialects than pre-set ones, even more so with MoLA) or maybe ontolex-lemon is of use here.

That out of the way, the one remaining category of issue is the anonymous individuals in OWL 2 EL and OWL 2 QL, which, considering that few DL-based OWL features are used, does make sense to aim for in the interest of scalable applications. The blank nodes that generate to new individuals in the ontology file are due to the unnecessary nesting of the list of editors that forced the introduction of blank nodes, rather than adding them one by one as an individual each. That is, instead of a pattern like

  <http://www.w3.org/2001/02pd/rec54#editor> [
    a foaf:Person;
    foaf:firstName "x";
    foaf:lastName "y"
  ], [
    a foaf:Person;
    foaf:firstName "z";
    foaf:lastName "w"
  ];

in the ontology metadata, adding a separate entry for each, like

<http://www.w3.org/2001/02pd/rec54#editor> “x y”
<http://www.w3.org/2001/02pd/rec54#editor> “z w”

addresses that problem.

Removing that Turtle list as it was, the modified CPEV file is also in all OWL 2 Profiles. Is an RDF list with blank nodes worth it to forfeit CPEV usage in ontology-based data access systems, when it can be solved simply by adding each editor individually? I don’t think so. Or: what are the arguments why the listing causing the blank nodes is preferable over scalable use of CPEV?

The Financial Industry Business Ontology FIBO

FIBO has many files in its GitHub repository, and I randomly picked one, being Legal Capacity https://github.com/edmcouncil/fibo/blob/master/FND/Law/LegalCapacity.rdf. The W3C RDF Validator is ok with it (see image on the right).

Testing it in the OWL Classifier, it also turned out to be in OWL 2 Full:

and a copy of the full text of the list of violations is pasted here to see the entire entries:

1 - Use of undeclared annotation property: owl:minQualifiedCardinality in annotation [Annotation(owl:minQualifiedCardinality "0"^^xsd:nonNegativeInteger) in AnnotationAssertion(owl:minQualifiedCardinality _:genid650 "0"^^xsd:nonNegativeInteger)]

2 - Use of undeclared annotation property: owl:minQualifiedCardinality in annotation [Annotation(owl:minQualifiedCardinality "0"^^xsd:nonNegativeInteger) in AnnotationAssertion(owl:minQualifiedCardinality _:genid649 "0"^^xsd:nonNegativeInteger)]

3 - Use of unknown datatype: rdf:langString [DatatypeDefinition(<https://www.omg.org/spec/Commons/TextDatatype/Text> DataUnionOf(rdf:langString xsd:string )) in OntologyID(OntologyIRI(<https://www.omg.org/spec/Commons/TextDatatype/>) VersionIRI(<https://www.omg.org/spec/Commons/20221101/TextDatatype/>))]

4 - Use of reserved vocabulary for annotation property IRI: owl:minQualifiedCardinality [AnnotationAssertion(owl:minQualifiedCardinality _:genid649 "0"^^xsd:nonNegativeInteger) in OntologyID(OntologyIRI(<https://spec.edmcouncil.org/fibo/ontology/FND/Utilities/Analytics/>) VersionIRI(<https://spec.edmcouncil.org/fibo/ontology/master/latest/FND/Utilities/Analytics/>))]

5 - Use of reserved vocabulary for annotation property IRI: owl:minQualifiedCardinality [AnnotationAssertion(owl:minQualifiedCardinality _:genid650 "0"^^xsd:nonNegativeInteger) in OntologyID(OntologyIRI(<https://spec.edmcouncil.org/fibo/ontology/FND/Utilities/Analytics/>) VersionIRI(<https://spec.edmcouncil.org/fibo/ontology/master/latest/FND/Utilities/Analytics/>))]

Setting aside the langString, the “AnnotationAssertion(owl:minQualifiedCardinality” is clearly the problem: cardinality constraints are not there for annotation, but to be used in class expressions, and they are reserved for it at that. This is both a problem for Legal Capacity and the Utilities/Analytics it imports. Someone had added minQualifiedCardinality as an annotation property, likely by accident:

It is not allowed to be so if it is to be a DL-based OWL ontology, because it’s reserved vocabulary. Note that “An OWL 2 Full ontology document is any RDF/XML document”, and since it validates, one could argue it’s an OWL ontology. Yet, the DL constructs used are merely ALCHIQ(D), or: most certainly using less than the OWL 2 DL features one could use. Would one want to forfeit decidable automated reasoning for an unused annotation property? I think not.

Where to go from here?

These sort of issues are hard to find manually, and ever harder once the size and complexity of the ontology and the modules it imports increase. Eight OWL Species to choose from in a number of different concrete syntaxes doesn’t make the debugging task any easier either. That’s why my students chose to develop the tool. They did so a while ago, however, and the OWL classifier (GitHub repo) I used was developed in 2016 and only works with older JDKs due to backwards incompatibility of Java, it having been a mini-project topic of the course, the CS honours students – Brian Mc George, Aashiq Parker, and Muhummad Patel – graduated and moved on, and new students want to do new things. You’ll have to set up an older version of JDK to avail of the OWL Classifier to catch syntax violation issues with basic explanations. It doesn’t solve the syntax problem yet, but at least it pinpoints, or at least directs, to where the violations are that make it RDF/Turtle but not the lightweight OWL many an RDF-oriented modeller is after.

Finally, I don’t want to merely complain; I want to help. Writing this and the previous post in an uncontrollable GitHub issue for one ttl/rdf file is a bit much, and, going by the ttl and rdf files that exist on disparate repos, it serves to be known for more than one such file.

It’s good to see how much of the Semantic Web technologies actually made it into industry and public administration, especially considering all the boasting (and bullying?) by the LLM groupies. I’d like to see it taking a step up towards further effective interoperability in the EU and beyond.

No, an ontology isn’t ‘just RDF’

Over the past few years that I’ve been peeking and dabbling outside computer science and the ivory tower of academia more than before, I noticed a disturbing trend, or perhaps even entrenched practice, of talk about “RDF ontologies” and of “ontologies really being no more than just RDF graphs”. But just because ontologies in OWL are expected to be serialised in RDF/XML as the required exchange syntax according to the standard – and optionally in another specified format, such as Turtle (an acronym of Terse RDF Triple Language), OWL/XML, functional style syntax, or Manchester syntax – and has a mapping into RDF, it doesn’t make them ‘RDF ontologies’. Why isn’t an ontology ‘just RDF’?

A very short non-technical answer is that while ontologies (formalised in OWL) can be seen as an RDF graph when that particular serialisation language is chosen, not all RDF graphs are ontologies, and since there are valid serialisations of OWL ontologies that are not graphs (e.g., OWL/XML), not all OWL ontologies are RDF graphs when you encounter the document. A longer explanation follows, where I’ve adapted sidebar 7 of my textbook to make its contents more suitable for a blogpost, and for that reason it also has a little from the ‘encoding peculiarities’ section and sidebar 12.

Abstract versus concrete syntax

(source: my textbook 2nd edition, p103)

As preliminary, recall that the OWL 2 Web Ontology Language is a W3C recommendation since 2009 with a 2nd edition in 2012, and the latest RDF 1.1 for publishing and linking data is a recommendation since 2014 (RDF 1.2 is on the way). And, for what it’s worth it (one certainly can squabble about certain parts – visuals have limitations), here’s an adjusted Semantic Web layer cake, with more standards, less crazy on the colours, and relevant DIKW pyramid concepts on the right.

Let’s consider “Figure 1 The structure of OWL 2” from the OWL 2 overview, reproduced below, and the lime-green oval in the centre with the “mapping”-labelled arrows between OWL and RDF, and, why not, also that orange rectangle at the top-centre with the “RDF” mention and the “Turtle” on the right as well. Perhaps that’s what might cause a reader to simplify it all to equate an ontology with an RDF graph, and save their ontology with a .rdf or, as problematic, .ttl extension instead of saving the ontology in RDF/XML format with an .owl extension to indicate it’s meant to be an ontology in OWL.

“The structure of OWL”, Figure 1 of the OWL 2 document overview.

An OWL 2 ontology is represented as an instance of the OWL 2 structural specification, which is independent of concrete OWL 2 exchange syntaxes. Put differently, an OWL 2 ontology has one structural specification and will be written and stored in one or more concrete exchange syntaxes. RDF/XML is one such exchange syntax, and the mandatory one at that; functional style syntax is an optional exchange syntax, and so are Turtle, Manchester syntax and OWL/XML. Here are two examples of those concrete syntaxes and their respective rendering in a GUI: berry being a fruiting body and carnivorous plants eating animals from a tutorial on improving an ontology, as lazy screenshots of Protégé that fit in my laptop window.

For the data and information modeller among you, you may draw a parallel with UML: there’s the standard that specifies valid UML diagrams, and for a modelling tool one can choose how to turn the UML class diagram into flat text to manipulate and store it, be it OMG’s XMI, some other XML, JSON, your home-grown pet language, or even store them in RDF by treating the diagram as a graph. Does that make the UML class diagram alternatingly a tree data structure, a graph, and a collection of name/value pairs or, schizophrenically, all of them at the same time? No, a UML class diagram remains exactly that, regardless of the implementation choice for the serialisation language.

Back to OWL. The particular structure of an OWL ontology can be mapped into an RDF graph for a concrete computer-processable serialisation of the ontology. Any Description Logic-based OWL ontology still has a direct semantics that is model-theoretic. That mapping into what is syntactically an RDF graph does not change the semantics of the ontology if it’s DL-based (i.e., in either of OWL DL, OWL Lite, OWL 2 DL, OWL 2 EL, OWL 2 QL, or OWL 2 RL), or: the ontology does not swap into a graph-based semantics by serialising it in RDF or its Turtle dialect and we still can send it to the DL-based automated reasoner.

Another way of looking at it is that for concretely writing down the ontology for computational use, we abuse/avail of some syntax that was already specified somewhere for another purpose – representing data and information on the Web – that’s reused here for a different purpose – serialising an ontology where knowledge is represented. It’s a bit like abusing UML class diagram notation to visualise key aspects of an ontology for communicative purpose because it’s around already, people are more familiar with UML notation, and it saves you inventing and explaining a new visual notation.

There are two key reasons why a distinction is made between an abstract structural specification and concrete syntaxes. First, the abstract structure serves as a pivot that then can be linked to multiple concrete syntaxes, compared to generating many-to-many mappings between all exchange syntaxes. Second, additional practical conveniences can be added to concrete syntaxes that do not affect the logical theory (the ontology). For instance, a concrete syntax may have an abbreviatedIRI feature to simplify processing long IRI strings and it may have extras for ontology annotations.

If you look at the fine print of the mapping specification from OWL into RDF, that is, not the convenient table but some parts of the surrounding text, you’ll notice the ‘snag’ that it isn’t simply 1:1. Ontology O and the transformation of it into RDF syntax, T(O), works anyhow, yes, but it’s the “The mapping presented in Section 3 can be used to transform an RDF graph G satisfying certain restrictions into an OWL 2 DL ontology OG” that makes a difference (bold face added). Whatever is in the graph needs to adhere to what’s described in that Section 3; if it doesn’t, it’s still a graph, but just isn’t an OWL ontology. Consequently, RDF tools great for processing lots of instances aren’t necessarily adequate for OWL ontologies – if the tool’s feature set doesn’t boast adherence to those “certain restrictions”, then they aren’t adequate as tool for ontologies for sure.

RDF Schema?

Perhaps the people who talk about ‘RDF ontologies’ mean lightweight ontologies or vocabularies in RDFS, short for RDF Schema. RDFS is based on the RDF Semantics specification and is intended for type-level information and can help guide what to add to a graph. You can declare classes, properties, class hierarchies, property hierarchies, domain and range restrictions, and a few other things like labels, see-also, and bags, but not more substantive knowledge about the subject.

It won’t let you declare characteristics of properties (e.g., inverse, transitive), nor local range restrictions (e.g., that for a class Person specifically, the property hasName has as range xsd:string), nor complex concept descriptions (e.g., that class Bicycle is defined by the union of Human-powered bicycle and Electrical bicycle), nor cardinality restrictions (e.g., each Electrical Bicycle has exactly 1 motor), nor disjointness axioms (e.g., nothing can be both Apple and Orange), not to mention that one can mess up/around, like using vocabulary of the language (e.g., stating that rdfs:Class rdfs:subClassOf ex:a).

If you were thinking in the direction of a schema for RDF, and so RDFS, yet an ontology regardless, then you probably had intended to say an ontology in OWL 2 Full. Reasoning over OWL 2 Full is undecidable, so it’s not like that by forfeiting all the nice modelling features you’d be rewarded with good performance. Or: this may not be what you really want to have.

Ontologies in data stores

Perhaps the people who talk about ‘RDF ontologies’ meant something else. There are, for the lack of a better term, ‘encoding peculiarities’. I could store my ontology about, say, electrical bicycles in a relational database as well, if I so fancy. For the class hierarchy, I can create a 2-column table called Taxonomy, and store it there:

and so on for other tables, like a hasPart table with four columns: one for the whole, one for the part, and two for the basic constraints (universally or existentially quantified, number restrictions). Mathematically, that trick has turned my classes and properties into values. Not that most people would care, because we can look at it and think of it as if they were classes. Computationally, some tasks will go faster. Regardless, we can take R2RML and convert the relational database to RDF, and voila, we have the ontology as an RDF graph at the level of individuals. It’s mathematically and technologically gymnastics, but anyone who understands the stretching wouldn’t talk of an RDF ontology, but keep the performance optimisation hack under wraps and of no concern to the modeller.

When I look at the .ttl files of the SEMIC Core Vocabularies, for instance, such as the most recent release that happens to be v1.1 of the core public event vocabulary, it looks like that the intent is the first case, i.e., where an OWL ontology is serialised in Turtle, as is the case for QUDT, and others. If OWL 2 Full or any of the DL-based OWL 2 languages was intended, they should have had an .owl extension to indicate the ‘specialness’ of that Turtle file. It is not much different for the Financial Industry Business Ontology (FIBO) where, although the syntax isn’t even in Turtle or simple RDF, the file extension is still .rdf rather than .owl. I don’t mean to pick on these, but just happen to know of them and they originate from different communities.

In closing

As per Conformance (normative) of OWL 2, there are OWL 2 Full, OWL 2 DL, OWL 2 EL, OWL 2 QL, and OWL 2 RL ontology documents. Not ‘RDF ontology’ documents. They can be serialised in, at least, RDF/XML, Functional Style Syntax, OWL/XML, Turtle, and Manchester Syntax. Let’s not conflate the ontology with merely one of its exchange syntax serialisation options. More precise terminology may help communicating better, like tasting one’s vocabulary agreement tea.

Of course, other modelling languages exists that can be used for representing an ontology on paper or for computational use that are also not RDF graphs, such as Common Logic. Also, a tool such as Protégé can easily convert between the exchange syntax formats specified in the OWL standard and a few others (export to LaTeX, render it visually in OntoGraf, and whatnot). If you fancy the ontology to be in OWL/XML so you can use Owlready2 in a Python programming environment, go for it – just make sure it’s an OWL 2 ontology. It’s conformance to the OWL standard that counts for all those ontologies in the Semantic Web we weave in order to not end up knitting knots that would become too daunting to disentangle.

p.s.: I‘ll do have more concrete examples in the next post that I’ll finish up in a day or two, zooming in on CPEV and FIBO.

An initial comparison of approaches to automating adding data to knowledge graphs

A bottom-up approach to knowledge graph development may be in part alike bottom-up ontology development from existing and legacy resources, but for sure also involves other tasks, and perhaps principally so. The major other task concerns instance data that also have to turn up in the graph. Digging into a pile of instances begets one dirty hands, however, and therefore substantial work has gone into trying to automate the task of processing the source data and automating loading them into a knowledge graph. Different types of source formats give rise to distinct frameworks with their algorithms and mapping languages, and there can be disparate core requirements on top of it that add to the ever growing list of potential solutions to choose from.

For instance, converting relational database data into a graph requires a different algorithm from converting tree-shaped semi-structured data (XML) into a knowledge graph. A requirement for a high quality graph with legal consequences in case of errors demands more quality control checks along the extract-transform-load (ETL) pathway than a community-based best-effort will-do graph. Control over the source and whether the source must be maintained or can be abandoned once the data has been converted also affects how to go about creating that knowledge graph, as does the requirement whether all data has to be converted or only the selected part that is of interest right now.

Many algorithms and tools have been proposed to automate the task. Since this is only a blog post, I’m going to condense the extant approaches that I know of into three main groups:

  1. ETL-then-quality: first extract stuff from the unstructured, semi-structured, or structured source(s), dump the generated triples in the graph, and then clean it up for as much as feasible;
  2. Quality ETL to KG: the semi-structured or structured source is carefully mapped to the schema of the graph and then the data is loaded into the knowledge graph accordingly;
  3. Virtual KG: only a selected fragment of the structured source data is converted into the graph, which is computed on the fly upon selection.

Each approach has their own set of permutations. A feature comparison then inevitably leads to a few generic statements. I gave it a try nonetheless, as shown in the table below.

Let’s look at some of those values in the table. The “at least one” set of mappings for VKGs is that scalable VKGs can have two sets of mappings or one set of mappings + transformations; e.g., Ontopic’s VKG has a dedicated mapping language to map the ontology to the data and then uses W3C’s R2RML mapping language to convert the query answer into RDF. This makes it exceedingly suitable for scenarios where the source data needs to remain where they are and they are regularly updated, resulting in always up-to-date content in the RDF graph.

The quality ETL works only with semi-structured or structured data (unstructured data is of low quality) and may cater for, among others, selected or all trees in the XML files or all attributes in all RDBMS database tables, using any one of a variety of approaches ranging from dedicated mapping languages, such as RML, to ad hoc code to queries. Since each ETL conversion takes time, it makes sense to use it when one wouldn’t still need the source data in that older format, though re-running the transformation is easily done.

For ETL-then-quality, the source format can range from unstructured text to spreadsheets and beyond, and such best-effort approaches are typically not constrained upfront by a complex domain ontology to allow for as much triple generation as possible, but rather have a schema of sorts emerge from the conversion and cleaning up afterwards. It is, perhaps, the most well known option with NLP-based (including LLM) KG construction, a dash of Schema.org to it, and Google’s knowledge panels as end-user facing rendering of the output. There are obviously lots of methods and techniques for the various subtasks, but I’m not aware of a methodology for it.

I’ve worked on both the VKG/OBDA and the quality ETL approaches, and read up on and used the latter. I don’t have a preference on the whole, because the distinct approaches serve different scenarios, although a half-baked quality improvement for the ETL-then-quality approach irks me as a job half-done at best. This is not a case of bothsideism, but having an understanding of their respective strengths and thereby which one to use when, and consequently that one always can construct use cases with requirements for evaluation where one wins by design, as is the case also for the four approaches of VKG/OBDA as well and a brief overview of those considerations is included in Chapter 8 in the second edition of my ontology engineering textbook. There, I take also other features into account, such as computational complexity and open vs closed world assumption.

Are there other KG construction approaches with an automated component? Should there be more or other implementation-independent features to compare them on? Are some features more important than others? Is one approach much more prevalent in industry than the others, be that in number of projects, size, or funding, or in prospects for successes and failures in solving the problem? Questions, questions.

Regardless, later this week I’ll give a presentation about a particular Quality ETL to KG approach at the European Data Conference on Reference Data and Semantics (ENDORSE 2025) in the ‘AI driven data interoperability & mapping’ session. Meaningfy developed both a preliminary methodology and a supporting tool, called Mapping Workbench, that focusses on converting XML files to an ontology-based knowledge graph in this first iteration. Early results were presented at the SEMANTiCS conferences in 2023 and 2024 in the posters & demos session, and we’ve been working on more tool and methodology enhancements since then, including automation of testing whether the mappings are correct and looking into more types of data as input. If you’re interested in either the methodology or tool, be it for research or for bringing your old XML data (or another format) into the 21st-century technology ecosystem, and the 2020s really, please feel free to contact me, attend the presentation, or arrange for a meeting with demo afterwards. I’ve also installed and checked in on the conference app.

Affordable non-profit publishing being thwarted by Amazon’s antics

The second edition of my ontology engineering textbook was published earlier this month with a non-profit publisher, College Publications, to keep the book affordable. I wouldn’t be getting rich from the number of textbooks sold anyway and affordability of access to knowledge is important for a myriad of reasons. The book is not for free: printing hardcopies cost money, as does cover design, so does the server where the eBook will be stored, and so on. (Expecting it for free is unsustainable in the long run if the creation of textbooks is not to be dependent entirely on the whims of a wealthy benefactor.)

The textbook was configured to cost around 31 GBP or 43 USD or 37 EUR by the current current currency exchange rates, give or take a few dollars/euros due to country-specific taxation differences and fluctuations in exchange rates. It is around that price indeed on Amazon.com and Amazon.co.uk, yet not so everywhere else I checked. Amazon.de had it on offer for 65.65 euro mid September, and Amazon.es for 93.30 euro and likewise for Italy. Amazon apparently is expecting that people in the eurozone are so used to fork out exorbitant prices that they will do so for this textbook as well.

Amazon’s UK site, with the price around (north of the) expectations.
Amazon’s Germany site on the same day (16th of Sept.), selling it for 65.65 euro instead of the 37 euro of the exchange rate of the day.
Amazon Spain trying to sell it for 93.30 euro on 20 September.
Amazon Spain giving it another try, for 66.29 euro on 30 September when it didn’t sell for 93.30.

Amazon.com.au speculated with Australian dollars over the weekend from nearly 80 to 135 back to 83.44 AUD now (though ‘out of stock’). The newly established Amazon.co.za sets its tone by demanding twice the set price from its South African customers – reality soon to be acting out my SIPP/SEP/computer ethics course’s popquiz question about what to do when textbooks are at R1500 (i.e., too expensive). 31 GBP is around R700, however.

Amazon Australia, with the revised up to 135.36 AUD on 28 September.
Amazon Australia on 30 September, priced down to $83.44.

Amazon toying with South African customers: pretending the list price to be nearly R2000 rather than the R700 it ought to be, and pretending there’s a 23% discount to what’s still twice as much as it ought cost (screenshot d.d. 20 September).

The extra 30-60 euros go neither to College Publications nor to me, but to Amazon’s already huge amount of profits and its obscenely rich founder. It appears that there’s nothing we—i.e., College Publications or I—can do to reign in Amazon’s Algorithm Overlords. Not right now. It would not surprise me if this sort of price manipulation also happens to other authors and publishers and a class action suit against Amazon for monopoly abuse sounds like a fine plan.

There are other online retailers, one might retort. True, but they either don’t list my book yet, are in the process of doing so (e.g., ibs.it), or followed Amazon’s lead and added their own markup on top of it (e.g., Thalia.de), except for Booktopia that offers it for a few AUD less than Amazon’s lowest price in Australia. To the best of my knowledge, this sort of manipulation didn’t happen with the first edition of the textbook.

Algorithms are not magic, nor are they the boss; they are designed by humans who are also in charge of deploying them. Hence, the price manipulation and duping of customers in the, perhaps, ‘non core’ countries is by design. This is unethical and should be, if not already is, illegal. Meanwhile, if you’d like to buy the textbook and have limited means or don’t want to sponsor an already fat cat on principle, it can be worth it to shop around to find a better deal.

The ebook of the second edition of An Introduction to Ontology Engineering will be available soon here, where it will be free from Amazon’s algorithms. It won’t be DRM protected, which may have the result that insufficient hardcopies are going to be sold to break even. Publishing can be complicated. Either way: thank you to those who already have bought a copy. I’m sorry if you paid too much for it; I tried my best to avoid that and I hope you’ll find the contents insightful and worthwhile regardless.

Second Edition of the Ontology Engineering Textbook is available now!

front cover OE book v2

It is here: the second edition of An Introduction to Ontology Engineering, also published with the non-profit College Publications and making its way through the distribution channels (e.g., on amazon.co.uk and amazon.de through print-on-demand already).

Did I really want to write a second edition over moving on to write a brand new book? Not really, but it was pulling at me hard enough to go ahead with it anyway. Seven years passed since the publication of the first edition, and everything before COVID seems so long ago. New scientific results have been obtained, new interests emerged, known gaps in the book’s content needed to be filled more urgently than before, and ontologies are on the rise again for a variety of reasons. Those ontologies may, at times, be only lightweight, incorporated in a knowledge graph, or presented as a harmless-looking low-entry controlled vocabulary, but it’s a renewed interest in modelling nonetheless. Also, I fancy the thought that my writing skills have improved over the years and that I could improve the book that way as well.

What’s the difference? What’s new?

I’ve added several recent advances and gracefully shortened or removed older material that lost relevance. The updated and extended NLP section is but one of such cases (see also the recent post), the OBDA chapter another, and, yes, now there is a section on competency questions and I could not avoid writing something about LLMs either. Chapter 11 is new, consisting of integrative and advanced topics, such as ontology modularisation and ontology matching, and a few other topics.

There are also improved explanations and better contextualisation in the broader firmament, like where logics fit and which ones, the DIKW pyramid to also visually position ontologies in the knowledge layer, a new integrated diagram for ontology development methodologies (trialled in this post), and more.

Other novelties are highlighted “learning outcomes” formulated per chapter, more exercises in the book and a new workbook that was announced recently. There are 53 new or improved figures, serialisations neatly in code listings, a prettier layout, and other enhancements. Overall, chapters 1, 7, 8, and 11 have been revised substantially, Chapters 2, 4, 5, 6, and 9 have been updated for a considerable part, and there are additions and improvements in Chapters 3 and 10. All the enhancements caused the main content to increase with 80 pages, or about 30% more, compared to the first edition.

Additional materials

The website accompanying the textbook has been growing over the years, with new ontologies for the tutorials and exercises, additional software for the exercises, and instructions for improved accessibility for the visually impaired (see here or here). The slides still have to be updated, which I’ll probably only get around to do when the next installment of the ontology engineering course is schedule.

I moved the answers to selected exercises from the appendix to the workbook that now also contains tutorials, in order to keep the printing costs of the book relatively low. To recap from the post about the workbook: the new tutorials are about developing good ontologies, reusing ontologies, reasoner-based use of OntoClean, and generating natural language sentences from ontologies.

Audience

The book is still unapologetically aimed at postgraduate students in computer science. There is no quick fix for developing a good ontology, and so nor does the book offer that. Rather, it looks at the foundations of ontology engineering and ontology development. Not everyone needs all that, and that’s fine; some do need it to devise novel theories, methods, techniques, and tools for ontology development, and to maintain the ontology engineering software ecosystem.

What that didn’t make it?

The LLM section is short. Results are coming in indeed, but there’s difference between preliminary research results and a textbook. The section on competency questions is also short: while there are substantive results, it is not as core a topic as the others are that did receive more attention. And a content request for guidelines how to develop an ontology collaboratively could not be honoured beyond an informal sidebar, because there is no substantive reliable, tested, and working materials on what works well. Other topics are nice, but would fit better in a more practical guide for industry, such as how to manage ontology annotations. Hard choices had to be made, for various reasons. If you would have made different choices: you always can cherry-pick from the content and supplement it with other materials—I did so for other subjects’s respective textbooks for courses I taught—or try to write your own textbook.

Is it as cheap as the first edition?

No, sorry. There’s considerably more content and thus more pages to print, the size is different, which affect the softcover hardcopy price as well, and, unlike the first edition, it’s printed in colour. The colour really helps in the explaining that B&W print wouldn’t have. Then there are price fluctuations due to currency exchange fluctuations. There’s a cheaper eBook version, and that file likely will end up on a sharing site at some point. College Publications is non-profit, and so at least the cost is still substantially lower and more affordable than it otherwise would have been.

I hope you will find the updated version interesting and useful. I’m happy to receive feedback and endeavour to be responsive.

On the anatomy of different types of competency questions

Late 2024, Zubeida and I proposed the notion of different types of competency questions for ontologies [1], with each type serving a different purpose. Some youngsters attending the talk at EKAW’24 deemed it not only obvious that there are different types of questions but that surely they existed already. They didn’t. We have been working on this line of investigation: what are the constituents of those types of competency questions that makes them different from each other?

To answer that question, one might be tempted to give NLP a try or feed an LLM a large amount of competency questions (CQs) and let it figure it out. We didn’t do that. We looked at the meaning behind the questions rather than just surface structures and probabilities, because we already could see that NLP wouldn’t be able to second-guess that much implicit information. (ChatGPT might have embedded an unreliable ‘top-level ontology’ of its own.)

For instance, a question about a relational property (rpRCQ), such as “If a human loves, must it also love itself?” refers implicitly to the ‘loving’ relation and the relational property assessed (reflexivity), though one of course could do so explicitly as well, like a “Is the counsels relationship between a therapist and a patient irreflexive?”. A scoping CQ has different constituents: it must have as associated elements at least one entity, be in a particularly subject domain, and is in relation to a prospective ontology; e.g., in “What can the pipes in a sewer network convey?”, the scoping entities are pipe, sewer network, and conveying something, the subject domain is hydrology, and the prospective ontology is the SewerNet ontology.

Snippet of Scoping and Validation CQs’s elements.

We analysed all types of CQs we had identified and identified what makes them distinct from other types of CQs. Besides describing it, we created ER model snippets for each, both for communicative purpose and for the plan to create a database for further analysis, and wrote a corresponding formalisation to be a bit more precise for those who want it. We call the resulting model QuO, and kept it as a ���model’ because we consider it not mature enough to call it an ontology just yet. The details are described in the recently accepted paper entitled Characterising Competency Questions for Ontologies [2] that will be presented at the 9th Cognition And OntologieS workshop as part of the Joint Ontology Workshops that will take place on 8 and 9 September in Catania, Sicily, which is co-located with FOIS’25 that takes place from 10 to 12 September.

The paper also describes a use case-based evaluation (sewer networks), an illustration where different types of CQs can be used in the various ontology development processes, and examples of potential use. We close with a few notes on exemplary CQs, questions that are almost CQ, faulty CQs, and answerability, some reflections in the discussion, and conclusions.

The paper does not describe evaluations with LLMs or other NLP algorithms, but it is an obvious next step: can they be used to improve the quality of automatically generated CQs [3], diversify the stilted short phrases generated, and make the outputted questions more relevant? There are no user evaluations either, because a key question to answer first is: can it feed into a method to assist domain experts to get them to write more good, varied, and useful CQs, and if so how, or could CLaRO [4] somehow be extended and then re-evaluated? And will the types of CQs have a beneficial effect on the quality of the ontology or the CQ authoring process? Such a method is yet to be developed and workshop papers have page limits, one which we reached already, and the writing is dense as it is. There are no benchmark evaluation reports in the paper either, because there was nothing to compare against; there’s only our extensively annotated dataset of CQs by type, i.e., the Repository of Ontology Comptency Questions (ROCQS) that we now have turned into a GitHub repo, yet testing on our own highly curated set wouldn’t be convincing. And so, the evidence is our theoretical argument and the evaluation the applicability in the use case. Did I write this paragraph for a [clueless (?)/lazy (?)/limited-thinking (?)/one-who-used-a-reviewing-checklist (?)] reviewer who whined that “no quantitative evaluation or user study” was a weakness of the paper? Mainly, yes.

Selected examples of main types of CQS, both instantiated ones and templated ones, and an illustration of problematic CQs and why they are problematic (shorthand description)

While it’s more of a theoretical advance rather than a practical one, practical applicability and use seems tantalisingly close and ROCQS as dataset can be of use there. For now, there are those types of CQs, and they have shown their use in the development of the SewerNet ontology, which is a promising start. I’ll present the paper at CAOS and feel free to reach out for follow-up activities.

References

[1] Keet, C.M., Khan, Z.C. On the Roles of Competency Questions in Ontology Engineering. 24th International Conference on Knowledge Engineering and Knowledge Management (EKAW’24). Springer LNAI vol. 15370, pp123–132. Amsterdam, Netherlands, November 26-28, 2024.

[2] Keet, C.M., Khan, Z.C. Characterising competency questions for ontologies. 9th Cognition And OntologiesS workshop (CAOS’25) part of JOWO’25, 8-9 September 2025, Catania, Italy, CEUR-WS (in print)

[3] Mahlaza, Z., Keet, C.M., Chahinian, N., Haydar, B. On the Feasibility of LLM-based Automated Generation and Filtering of Competency Questions for Ontologies. Proceedings of the 5th Conference on Language, Data and Knowledge 2025 (LDK’25). ACL. (in press)

[4] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). E. Garoufallou et al. (Eds.). Springer CCIS vol. 1057, 3-15. 28-31 Oct 2019, Rome, Italy.

Supercharging marketing communications with knowledge graphs

Book cover (source: Petkova’s website)

Either this post’s title or something like ‘21st century technology to manage marketing content’ seems to me to be a more apt subtitle to Teodora Petkova’s “Being dialogic” book she published last year. Now the subtitle is a humble an unassuming “Marketing communications for the web of people and machines”. Teodora kindly sent me a signed copy (thank you!), and I finally made the time to read it over the past weeks. I’ll briefly review it in this post.

Why?

Why would I even bother to read something about marketing? As any regular reader of this blog knows, I could be classified, at best, as a layperson when it comes to marketing, and a disinterested one at that. Teodora’s book is definitely aimed at, an highly readable for, interested laypeople and people in marketing, SEO, communication studies experts and the like. I read it not to dutifully return the favour (she reviewed my book), but out of curiosity to see how and why someone with a non-technical background ended up with Semantic Web technologies. Also, there’s the personal branding we all have to engage in regardless of whether we want to or not, and working at an AI startup/SME there’s no avoiding the topic of marketing either.

Structure and content of the book

The book is structured in four parts. The first two parts are mainly the sociological (and sociologese writing) angle to the theme with traditional marketing, what communication and marketing communications are with their participants, and what dialogue means. Part 3 introduces the Semantic Web and knowledge graphs (KGs) from a user perspective, and part 4 combines the knowledge from the first two parts with the third part to propose a new mixture, where marketing communication is managed by a marketing knowledge graph, and thereby that one should be able to achieve more than with only plain (non-KG) marketing communications.

Teodora proposes that marketing KG can help annotate marketing content for better structuring of content and querying it, it can assist with keeping track of marketing events that are more diverse than one might think, one can plot personalised website engagement paths, and more. For the ‘and more’ and what sort of channels and activities exist for marketing, parts 1 and 2 of the book come in handy.

Marketing used to be a one-way street from company to the consumer, but with the “intertextual animal [the human] foraging the Web for information” (p16), there are many communication channels that can be one-way or two-way streets, Teodora argues. The netizen in “Cyberia” obtains information also from product reviews on websites, emails and whatnot including the new LLM-generated snippets forced upon us in the Google searches, among others. Audiences are not only the humans reading the text, but there are also “algorithmic audiences” (p73) to adjust the marketing message to—and those algorithms better be spoon-fed structured content for a higher chance of quality output to be served up semi-digested to the wandering Web surfer and purposefully information-seeking proletariat.

Teodora’s thesis is that the marketing communication on the Web is, or should be, a dialogue—of the consumer wanting to be heard, the organisation needing to listen, the messages to be adjusted to the dialogue between the actors, obtaining interactivity, mutual learning, and generating actionable information for the consumer. And that the Web, and the semantic Web in particular, as digital public sphere can enable all that. There are many more nuggets described in part 2 to substantiate the claim and to provide a broader setting, including a framework for “building dialogic relationships” on the Web (pp125-130), mass marketers as hunters versus relationship marketers as farmers growing long-term fruitful relationships, and that dialogic marketing leans towards the latter.

Moving on to part 3 as well as chapter 10, it starts to look at infrastructure to realise the marketing dialogue. If you know of knowledge graphs, most of it can be skimmed over, except for chapter 9 “Honey, I shrunk the dialogue with schema.org!” and its table on pp186-191 in particular. Apparently about half of the Web’s websites use RDFa, microdata or JSON-LD statements already, and in that table, Teodora maps the theory with its dialogic principles and concrete mechanisms to schema.org elements, showing with which bare minimum of semantics from schema.org those dialogic components can be implemented.

Chapters 11 and 12 focus on using KGs “to underpin relationship marketing”, “as repositories to store knowledge about customers and internal concepts”, and “as content data hubs to power content experiences” (p210) where the KG “is a potential technological solution for contextualized, personalized, and, in this sense, dialogic marketing communications” (p212). I suppose there is a fine line between more sophisticated manipulation and trustworthy persuasion of the consumer.

Chapter 12 contains several use cases, including one at Graphwise (that Ontotext is now part of) where she works. The recurring persona, Tim, is able to do a lot with the KG-powered marketing platform, to “”talk” to the content, browse it through faceted search and search for certain keywords and topics, navigating a tree of concepts. Tim is also able to ask an LLM about anything the get answers which are contextually relevant to Ontotext”. It leaves a key question unanswered: does just this make it dialogic? What I would have liked to see here, are those concepts from part 2 mapped to the Ontotext solution, and that, and how, it meets being a dialogic software solution. This may be due to my technical lens through which I read the book and perhaps the chapter’s content already suffices for readers with a marketing background.

It remains to be seen whether sophisticated KGs for marketing will end up in improved honest persuasion and in ease of finding correct and actionable information or turn into sneakier manipulation. One may hope the former, but dangers with the latter have been pointed out already. A concrete controlled experiment could be fun as a follow-up to the book’s proposal.

I would have liked to see more concrete examples in detail, such as a snippet of that Ontotext marketing KG and how exactly that section of the KG solves a specific problem or manages a particular dialogue, or else with the described IKEA, BBC, or jazz cases, or even a made-up scenario and data. The book’s presentation and polishing also could have benefitted from a better editor and proofreader, but maybe I notice those things more than the average reader because I’ve been bogged down with those issues recently in my own writings.

In closing

Overall, it is an easy read and I’ve learned something from it, so it was time well spent. For anyone interested in semantic technologies, it can serve as a motivational use case for KGs and to sip your own champagne; for marketing and communications experts, it is an entry into technological advances to help you to do your job better.

New workbook for the Ontology Engineering textbook

The first part of my textbook revisions is complete and there’s now a first version of the accompanying workbook (also available here)! It’s designed partially to not substantially increase the printing costs of the impending version 2 of my textbook on ontology engineering, partially for ease of access and updates, and partially to pull in the various resources developed over the past few years since version 1.5 of the textbook was released in 2020. Also, it’s in collaboration with Zola Mahlaza, who developed one extensive tutorial from scratch and another substantially.

Cover of the workbook accessible at https://people.cs.uct.ac.za/~mkeet/OEbook/OEWorkBook.pdf

What’s in the new workbook? First, and foremost, there are non-trivial integrative tutorials in academic tutorial style. That is, the tutorials are of the variety of “give it a try; succeed for a bit, get stuck; try some more; and the teaching assistant or lecturer will explain a solution for some tasks; and en passant the dead ends in trying turned out to be insights useful for later exercises, tasks, and tutorials”. It may appear superficially alike cookbook style, but they make you engage with the material to obtain a much better understanding than rote learning or a copy-paste-press-button would ever be able to provide.

Unlike the revision questions and exercises in the textbook that concern the particular chapter they are part of, the tasks in the tutorials draw from content covered across at least two chapters, as described in the introduction of each tutorial.

There are four tutorials in the workbook: an introduction to developing good ontologies (mainly the ISAO23 tutorial), a new tutorial on ontology reuse, the tutorial on OntoClean using an ontology editor and an automated reasoner (cf. a paper-based exercise), and the tutorial from JOWO22 on ontology verbalization where you’ll build your own verbaliser. Could there be a tutorial that uses an LLM for at least one of the tasks in ontology development? Yes; it’s not included explicitly in the pdf file now, lest I have to update it every few months. An easy way to incorporate LLM use is to append an experiment to the ontology verbalisation tutorial: prompt an LLM for the verbalisation of the chosen ontology, and compare the LLM’s output with your verbaliser’s output on fluency and semantic accuracy of the generated text.

The workbook also includes the descriptions of two assignments (I’ve update the list of mini-project topics) and answers to selected exercises of v2 of the textbook. The appendix has a brief introduction to Protégé with a few annotated screenshots for those who don’t want to RTFM and think they can do the first tutorial at home rather than in the lab or you are a self-study student, and it has guidance on citing related work. That last topics may seem out of place, but ontology engineering tends to be taught in late undergrad/early postgrad where many computer science students need guidance on how to cite scientific papers, which is required and assumed knowledge for both assignments.

There likely will be typos despite the best efforts, and maybe something is not as clear as Zola and I think it is, so please feel free to contact Zola or me for feedback. I may be willing to develop cookbook style or industry-level-and-quality tutorial, provided there are sufficient requests for it, but it’s not a top priority in my list of leisure time activities at the moment. Having completed the main updates to the textbook, the lion’s share of the writing time is being taken up with editing and proofreading the draft.

Progress on fully automated competency question generation with AgOCQS++

Competency questions for ontologies are useful in ontology development for setting the scope, ontology reuse, modelling, and validation, among other tasks. The problem is that no-one really wants to write them, and even when they do, the question are of varying quality, resulting in mixed experiences. With the popularity of LLMs these years and given that question writing is a text-based task, it’s obvious to want to try to make one of those language models do it for us. That turns out to be easier said than done.

Early results in LLM-based competency question (CQ) generation for ontologies tried prompting guided by existing ontologies [AlharbiEtAl24,Pan25], based on a mini-corpus of a handful of documents [AntiaKeet23], and recently also from user stories [LippoliEtAl25]. Domain experts probably aren’t too happy writing a good number of user stories either if they barely can get themselves to write single sentences for CQs. Retrofitting can be useful for ontology reuse, but that won’t help the modeller if they have to develop a new ontology or new module to extend an ontology. We’re interested in that latter scenario.

We first had a go at that setting over two years ago, which resulted in the “AgOCQS” pipeline that principally used T5, the SQuAD dataset for training, and the CLaRO controlled language for filtering out bad question [AntiaKeet23]. AgOCQS was evaluated in a survey with humans, in one subject domain, and it had some manual processing. We wanted to know whether it was as domain-independent as it was expected to be and, more interesting from a scientific viewpoint, the broad question it raised was: how to (fully) automate and obtain relevant CQs and to do this in such a manner that the CQs can be traced to the source through the various generation and filtering steps it undergoes? And of course, that a sufficient amount of CQs will be relevant: a fully automated pipeline that is fed a mere 4-6 documents + 10-15 minutes checking for relevant in-domain documents is easier than inventing questions spending hours crafting questions oneself or wading through bad output for hours.

Narrowing down the broader question into more manageable chunks, we focussed on:

  1. Is the AgOCQs pipeline effective for use cases in other domains than it was tested on (COVID-19)?
  2. Is AgOCQs effective on other types documents, i.e., not just scientific articles, but also standards and guidelines?
  3. What is the effect of different corpus size on the number and quality of the CQs generated?
  4. What exactly is the contribution of each filtering step on AgOCs’s output?
  5. What is the effect of the SQuAD training set on the quality of the output?

We assesses and upgraded AgOCQS, and used a human evaluation where two domain experts assessed the generated questions on scope and relevance and an ontology expert assessed the questions on quality. We managed to answer all the specific research questions, which are described in detail in the paper “On the Feasibility of LLM-based Automated Generation and Filtering of Competency Questions for Ontologies” [MahlazaEtAl25] that was recently accepted for the 5th Conference on Language, Data and Knowledge 2025 (LDK’25) that will be held in Napoli in September.

To jump straight to the interesting results: the first updated algorithm—dubbed “AgOCQS+” for clarity and lack of imagination—is now fully automated and has traceability features throughout the pipeline to provide insights into the effects of each step (see figure below). It showed that filtering the trained (fine-tuned) T5 output with the CLaRO controlled language for CQs for ontologies [KeetEtAl19] is good for output quality, but not for diversity in sentence structure. A corpus of a mere 4 documents was enough to obtain a fine number of usable CQs already. Genre had little effect overall: we observed better quality and scope for the questions generated from the sewer networks guidelines compared to the questions generated from the scientific articles, but that was evened out when taking into account relevance for the chosen ontology.

And so we experimented some more, ending up with an “AgOCQS++” whose pipeline is shown in the figure below.

The outcome? Relaxing the CLaRO filtering resulted in more varied outputs (yay), but also more bad quality strings that required a number of additional strategies to get rid of to ensure that the good CQs won’t drown in a sea of bad questions, sentences, and gibberish (see paper for details).

Here’s a sampling of CQs that were evaluated as within scope, relevant for the SewerNet ontology about sewer networks, and of acceptable quality:

  • What is the rated capacity of the sewage treatment plant?
  • What does the rainfall reduction method involve?
  • What is the purpose of a diffuser?
  • What is the purpose of an energy efficient treatment process?
  • What is the purpose of a storm sewer system?
  • What is the purpose of a major drainage system?
  • What is the purpose of the two wastewater cycles?
  • What is the definition of the pipe network?
  • What is the transmission of Qs?
  • What is the minimum height of the weir plate?
  • During dry weather periods, what is the average daily flow of approximately m s?
  • What is the minimum number of conduits connecting any manhole to the ground?
  • What is the purpose of the proposed SSWMS?

Conducting the evaluation the hard way with humans, as opposed to BLEU and the like, did pay off in ‘fringe benefits’—or: we obtained additional useful insights. First, the evaluation on CQ quality resulted in a better understanding of features that determine CQ quality for ontologies, and part of that was implemented in the ‘conceptual filtering’ stage of the pipeline. Second, the domain experts, Batoul Haydar and Nanée Chahinian, needed concrete in-domain examples of what it means for a CQ to be in/out of scope of their domain (hydrology) and be relevant or irrelevant for their ontology (SewerNet), which had to be established during a conversation with them before conducting the evaluation. For instance, question about drinking water were out of scope of sewer networks, wastewater quality measurement was within scope but not relevant for the SewerNet ontology, and a manhole cover question was both within scope and relevant. Such a human training phase is obvious in hindsight, yet I don’t recall a CQ authoring method that contains a deliberation to that effect.

Are we done now? Our paper is unlikely to be the last word on automating CQ generation. Among others, if only we had a SQuAD-like dataset for CQs for ontologies, then the pipeline should be able to output better quality CQs, because the SQuAD questions are not exemplary as CQs for ontologies due to its intended purpose (testing QA systems). And that, for now, is a catch-22: to create a good dataset for training/fine-tuning of an LLM, we need very many CQs of good quality, and for that we need to know what good CQs are, and ideally also devise and automate quality metrics to evaluate CQs, yet the task we’re trying to farm out to the LLM is to solve the lack of CQs of good quality. Second-best options for possible gains might be to fine-tune another LLM or to update the CLaRO patterns and templates by taking into account the recent quality assessment of its source data [KeetKhan24].

Meanwhile, if you want to try out AgOCQS+ or AgOCQS++: my collaborator Zola Mahlaza, as the main author of the paper, made the materials available on his GitHub repo and he’ll also present the paper at LDK’25. You can contact either of us if you’d like to know more about it.

References

[AlharbiEtAl24] Alharbi, R., Tamma, V., Grasso, F., Payne, T. The Role of Generative AI in Competency Question Retrofitting. ESWC 2024, Extended Semantic Web Conference, May 2024, Hersonissos, Greece.

[AntiaKeet23] Antia, M.-J., Keet, C.M. Automating the Generation of Competency Questions for Ontologies with AgOCQs. 5th Iberoamerican conference on Knowledge Graphs and Semantic Web (KGSWC’23). F. Ortiz-Rodriguez et al. (Eds.). Springer LNCS vol 14382, 1-15. 13-15 Nov 2023, Zaragoza, Spain.

[KeetEtAl19] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). E. Garoufallou et al. (Eds.). Springer CCIS vol. 1057, 3-15. 28-31 Oct 2019, Rome, Italy.

[KeetKhan24] Keet, C.M., Khan, Z.C. On the Roles of Competency Questions in Ontology Engineering. 24th International Conference on Knowledge Engineering and Knowledge Management (EKAW’24). Springer LNAI vol 15370, pp123–132. Amsterdam, Netherlands, November 26-28, 2024.

[LippolisEtAl25] Lippolis, A.S., Ragagni, M. D., Ciancarini, P., Nuzzolese, A. G., Presutti, V. Bench4KE: Benchmarking Automated Competency Question Generation. Preprint, arXiv 2505.24554. 2025.

[MahlazaEtAl25] Mahlaza, Z., Keet, C.M., Chahinian, N., Haydar, B. On the Feasibility of LLM-based Automated Generation and Filtering of Competency Questions for Ontologies. Proceedings of the 5th Conference on Language, Data and Knowledge 2025 (LDK’25). ACL. (in press)

[PanEtAl25] Pan, X., Van Ossenbruggen, J., De Boer, V., Huang, Z. A RAG approach for generating competency questions in ontology engineering. Preprint, arXiv:2409.08820. 2025.