An initial comparison of approaches to automating adding data to knowledge graphs

A bottom-up approach to knowledge graph development may be in part alike bottom-up ontology development from existing and legacy resources, but for sure also involves other tasks, and perhaps principally so. The major other task concerns instance data that also have to turn up in the graph. Digging into a pile of instances begets one dirty hands, however, and therefore substantial work has gone into trying to automate the task of processing the source data and automating loading them into a knowledge graph. Different types of source formats give rise to distinct frameworks with their algorithms and mapping languages, and there can be disparate core requirements on top of it that add to the ever growing list of potential solutions to choose from.

For instance, converting relational database data into a graph requires a different algorithm from converting tree-shaped semi-structured data (XML) into a knowledge graph. A requirement for a high quality graph with legal consequences in case of errors demands more quality control checks along the extract-transform-load (ETL) pathway than a community-based best-effort will-do graph. Control over the source and whether the source must be maintained or can be abandoned once the data has been converted also affects how to go about creating that knowledge graph, as does the requirement whether all data has to be converted or only the selected part that is of interest right now.

Many algorithms and tools have been proposed to automate the task. Since this is only a blog post, I’m going to condense the extant approaches that I know of into three main groups:

  1. ETL-then-quality: first extract stuff from the unstructured, semi-structured, or structured source(s), dump the generated triples in the graph, and then clean it up for as much as feasible;
  2. Quality ETL to KG: the semi-structured or structured source is carefully mapped to the schema of the graph and then the data is loaded into the knowledge graph accordingly;
  3. Virtual KG: only a selected fragment of the structured source data is converted into the graph, which is computed on the fly upon selection.

Each approach has their own set of permutations. A feature comparison then inevitably leads to a few generic statements. I gave it a try nonetheless, as shown in the table below.

Let’s look at some of those values in the table. The “at least one” set of mappings for VKGs is that scalable VKGs can have two sets of mappings or one set of mappings + transformations; e.g., Ontopic’s VKG has a dedicated mapping language to map the ontology to the data and then uses W3C’s R2RML mapping language to convert the query answer into RDF. This makes it exceedingly suitable for scenarios where the source data needs to remain where they are and they are regularly updated, resulting in always up-to-date content in the RDF graph.

The quality ETL works only with semi-structured or structured data (unstructured data is of low quality) and may cater for, among others, selected or all trees in the XML files or all attributes in all RDBMS database tables, using any one of a variety of approaches ranging from dedicated mapping languages, such as RML, to ad hoc code to queries. Since each ETL conversion takes time, it makes sense to use it when one wouldn’t still need the source data in that older format, though re-running the transformation is easily done.

For ETL-then-quality, the source format can range from unstructured text to spreadsheets and beyond, and such best-effort approaches are typically not constrained upfront by a complex domain ontology to allow for as much triple generation as possible, but rather have a schema of sorts emerge from the conversion and cleaning up afterwards. It is, perhaps, the most well known option with NLP-based (including LLM) KG construction, a dash of Schema.org to it, and Google’s knowledge panels as end-user facing rendering of the output. There are obviously lots of methods and techniques for the various subtasks, but I’m not aware of a methodology for it.

I’ve worked on both the VKG/OBDA and the quality ETL approaches, and read up on and used the latter. I don’t have a preference on the whole, because the distinct approaches serve different scenarios, although a half-baked quality improvement for the ETL-then-quality approach irks me as a job half-done at best. This is not a case of bothsideism, but having an understanding of their respective strengths and thereby which one to use when, and consequently that one always can construct use cases with requirements for evaluation where one wins by design, as is the case also for the four approaches of VKG/OBDA as well and a brief overview of those considerations is included in Chapter 8 in the second edition of my ontology engineering textbook. There, I take also other features into account, such as computational complexity and open vs closed world assumption.

Are there other KG construction approaches with an automated component? Should there be more or other implementation-independent features to compare them on? Are some features more important than others? Is one approach much more prevalent in industry than the others, be that in number of projects, size, or funding, or in prospects for successes and failures in solving the problem? Questions, questions.

Regardless, later this week I’ll give a presentation about a particular Quality ETL to KG approach at the European Data Conference on Reference Data and Semantics (ENDORSE 2025) in the ‘AI driven data interoperability & mapping’ session. Meaningfy developed both a preliminary methodology and a supporting tool, called Mapping Workbench, that focusses on converting XML files to an ontology-based knowledge graph in this first iteration. Early results were presented at the SEMANTiCS conferences in 2023 and 2024 in the posters & demos session, and we’ve been working on more tool and methodology enhancements since then, including automation of testing whether the mappings are correct and looking into more types of data as input. If you’re interested in either the methodology or tool, be it for research or for bringing your old XML data (or another format) into the 21st-century technology ecosystem, and the 2020s really, please feel free to contact me, attend the presentation, or arrange for a meeting with demo afterwards. I’ve also installed and checked in on the conference app.

Second Edition of the Ontology Engineering Textbook is available now!

front cover OE book v2

It is here: the second edition of An Introduction to Ontology Engineering, also published with the non-profit College Publications and making its way through the distribution channels (e.g., on amazon.co.uk and amazon.de through print-on-demand already).

Did I really want to write a second edition over moving on to write a brand new book? Not really, but it was pulling at me hard enough to go ahead with it anyway. Seven years passed since the publication of the first edition, and everything before COVID seems so long ago. New scientific results have been obtained, new interests emerged, known gaps in the book’s content needed to be filled more urgently than before, and ontologies are on the rise again for a variety of reasons. Those ontologies may, at times, be only lightweight, incorporated in a knowledge graph, or presented as a harmless-looking low-entry controlled vocabulary, but it’s a renewed interest in modelling nonetheless. Also, I fancy the thought that my writing skills have improved over the years and that I could improve the book that way as well.

What’s the difference? What’s new?

I’ve added several recent advances and gracefully shortened or removed older material that lost relevance. The updated and extended NLP section is but one of such cases (see also the recent post), the OBDA chapter another, and, yes, now there is a section on competency questions and I could not avoid writing something about LLMs either. Chapter 11 is new, consisting of integrative and advanced topics, such as ontology modularisation and ontology matching, and a few other topics.

There are also improved explanations and better contextualisation in the broader firmament, like where logics fit and which ones, the DIKW pyramid to also visually position ontologies in the knowledge layer, a new integrated diagram for ontology development methodologies (trialled in this post), and more.

Other novelties are highlighted “learning outcomes” formulated per chapter, more exercises in the book and a new workbook that was announced recently. There are 53 new or improved figures, serialisations neatly in code listings, a prettier layout, and other enhancements. Overall, chapters 1, 7, 8, and 11 have been revised substantially, Chapters 2, 4, 5, 6, and 9 have been updated for a considerable part, and there are additions and improvements in Chapters 3 and 10. All the enhancements caused the main content to increase with 80 pages, or about 30% more, compared to the first edition.

Additional materials

The website accompanying the textbook has been growing over the years, with new ontologies for the tutorials and exercises, additional software for the exercises, and instructions for improved accessibility for the visually impaired (see here or here). The slides still have to be updated, which I’ll probably only get around to do when the next installment of the ontology engineering course is schedule.

I moved the answers to selected exercises from the appendix to the workbook that now also contains tutorials, in order to keep the printing costs of the book relatively low. To recap from the post about the workbook: the new tutorials are about developing good ontologies, reusing ontologies, reasoner-based use of OntoClean, and generating natural language sentences from ontologies.

Audience

The book is still unapologetically aimed at postgraduate students in computer science. There is no quick fix for developing a good ontology, and so nor does the book offer that. Rather, it looks at the foundations of ontology engineering and ontology development. Not everyone needs all that, and that’s fine; some do need it to devise novel theories, methods, techniques, and tools for ontology development, and to maintain the ontology engineering software ecosystem.

What that didn’t make it?

The LLM section is short. Results are coming in indeed, but there’s difference between preliminary research results and a textbook. The section on competency questions is also short: while there are substantive results, it is not as core a topic as the others are that did receive more attention. And a content request for guidelines how to develop an ontology collaboratively could not be honoured beyond an informal sidebar, because there is no substantive reliable, tested, and working materials on what works well. Other topics are nice, but would fit better in a more practical guide for industry, such as how to manage ontology annotations. Hard choices had to be made, for various reasons. If you would have made different choices: you always can cherry-pick from the content and supplement it with other materials—I did so for other subjects’s respective textbooks for courses I taught—or try to write your own textbook.

Is it as cheap as the first edition?

No, sorry. There’s considerably more content and thus more pages to print, the size is different, which affect the softcover hardcopy price as well, and, unlike the first edition, it’s printed in colour. The colour really helps in the explaining that B&W print wouldn’t have. Then there are price fluctuations due to currency exchange fluctuations. There’s a cheaper eBook version, and that file likely will end up on a sharing site at some point. College Publications is non-profit, and so at least the cost is still substantially lower and more affordable than it otherwise would have been.

I hope you will find the updated version interesting and useful. I’m happy to receive feedback and endeavour to be responsive.

Progress on fully automated competency question generation with AgOCQS++

Competency questions for ontologies are useful in ontology development for setting the scope, ontology reuse, modelling, and validation, among other tasks. The problem is that no-one really wants to write them, and even when they do, the question are of varying quality, resulting in mixed experiences. With the popularity of LLMs these years and given that question writing is a text-based task, it’s obvious to want to try to make one of those language models do it for us. That turns out to be easier said than done.

Early results in LLM-based competency question (CQ) generation for ontologies tried prompting guided by existing ontologies [AlharbiEtAl24,Pan25], based on a mini-corpus of a handful of documents [AntiaKeet23], and recently also from user stories [LippoliEtAl25]. Domain experts probably aren’t too happy writing a good number of user stories either if they barely can get themselves to write single sentences for CQs. Retrofitting can be useful for ontology reuse, but that won’t help the modeller if they have to develop a new ontology or new module to extend an ontology. We’re interested in that latter scenario.

We first had a go at that setting over two years ago, which resulted in the “AgOCQS” pipeline that principally used T5, the SQuAD dataset for training, and the CLaRO controlled language for filtering out bad question [AntiaKeet23]. AgOCQS was evaluated in a survey with humans, in one subject domain, and it had some manual processing. We wanted to know whether it was as domain-independent as it was expected to be and, more interesting from a scientific viewpoint, the broad question it raised was: how to (fully) automate and obtain relevant CQs and to do this in such a manner that the CQs can be traced to the source through the various generation and filtering steps it undergoes? And of course, that a sufficient amount of CQs will be relevant: a fully automated pipeline that is fed a mere 4-6 documents + 10-15 minutes checking for relevant in-domain documents is easier than inventing questions spending hours crafting questions oneself or wading through bad output for hours.

Narrowing down the broader question into more manageable chunks, we focussed on:

  1. Is the AgOCQs pipeline effective for use cases in other domains than it was tested on (COVID-19)?
  2. Is AgOCQs effective on other types documents, i.e., not just scientific articles, but also standards and guidelines?
  3. What is the effect of different corpus size on the number and quality of the CQs generated?
  4. What exactly is the contribution of each filtering step on AgOCs’s output?
  5. What is the effect of the SQuAD training set on the quality of the output?

We assesses and upgraded AgOCQS, and used a human evaluation where two domain experts assessed the generated questions on scope and relevance and an ontology expert assessed the questions on quality. We managed to answer all the specific research questions, which are described in detail in the paper “On the Feasibility of LLM-based Automated Generation and Filtering of Competency Questions for Ontologies” [MahlazaEtAl25] that was recently accepted for the 5th Conference on Language, Data and Knowledge 2025 (LDK’25) that will be held in Napoli in September.

To jump straight to the interesting results: the first updated algorithm—dubbed “AgOCQS+” for clarity and lack of imagination—is now fully automated and has traceability features throughout the pipeline to provide insights into the effects of each step (see figure below). It showed that filtering the trained (fine-tuned) T5 output with the CLaRO controlled language for CQs for ontologies [KeetEtAl19] is good for output quality, but not for diversity in sentence structure. A corpus of a mere 4 documents was enough to obtain a fine number of usable CQs already. Genre had little effect overall: we observed better quality and scope for the questions generated from the sewer networks guidelines compared to the questions generated from the scientific articles, but that was evened out when taking into account relevance for the chosen ontology.

And so we experimented some more, ending up with an “AgOCQS++” whose pipeline is shown in the figure below.

The outcome? Relaxing the CLaRO filtering resulted in more varied outputs (yay), but also more bad quality strings that required a number of additional strategies to get rid of to ensure that the good CQs won’t drown in a sea of bad questions, sentences, and gibberish (see paper for details).

Here’s a sampling of CQs that were evaluated as within scope, relevant for the SewerNet ontology about sewer networks, and of acceptable quality:

  • What is the rated capacity of the sewage treatment plant?
  • What does the rainfall reduction method involve?
  • What is the purpose of a diffuser?
  • What is the purpose of an energy efficient treatment process?
  • What is the purpose of a storm sewer system?
  • What is the purpose of a major drainage system?
  • What is the purpose of the two wastewater cycles?
  • What is the definition of the pipe network?
  • What is the transmission of Qs?
  • What is the minimum height of the weir plate?
  • During dry weather periods, what is the average daily flow of approximately m s?
  • What is the minimum number of conduits connecting any manhole to the ground?
  • What is the purpose of the proposed SSWMS?

Conducting the evaluation the hard way with humans, as opposed to BLEU and the like, did pay off in ‘fringe benefits’—or: we obtained additional useful insights. First, the evaluation on CQ quality resulted in a better understanding of features that determine CQ quality for ontologies, and part of that was implemented in the ‘conceptual filtering’ stage of the pipeline. Second, the domain experts, Batoul Haydar and Nanée Chahinian, needed concrete in-domain examples of what it means for a CQ to be in/out of scope of their domain (hydrology) and be relevant or irrelevant for their ontology (SewerNet), which had to be established during a conversation with them before conducting the evaluation. For instance, question about drinking water were out of scope of sewer networks, wastewater quality measurement was within scope but not relevant for the SewerNet ontology, and a manhole cover question was both within scope and relevant. Such a human training phase is obvious in hindsight, yet I don’t recall a CQ authoring method that contains a deliberation to that effect.

Are we done now? Our paper is unlikely to be the last word on automating CQ generation. Among others, if only we had a SQuAD-like dataset for CQs for ontologies, then the pipeline should be able to output better quality CQs, because the SQuAD questions are not exemplary as CQs for ontologies due to its intended purpose (testing QA systems). And that, for now, is a catch-22: to create a good dataset for training/fine-tuning of an LLM, we need very many CQs of good quality, and for that we need to know what good CQs are, and ideally also devise and automate quality metrics to evaluate CQs, yet the task we’re trying to farm out to the LLM is to solve the lack of CQs of good quality. Second-best options for possible gains might be to fine-tune another LLM or to update the CLaRO patterns and templates by taking into account the recent quality assessment of its source data [KeetKhan24].

Meanwhile, if you want to try out AgOCQS+ or AgOCQS++: my collaborator Zola Mahlaza, as the main author of the paper, made the materials available on his GitHub repo and he’ll also present the paper at LDK’25. You can contact either of us if you’d like to know more about it.

References

[AlharbiEtAl24] Alharbi, R., Tamma, V., Grasso, F., Payne, T. The Role of Generative AI in Competency Question Retrofitting. ESWC 2024, Extended Semantic Web Conference, May 2024, Hersonissos, Greece.

[AntiaKeet23] Antia, M.-J., Keet, C.M. Automating the Generation of Competency Questions for Ontologies with AgOCQs. 5th Iberoamerican conference on Knowledge Graphs and Semantic Web (KGSWC’23). F. Ortiz-Rodriguez et al. (Eds.). Springer LNCS vol 14382, 1-15. 13-15 Nov 2023, Zaragoza, Spain.

[KeetEtAl19] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). E. Garoufallou et al. (Eds.). Springer CCIS vol. 1057, 3-15. 28-31 Oct 2019, Rome, Italy.

[KeetKhan24] Keet, C.M., Khan, Z.C. On the Roles of Competency Questions in Ontology Engineering. 24th International Conference on Knowledge Engineering and Knowledge Management (EKAW’24). Springer LNAI vol 15370, pp123–132. Amsterdam, Netherlands, November 26-28, 2024.

[LippolisEtAl25] Lippolis, A.S., Ragagni, M. D., Ciancarini, P., Nuzzolese, A. G., Presutti, V. Bench4KE: Benchmarking Automated Competency Question Generation. Preprint, arXiv 2505.24554. 2025.

[MahlazaEtAl25] Mahlaza, Z., Keet, C.M., Chahinian, N., Haydar, B. On the Feasibility of LLM-based Automated Generation and Filtering of Competency Questions for Ontologies. Proceedings of the 5th Conference on Language, Data and Knowledge 2025 (LDK’25). ACL. (in press)

[PanEtAl25] Pan, X., Van Ossenbruggen, J., De Boer, V., Huang, Z. A RAG approach for generating competency questions in ontology engineering. Preprint, arXiv:2409.08820. 2025.

Is developing an ontology from an LLM really feasible?

It is, perhaps, a contentious question. I won’t take for an answer a “yes, because LLMs are indistinguishable from magic” – I know they aren’t. It is understandable why ontology developers would want to at least try it: the ‘pre-LLM’ or ‘non-LLM’ pipeline is laborious to say the least and a shortcut would be nice. Let’s take the following comprehensive NLP pipeline as illustration:

Figure. Comprehensive NLP pipeline from text to ontology, using a variety of linguistic, statistical, and logic and evaluation techniques.
Figure. Comprehensive NLP pipeline from text to ontology, using a variety of linguistic, statistical, and logic and evaluation techniques.

The diagram is a draft figure for v2 of my ontology engineering textbook and extends the diagram of Asim et al., who made their version in 2018 as part of their survey on using NLP for ontology development [Asim18], in the times before the meteoric rise of the LLMs. A particular pipeline will carry out some or all of these steps in the middle using one or more techniques in the boxes on the left-hand and right-hand sides.

What is it, scientifically, that makes some people (you?) believe/hope/assume that an LLM can do all those steps at once and at the same level of quality so that it can replace all those arduous subtasks? Why should an LLM be expected to be able to do so?

A few gentle considerations follow – I would like to see reasons why it might be a ‘yes’ or even a ‘promising’ avenue with LLMs rather than appearing more alike a fool’s errand, not start a fight on a blog.

There’s been some abuse of the term ‘knowledge base’ by claiming that ChatGPT is a knowledge base, and if an ontology is a knowledge base, then maybe we simply could do away with ontologies and use ChatGPT instead? Nah. The LLMs do not store the (structured) facts, (axiomatised) sentences, and rules to make the inferences, and so nor does an LLM offer the reliability that it would return the same answer to a query when it’s posed more than once over the same version of the knowledge base. Put differently, those LLMs are, technically, not knowledge bases and they won’t replace ontologies, not for reliable querying and not elsewhere where ontologies are used.

Spitting out variations and factually wrong output has been amply documented especially for the ChatGPT toy and, based on the techniques used for LLM so far, that’s a symptom of the techniques used to date and thus won’t go away fully. Two different runs then may lead to two different ontologies, but if the LLM were to capture facts and the consensus due to learning from lots of texts, the same ontology would have to be outputted. But also: ontology is not a democracy. If a majority is ignorant of the fact that, say, whales are mammals and wrote about their misconception, an LLM may propose them to be fish, but that would make the ontology inconsistent if the rest of animal classification was represented properly. Inconsistent ontologies are bad computationally.

Even if the LLM were to be seen as a consensus-outputter and one were to subscribe to the premise that ontology development is about consensus-building, this doesn’t entail that the humans in the project have reached consensus just because an LLM said so. (More detailed arguments on consensus-building can be found in the papers by Fabian Neuhaus [Neuhaus22,Neuhaus23].)

Then there are arguments from natural language: there are many ways to describe the same state of affairs in natural language sentences, not that each utterance refers to a different state of affairs with different entities, entity types, relations, and relationships. The LLMs concern the language layer at best, not the underlying facts. Perhaps there are elusive locations in vector space that numerically capture the elements as ‘super abstract’ representations of sorts. If so: are they the same across LLMs, or ought they be so? Shouldn’t they be immutable for a defined concept with its ‘bundle of features’, and be independent of elusive artful prompt engineering and fine-tuning to tease it out of the LLM? (As an aside: going down the road of universals captured with LLMs would be another bridge to find or create, if possible at all.)

I readily admit I’m no expert on LLM-based ontology development, although I have been involved in research on LLM-based tasks in the realm of ontology engineering (and other subfields in computing). Those results didn’t compare gloriously with the respective alternative, which contributes to a certain dose of hesitancy. So, if you have good arguments and results, please feel free to comment and convince me.

References

[1] Muhammad Nabeel Asim, Muhammad Wasim, Muhammad Usman Ghani Khan, Waqar Mahmood, Hafiza Mahnoor Abbasi. A survey of ontology learning techniques and applications. Database, Vol. 2018, 2018, bay101.

[2] Fabian Neuhaus. Ontologies in the era of large language models – a perspective. Applied Ontology, 2023, 18(4): 399–407.

[3] Fabian Neuhaus and Janna Hastings. Ontology development is consensus creation, not (merely) representation. Applied Ontology, 2022, 17(4); 495–513.

Book reviews (2024): FOCUS and Atlas of AI

It’s that time of summer holiday again: reading and reflecting on books read in the past year, having blogged reviews on and off since 2012. For this year, I’ve selected only two books, but I’ll spend a few more words on them:

The tl;dr-versions of the reviews:

  • Atlas of AI: recommendable for those who are unfamiliar with problematic societal aspects of AI. [read the review]
  • FOCUS: an award-deserving, wonderfully written page-turner about the fascinating biography of ASML. A must-read for anyone, and especially for computer scientists and engineers, managers and entrepreneurs, and politicians and R&D funders. [read the review]

Atlas of AI

Upfront, I must admit I bought the book more for looking at how other people write highly rated popular science books on societal aspects of IT rather than expecting to read much new material, as I’m relatively well-read on the topic. It indeed is well written in the sense of easily readable, creative non-fiction, it is well organised, and has some supporting figures. It covers mainly a broader scope focussing on attendant aspects of several AI technologies, and so if you have been following the AI scandals reported in the media over the past years, there’s nothing really new. There are no new theories, frameworks, or concepts, although the atlas metaphor is an interesting notion of charting the landscape of the myriad of facets that concern AI; it’s not just the ChatGPT toy, but also many other applications pervasive in society, the human labour involved in developing the AI applications, the collection and processing of the data used to train the models, and where and how the hardware is sourced. It’s alike an updated, broader-scope version of Cathy O’Neill’s Weapons of math destruction.

After an introduction, the book has eight chapters, grouped by subtopic of the theme (cf. other typical structures of non-fiction books, such as, ‘problem, how we got there, solution’). Chapter 1 commences with the production process of the hardware: computing is done on machines that are ultimately built from raw materials that have been extracted elsewhere. It’s fair enough as a start to bring the topic afore, though reality is more complicated; see also the book about ASML, below.

Chapter 2 zooms in on labour, mainly referring to the “hidden” manual labour to make AI work, notably the data annotation and assessment of the algorithms’ outputs to improve their performance. I know that part all too well trying to investigate NLG for African languages, and labour exploitation issues have received ample attention in Africa since the news broke about OpenAI and Kenyan workers in click farms. Crawford also provides a platform for two catchy terms: Jonathan Sadowski’s Potemkin AI regarding the window-dressing and hiding all the manual labour it needs, and Astra Taylor’s Fauxtomation on fake automation that refers to pretend-AI whilst redistributing the manual labour (e.g., self check-outs in the supermarket). The chapter also discusses workplace monitoring using AI, up to algorithmic domination.

IBM’s diversity in faces (Source), and the resurgence of pseudo-science (craniology and the like) in facial recognition and emotion detection algorithms.

Chapter 3 looks at data used for training the models and the context where those datasets come from. Read Crawford and Paglen’s freely available Excavating AI essay for a taste of it. The chapter also spells out issues with Big Tech grabbing ever more data for training, having resulted in the “end of consent” – taking away the choice to share, or not, your personal data, and the forced labour for training (e.g., captchas and clicking images to ‘prove’ you’re not a robot) – and the “capture of the commons”.

Ever more data is used in the hope to classify more accurately, and so Chapter 4 is about classification. Crawford takes issue with those data-driven (i.e., subsymbolic) AI people and annotators who choose some odd ad hoc labels for labelling and classification buckets and harm people as a consequence. The next chapter is effectively also about classification by zooming in on affect and using AI techniques for emotion detection. Crawford traces it all the way back to where the emotion detection research originated. In short, the foundations are murky and unscientific, and she convincingly argues AI-driven emotion detection is a fools errand at best.

(source)

Chapter 6 moves on to state involvement, surveillance, and so on. The conclusion chapter is mostly a summary of main points and very little about a way out of this mess. The coda/final chapter, called Space, consists of a few notes about the tech bros’ fascination to explore space and to colonize it as last frontier.

Throughout the book, the “AI” thus refers to only a narrow notion of machine learning and AI with the likes of deep learning, neural networks, and statistics on steroids, yet there are a number of other specialisations in AI and it doesn’t sit well with me that she tarnishes the whole field with the same brush when only some of its specialisations are causing the headline-worthy debacles.

In addition, and as Crawford readily admitted early on in the book, it is USA-centric, and very much so; hence, it’s soaked in the peculiarities of the computing climate there. That is, the military-industrial complex, widespread DoD funding for computer science research over decades – some 71% already in 1987, and while there’s increased funding from business nowadays, federal funding still comes mostly from the DoD, depending on how you count what by whom – and there’s laissez-faire attitude to privacy and policies in general.

While the UK also has their ‘soldiers’ in the lab, the EU can pride herself on the separation of civilian vs defense/military-funded research and has a number of regulations in place, such as the GDPR for privacy. Also South Africa has its GDPR-lookalike, the POPIA, to try to stem the tide of the excesses of personal data exploitation.

In addition, the forced labour is likely to be perceived differently here, not only because the clickfarm workers are geographically closer by don’t see the benefits of their labour themselves, but also because many people use ‘free’ services due to limited means and poverty, and are thus exploited more and more often in the surveillance capitalism system through what has been dubbed digital colonialism (general public version of the argument here) and, sanitised, extractivism. Consequently, there are explorations into what digital sovereignty in Africa would mean generally and for business specifically, whereas such a notion is not elaborated on in Crawford’s atlas. (It’s definitely not a new topic.)

Nonetheless, the book can be a useful read, even if only to gloat there are at least some regulations already that also may serve as inspiration for those who want some, or as argument that more regulation is needed – under the assumption we’d want a healthy democracy and a fair and just society.

For those who find heavily footnoted, easily readable non-fiction still too much effort, or prefer to become acquainted with some of those issues in short fiction, rather, I can recommend my my collection of short stories, entitled Autonomous Decisions, that is now also on sale internationally as ebook and paperback on Amazon. There are stories with panspectric surveillance, facial recognition issues, algorithmic employee and student management, and more, set in South Africa and elsewhere in the world.

~~~

FOCUS – The ASML way

Serendipity, perhaps, made me stumble on this gem in the bookshop, in an otherwise lost half hour waiting at Eindhoven train station. I can’t attest to whether the machines ASML designs and builds really are the “most complex” in the world, but they’re definitely extremely sophisticated and practically more important than, say, a large hadron collider – the collider wouldn’t even be able to operate without ASML, nor would space stations, cars, smartphones, the machines on which the AI algorithms run, and roughly most devices that contain computer chips. ASML is the company that makes the most intricate and precise lithography machines that make computer chips, and apparently have a market share of 90% or so. How did that happen?

Hijink’s engaging company biography of how ASML came about, their growth, rise to world leader and the geopolitical challenges that come with it, renders ‘boring hardware’ as something exciting. You may still think not, and I probably had a step ahead to be swayed to buy the book: I already knew of ASML, having grown up relatively nearby its HQ in Veldhoven, a friend works for one of their suppliers, and I had a quick crash course on the tech and their attempts at knowledge-driven fault management at the EKAW2024 conference. I had seen a documentary about ASML from the outsider perspective by those who don’t quite like the current state of affairs (there are a few on YouTube). This book takes a different approach, thanks to having been written by a Dutch journalist who has been following ASML for many years, got access to many people to interview, and speaks the language and understands Dutch culture. The story, I can assure you, is decidedly different.

An engineer holds a microchip (source, from ASML’s basics of microchips)

It chronicles the inception, growth and world domination of the lithographer. Who backed them, funded them, who they collaborate with a lot, how they approach designing and operationalising those unparalleled machines. As penned down engagingly by Hijink, there’s a good dose of Philips heritage, national and EU funding, solid science and engineering-based design and manufacturing, collaboration with Zeiss and others, a tolerance for faults to learn from, and then some “cowboy” tactics to make and ship the machines.

A ‘move fast and have a system to figure out and fix things as quickly as you can’ is more alike their motto rather than a ‘move fast and break things’; it begets you a titan of a company rather than hype or user exploitation. And no, ASML didn’t intend to crush the competition, according to the book; the competition couldn’t keep up. ASML wanted to make the best lithography machines, in a “maker” approach, which requires investment in R&D, which they did right from the start when the founders left Philips NatLab. Hardware cannot be hollowed out like software can be; if you do, you lose. From the figures mentioned in the book, trying to get back into the game now would require huge amounts of highly-skilled people, capital, know-how, and a prospect of an ecosystem of suppliers and distribution of machines that spans decades. Most politicians and shareholders aren’t keen to plan that far and wide ahead, and so the status quo is likely to remain for a while at least, causing an ‘interesting’ political dance between the EU, USA, and Taiwan & China, which the book covers amply as well.

The book is divided up into five parts, entitled: good idea, bad plan; the big guys; build the impossible; in the limelight; growing pains. The Dutch version has some great chapter titles with creative alliterations, like een vleugje voodoo (a dash of voodoo), prutsers in het Pentagon (bunglers in the Pentagon), de foutenfabriek (the fault factory), and other enticing ones such as buurman, wat doet u nu? (neighbour, what are you doing now? [a famous line from the movie classic Flodder]), Eerst schieten, dan mikken (Shoot first, aim later), en het regent hier miljarden (it’s raining billions here). The Dutch version shows mastery and craftsmanship of the language to make it an engaging read, and I hope Hijink’s writing style and word choice are maintained in the translations.

Hijink does a good job at explaining key components of the sophisticated science and engineering, as well as painting a vivid picture of the company culture, with the last part of the book more about the politics they unwillingly have to deal with. It can be read as a story about setting up your own computer hardware company; it just as well can be read as a case study in bold company management and leadership.

Photo of a street in Veldhoven (source)

There’s some mention of how it affected the region into what is now called the “brainport” of the Netherlands – not in the sense of the corporate social responsibility like in the good years of Philips, but the impact it has with its supplier network of companies, the university and young highly educated engineers it attracts. Into something that the Brabanders in De Kempen are proud of, and that the whole of the Netherlands and the EU should be as well. And maybe across the pond they shouldn’t be jealous or resentful or too envious, but glad there was a group of driven and highly skilled people that kept pushing the boundaries of the possible so that Moore’s law applied for this long.

If I were a hardware expert, I might have more to say and perhaps complain, but I’m in the intended audience category for this book. From that perspective, it was an awesome read.

Different roles for various competency questions for ontologies

Most ontology developers know they’re supposed to write competency questions, be it manually, with the CLaRO tool [KeetEtAl19] or assisted by a language model [AlharbiEtAl24, AntiaKeet23, RebboudEtAl24]. Few enjoy the manual authoring task. Moreover: what is a good competency question (CQ) and what not, and why? Are all CQs alike? Is it only those information-seeking questions we ask before developing an ontology or also during the modelling and testing phases? What sort of information are the CQs zooming in on?

They’re questions we – Zubeida Khan and I – wanted an answer to, in order to understand CQs better, which may then lead to better CQs and methods and tools for CQ authoring, and hopefully also better automated generation and evaluation of CQs. Some of the answers to aforementioned question can now be found in our position paper entitled “On the roles of competency questions in ontology engineering” [KeetKhan24] that was accepted at the 24th International Conference on Knowledge Engineering & Knowledge Management that will be held in Amsterdam, The Netherlands, from 26-28 November.

To analyse CQs for ontologies, we examined what can make a CQ faulty and looked into what philosophy and logic say about questions. Besides pondering about and collating experiences about what could possibly go wrong, we took the CQ dataset of [PotoniecEtAl20,WisniewskiEtAl19] that was collected 6 years ago, and annotated all its 234 CQs on whether they were problematic, and if so, why. A disappointing 53 of them had issues, 17 of which were easily fixable grammar issues, 9 were inappropriate “can I …” and “how to…” questions that no ontology will answer, and the rest had other problems, including ambiguity and vagueness that would be hard to answer even with a fuzzy ontology. Among the questions we flagged as problematic are “Can I render [it] if the software supplier goes out of business?”, “Where do I categorise bulk like [this bulk]?”, and “What are the main types of data considered?”.

Other challenges with CQs are contextual, such as being fine of itself but not if there’s a restriction on the ontology language and therewith rendering the CQ unanswerable for that ontology. For instance, a CQ “Does a narcissist love himself?” requires an ontology language where reflexivity can be declared, and “How many legs does a human have?” needs the ability to declare a qualified cardinality constraint if it is to answer with a number.

Put differently, there’s scope for improvement. Somehow.

The so-called ‘question question’ has been investigated on and off by philosophers for at least over a century: what is a question? There is no single theory of questions, but there are pointers that turned out to be relevant for, and applicable to, CQs for ontologies. While asking a question is an information-seeking act, there can be various goals, such as knowledge acquisition and knowledge organisation, and there may be motivations behind asking the question. As to types of questions, there’s a recurring division between hypothesis-scanning and constraint-seeking type of questions, or: checking your guess versus narrowing down the search space by means of feature in- or exclusion. (See paper for details)

Putting it all together, we identified 5 main types of CQs: CQs for scoping the domain of an ontology, CQs for validating the content of the ontology, CQs to align an entity to a foundational ontology entity, CQs to examine en entity’s metaproperty (rigidity etc.), and CQs for characteristics of relations. They are summarised and illustrated in the paper, and structured in a small taxonomy (see figure below). We do have their respective formal definitions, but not in the EKAW paper (TBC).

Main types of CQs structured in a small taxonomy depicted in EER diagram notation. (Source: [KeetKhan24])

It’s the first time that different types of CQs have been identified for ontologies and we caution not to call that an ontology just yet, as further research may reveal more types of CQs or a better characterisation.

We also developed a preliminary Repository of Competency QuestionS, ROCQS, with a total of 438 CQs, offering further examples and to serve as a start to develop more of especially the under-represented ones, so that they can be analysed for prospective methods and tool development. Infrastructure to easily add more CQs is planned.

Main types of CQs, their purpose, and illustration with sample questions.

The last word clearly has not been said about CQs, and, actually, on the contrary: it opens up multiple avenues for further research and tool development (see paper).

The presentation of the paper is scheduled for the 27th of November in the morning, right after the conference’s opening session. I’m looking forward to #EKAW2024 and would be happy to make time for a chat to explore the topic further!

References

[AlharbiEtAl24] Alharbi, R., Tamma, V., Grasso, F., Payne, T. The Role of Generative AI in Competency Question Retrofitting. ESWC 2024, Extended Semantic Web Conference, May 2024, Hersonissos, Greece

[AntiaKeet23] Antia, M.-J., Keet, C.M. Automating the Generation of Competency Questions for Ontologies with AgOCQs. 5th Iberoamerican conference on Knowledge Graphs and Semantic Web (KGSWC’23). F. Ortiz-Rodriguez et al. (Eds.). Springer LNCS vol 14382, 1-15. 13-15 Nov 2023, Zaragoza, Spain.

[KeetKhan24] Keet, C.M., Khan, Z.C. On the Roles of Competency Questions in Ontology Engineering. 24th International Conference on Knowledge Engineering and Knowledge Management (EKAW’24). Springer LNAI. Amsterdam, Netherlands, November 26-28, 2024. (in print)

[KeetEtAl19] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). E. Garoufallou et al. (Eds.). Springer CCIS vol. 1057, 3-15. 28-31 Oct 2019, Rome, Italy.

[RebboudEtAl24] Rebboud, Y., Tailhardat, L., Lisena, P., Troncy, R. Can LLMs Generate Competency Questions? ESWC 2024, Extended Semantic Web Conference, May 2024, Hersonissos, Greece. Hal-04564055

[PotoniecEtAl20] Potoniec, J., Wisniewski, D., Lawrynowicz, A., Keet, C.M. Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations. Data in Brief, 2020, 29: 105098.

[WisniewskiEtAl19] Wisniewski, D., Potoniec, J., Lawrynowicz, A., Keet, C.M. Analysis of Ontology Competency Questions and their Formalisations in SPARQL-OWL. Journal of Web Semantics, 2019, 59:100534.

Automatically converting numbers into words: now also in isiZulu!

Converting Indo-Arabic numerals into words, like 175 into ‘one hundred seventy-five’, can be useful for a range tasks, including any text-to-speech application, even if only to make a restaurant reservation for six-thirty, and language learning of basic conversations, like to tell someone your age or that the route to take is the fourth street on the right or is some 15km straight ahead. How can you make the computer generate the word for you from the many numbers?

For English, there’s reusable code in Python and many other programming languages, and even webpages with a front-end interface (e.g., here and here), that will do just that for you. Would it be just as straightforward for less well-resourced languages, or that count a little different than in English? For instance, 88 ‘eighty-eight’ in Dutch swaps the numbers around (achtentachtig ‘eight and eighty’), and inFrench there’s a multiplication embedded in the quatre-vignt-huit ‘four twenty eight’. On the bright side, at least it’s [ eighty-eight / achtentachtig / quatre-vignt-huit ]regardless whether you count marbles, sheep, or euros. We already knew that it wouldn’t be that easy for isiZulu, the most widely spoken language in South Africa by first language speakers.

Here’s a brief overview of what we’re covering on our solution:


A taxonomy of the several types of numbers in isiZulu (Adapted from (Grout, 1893)). Green shaded boxes indicate the categories covered by our algorithms (Source: [1])

The details can be found in our recently accepted paper, entitled “Automatically Generating IsiZulu Words From Indo-Arabic Numerals” that will be presented at the 17th International Natural Language Generation Conference (INLG’24) in Tokyo later this month. I’ll introduce and illustrate some of it in the remainder of this post, as well as the text corpus we generated, consisting of 771,643 sentences (with a total of 7,533,595 tokens).

The complicated part with rules

For instance, the number 2 uses the stem –bili, with the “-” at the start still to be finalised to make a word, which is based on what is being counted and then it depends on the noun class of those things that are being counted. For instance, dogs (izinja) is in noun class 10 (nc10), so then we get izinja ezimbili ‘two dogs’. For two humans (abantu), we get abantu ababili, i.e., now the aba– completes the –bili, rather than the ezim- for dogs, because abantu is in nc2, not nc10. It’s the same story for other numbers and noun classes. More precisely, for those cardinal numbers, in this case, the piece that goes in front is called the adjectival concord. Each noun class has its own adjectival concord. For a particular Niger-Congo B language, that string might be the same for more than one noun class, but still, there easily are 10-20 of them.

It’s a similar story for ordinal numbers, like the ‘second person’ and ‘second dog’, but then the possessive concord is used, also one for each noun class. IsiZulu has 17 noun classes.

Then there are the 10s, 100s, and 1000s. They are considered nouns, and so are categorised in a noun class as well. For instance, the noun stem for 100 is -khulu and so then two of those (i.e., 200) has to make the -khulu plural, which is amakhulu, and complete the –bili with the adjectival concord for the noun class of amakhulu, which is the concord for nc6, ama-, and so we obtain amabili. Stringing those two pieces together, the 200 in words in isiZulu is thus amakhulu amabili. The word for 10, -shumi, is also in nc5 for singular and nc6 for plural; the one for 1000, –nkulungwane, is in nc9 for singular and nc10 for plural.

It’s different for different sorts of numbers, like reading a book three times, or for the third time, or all three books in the collection.

Left: screenshot of some of the regular expressions; Right: an example of the few computational shortcuts we took in the implementation. (source: screenshots taken from [1])

Even so, if you thought it doesn’t look that hard: we had to piece them together from various resources, test them on correctness and fix the regular expressions and other rules accordingly. It’s the first time ever they’re documented for the linguistically most use noun class system and such that they’re rules that result in correct output. Admittedly, the rules take a shortcut here and there with respect to the linguistics – at the end of the day, parsimony ruled more than the grammatical justifications (in the paper at least).

More elaborate examples

Let’s put the rules together. For instance, a simple ‘17 rivers’. The loose translation from the generated isiZulu into English is ‘rivers of which there are ten and then seven’: imifula ‘rivers’ emi- ‘just to agree with rivers that is quantified over’ –yi- ‘copula merging with prefix of the next word’ –shumi ‘ten’ na- ‘and’ –isikhombisa ‘seven’, ending up as imifula emiyishumi nesikhombisa. 17 dogs demands an ezi– for agreement at the start: izinja eziyishumi nesikhombisa.

A more challenging one is, e.g., ‘4433 supporters’. The stem for supporter is -landeli and in nc2 for the plural, and so ‘supporters’ becomes abalandeli. Its adjectival concord is aba-. The plural of 1000 (since there are 4 of them) is izinkulungwane. ‘Four’ is –ne and with the nc10 concord for the 1000 (ezin-) itmbecomes ezine, having dropped one n due to phonological conditioning. This gets us to abalandeli abayizinkulungwane ezine for ‘four thousand supporters’. Now for the ‘and four hundred’: na- is ‘and’, 400 is amakhulu amane, making namakhulu amane (because na- ama- → nama-). The ‘and thirty’ amounts to ‘tens of which there are three’, so with the plural for ishumi being amashumi and ‘three’ –thathu that needs to agree with the amashumi, making amashumi amathathu. Then the final ‘and three’ as na + -thathu = nanthathu (with a pittle phonological conditioning with the extra n. Putting all pieces together we obtain: abalandeli abayizinkulungwane ezine namakhulu amane namashumi amathathu nanthathu ‘four thousand four hundred and thirty-three supporters’.

This is just the gist of it for cardinal numbers with agreement markers; there are the others from the figure above that our rules and algorithms can cope with just as well. We can do the word generation for numbers up to 9999, which made sense for the original use case of financial literacy. There are tons of rules for them, as documented in our paper [MMKK24]. We implemented it as well, and evaluated it in several iterations so as to get most of it correct.

Examples of components of the words and the generated words of several numbers; see paper for further details.

Scaling it up

Just the rules won’t help voracious data-demanding approaches to learn the patterns, however, and so we created a dataset from the rules as well. We basically used the same approach as in [GK18] and then scaled it up. That is, roughly: take a set of nouns that play the subject, a set of verbs that are semantically relevant, and then a number of objects. For instance, a [father, mother, sister, brother, uncle, nurse, doctor, teacher, …] each can [buy, lend, borrow, read, burn, … ] [a number of] [books, magazines, …], and then just compute the cartesian product of that. We already developed the rules needed for the rest of it anyway. And so, we created a corpus of 771,643 sentences consisting of 7,533,595 tokens in total, which is freely available for further use since it was paid for by taxpayers money indirectly already. It’s probably also the largest corpus of isiZulu text, and of new sentences at that (cf. copying old Bible translations over and over again), and trivial to turn into a parallel corpus with a well-resourced language.

The sentences are grammatically correct and good for training algorithms, albeit not all equally interesting from a human interest perspective. Among others, it has the likes of izintaba zikhusela izindlwana ezinyizinkulungwane ezinhlanu namakhulu amabili namashumi ayisishiyagalombili nesithupha ‘the mountains protect five thousand two hundred and eighty-six houses’, abafundisi babona izilwane ezifuyiweyo kayizinkulungwane eziyisithupha namakhulu amane namashumi amabili nesithupha ‘the teachers saw the domesticated animals six thousand four hundred and twenty-six times’, umfundisi wamakhulu amabili namashumi amathathu nambili ubona izihlahla ‘the two hundred and thirty-second teacher sees the trees’, and amakomidi aphatha amaphoyisa omangamakhulu amabili namashumi ayisishiyagalombili nesishiyagalolunye ‘the committees manage all two hundred and eighty-nine police officers’.

Final remarks

So, it can be done, and hopefully those who requested the feature (banks) may find a way to integrate it in their applications, be it as the rules-based approach or training from the dataset we created. And Wikifunctions recently had a call for functions to spell out the number, which may find this an interesting test case.

Also, maybe we should go beyond the GitHub repo and turn it into an easily accessible web app, too, like for English. I’ve been told several times already that isiZulu speakers don’t find it easy pronouncing the numbers in words, and then such a services can come in handy. And perhaps it works similarly for at least the other three Nguni languages, so that we might bootstrap it from the rules we so painstakingly put together and evaluated. Plenty to do still, but for now we’re pleased to have passed this milestone.

References

[1] Mahlaza, Z., Magwenzi, T., Keet, C.M., Khumalo, L. Automatically Generating IsiZulu Words From Indo-Arabic Numerals. 17th International Natural Language Generation Conference (INLG24), Tokyo, Japan, September 23-27, 2024. ACL. (in print)

[2] Gilbert, N, Keet, C.M. Automating question generation and marking of language learning exercises for isiZulu. 6th International Workshop on Controlled Natural language (CNL’18). IOS Press, FAIA vol 304, 31-40. Co. Kildare, Ireland, 27-28 August 2018.

 

Exploring comparing models in hydrology and computing

Visiting HydroSciences Montpellier on the EU-funded STARWARS project on AI for stormwater and wastewater data management, I was invited to give a seminar at the research institute. On any topic, but, yeah, they’re all hydrologists. The biological models that I discuss in chapter 3 of my modelling book [Keet23] aren’t hydrological models, but the overall argument of the book might still work, and so maybe I might be able to link their way modelling to conceptual data modelling and to ontologies and illustrate differences with a a task that’s different from the task-based comparison illustrated in chapter 7. A plan was hatched.

Hydrology has ample textbooks and handbooks on modelling, however, not only covering the discipline as a whole, but even for just one aspect of it, like rainfall-runoff. Luckily, I did find a useful condensed overview of models and modelling in hydrology [SolomatineWagener11]. Add to that the task description of a new HSM postdoc on converting old FORTRAN code into something more fit for the 21st century, it reminded me of another problem in modelling I looked into briefly in late 2003-early 2004 that appears to have been addressed only in system biology with SBML, and the talk was starting to take shape.

Besides scene setting, a motivation why one should want to annotate mathematical models (create a declarative model of it, really), and an attempt at clarifying terminology about several types of ‘models’ in hydrology and computing (slide 9 of the presentation), a key aspect was to take it from there and transform some hydrological model to conceptual data modelling and ontologies.

What does a hydrological model look like, and what are the pros and cons thereof? How to simplify it to keep the presentation in about 30-45 minutes? As illustration and task for the task-based comparison, I arbitrarily took one of the formulae from a recent paper of one of the scientists at HSM (and a great host) [ChahinianEtAl23], added the declarative part, and prettified the formula annotation with icons and colour:

Sample diagrammatic rendering of how one of the formulae in [ChahinianEtAl23] may be annotated.

It generated a few smiles of approval during the talk. Whether all their mathematical models can be represented in SBML or may need their own serialisation language is an open question.

Moving on to ‘gaps’ in that diagram, such as missing cardinality constraints, and from there to conceptual data modelling, it was me who had a go at filling in some details. I did not make a real effort to be correct, as the point was to illustrate how it would look like and what sort of information can be added now. Since at least a few had heard of UML, I used that notation:

UML class diagram version, with more or less plausible multiplicity constraints, data types, and a method.

That generated a question on the content of the model, along the line of “that ‘0..1’ there, does that mean there can be no runoff measurement? There are also xyz systems that measure at multiple intervals”, Or: just having more features available brought afore a conversation that different software systems will have different assumptions and practices encoded that more or less resemble the formulae, which can be made clear with conceptual data models upfront1.

It easily led to the topic of multiple databases and data integration, which was as a useful prelude to ontologies. Ontologies were the topic of the third section, solving limitations of conceptual data models and moving goalposts.

The sketch for the ontology was intentionally multifaceted. There’s a tentative alignment to a foundational ontology, reuse of one of its relations, a diagrammatic sketch and verbalisations of sample axioms. I intentionally had put a different constraint on the rainfall runoff than in the conceptual data model, just in case I needed a back-up discussion topic (which is maintained in the slides). The audience noticed something was going on. One of the verbalised axioms resulted in a resolute “No, it can be 0, or none; if that axiom means what is written there afterwards, then it’s wrong”, en passant neatly demonstrating precisely some of the reasons of why one should formalise the knowledge and verbalize it so that the domain expert can validate the content of an ontology or note corrections.

Rainfall: sample sketch and some axioms for an ontology, with the offending axiom starred.

While appreciating there’s a difference between whether any rainfall will have runoff (regarding the ontology) and whether there’s something or someone measuring the runoff and recording it in a database (for the conceptual data model), it turned out for both cases that universal quantification and 0..n, respectively, are appropriate.

Mission accomplished regarding a tiny topical task-based comparison of a small sampling of the extant declarative modelling languages. The proposal to separate the declarative from the imperative in their mathematical models-based simulation tools might have been new to them, yet a starting point for me from the computing viewpoint. Either way, there are multiple avenues for research to colour in the details precisely. Whether the hydrologists in the project really need to go all the way with an ontology remains to be seen, but hopefully they’re in a better position to make an informed decision now.

p.s.: if you missed the link to the slides in the text above: the pdf is available here.

References

[ChahinianEtAl23] Chahinian N, et al. (2023). Evaluation of an early flood warning systemin Bamako (Mali): Lessons learned from the flood of May 2019. J Flood Risk Mgmt, 2023, 16(3): e12878.

[Keet23] Keet, C.M. The what and how of modelling information and knowledge – from mind maps to ontologies. Springer 2023.

[SolomatineWagener11] Solomatine, D.P. and Wagener, T. 2.16 – Hydrological Modeling. Treatise on Water Science, 2011, vol 2, 435-457.

1 From the computing side, we know this of course, but convincing domain experts to learn something new – and the possible value spending one’s time on it – takes time (as it does for a computer scientist to figure out what’s going on in the subject domain).

A review on logics for conceptual data modelling

Pablo and I thought we could write the review quickly. We probably could have done so for a superficial review, describing the popular logics, formalisation decisions, and reasoning services for conceptual data models. Those sections were the easiest sections to write as well, but reviewing some 30 years of research on only that theme was heading toward a ‘boring’ read. If the lingering draft review could have spoken to us last year, it would have begged to be fed and nurtured… and we listened, or, rather, we decided to put in some extra work.

There’s much more to the endeavour than a first glance would suggest, and so we started digging deeper to add more flavour and content. Clarifying the three main strands on logics for conceptual data modelling, for instance. Spelling out what the key dimensions are where one has to make choices when formalising a conceptual data model, just in case anyone else wants to give it a try, too. Elucidating distinctions between the two approaches to formalising the models, being rule-based and mapping-based, and where and how exactly that affects the whole thing.

A conceptual model describing the characteristics of the two main approaches used for creating logic-based reconstructions of conceptual data models: Mapping-based and rule-based. (See paper for details)

Specifically, along the way in the paper, we try to answer four questions:

  • Q1: What are the tasks and challenges in that formalisation?
  • Q2: Which logics are popular for which (sub-)aim?
  • Q3: What are the known benefits of a logic-based reconstruction in terms of the outcome and in terms of reasoning services that one may use once a CDM is formalised?
  • Q4: What are some of the outstanding problems in logic-based conceptual data modelling?

Is there anything to do still on this topic, one may wonder, considering that it has been around since the 1990s? Few, if anyone, will care about just another formalisation and you’ll unlikely to get that published no matter how much effort it took you to do. Yet, Question 4 could indeed be answered and it’s far from a ‘no’.

We need more evidence-based research, more tools with more features, and conceptual modelling methodologies that incorporate the automated reasoner. There’s some work to do to integrate better, or at least offer lessons-learnt and have results re-purposed, with closely related areas, such as with ontology-based data access and with ShEx & SHACL with graphs. One could use the logic foundations to explore new applications in other contexts than just modelling and that also need such rigour, such as automated generation and maintenance of conceptual data models,  multilingual models and related tasks with controlled natural languages or summarization (text generation from models), test data generation, and query optimization, among others.

More details of all this can be found in the (open access) paper:  

Pablo R. Fillottrani and C. Maria Keet. Logics for Conceptual Data Modelling: A Review. In Special Issue on Trends in Graph Data and Knowledge – Part 2. Transactions on Graph Data and Knowledge (TGDK), Volume 2, Issue 1, pp. 4:1-4:30, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/TGDK.2.1.4

On comparing models

Some of the readers of this blog are interested in modelling, and then mainly conceptual data models or ontologies. There are more types of models and modelling languages as well, such as mind maps, biological models, domain-specific languages, and so on. Can you confidently say—and justify!—which one is the best? Would such an answer be so elaborate so as to lean towards the idea of, and support existing calls for, modelling as a specialisation in an IT or computing degree programme, if not deserving to be a separate discipline outright? If so: why? What sets it apart and what are recurring themes across the various types of models and ways of modelling, and their differences? These questions are easy to ask, but far from trivial to answer. I tried anyway, to some extent at least. The latest attempt written in an accessible way—i.e., more like popular science than textbook-like—can be found in Chapter 7 of my recently published book entitled “The what and how of modelling information and knowledge: from mind maps to ontologies”, which was published by Springer (also available through Springer professional, online retailers such as Amazon, and university libraries). Instead of summarising that in this post, I did so in a guest post of Jordi Cabot’s blog, which can be read here: https://modeling-languages.com/on-comparing-modelling-languages/

Figure 1. Two example diagrams about espresso machines: a mind map and a conceptual data model. If you have no idea about what or how to compare yet: before reading about the comparisons, can you describe differences between these two examples?