Let me start the new year with a long overdue topic to write about: low-resourced languages (LRL). If you take English as the benchmark and a calimero attitude of a ‘but they have it much easier with so many resources and that’s not fair’ for computational tasks involving NLP/HLT, then you can complain it’s not easy and unfair for 99.99% of the languages in the world. That’s not helpful to understand what ‘not easy’ really amounts to, nor what the – in many cases unrealistic – implications are of the ignorant (at best) statements along the line of ‘just get yourself together and work a little harder’ attitude, or just how much effort some work has taken even if it looks like mere baby steps from a highly resourced language technologies perspective.
Not just that. Let me cherry-pick a few anecdotes to illustrate. A conversation with a fellow researcher I spoke with at INLG2023 told me he just gets his new high quality annotated data delivered and can readily use it for his research. What?!?! Really?!! Other researchers, be it in NLP or requiring NLP tools for the intended task, have to collect and annotate it themselves, clean it up, and determine quality, or manage it to be done so as part of their research. My fellow researcher was surprised that also I had to do all that extra work in order to have enough results to write it up in a paper. The paper I was still working on at the time (presented at the 2024 edition, a year later [Mahlaza24]), it included trips to the library, online searches for grammar rules, and consulting with the linguists just to get a few examples and rules to begin with to try to work out how to convert numbers to text in isiZulu, and a number of iterations to plug the gaps in the limited rules’ documentation to get to a passable level of correct output. There was no data for good reasons; we generated that data to make it less hard for data-oriented researchers. OCR issues are documented elsewhere, and some can be gleaned from carefully reading between the lines of some of my NLG papers.
I probably could write a book about the anecdotes, but the plural of anecdote isn’t data. Were we just repeatedly unlucky? Didn’t we search hard enough? Do other people – researchers and software developers alike – working with low-resourced languages not have such problems? No, they struggle, too. For instance, having to adapt UD before being able to computerise sentence annotations for St Lawrence Island Yupik [Park21].
What makes it challenging, and could one ‘low-resourced’ be even worse than another ‘low-resourced’, akin to a very low-resourced? I’ve heard a Dutch colleague claiming Dutch was low-resourced, and he was staunchly convinced of it. I shook my head in disbelief and could not resist to comment, as he clearly didn’t know what low-resourced really was like. But it does raise the question: what is low-resourced? What are its characteristics so that it can be distinguished from intermediately resourced and well-resourced?
They’re not new questions and other researchers have tried to answer them with the typical approach of bean counting something: Wikipedia articles, number of corpora, number of tools, number of papers in top-level NLP conferences. They all have issues of missing out in the counting: much of the work on low-resourced languages (LRLs) doesn’t make it into the top-tier conference venues that take English as the gold standard, the number of tools doesn’t tell anything about whether they actually work, and available non-published tools are easily overlooked yet could be really useful, and Wikipedia has skewed editor issues.
For instance, media darling and recent SAICSIT emerging pioneer award winner Prof. Vukosi Marivate doesn’t get much of his work into premier international NLP conferences even though he most definitely uses data-driven techniques for years and managed to set up spin-off company lelapa.ai on NLP for African languages. It is not much different for P-rated Dr Jan Buys for his papers on African languages, who also has been working on data-driven approaches for years, including having a number of papers in the main NLP conferences – just not about African languages. Our trailblazing work before the LLM craze having resulted in, among others, 4 INGL papers, a COLING paper, and a CiCLing paper & prize and and journal articles in TALLIP and LRE on NLP for African languages is notable as well, but it mostly won’t reflect in, e.g., Joshi et al’s top conference paper-based counting [Joshi00]. Prof. (emer.) Laurette Pretorius’ LREC papers and ZulMorph were largely before the resource indexing and open source requirements, and AwezaMed, spellcheckers for MS Office, and so on and so forth would fall through the bean counting cracks as well.
Not being indexed by bean counting scrapers of the researchers from the Global North doesn’t mean nothing is happening. Admitted, variant spellings of names of languages and changes in names of languages hampers the bean counting approach when searching for resources—though locals know. Joshi et al.’s ‘Mbosi’ language search could have been augmented with ‘Mboshi’ and ‘Embosi’ if only they’d known, and then they would have found and included the LREC18 paper [Rialland18], for instance, but, alas.
Is there another way to capture the fuzzy notion of LRL differently? My collaborator, Langa Khumalo, and I set out to take a different, complementary, approach: contextualising the language to determine resourcedness.We focussed on three key issues:
- What are really the distinguishing characteristics of LRLs (and, by extension, ‘non-LRLs’)?
- What are the characteristics of levels of resourcedness?
- Which language fits where and why?
The results are described in detail in our technical report [KeetKhumalo26]: we identified 11 dimensions of resourcedness, their components, and tentative scales or grouping buckets and matched the dimensions to Very LRL, LRL, RL, High RL, Very HRL levels, and assessed its operationalisability with isiNdebele and several other languages.
The dimensions concern the sort of things that actually impact developing NLP tools. For instance, the amount of people: fewer people are harder to find and more in demand, not to mention tens of participants (if that) for crowdsourcing who’d need to be paid internet data upfront to do the evaluation. Or take the participants’ level of education in that language: speaking and writing a language is not the same as a deep grasp to provide 100% correct feedback on the morphological analysis, say, or having received education in the language at least up to matric/high school exams. Less-than-correct feedback requires more rounds of human evaluation, or: takes more time to carry out the evaluations, more time to analyse the data, and more remuneration for the tasks. Or the choice or grammars, or the lack thereof: taking a UD or SUD from the shelf versus digging into old books and poring over various linguistics papers to determine what it is that needs to be represented in any formalism expressive enough to capture it. Having a choice of parsers versus no or outdated software that needs to be brushed up or re-implemented first.

The dimensions and described, motivated, and illustrated over a good 6 pages in the paper. There may be more dimensions, but this already gives a good basis to assess and classify languages, to develop policies to benchmark and assess changes in language resourcedness, for certain people to get down from their English high-horse incorrectly judging efforts for other languages, and to make better sense of ‘LRL paper tracks’ at conferences and workshops. And perhaps anyhow to gain an appreciation of NLP activities when there’s no cornucopia of tools and datasets.
We grouped the dimensions as contributing to characterising Very LRL, LRL, RL, HRL and Very HRL. Admittedly, there’s a notable flip at the RL level that asks for more fine-grained needling and characterising. Yet, the notion of getting the ball rolling being harder than keeping it rolling and amassing more thanks to the bandwagon effect applies to many areas.
We apply the dimensions to isiNdebele, a language spoken in South Africa and Zimbabwe with about 3.7 million first/home-language speakers overall. There are newspapers, TV news bulletins, schoolbooks, a dictionary and more in isiNdebele, i.e., it is actively used in daily life. It turns out that it is in the Very LRL category, albeit noting it’s not all doom and gloom, or: there are a few resources.
The discussion section of the report elaborates on various aspects, including policy implications, and there’s a bonus section on nitpicking about terminology, including low-resource vs. low-resourced vs. under-resourced languages. What can I say, besides NLP, co-author Langa is the Director of the South African Digital Languages Resource centre and I’m an ontologist. The paper’s flavour overall is distinctly on the languages side rather than computation, which may be taken as a warning or an encouragement; either way, I hope you’ll find something of interest in it. Opinions, additions, or your assessment of your language(s) of interest are welcome.
References
[Joshi00] Joshi, Pratik, et al. “The State and Fate of Linguistic Diversity and Inclusion in the NLP World”. arXiv [Cs.CL], 20 Apr. 2020, http://arxiv.org/abs/2004.09095.
[KeetKhumalo26] Keet, C.M., Khumalo, L. Contextualising levels of language resourcedness for NLP tasks. Arxiv report 2309.17035.17 January 2026. https://arxiv.org/abs/2309.17035.
[Mahlaza24] Mahlaza, Z., Magwenzi, T., Keet, C.M., Khumalo, L. Automatically Generating IsiZulu Words From Indo-Arabic Numerals. 17th International Natural Language Generation Conference (INLG’24), Tokyo, Japan, September 23-27, 2024. Association for Computational Linguistics.
[Park21] Park, Hyunji, et al. “Expanding Universal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik”. Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, Association for Computational Linguistics, 2021.
[Rialland18] Rialland A, Adda-Decker M, Kouarata G-N, Adda G, Besacier L, et al. “Parallel Corpora in Mboshi (Bantu C25, Congo-Brazzaville)”. 11th Language Resources and Evaluation Conference (LREC 2018), ELRA, May 2018, Miyazaki, Japan.
































