Can AI Grasp Pacific Languages? Inside Our Push for an AI Samoan Translator

Part 1, Introducing the Project

Talofa! This will be a series of posts about a new AI-powered language translator built by PBDE and available on the web here. We’re aware that many of our colleagues are very interested in AI today, and we encourage you to check it out and contact us with feedback.

The Landscape: Bridging Gaps and Setting Expectations

It’s probably somewhat apparent from the name that Pacific Broadband and Digital Equity is focused on issues of digitally connecting and including Pacific island communities in ways that improve economic, social, and cultural outcomes in ways that their peer communities from the mainland and globally already enjoy. For the past few years, a big part of our focus has been on broadband internet, and we are happy to say that the region is now virtually swimming in resources for broadband moving forward–something we were happy to have had a small part in helping to bring about.

Looking forward, we see an issue exploding into global discussions that has the potential to be just as critical–if not more critical–to digital equity than internet access. We are speaking, of course, of Artificial Intelligence (AI). According to Google Trends, so is everybody else:

At this point, it seems that there is not just a backlash to the AI-hype, but also a backlash to that backlash, and possibly a third backlash, etc. However, it seems that any credible discussion of AI has to acknowledge that it has enormous transformative potential in a large number of fields and walks of life, and indeed is already having major impacts upon workers at a global scale.

An in-depth discussion would be beyond the scope of this blog, but let’s operate from assumptions that:

  1. AI impacts upon society will be significant, far-reaching, and particularly potent in some industries, and
  2. They will not automatically confer benefits to marginalized or disconnected communities.

Based upon these assumptions, the impact of AI upon Digital Equity in the Pacific region is something we should be very interested to explore. But to do that, we wanted to start with a fundamental question: do commercial AIs even work in the languages primarily spoken by many in the region? It turns out that the answer to that question is pretty complicated.

Commercial AIs and Pacific Islands Languages

It’s pretty easy to start to answer the question, at least. Commercially available models like ChatGPT (OpenAI), Claude (Anthropic), Bard (Google), LLama (Meta), Bing (Microsoft) and more all offer “playground” access where users can input text of any kind and expect results. As a result, you can go to any of these products and just ask them what they can do. Here is one such conversation with Bard:

Bard gets pretty complicated sometimes when it cannot do something. However, when asked a simple question in Samoan, the response is more abrupt.

The responses from other AI models are similarly mixed. ChatGPT and Claude seem to be most capable with Samoan, but after a bit of conversation they will frequently hit a point where they become confused and give the same sort of response: sorry, I cannot help with that, contact technical support. 

We have started with the Samoan language because of our partnership with American Samoa Community College and the integration of this work with our ongoing projects there. However, plugging in text in Chamorro, Tongan, Fijian, and so on, lead to a similar patchwork of results. Bard is capable in Chuukese in a way that ChatGPT is not. Claude is facile in Tongan in a way that Bing is not. And so on. 

Perhaps the most interesting part of this approach is what the AI models do when they can not help with a question: they uniformly refer the user to a help center or other technical support website. There is not generally documentation at any of these sites about which languages they support, when they will support them, how much they support them, why they will support them, and so on. 

What’s more, the models change very rapidly over time, and support can appear without notice. Unfortunately, this also means that support could disappear on the same basis. Of course, getting in touch with any of the companies involved would be especially challenging for just about anybody in any part of the region, a problem we are certain is familiar to all of our fellow islanders–even those of us here on O‘ahu.

Building the Translator

After a few forays into language chats, we wanted to be more systematic about testing various AI language capabilities, and felt we needed a technology testbed to do it. In talking through the problems with our colleagues at the College, we decided that a translator would potentially be both a great way to do this exploration, as well as to generally help out the college community and get it exposed to AI.

While we won’t go into the technical details in this post, we can discuss how the translator is generally put together. We built a custom web application with the goal of providing a simple translation interface to end-users: we hope we can generate feedback and improve that interface over time. The interface works with the Application Programming Interfaces (APIs) of commercial AI models, and we designed it to make it possible to switch from one model to another. 

At the moment, we have models from OpenAI and Facebook integrated, but we plan to add in others as soon as they are available. There are ways that we will explore to customize the output of most of the AIs we work with, but to start with we are working more or less with the “default” settings on the models with the hope to just establish a baseline.

As for the success of the platform, the results are unsurprisingly mixed. To test the platform, we have started by simply processing phrases from the Samoa language textbook, Gagana Samoa, and observing the results. We will have a more formal analysis in the coming months, but for now we can say that the most recent OpenAI ChatGPT model (version 4) works somewhat plausibly, especially when translating Samoan to English. There are certainly some “clunker” results even working with this model. The other models are less successful by some margin.

By contrast, performing the same translations from English into Dutch–the most lexically similar widespread language to English–yields nearly fluent results from any of the AI models. However, this compatibility is not only due to the similarities between the languages: there are massive amounts of Dutch-language text available in computer formats, which is a critical requirement in training the “Large Language Models” (LLMs) that are the primary basis of today’s rapid expansion of AI tech. And, of course, there are many Dutch programmers with the inclination to train models in the language.

Translation between languages turns out to be more complicated than meets the eye. Islanders working in legal or education professions, among others, will be very familiar with the difficulties in transcribing large technical documents into or from English from a local, indigenous language. There are semantic mismatches, structural differences, and cultural nuances that make manual translation very challenging.

But this is precisely where AI models could help. By analyzing large collections of translated text, AI systems can learn to identify common semantic gaps and propose options to bridge them. They can also highlight typical structural changes needed to convey ideas between language pairs. Ultimately, by surfacing patterns in human translations, AI could assist in developing better guidelines and methodologies for smoother manual translation.

Our translator testbed will allow us to explore whether current AI capabilities are ready to provide this kind of assistance between English and Pacific Islander languages. We hope it is a stepping stone to reducing barriers for important linguistic work in the region.

Looking Ahead

In this first post of our series, we introduced our machine translation project for Samoan and English, and discussed some core translation challenges. This is just the beginning of our investigation into AI’s capabilities with Pacific Islander languages.

In our next post, we’ll discuss in more detail where we see the potential for the project to grow beyond simple translation. We’ll discuss the technology more in detail, and the way the technology informs our thinking about regional development issues. Finally, we’ll explore how this thinking connects with current digital equity projects and progress in the region.