Article written by Adam Wealand, Principal Product Marketing Manager, Red Hat. 

An experience can plant a seed in your mind that grows into a much bigger idea. I recently visited Japan, a beautiful country with a rich and defined culture, and was reminded of how deeply language and culture are intertwined. Language structure reflects and also influences how people perceive the world. For example, some cultures are more informal, while others with established social hierarchies tend to be more formal. It's manifested grammatically, and also in the way people greet each other, the way questions are formed, and the very rhythm of conversation. This experience sparked a question for me: as we build a new generation of AI, how do we train it to speak the language of all cultures?

Global AI

We often hear about AI as a revolutionary global tool, a technology for all of humanity. But if AI is learning from our collective data, whose culture is it truly learning? The promise of a universally intelligent system is powerful, but it masks a critical issue, in that today's most powerful AI models are not culturally neutral. They are a direct reflection of their training data, which is overwhelmingly English-centric and Western-biased.

The concept of "AI sovereignty" becomes particularly relevant to this Western-biased model training. It's not simply about having access to technology or hardware, but also about building AI that can reflect a nation's or community's unique languages, values, and culture. We believe the key to achieving this lies in the world of open source AI.

AI's language gap in numbers

Digital sovereignty, particularly in the context of AI, is rapidly evolving from an abstract concept to a critical, real-world issue. The numbers confirm this shift, from the EU's launch of InvestAI—to mobilize €200 billion for AI development including €20B for the creation of AI "gigafactories"—to corporate private investment increasing 44.5% from 2023 to 2024, with both governments and the private sector worldwide investing billions in developing domestic AI ecosystems to secure their digital futures. As AI models continue to grow in sophistication and reach, the location and control of the data used for their training and operation have significant implications for national security, economic competitiveness, and ethical governance. This control of data is not just a theoretical problem but a tangible concern with clear implications.

The foundational models we generally hear about are primarily based on the English language. For instance, 89.7% of the pre-training data for Meta's Llama 2 was English. Even with Llama 3.1, only 8% of its 15 trillion token dataset was non-English. Similarly, OpenAI's GPT-3 was trained on a dataset that was approximately 93% English. These statistics, drawn from the models' own data sheets, are quite revealing.

The web itself, the primary source of training data, is similarly skewed. The Common Crawl dataset, a snapshot of the internet used to train many models, is a prime example. In a recent version, 46% of the documents were in English, with the next closest languages, German and Russian, making up less than 6% each. In contrast, just under 19% of the global population speaks English, according to the CIA World Factbook 2022. 

The consequence of this imbalance goes beyond simple language translation. It shapes the model's cultural alignment. Research has shown that large language models (LLMs) tend to align with the cultural values of Western, educated, industrialized, rich, and democratic (WEIRD) societies, because that's the source of the data on which they have been trained.

Training an AI model on language can be a powerful way to represent and reproduce cultural patterns, because languages are a direct reflection of a culture's values, beliefs, and worldview. By analyzing vast amounts of text from a specific linguistic and cultural group, AI will learn to mimic the nuances of that culture.

Dad jokes are complicated

Training an AI model involves comprehending not just vocabulary and grammar, but also the practical application of language. That means going beyond literal words to include sarcasm, irony, humor, and all the social etiquette embedded in a conversation. We can see all this reflected in a short "Dad joke."  For example, I prompted a public GPT model for a "Dad joke" and it provided the following: 

I'm afraid for the calendar. Its days are numbered.

That joke might be funny in English, but it could be confusing for a non-native English speaker because the punchline is an idiom commonly found in Western cultures. Sarcasm and humor (sometimes found in Dad jokes) only materialize from processing extensive corpuses of literature, historical documents, social media interactions, and even colloquialisms. By doing this, AI models can begin to mimic recurring themes, dominant narratives, and the underlying cognitive frameworks that shape a culture's identity.

Open sourcing a new path

Communities do not need to build their own models from scratch. The beauty of open source is that it offers an alternative path. Communities can take a powerful, open source "base model" (like Llama) and fine-tune it. This means they can further train the model on their own culturally-specific data, so it learns the nuances of their language, history, and legal frameworks.

Cultural fine-tuning is not just a theory, it's happening right now. Here are some examples:

  • Pan-African natural language processing (NLP) with Masakhane: Masakhane, which roughly translates to "We build together" in Zulu, is a grassroots, pan-African community of researchers. They are a perfect example of a community working to solve its own problems. They've created the first-ever named entity recognition (NER) dataset for 10 African languages (MasakhaNER) and have built translation models for over 30 African languages.
  • Preserving indigenous languages: AI's application extends to safeguarding endangered languages. Projects like the Indigenous languages technology project by the National Research Council of Canada (NRC) and IBM's work with languages such as Guarani Mbya in Brazil are exciting examples of how this technology can be used to aid in cultural preservation.

The growing efforts of AI sovereignty

In parallel with the technical work, a broader political movement is emerging around the concept of AI sovereignty. AI sovereignty refers to a nation taking control of its own AI development so it remains independent from other countries (or regions). Sovereign AI means controlling sensitive data within national borders, maintaining strategic independence for critical systems, developing AI that reflects local cultures and aligns with national values, boosting a domestic economy, and establishing frameworks and regulations, such as the EU AI Act in the European Union.

This legal and political movement drives the work of communities like Masakhane, making it not just a good idea but a national priority for many countries. It provides the "why" for the massive undertaking of collecting local datasets and building sovereign AI capabilities. After all, a nation cannot achieve AI sovereignty if all its data is processed through foreign models that do not reflect its cultural context. Local fine-tuning of open source models helps address these policy demands.

A multilingual AI future

The default path for AI could be one of cultural homogenization, where the nuances of our global cultures are flattened by models trained on a narrow slice of human experience. By using open source tools and models, dedicated communities are building a more equitable and diverse AI ecosystem.

The principles of open source are quite powerful, and it's important to champion a community-driven approach to AI. When we embrace transparency, collaboration, and shared development, open source helps accelerate innovation. It brings together many different perspectives and contributions, which can then shape the future of AI.

For example, Red Hat's involvement with projects like InstructLab and vLLM is making it possible for anyone, not just data scientists, to contribute their knowledge and expertise to LLMs. This collaborative approach helps build AI technologies that reflect a broader array of societal needs and cultural norms. It helps reduce power being concentrated in just a few hands and helps make cutting-edge advancements more accessible to everyone. 

More models, less bias

Model bias usually originates from the data used to train a model. If a model is trained on a dataset that isn't diverse or representative of the real world, it will inevitably reflect and amplify those inherent biases. Red OpenShift Hat AI can help address bias by enabling developers to choose from a wide variety of AI models. This flexibility means that no single, potentially biased model is imposed, and users can select models best suited for their specific context, as well as models trained on more diverse datasets. OpenShift AI's open source nature also promotes transparency and enables a community of diverse contributors, further helping reduce these inherent biases. 

A community-driven approach not only helps accelerate technological progress but also democratizes AI development, empowering a larger number of individuals and organizations to contribute to and benefit from these transformative technologies. The future of AI doesn't have to be a barren monoculture. Thanks to dedicated open source communities around the world, it can be a vibrant ecosystem built by all of us together.

Learn more about Applied AI & Red Hat Contact a WWT Expert 

Technologies