Sound Data Strategy Paramount to Generative AI

This was originally published in May 2023

Organizations that adopt a data-centric mindset across their people, processes and technology will be better positioned to leverage generative AI solutions to reduce risk, optimize resources, and identify or implement best practices across their business.

Yet according to WWT's latest survey, IT leaders are least satisfied with their business intelligence and analytics capabilities when it comes to execution — a challenge that may persist over the next six to 12 months until data strategy cracks their list of top priorities.

Without quality data, generative AI will fail to deliver upon its immense potential.

For those reasons, we advise organizations looking to kickstart their AI journeys to think beyond technology and infrastructure and focus first on developing a coherent data strategy.

Remember, large language models (LLMs) do not understand, they predict by consuming massive amounts of data to provide unique responses. LLM effectiveness is dependent on the quality and quantity of the dataset used to train the model.

You could have the most modernized infrastructure in the world. If you are feeding generative AI untrustworthy data, you will get untrustworthy responses.

As an IT leader, you need to take time to test and train AI models before you deploy them. To even get to that point, you'll need to understand how data is produced, collected, utilized and analyzed.

For many organizations, it's a matter of quality. IT leaders we engage with often know the data exists. But too often, data quality is low and incomplete, and data is located in silos.

Key questions

Other key data-related questions to consider when approaching generative AI solutions include:

Does your organization have a data governance policy or data management strategy?
Does your organization have the appropriate data infrastructure to support the data environment (data platforms and architectures, etc.)?
Are teams and stakeholders aligned on the critical role of clean, consistent and accurate data in advancing sustainability initiatives?
Are the data being analyzed of high quality and trustworthy?
Are the data easily accessible, automated and at the appropriate level of granularity to help drive decisions?

Actionable steps

Steps you can take to improve your data posture:

Thoroughly understand your organization's data maturity, capabilities and roadmap.
Evaluate what data will be of value from your organization and third-party vendors or suppliers.
Devise a collection method to acquire and capture clean data.
If you have the budget, invest in a robust data management tool that can organize and analyze your data and automate the process moving forward.
Create alignment across departments. Generative AI will be a cross-functional value driver that will require collaboration across teams to collect their data.
Identify proprietary or sensitive data that should be treated with greater caution.
Understand your data privacy policies and whether any of customer or employee data will be exposed.

Aside from generative AI, data governance will have an impact on a broad swath of your organization, allowing you to make strategic, data-based decisions that support your business outcomes.

Up next:

List of key generative AI terms

For those unfamiliar with generative AI, here's a glossary of terms that should help you gain a clearer understanding of what all the hype is about:

LLM or large language model: A type of AI algorithm that leverages deep learning techniques to process natural language to understand, summarize, predict and generate content; they have at least a few million parameters.
GPT or generative pre-trained transformer: A type of LLM trained on a large corpus using the transformer neural network to generate text as a response to input.
NLP or natural language processing: The processing of human language by a machine including parsing, understanding, generating, etc.
Corpus: Essentially, the training data. A collection of machine-readable text structured as a dataset.
Vector: The numerical representation of a word or phrase. A list of numbers representing different aspects of a word or phrase.
Tokens: A unit of input text. A token is the smallest semantic unit defined in a document/corpus (not necessarily a word). ChatGPT, for example, has a 4,000 token limit. GPT-4 permits up to 32,000 tokens.
Parameters: The weights or variables used to train a target model. For example, 187 billion parameters were used to train ChatGPT.
Transformer: The algorithm behind LLMs. A deep learning model adopting the attention mechanism that learns different weights and the significance for each part of the input data in a robust manner.
RL or reinforcement learning: A feedback-based machine learning paradigm where the model/agent learns to act in an environment to maximize a defined reward.
RLHF or reinforcement learning from human feedback: A technique that trains a reward model directly from human feedback and uses the model as a reward function to optimize an agent's policy using RL.
Inference: Testing the model. Feeding new data to the model to get its response/prediction.

Key questions

Actionable steps

List of key generative AI terms

Contributors