A groundbreaking experiment with ChatGPT shows the potential of generative AI—and its limits.

Must read

The rapid speed of change in generative AI (GenAI), where significant advances can happen even month-to-month, means that executives must avoid obsessing over perfecting near-term use cases. Instead, they need to focus on formulating lasting guiding principles for how their company uses the technology, as it evolves, in order to meet the ultimate goal of creating a competitive advantage.

But how does a company truly generate competitive advantage when adopting rapidly changing GenAI? To answer that, BCG conducted a first-of-its-kind scientific experiment, with 750 BCG consultants using GPT-4 for a series of tasks that reflect part of what employees do day-to-day. With support from scholars at Harvard Business School, MIT Sloan, the Wharton School, and University of Warwick, the experiment looked to answer two fundamental questions business leaders face when determining their AI strategy: How should GenAI be used in high-skilled, white-collar work? And how should companies organize themselves to extract the most value from the partnership of humans and this technology? 

Exploring generative AI’s ‘capability frontier’

The experiment’s results showed that when and how GenAI should be used in white-collar work depends largely on where a given task lies in relation to the technology’s “capability frontier”—either within a particular model’s competence, or beyond it. The capability frontier is largely expanding, increasing the range of competencies, but with bumps along the way where GenAI models unexpectedly fail. These fluctuations create a “jagged” capability frontier that makes it complex and confusing for generative AI users to identify whether a given task falls within or beyond the frontier, and make strategic decisions accordingly. 

These rapid capability frontier shifts can be seen in the performance of OpenAI’s GPT-3.5 compared to GPT-4. The two models were released just months apart, but the capability gains, in some cases, were massive. For instance, on certain standardized tests like the Uniform Bar Exam, used to license lawyers to practice law, performance jumped from 10th percentile on GPT-3.5 to nearly 90th percentile for GPT-4.

Yet adoption of the technology is complicated by the fact that, paradoxically, the capability frontier can at times contract. For example, when GPT-4 was first released in March, it was very good at identifying prime numbers correctly, doing so with 98% accuracy. But by July, after just a few months, this same test yielded only a 2% accuracy rate. What had changed? In the background, OpenAI continuously retrains its models to be safer, to correct problems, and to be, generally, more capable over time. But since these models are so big, with hundreds of billions of parameters working together to produce outputs, certain changes inadvertently degrade some abilities, and it isn’t always clear why.

When use of generative AI can backfire

We designed two experiments to evaluate how participants use generative AI on two types of tasks. The first task—termed creative product innovation—was designed to be within GPT-4’s capability frontier. It tested product ideation (“give me 10 ideas for a new shoe targeting an underserved market”), product testing (“what questions would you ask a focus group to validate your product”) and, finally, product launch (“draft a press release announcing your product’s launch”). The second task—termed business problem solving—was designed to be complex enough that GPT-4 would make errors when solving it, such that it was clearly outside GPT-4’s capability frontier. The test provided participants with financial data and interview notes from a fictitious company and asked them how best to boost company revenues and profitability.

Our experiment’s findings indicate that for a task designed to be inside GPT-4’s capability frontier, participants using GPT-4 to complete it handily outperformed the control group by 40%. We expected GPT-4 to be good, but we were surprised by just how good it truly was. In addition, the results showed that when participants attempted to modify GPT-4’s output when operating within its competence, in the hopes of improving it, the modifications actually degraded its quality. 

There were also drawbacks to consider. Although GPT-4 improved almost everyone’s performance on creative product innovation tasks, we found the group of participants using it had significantly less diversity of ideas (41% less) than the control group (driven by GPT-4 giving everyone a similar answer). This homogenization of ideas within an organization can be a big problem for companies because it dampens divergent thinking and innovation.

Perhaps more surprisingly, on the business problem-solving task outside the model’s capability frontier, the participants using GPT-4 performed significantly worse than the control group—by about 23%. That GPT-4 was not only not helping humans on this task, but was, in fact, actively hurting performance, is a significant finding. But why might this be the case? From interviews with participants, we found that GPT-4 can be very persuasive, such that it can justify almost any recommendation—even an incorrect one. When using GPT-4, participants tend to rely heavily on its recommendation instead of using their own critical reasoning when confronted with errors in its logic. These results show the importance of evaluating GenAI’s performance in relation to human partners when assessing potential for competitive advantage.

What should companies do right now?

The experiment’s results show the importance of accurately locating the “jagged frontier” for creating value. Inside the capability frontier, humans add very little value to GenAI, but outside the capability frontier, humans working without GenAI improves performance. Beyond just locating this frontier, our experiment suggests a complete rethink of how humans and GenAI should collaborate. The value at stake is clearly very significant, but how can companies navigate this emerging and complex human and GenAI collaboration paradigm? 

The first and most urgent step executives must take is to establish a “generative AI lab” where each function and division within a company experiments with the latest GenAI models and analyzes the results for specific types of tasks. Is the AI output up to par? Is human intervention necessary to improve results? This type of exercise cannot be a one-and-done deal because, as new models are released and existing models updated, continuous experimentation will be essential to understanding GenAI’s evolving, yet jagged capability frontier.

Companies will also need to think critically in order to determine how they will build a competitive advantage through GenAI. At this phase, a company’s data strategy becomes even more important. This experiment has shown that GenAI can be a powerful tool, but it is also a tool that is widely available to everyone. To truly drive competitive advantage, companies must ensure that this technology will yield firm-specific, differentiated insights by using their own proprietary data (or any other unique source of data they can create or gain access to). This is often easier said than done because companies typically don’t have the data infrastructure to automatically digitize, collect, clean, and store all their own data—anything from customer behavior data to internal-R&D-generated information. Building data engineering capabilities in-house to unlock data is therefore vital for companies in the age of GenAI. 

Beyond accruing their own proprietary data, companies must also explore unconventional methods to build up their data moat. In monopolies, for example, the non-dominant companies rarely have sufficient market power to generate useful insights from their own proprietary data. In such instances, a well thought out data-sharing strategy, built upon shared trust and well-designed contracts, can enable the non-dominant players to compete with the largest players.

While articulating their data strategy, companies must also adapt their people strategy. Specifically, companies must think critically about how to redeploy their people on work beyond GenAI’s capability frontier. This reimagining of a company’s workforce can take various forms. For example, companies can retrain existing data scientists—where AI is fast-gaining capabilities—into data engineers, focusing on tasks which AI cannot do, such as setting up the data-gathering infrastructure. This shift fulfills a critical need for companies in data engineering, while ensuring that humans are working beyond the capabilities of AI, strengthening a company’s competitive positioning.

Another shift for companies is how they organize their marketing division—because, as the BCG experiment showed, this is an area where generative AI is already exceedingly good. Instead of focusing on content creation, which AI can do very well, marketers can now focus on strategic decision-making, which AI cannot yet do. Human work can exist beyond AI’s capabilities and add value by tackling questions like: “What products should a company launch?” or “How should the company position its brand to best target millennials?”

Companies will also need to rethink their talent, hiring, and development strategy, beyond redeploying the current workforce. Certain raw individual talents, previously sought after, may not be as important in the future as the ability to supervise AI systems and discern when the technology is at its limit, which will be more important. Current hiring processes are not designed to identify such skills. In addition, a broader question for executives to tackle is, as employees move away from content creation and into new oversight roles, how can workers effectively manage the technology on tasks that they have not mastered themselves?

In tandem, companies will have to redefine the roles and workflows within their organizations. The current prevailing wisdom suggests the best way for humans and AI to collaborate is a consistent and tight collaboration between the two, each feeding off one another. But our experiment suggests, with the advent of generative AI, that the opposite is true. On tasks where GenAI is very good, minimal human involvement is needed. In fact, better results are produced when humans step aside and act as supervisors, treating the model’s output as near-final draft. Humans instead create value by acting as complementors of GenAI, pushing its capability frontier by working beyond it and doing tasks where AI is not yet competent. 

We hope that adopting this “complementor model of human-AI collaboration” will be a net positive to both individuals and companies. Individuals, now freed up from a host of daily tasks, can redirect their time, energy, and effort to take on a much broader mandate with their work and drive impact. In turn, these efficiency gains will turbocharge businesses, delivering better products and services to their customers. 

***

Generative AI presents a unique opportunity—and challenge—to business executives. For companies, the value of GenAI lies in the companies’ ability to monitor and understand the fluctuating frontier of capability, such that they can rapidly deploy GenAI where the technology is advanced and use other means where it isn’t. Those businesses that are able to effectively strike this balance, while adapting their experimentation, workflows, people and data capabilities, will create value, maximize their competitive advantage, and be the most successful.

Read other Fortune columns by François Candelon

François Candelon is a managing director and senior partner in the Paris office of Boston Consulting Group and the global director of the BCG Henderson Institute (BHI).

Lisa Krayer is a project leader in BCG’s Washington, D.C. office and an ambassador at BHI.

Saravanan Rajendran is a project leader in BCG’s San Francisco office and an ambassador at BHI.

David Zuluaga Martinez is a partner in BCG’s Brooklyn office and an ambassador at BHI.

More articles

Latest article