Apple researchers conducted an A/B test to measure the impact of AI-generated relevance labels on App Store search rankings and app downloads. Here are the results they found.

AI-generated relevance labels slightly improved App Store search conversions

In a new study titled "Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments," a group of Apple researchers investigated whether LLMs could help improve App Store search results. This involves generating relevance labels used to train the ranking system.

As mentioned in the study, relevance is a key element in helping users find the apps they are searching for. While there are many signals that can contribute to search ranking, the researchers focused on two main signals:

  • Behavioral relevance reflects how users interact with the results; for example, whether they click on or download an app.
  • Textual relevance measures how meaningfully an app's metadata (such as name, description, and keywords) matches a user's search query.

In the study, the researchers note that while there is a wealth of data available on behavioral relevance (as it can be easily measured), the same is not true for textual relevance:

While behavioral relevance labels are abundant, textual relevance labels produced by human judgments are much rarer. This creates a fundamental problem: high-quality textual relevance labels are scarce and expensive to produce, creating a bottleneck in scalability and giving weak power to the textual relevance objective.

To overcome this issue, the researchers fine-tuned a 3 billion parameter LLM on existing human judgments so that it could learn to assign relevance labels to apps based on a user's search query and the app's metadata.

Subsequently, they generated millions of new relevance labels with this model and retrained the App Store ranking system using both the original data and the labels generated by the LLM.

After completing this process, they conducted an offline evaluation and then performed a global A/B test on live App Store traffic:

“(…) The llm-augmented model showed a statistically significant +0.24% increase in the conversion rate, defined as the ratio of app downloads to at least one search session, our primary metric. While this number may seem small, it is considered a significant improvement for a mature industrial ranking system. This gain was observed in 89% of showcases.”

In other words, users who saw search results ranked using the LLM-augmented model downloaded apps 0.24% more than those who saw results presented by the traditional ranking model.

And while a 0.24% increase may seem very small, considering that total App Store downloads are projected to be around 38 billion by 2025, this scales quite rapidly. In practice, this could mean tens of millions of additional downloads from App Store searches, which developers would certainly appreciate.

Follow this link to read the full study.

Accessory Deals on Amazon

  • Logitech MX Master 4
  • AirPods Pro 3
  • AirTag (2nd Generation) – 4 Pack
  • Apple Watch Series 11
  • Wireless CarPlay Adapter