Statistical Information System Collaboration Community

SDMX+AI: A path to explore generative AI for better data accessibility

In March 2024, the “SDMX+AI” workshop, co-organised by the OECD and the BIS, brought together 70 participants from national and international statistical organisations, interested in exploring the opportunities in “unlocking the potential of Natural Language Processing to enhance data access”.

Read the detailed report.

 

The topic is not completely new. As far back as 2019, the OECD, Statistics Canada and Statistics Netherlands (CBS) developed a StatsBot prototype, with the intuition that the combination of AI and SDMX semantics should allow a universal, natural language and conversational access to the hundreds of statistical sources made available in SDMX. However, this was before – before the ChatGPT storm hit in 2023. As demonstrated clearly during the workshop by the 4 technology providers, achieving this longstanding goal seems to be within reach, based on Retrieval-Augmented Generation (RAG) techniques and with the advancement in Large Language Models (LLMs) learning and use. The workshop confirms the initial intuition: SDMX semantics and standard format, combined with AI, should indeed allow natural language access to statistics – and much more, “talking to the data” as put by Jim Tebrake, IMF.

The workshop confirms the initial intuition: that SDMX semantics and standard format, combined with AI, should indeed allow natural language access to statistics – and much more, “talking to the data” as put by Jim Tebrake, IMF.

88% of the workshop participants recognised the use case as relevant for their organisation (yet identifying other relevant and related use cases – especially, in augmented metadata editing, enrichment, harmonisation, and translation/mapping). 11 organisations confirmed they were likely to contribute financially to develop a production-grade solution. Participants agreed on the following conditions for a potential co-investment:

  • the use case is specific to official statistics;
  • there is relative consensus on the functional expectation and value to the end user;
  • the delivery of a production-grade service seems within reach;
  • the solution is inclusive, that is, deployable and/or usable in any statistical office;
  • the solution is open source, but commercial LLMs could be part of the solution;
  • the solution is cloud vendor agnostic.

It is recognised that some of the conditions might prove to not be fully achievable (or potentially clash); these conditions should be considered as a basis for discussions to forge the alliance for such a co-investment approach.

What are the next steps? The preferred scenario (making the IMF “StatGPT” production-grade, reusable and aligned with the above conditions) should be explored in the coming months.

Is your organisation potentially interested in joining the alliance? Get in touch!

 


Thank you to Eric Anvar Head of Smart Data, Statistics and Data Directorate at the OECD for contributing to this post