Schema Generation for Querying a Data Lakehouse

# 209





Abstract

Using Large Language Models (LLMs) for generating SQL queries has been studied in the recent past. A related problem is schema generation for building applications on top of a Data Lakehouse. Generating longer text like a schema, within the constraints of the existing databases and evaluating the correctness of the generated schema, make schema generation harder than query generation. A number of techniques like schema linking, knowledge infusion, preference learning and uncertainty quantification have been attempted for query generation which are also relevant for schema generation. A natural expansion of this problem is generating and consuming data from APIs. GraphQL is a querying language for retrieving data from diverse sources including databases and REST APIs. We have been exploring the use of Large Language Models to generate GraphQL queries and schema, to retrieve data from a Lakehouse. An open question in this space is whether existing techniques like in-context learning and fine tuning are sufficient for this task or do LLMs need reasoning ability. We’ll present our recent works that explore this question.

Balaji Ganesan, Senior Research Engineer, IBM India Research Lab

Balaji Ganesan is a Senior Research Engineer at IBM India Research Lab (IRL). He is part of the Data and AI dept under IBM Research AI. He currently works on Large Language Models, Knowledge Graphs, LLM Agents and their applications in Semantic Automation like text-to-SQL and text-to-GraphQL. He has previously worked on entity matching and link prediction using Graph Neural Networks, especially focusing on explainability. Balaji was previously at Yahoo, and a number of startups in search and computational advertising. He graduated with a Bachelor's degree in Computer Science Engineering from the University of Madras in 2003, and a Master's degree in Computer Science from the University of Arizona in 2006.