We study the problem of optimizing data storage and access costs on the cloud while ensuring that the desired performance or latency is unaffected. Firstly, we study the problem of optimizing the cloud tiers and compression schemes for given data partitions with temporal access predictions. Secondly, we propose to learn the compression performance of multiple algorithms across data partitions in different formats to generate predictions on the fly, as inputs to the optimizer. Thirdly, we approach the data partitioning problem fundamentally differently than the current default in most data lakes where partitioning is in the form of ingestion batches. We propose access pattern aware data partitioning and formulate it as a constrained optimization problem. We study the various problems theoretically as well as empirically and show significant cost savings over platform defaults as well as closest baselines in literature.
Koyel Mukherjee is a Senior Research Scientist in Adobe Research, Bangalore. Her current interests are in the areas of efficient Generative AI through algorithmic and learning based approaches. Earlier, she has studied several cost optimization problems in enterprise systems, such as storage costs in cloud and data redundancy in data lakes. She regularly publishes in top CS conferences such as ICDE, SIGMOD, ICML. Prior to Adobe, she has been a part of Xerox Research as well as IBM Research, Bangalore.