Keywords: Large Language Models, LLMs, Data Engineering, Tabular Data, Tables, Enterprise, Industry
TL;DR: We show how data engineering in enterprises is often more challenging than portrayed by existing research on LLMs for tabular data.
Abstract: Data engineering on tabular data is crucial for transforming raw data sources into a suitable form for downstream tasks like machine learning and analytical query processing. Since handling raw data often entails high manual overheads, automating data engineering tasks like entity matching and column type annotation has long been an active area of research. Recent work shows that Large Language Models (LLMs) achieve state-of-the-art performance on various public benchmarks, providing enterprises a promising avenue to automate processes without needing expensive, specialized solutions.
Yet the enterprise domain comes with unique challenges that remain overlooked by existing studies using LLMs, like table sizes and domain knowledge. To address this gap, we analyze enterprise-specific challenges related to data, tasks, and background knowledge, as well as the costs of using LLMs on tabular data. To understand how the differences between scientific benchmarks and enterprise scenarios affect LLM performance, we apply recent LLMs to various enterprise-related downstream tasks on representative customer data.
Submission Number: 16
Loading