Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning

Kyoka Ono1,2    Simon A. Lee2   
1International Christian University, Tokyo Japan 2 UCLA, Los Angeles, CA, USA ICML 2024 AI4Science Workshop Paper
VRB Initial Scene

Dall-E's attempt at visualizing the the research we performed. Makes our work look a lot cooler...

Abstract

Recent research has explored how Language Models (LMs) can be used for feature representation and prediction in tabular machine learning tasks. This involves employing text serialization and supervised fine-tuning (SFT) techniques. Despite the simplicity of these techniques, significant gaps remain in our understanding of the applicability and reliability of LMs in this context. Our study assesses how emerging LM technologies compare with traditional paradigms in tabular machine learning and evaluates the feasibility of adopting similar approaches with these advanced technologies. At the data level, we investigate various methods of data representation and curation of serialized tabular data, exploring their impact on prediction performance. At the classification level, we examine whether text serialization combined with LMs enhances performance on tabular datasets (e.g. class imbalance, distribution shift, biases, and high dimensionality), and assess whether this method represents a state-of-the-art (SOTA) approach for addressing tabular machine learning challenges. Our findings reveal current pre-trained models should not replace conventional approaches.

Text Serialization

Text Serialization is the process of turning tabular data into textual representations, first proposed by TabLLM.

VRB Initial Scene

Project Overview

In this project, we are interested in addressing two questions in regards to text serialization. In the first part of our research, we compare how text serialization compares to traditional tabular machine learning paradigms in data curation. In the second part of our research, we explore how text serialization can be used to address common challenges in tabular machine learning and whether they are better than existing machine learning methodologies.

VRB Initial Scene

Webpage under Construction

We apologize for the inconvenience but we will be at ICML 2024 AI4Science Workshop for poster session #2 in Hall A8 to present our paper. We will be updating this webpage with more information soon.

VRB Initial Scene

BibTeX

@article{ono2024text,
            title={Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine Learning},
            author={Ono, Kyoka and Lee, Simon A},
            journal={arXiv preprint arXiv:2406.13846},
            year={2024}}

Questions

If there are any questions or concerns, please feel free to reach out to us at simonlee711@g.ucla.edu.