Tabular data constitute one of the most prevalent and practically important data formats across diverse fields, including healthcare, finance, manufacturing, and civil infrastructure management. Despite their ubiquity, learning reliable and generaliza...
Tabular data constitute one of the most prevalent and practically important data formats across diverse fields, including healthcare, finance, manufacturing, and civil infrastructure management. Despite their ubiquity, learning reliable and generalizable representations from tabular data remains a fundamental challenge due to the absence of inherent structural priors and the frequent presence of noise and uncertainty in both features and labels. Unlike images or text, tabular features are heterogeneous—often mixing numerical, categorical, and ordinal variables—without spatial or sequential correlations that can guide self-supervised learning (SSL). Moreover, labels in many real-world applications, such as geotechnical risk assessment, are derived from subjective human judgment rather than objective ground truth, leading to label unreliability and inconsistency. This dissertation addresses these challenges through the development of uncertainty-aware learning frameworks that integrate robust representation learning and noise-resilient label refinement for heterogeneous tabular datasets.
The first major contribution is UA-Tab, a novel Uncertainty-Aware Self-Supervised Learning framework for tabular data. UA-Tab introduces a feature-level attention–uncertainty module that simultaneously estimates the relevance and reliability of each feature. This mechanism adaptively emphasizes informative and trustworthy attributes while suppressing noisy or ambiguous ones. Furthermore, UA-Tab replaces conventional input-level augmentations—which often distort semantics in tabular domains—with latent-space perturbation, generating semantically consistent positive pairs directly from the posterior distribution of a variational encoder. By combining contrastive similarity, reconstruction, and latent regularization losses within a unified objective, UA-Tab achieves robust and interpretable representations that remain stable under various noise conditions. Experiments on benchmark datasets demonstrate that UA-Tab consistently outperforms state-of-the-art self-supervised learning methods such as VIME, SCARF, SubTab, STab, and TabDeco particularly when feature corruption or uncertainty is present.
The second contribution, GF-KDA (Graph-Free Kernel Discriminant Analysis), addresses the problem of label refinement under noisy supervision. Conventional semi-supervised methods like label propagation or label spreading rely on explicit graph structures that are unstable in heterogeneous, high-dimensional feature spaces. GF-KDA eliminates this dependency by performing label refinement directly in kernel space using an uncertainty-aware RBF kernel that integrates mutual-information-based feature relevance and conditional-entropy-based feature uncertainty. Through iterative posterior estimation and confidence-based propagation, GF-KDA selectively spreads only high-confidence labels while suppressing noise amplification. The method consistently exhibits enhanced robustness under noisy supervision and provides more interpretable label refinement than existing semi-supervised methods such as label propagation, label spreading, self-training, and PET, across diverse noise conditions and labeling scenarios.
Building on these two foundations, the dissertation proposes an Uncertainty-Aware Hybrid SSL (semi-supervised learning) Framework for nationwide geotechnical risk prediction using the Cut Slope Management System (CSMS) dataset. This hybrid framework combines UA-Tab’s uncertainty-aware embeddings and GF-KDA’s refined labels within a teacher–student knowledge-distillation paradigm. The teacher model, trained on high-confidence samples, generates both discrete risk grades (A–E) and continuous severity scores that capture the ordinal nature of slope risk. Knowledge is then distilled into a lightweight student model, enabling accurate and interpretable risk prediction even under scarce and noisy supervision. The framework demonstrates substantial improvements in classification stability, noise tolerance, and interpretability, providing a practical foundation for AI-based slope safety management.
Overall, this dissertation contributes to the theoretical and practical advancement of machine learning on tabular data by introducing uncertainty as a first-class modeling principle. Through UA-Tab, GF-KDA, and their hybrid integration, it establishes a unified paradigm for uncertainty-aware representation learning and label refinement, with broad applicability to engineering, environmental, and other real-world domains where data reliability is inherently uncertain.