Uncertainty-Aware Representation and Label Refinement on Tabular Data: Methods and Engineering Applications = 정형데이터의 불확실성 인식 표현학습과 라벨 정제: 방법론 및 공학적 응용|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Tabular data constitute one of the most prevalent and practically important data formats across diverse fields, including healthcare, finance, manufacturing, and civil infrastructure management. Despite their ubiquity, learning reliable and generalizable representations from tabular data remains a fundamental challenge due to the absence of inherent structural priors and the frequent presence of noise and uncertainty in both features and labels. Unlike images or text, tabular features are heterogeneous—often mixing numerical, categorical, and ordinal variables—without spatial or sequential correlations that can guide self-supervised learning (SSL). Moreover, labels in many real-world applications, such as geotechnical risk assessment, are derived from subjective human judgment rather than objective ground truth, leading to label unreliability and inconsistency. This dissertation addresses these challenges through the development of uncertainty-aware learning frameworks that integrate robust representation learning and noise-resilient label refinement for heterogeneous tabular datasets.
The first major contribution is UA-Tab, a novel Uncertainty-Aware Self-Supervised Learning framework for tabular data. UA-Tab introduces a feature-level attention–uncertainty module that simultaneously estimates the relevance and reliability of each feature. This mechanism adaptively emphasizes informative and trustworthy attributes while suppressing noisy or ambiguous ones. Furthermore, UA-Tab replaces conventional input-level augmentations—which often distort semantics in tabular domains—with latent-space perturbation, generating semantically consistent positive pairs directly from the posterior distribution of a variational encoder. By combining contrastive similarity, reconstruction, and latent regularization losses within a unified objective, UA-Tab achieves robust and interpretable representations that remain stable under various noise conditions. Experiments on benchmark datasets demonstrate that UA-Tab consistently outperforms state-of-the-art self-supervised learning methods such as VIME, SCARF, SubTab, STab, and TabDeco particularly when feature corruption or uncertainty is present.
The second contribution, GF-KDA (Graph-Free Kernel Discriminant Analysis), addresses the problem of label refinement under noisy supervision. Conventional semi-supervised methods like label propagation or label spreading rely on explicit graph structures that are unstable in heterogeneous, high-dimensional feature spaces. GF-KDA eliminates this dependency by performing label refinement directly in kernel space using an uncertainty-aware RBF kernel that integrates mutual-information-based feature relevance and conditional-entropy-based feature uncertainty. Through iterative posterior estimation and confidence-based propagation, GF-KDA selectively spreads only high-confidence labels while suppressing noise amplification. The method consistently exhibits enhanced robustness under noisy supervision and provides more interpretable label refinement than existing semi-supervised methods such as label propagation, label spreading, self-training, and PET, across diverse noise conditions and labeling scenarios.
Building on these two foundations, the dissertation proposes an Uncertainty-Aware Hybrid SSL (semi-supervised learning) Framework for nationwide geotechnical risk prediction using the Cut Slope Management System (CSMS) dataset. This hybrid framework combines UA-Tab’s uncertainty-aware embeddings and GF-KDA’s refined labels within a teacher–student knowledge-distillation paradigm. The teacher model, trained on high-confidence samples, generates both discrete risk grades (A–E) and continuous severity scores that capture the ordinal nature of slope risk. Knowledge is then distilled into a lightweight student model, enabling accurate and interpretable risk prediction even under scarce and noisy supervision. The framework demonstrates substantial improvements in classification stability, noise tolerance, and interpretability, providing a practical foundation for AI-based slope safety management.
Overall, this dissertation contributes to the theoretical and practical advancement of machine learning on tabular data by introducing uncertainty as a first-class modeling principle. Through UA-Tab, GF-KDA, and their hybrid integration, it establishes a unified paradigm for uncertainty-aware representation learning and label refinement, with broad applicability to engineering, environmental, and other real-world domains where data reliability is inherently uncertain.

번역하기

국문 초록 (Abstract)

표 형식(tabular data)의 데이터는 의료, 금융, 제조, 사회기반시설 관리 등 다양한 분야에서 가장 널리 활용되는 데이터 형태 중 하나이다. 그러나 이러한 데이터는 이미지나 텍스트와 달리 고유한 구조적 특성이 없고, 수치형·범주형·순서형 변수 등이 혼합된 이질적 형태를 지니며, 결측치와 측정오차, 주관적 판단에 따른 불확실성이 빈번히 존재한다. 이러한 이유로, 표 데이터로부터 신뢰성 있고 일반화 가능한 표현을 학습하는 것은 여전히 어려운 과제로 남아 있다. 특히, 지반공학 분야와 같이 라벨이 전문가의 경험적 판단에 기반한 경우, 불확실한 라벨이 학습성능 저하의 주요 원인이 된다. 본 논문은 이러한 문제를 해결하기 위해, 특징(feature)과 라벨(label) 수준의 불확실성을 동시에 고려하는 학습 프레임워크를 제안하였다.
첫째, 본 연구에서는 UA-Tab(Uncertainty-Aware Self-Supervised Learning) 프레임워크를 제안하였다. UA-Tab은 특징별 중요도와 신뢰도를 동시에 추정하는 주의(attention)–불확실성 통합 모듈을 도입하여, 신뢰성이 높은 특징은 강조하고 잡음이 많은 특징은 억제함으로써 견고한 표현을 학습한다. 또한, 기존 자기지도학습(Self-Supervised Learning, SSL)에서 흔히 사용되는 입력단 변형(augmentation) 대신 잠재공간(잠재벡터) 교란(latent-space perturbation) 기법을 적용하여, 표 데이터의 의미를 왜곡하지 않으면서도 대조학습(contrastive learning)이 가능한 쌍(view)을 생성한다. UA-Tab은 유사도 손실, 재구성 손실, Kullback–Leibler 발산을 결합한 통합 손실함수를 통해 학습되며, 다양한 노이즈 환경에서도 기존 방법(VIME, SCARF, SubTab, Stab, TabDeco)에 비해 우수한 성능과 해석가능성을 보였다.
둘째, GF-KDA(Graph-Free Kernel Discriminant Analysis) 기법을 제안하여 라벨 노이즈가 존재하는 환경에서의 레이블 정제(label refinement) 문제를 해결하였다. 기존의 그래프 기반 라벨 전파(Label Propagation/Spreading)는 고차원·이질적 표 데이터에서 유사도 그래프의 불안정성으로 인해 성능이 저하되는 한계가 있다. GF-KDA는 그래프를 명시적으로 구성하지 않고, 특징별 상호정보량(mutual information)과 조건부 엔트로피 기반 불확실성(conditional entropy)을 통합한 불확실성-가중 RBF 커널을 정의하여, 특징의 중요도와 신뢰도를 반영한 커널 판별분석(KDA)을 수행한다. 이를 통해 고신뢰 샘플만을 선택적으로 전파하며 라벨 노이즈의 확산을 억제하고, 다양한 노이즈 비율에서도 기존 준지도학습 기법(Label Propagation, Label Spreading, Self-Training, PET)보다 높은 정제 정확도와 강건성을 달성하였다.
마지막으로, 제안된 두 알고리즘을 결합한 불확실성 기반 하이브리드 준지도학습 프레임워크를 구축하여, 국내 도로비탈면관리시스템(CSMS) 데이터를 활용한 전국 규모의 비탈면 위험등급 분류에 적용하였다. UA-Tab을 통해 얻은 잠재표현을 기반으로 GF-KDA가 고신뢰 라벨을 정제하고, 이를 이용해 교사-학생(teacher–student) 지식증류(knowledge distillation) 구조를 학습함으로써, 라벨이 불완전한 상황에서도 신뢰도 높은 예측을 수행할 수 있었다. 또한, 5등급(A–E) 위험도를 연속형 위험지수(risk score)로 변환하여 해석성과 실무 활용성을 향상시켰다. 실험 결과, 제안된 프레임워크는 불확실성과 라벨 노이즈가 공존하는 실제 지반 데이터에서도 안정적이고 해석가능한 성능을 보였으며, 지반공학 분야에서 인공지능 기반 위험관리의 실질적 적용 가능성을 입증하였다.
본 연구는 불확실성을 명시적으로 모델링한 표 데이터 학습의 새로운 패러다임을 제시하였다. 제안된 UA-Tab과 GF-KDA, 그리고 이들의 하이브리드 구조는 불확실한 환경에서의 표현학습과 라벨 정제의 이론적·실용적 기반을 마련하였으며, 향후 공학·환경·사회 기반 분야 등 불확실성이 내재된 다양한 실세계 데이터로의 확장이 기대된다.

번역하기

표 형식(tabular data)의 데이터는 의료, 금융, 제조, 사회기반시설 관리 등 다양한 분야에서 가장 널리 활용되는 데이터 형태 중 하나이다. 그러나 이러한 데이터는 이미지나 텍스트와 달리 고...

목차 (Table of Contents)

LIST OF FIGURES v
LIST OF TABLES viii
ABSTRACT IN ENGLISH ix
1. Introduction 1
1.1 Tabular data: overview of representation and label refinement 1

LIST OF FIGURES v
LIST OF TABLES viii
ABSTRACT IN ENGLISH ix
1. Introduction 1
1.1 Tabular data: overview of representation and label refinement 1
1.2 Research motivation and challenges 2
1.3 Research objectives 4
1.3.1 Uncertainty-aware representation learning (UA-Tab) 4
1.3.2 Noise-resilient label refinement (GF-KDA) 5
1.3.3 Hybrid framework for geotechnical risk data 5
1.4 Thesis outline 6
1.5 Contributions 7
2. Background 9
2.1 Challenges of tabular data in machine learning 9
2.2 Self-supervised learning (SSL) for tabular data 10
2.3 Uncertainty modeling in deep learning 12
2.4 Label propagation and label spreading 14
2.5 Kernel method for semi-supervised learning 15
2.6 Noise-robust semi-supervised approaches 16
3. UA-Tab: Uncertainty-Aware Self-Supervised Representation Learning 20
3.1 Introduction 20
3.2 Overview of architecture 22
3.3 Feature attention and uncertainty modeling 23
3.4 Latent space perturbation and contrastive learning 25
3.5 Experimental setup 28
3.5.1 Dataset 28
3.5.2 Noise injection protocol 28
3.5.3 Noise injection scenarios 30
3.5.4 Evaluation protocol 31
3.6 Experimental results 32
3.6.1 Overall downstream performance under clean test (S1) 32
3.6.2 Overall downstream performance under noisy test (S2) 35
3.6.3 Feature-level uncertainty and noise alignment 38
3.6.4 Uncertainty-attention interaction across all features 41
3.6.5 Latent space representation and variance analysis 43
3.6.6 Noise type-specific behavior and error mode 45
3.6.7 Discriminative performance under high-noise condition 46
3.6.8 Ablation study 48
3.7. Discussion 50
3.7.1 UA-Tab's noise robustness: when and why it works best 50
3.7.2 Balancing feature-level and latent-level expressiveness 51
3.7.3 Comparative advantages and limitations 51
3.8 Conclusion 52
4. GF-KDA: Graph-Free Kernel Discriminant Analysis for Label Spreading 54
4.1 Introduction 54
4.2 Problem definition and overview 55
4.3 Feature-level information extraction 58
4.3.1 Mutual Information-based Feature Weights 59
4.3.2 Conditional Entropy-based Feature Uncertainty Estimation 59
4.3.3 Combined Feature Representation for GF-KDA 60
4.4 Uncertainty-aware kernel discriminant analysis 60
4.5 Iterative label refinement 61
4.6 Experimental setup 65
4.6.1 Dataset description 65
4.6.2 Noise injection strategy 65
4.6.3 Hyperparameter configuration for GF-KDA 66
4.6.4 Baseline methods for comparison 67
4.6.5 Evaluation metrics and implementation details 69
4.7 Experimental results and analysis 71
4.7.1 Robustness to increasing label noise 71
4.7.2 Performance under different labeled ratios 73
4.7.3 Comparative analysis of accuracy degradation under label noise 75
4.7.4 Convergence and robustness of GF-KDA across iterations 77
4.7.5 Noise-resilient behavior and interpretation in GF-KDA 78
4.7.6 Threshold-sensitive abstention analysis 80
4.8 Discussion 82
4.8.1 Additional Benchmarks: UCI Heart Disease and Credit Risk 82
4.8.2 Runtime and memory efficiency analysis 84
4.8.3 Advantages over graph-based methods 86
4.8.4 Limitations and future work 87
4.9 Conclusion 88
5. Uncertainty-Aware Hybrid SSL Framework for Geotechnical Risk Data 90
5.1 Introduction 90
5.2 Overview of architecture 91
5.3 Methodology 93
5.4 Datasets: CSMS 97
5.5 Data preprocessing 98
5.6 Experimental design 99
5.7 Results and discussions 101
5.7.1 Grade-wise representative risk 101
5.7.2 Feature-level effects: visualization and statistics 104
5.7.3 Sample-wise robust risk scores (distribution by grade) 113
5.7.4 Statistical validation of grade separability in predicted risk labels 115
5.7.5 Spatial pattern of predicted risk 119
6. Conclusion 123
6.1 Discussion of contributions 123
6.2 Remaining challenges and future directions 126
References 129
ABSTRACT IN KOREAN 133

상세검색

RISS 보유자료

상세검색

해외전자자료

Uncertainty-Aware Representation and Label Refinement on Tabular Data: Methods and Engineering Applications = 정형데이터의 불확실성 인식 표현학습과 라벨 정제: 방법론 및 공학적 응용

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료