Text Augmentation for Named Entity Recognition in Materials Science = 재료 과학에서의 개체명 인식 텍스트 증강|RISS 상세보기

다국어 초록 (Multilingual Abstract)

In applying natural language processing to materials science, named entity recognition (NER) is a key component for constructing structured databases from unstructured research literature. However, NER in materials science faces data scarcity due to two challenges. First, NER data requires word-level labeling which raises annotation costs due to the high labor intensity. Second, it demands annotators with specialized knowledge in materials science, lifting the cost heavier. To address the data scarcity problem in materials science NER, we propose a data augmentation pipeline tailored to materials science. Starting with a previous framework based on the masked language model, we suggest three improvements. First, we employ a materials-specific language model to generate materials science terms. Second, we enhance generation with entity information from a materials-specific knowledge graph. Lastly, we enable entity-level generation to address a limitation of masked language modeling. Experiments on three datasets demonstrate that the suggested augmentation pipeline improves NER performance in materials science, proving the effectiveness of domain-optimized augmentation strategies.

번역하기

국문 초록 (Abstract)

재료 과학 분야에 자연어 처리를 적용할 때, 개체명 인식 (Named Entity Recognition) 은 비정형 텍스트인 연구문헌을 구조화된 데이터베이스로 가공하는 핵심 요소로서 기능한다. 그러나, 재료 과학에서의 NER은 두 가지 이유로 인한 데이터 부족 문제에 직면해 있다. 우선 NER 데이터는 단어 수준의 라벨링이 필요하기에 주석 처리를 위한 노동량이 크기에 생성 비용이 크다. 또한, 이러한 노동을 재료 과학에 대한 고등 교육 을 받은 인력으로 수행해야 하기에 그 비용은 더욱 증가한다. 본 논문에서는 재료 과학 NER에서의 데이터 부족 문제를 해결하기 위해, 재료 과학 도메인에 최적화된 데이터 증강 파이프라인을 제안한다. 기존의 마스킹 언어 모델 (Masked Language Model) 에 기반한 증강 방법에서 출발해, 우리는 세 가지 개선점을 도입해 최적화한다. 첫째, 재료 과학에 특화된 언어 모델을 활용해 개체 어휘를 생성한다. 둘째, 재료 과학 지식 그래프에서 개체 정보를 추출해 어휘 생성을 보조한다. 마지막으로, 토큰이 아닌 개체 단위의 생성을 통해 마스킹 언어 모델의 한계를 완화한다. 우리는 실험적으로 제안된 증강 파이프라인이 재료 과학 분야에서 NER 성능 향상에 기여할 수 있음을 입증하며, 도메인에 최적화된 증강 전략의 효율성을 보인다.

번역하기

재료 과학 분야에 자연어 처리를 적용할 때, 개체명 인식 (Named Entity Recognition) 은 비정형 텍스트인 연구문헌을 구조화된 데이터베이스로 가공하는 핵심 요소로서 기능한다. 그러나, 재료 과...

목차 (Table of Contents)

Abstract i
초록 ii
Acknowledgment iii
Table of Contents iii
List of Tables vi

Abstract i
초록 ii
Acknowledgment iii
Table of Contents iii
List of Tables vi
List of Figures vii
1 Introduction 1
2 Related Works 3
2.1 Data Augmentation for NER . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Natural Language Processing for Materials Science . . . . . . . . . . . . 4
3 Methods 6
3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Masking with Entity Information from Materials Science KG . . . . . . . 9
3.3 Entity Level Linearization . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Materials Science Language Modeling . . . . . . . . . . . . . . . . . . . 12
4 Experiments Settings 13
4.1 Backbone Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Results 16
5.1 Augmentation for Materials Science NER . . . . . . . . . . . . . . . . . 16
5.2 Augmentation in Low-resource Setting . . . . . . . . . . . . . . . . . . . 18
5.3 Model Selection for Augmentation . . . . . . . . . . . . . . . . . . . . . 20
5.4 Probability Distribution in Entity Rescaling . . . . . . . . . . . . . . . . 22
6 Analysis 24
6.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Limitation 28
8 Conclusion 29
References 34
A Implementation Details 35
A.1 Prompt Design for LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . 35

상세검색

RISS 보유자료

상세검색

해외전자자료

Text Augmentation for Named Entity Recognition in Materials Science = 재료 과학에서의 개체명 인식 텍스트 증강

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료