In applying natural language processing to materials science, named entity recognition (NER) is a key component for constructing structured databases from unstructured research literature. However, NER in materials science faces data scarcity due to t...
In applying natural language processing to materials science, named entity recognition (NER) is a key component for constructing structured databases from unstructured research literature. However, NER in materials science faces data scarcity due to two challenges. First, NER data requires word-level labeling which raises annotation costs due to the high labor intensity. Second, it demands annotators with specialized knowledge in materials science, lifting the cost heavier. To address the data scarcity problem in materials science NER, we propose a data augmentation pipeline tailored to materials science. Starting with a previous framework based on the masked language model, we suggest three improvements. First, we employ a materials-specific language model to generate materials science terms. Second, we enhance generation with entity information from a materials-specific knowledge graph. Lastly, we enable entity-level generation to address a limitation of masked language modeling. Experiments on three datasets demonstrate that the suggested augmentation pipeline improves NER performance in materials science, proving the effectiveness of domain-optimized augmentation strategies.