This study proposes a methodology for constructing a new QA dataset by utilizing the structural characteristics of patent documents. Existing patent search systems have shown limitations in grasping the overall context of documents due to fragmentary ...
This study proposes a methodology for constructing a new QA dataset by utilizing the structural characteristics of patent documents. Existing patent search systems have shown limitations in grasping the overall context of documents due to fragmentary searches focused on the title, abstract, and claims. To overcome this limitation, this study analyzed 3,000 patent documents from 2000 to 2021 and constructed 1,071 question-answer datasets covering various sections such as background technology, technical field, and implementation details of the invention. Questions and answers were generated by leveraging the structural characteristics of patent documents through the hierarchical reasoning framework of the EXAONE 3.5 7.8B model and Retrieval-Augmented Generation (RAG) method. The KoELECTRA model, trained on the constructed dataset, achieved an EM score of 0.943 and an F1 score of 0.986, demonstrating a significant performance improvement compared to existing patent QA benchmarks. This study is significant in that it proposes a new direction in the field of patent information processing by introducing a dataset construction methodology based on the hierarchical structure of patent documents.