To query the vast amount of web pages which are available in the Internet, it is necessary to extract the encoded information in the web pages for converting it into structured data (e.g. relational data for SQL) or semistructured data (e.g. XML data ...
To query the vast amount of web pages which are available in the Internet, it is necessary to extract the encoded information in the web pages for converting it into structured data (e.g. relational data for SQL) or semistructured data (e.g. XML data for XQuery). In this paper, we propose a new web information extraction system, PIES, to convert web information into XML documents. PIES is based on a user-specified schema and HTML tag pattern descriptions. The web information is extracted by the pattern descriptions and validated by the schema. We designed a new language to describe extraction rules, and a new regular expression language to describe HTML tag patterns. We implemented PIES and applied it to the US patent web site to demonstrate its feasibility. It successfully extracted more than thousands of US patent data and converted them into XML documents.