An ontology-based information extractor for data-rich documents in the information technology domain

Jiménez Vargas, Sergio Gonzalo; González Osorio, Fabio Augusto

An ontology-based information extractor for data-rich documents in the information technology domain

Archivos

9972-18047-1-PB.pdf (283.01 KB)

Autores

Jiménez Vargas, Sergio Gonzalo

González Osorio, Fabio Augusto

Tipo de contenido

Artículo de revista

Idioma del documento

Español

Fecha de publicación

2008

Documentos PDF

Resumen

This paper presents an information extraction method, suitable for data-rich documents, based on the knowledge represented in a domain ontology. The extractor combines a fuzzy string matcher and a word sense disambiguation (WSD) algorithm. The fuzzy string matcher finds mentions of terms combining character-level and token-level similarity measures dealing with non-standardized acronyms and inconsistent abbreviation styles. We propose a new character-level edit distance sensitive to prefixes called root distance and a token-level similarity algorithm for fuzzy acronym detection. Additionally, a WSD strategy using an ontology-based semantic relatedness measure is used to solve the inherent ambiguity of some entities. The WSD module finds a sense combination over all the document length optimizing the document semantic coherence. Our approach seems to be suitable to extract information from data-rich documents describing Orly one main object (i.e. product) by document. The results showed a precision of 78.9% with 99.5% recall using documents and an ontology related to laptop computers domain.

Palabras clave

Knowledge Management ; Information Extraction ; Ontologies ; Fuzzy String Searching ; Word Sense Disambiguation ; Semantic Relatedness

URI

https://repositorio.unal.edu.co/handle/unal/24330

Colecciones

Avances en Sistemas e Informática

Página completa del ítem

An ontology-based information extractor for data-rich documents in the information technology domain

Archivos

Autores

Director

Tipo de contenido

Idioma del documento

Fecha de publicación

Título de la revista

ISSN de la revista

Título del volumen

Resumen

Abstract

Palabras clave

Descripción Física/Lógica/Digital

Palabras clave

Citación

URI

Colecciones