Abstract
Information extraction is a basic task required in document searching. Traditional approaches involve lexical and syntax analysis to extract words and part of speech from sentences in order to establish the semantics. In this paper, we present two machine learning-based methodologies used to extract particular information from full texts. The methodologies are based on the tasks: name entity recognition (NER), and text classification. The first step is the building training data and data cleansing. We consider a tourism domain using information about restaurants, hotels, shopping, and tourism. Our data set was generated by crawling the websites. First, the tourism data is gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purposes. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts on the tourism domain, we demonstrate how to build the model to extract the desired entity, i.e., name, location, or facility as well as relation type, classifing the reviews or the use of the classification to summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks. The accuracy on the tested data set on the name entity recognition for SpaCy is upto 95% and for BERT, it is upto 99%. For text classification, BERT and SpaCy yield accuracies of around 95%-98%. © 2021, ECTI Association. All rights reserved.