Information extraction on tourism domain using SpaCy and BERT

Information extraction on tourism domain using SpaCy and BERT
Back

02/05/2022 by นพพร ม่วงระย้า

Document

Information extraction on tourism domain using SpaCy and BERT
Download

Metadata

Document Title

Information extraction on tourism domain using SpaCy and BERT

Author

Chantrapornchai C., Tunsakul A.

Name from Authors Collection

Affiliations

Faculty of Engineering, Kasetsart University, Bangkok, Thailand; National Science and Technology Development Agency(NSD), Pathum Thani, Thailand

Type

Article

Source Title

ECTI Transactions on Computer and Information Technology

ISSN

22869131

Year

2021

Volume

Issue

Page

108-122

Open Access

All Open Access, Gold

Publisher

ECTI Association

DOI

10.37936/ecti-cit.2021151.228621

Format

PDF

Abstract

Information extraction is a basic task required in document searching. Traditional approaches involve lexical and syntax analysis to extract words and part of speech from sentences in order to establish the semantics. In this paper, we present two machine learning-based methodologies used to extract particular information from full texts. The methodologies are based on the tasks: name entity recognition (NER), and text classification. The first step is the building training data and data cleansing. We consider a tourism domain using information about restaurants, hotels, shopping, and tourism. Our data set was generated by crawling the websites. First, the tourism data is gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purposes. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts on the tourism domain, we demonstrate how to build the model to extract the desired entity, i.e., name, location, or facility as well as relation type, classifing the reviews or the use of the classification to summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks. The accuracy on the tested data set on the name entity recognition for SpaCy is upto 95% and for BERT, it is upto 99%. For text classification, BERT and SpaCy yield accuracies of around 95%-98%. © 2021, ECTI Association. All rights reserved.

License

CC BY-NC-ND

Rights

N/A

Link

https://www.scopus.com/inward/record.uri?eid=2-s2.0-85099661572&doi=10.37936%2fecti-cit.2021151.228621&partnerID=40&md5=c2cef66e9bf9f70edbf20c6680f92748

Publication Source

Scopus

Note

Full text

Download File

https://www.nstda.or.th/openarchive/download/107879/?tmstv=1674457403

Back to items list

Information extraction on tourism domain using SpaCy and BERT

Document

Metadata

Share

Document Title

Author

Name from Authors Collection

Chantrapornchai C.

Scopus Author ID

ORCID ID

Tunsakul A.

Scopus Author ID

ORCID ID

Affiliations

Type

Source Title

ISSN

Year

Volume

Issue

Page

Open Access

Publisher

DOI

Format

Abstract

Keyword

Industrial Classification

Knowledge Taxonomy Level 1

Knowledge Taxonomy Level 2

Knowledge Taxonomy Level 3

Funding Sponsor

License

Rights

Link

Publication Source

Note

Download File

Continue browsing