Information extraction on tourism domain using SpaCy and BERT Back 02/05/2022 by นพพร ม่วงระย้า   Document Information extraction on tourism domain using SpaCy and BERT Download Metadata Share Document TitleInformation extraction on tourism domain using SpaCy and BERTAuthorChantrapornchai C., Tunsakul A.Name from Authors Collection Chantrapornchai C. Scopus Author ID 6603855808 ORCID ID NULL | Tunsakul A. Scopus Author ID 57215311216 ORCID ID NULL AffiliationsFaculty of Engineering, Kasetsart University, Bangkok, Thailand; National Science and Technology Development Agency(NSD), Pathum Thani, ThailandTypeArticleSource TitleECTI Transactions on Computer and Information TechnologyISSN22869131Year2021Volume15Issue1Page108-122Open AccessAll Open Access, GoldPublisherECTI AssociationDOI10.37936/ecti-cit.2021151.228621FormatPDFAbstractInformation extraction is a basic task required in document searching. Traditional approaches involve lexical and syntax analysis to extract words and part of speech from sentences in order to establish the semantics. In this paper, we present two machine learning-based methodologies used to extract particular information from full texts. The methodologies are based on the tasks: name entity recognition (NER), and text classification. The first step is the building training data and data cleansing. We consider a tourism domain using information about restaurants, hotels, shopping, and tourism. Our data set was generated by crawling the websites. First, the tourism data is gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purposes. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts on the tourism domain, we demonstrate how to build the model to extract the desired entity, i.e., name, location, or facility as well as relation type, classifing the reviews or the use of the classification to summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks. The accuracy on the tested data set on the name entity recognition for SpaCy is upto 95% and for BERT, it is upto 99%. For text classification, BERT and SpaCy yield accuracies of around 95%-98%. © 2021, ECTI Association. All rights reserved.KeywordBERT | Name entity recognition | SpaCy | Text classificationIndustrial ClassificationN/AKnowledge Taxonomy Level 1N/AKnowledge Taxonomy Level 2N/AKnowledge Taxonomy Level 3N/AFunding SponsorKasetsart University; NvidiaLicenseCC BY-NC-NDRightsN/ALinkhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-85099661572&doi=10.37936%2fecti-cit.2021151.228621&partnerID=40&md5=c2cef66e9bf9f70edbf20c6680f92748Publication SourceScopusNoteFull textDownload Filehttps://www.nstda.or.th/openarchive/download/107879/?tmstv=1674457403 Continue browsing CAMSAP3 depletion induces lung cancer cell senescence-associated phenotypes through extracellular signal-regulated kinase inactivation Effect of the blockage ratios of circular stack on the performance of the air-based standing wave thermoacoustic refrigerator using heat pipe Back to items list