A Study to Detect Multi-word Expression from Text Using Deep Learning Models
Abstract
Detecting Multi-word Expressions (MWEs) is a crucial task in Natural Language Processing (NLP) for applications in machine translation, sentiment analysis, and information retrieval. This study evaluates the performance of several deep learning models on MWE detection using two samples of varying sizes from the major consumer electronic product retailer corpus. The sample is limited to 10,000 and 15,000 rows, with each row contains 15-20 English words. Preprocessing steps include removing special symbols and emojis, converting text to lowercase, and applying the spaCy NLP library for tokenization and part-of-speech (POS) tagging. Syntactic rules are then used to identify MWEs such as verb-noun combinations and phrasal verbs, with BIO tags (B-MWE, I-MWE, O) to mark MWE boundaries. We investigated transformer-based models such as BERT, BERT-CRF, LSTM-CRF and RoBERTa-CRF using a sample of 10,000 rows; BERT, BERT-BiLSTM, BiLSTM-GloVe, and BiLSTM-GloVe-BiGRU uses a sample of 15,000. Results demonstrated that the transformer-based model, RoBERTa-CRF, excels on the smaller sample which achieves the best performance by leveraging the contextual embeddings and sequential dependency modeling. On a larger sample, the BERT-BiLSTM model emerged as the most effective model, showcasing the advantage of combining dynamic embeddings with sequential learning. In contrast, models utilizing static embeddings, such as GloVe, displayed moderate performance, highlighting their limitations in capturing contextual nuances. Comparative analysis across both samples reveals that transformer-based models like RoBERTa-CRF performed optimally on the smaller dataset, whereas hybrid models integrating with sequential architectures like BERT-BiLSTM demonstrated superior performance as dataset size increased. These findings highlight the importance of model selection based on dataset scale to optimize MWE detection. This study underscores the importance of integrating contextual and sequential deep learning techniques to improve MWE detection and provides a basis for developing more robust and scalable systems for diverse linguistic tasks.
Article Metrics
Abstract: 0 Viewers PDF: 0 ViewersKeywords
Full Text:
PDFRefbacks
- There are currently no refbacks.
Journal of Applied Data Sciences
ISSN | : | 2723-6471 (Online) |
Organized by | : | Computer Science and Systems Information Technology, King Abdulaziz University, Kingdom of Saudi Arabia. |
Website | : | http://bright-journal.org/JADS |
: | taqwa@amikompurwokerto.ac.id (principal contact) | |
support@bright-journal.org (technical issues) |
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0