Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing

Rajesh Gupta

DOI: https://doi.org/10.47813/2782-5280-2024-3-1-0311-0320

Keywords: BERT, machine learning, natural language processing, transformers, neural network.


Abstract

First developed in 2018 by Google researchers, Bidirectional Encoder Representations from Transformers (BERT) represents a breakthrough in natural language processing (NLP). BERT achieved state-of-the-art results across a range of NLP tasks while using a single transformer-based neural network architecture. This work reviews BERT's technical approach, performance when published, and significant research impact since release. We provide background on BERT's foundations like transformer encoders and transfer learning from universal language models. Core technical innovations include deeply bidirectional conditioning and a masked language modeling objective during BERT's unsupervised pretraining phase. For evaluation, BERT was fine-tuned and tested on eleven NLP tasks ranging from question answering to sentiment analysis via the GLUE benchmark, achieving new state-of-the-art results. Additionally, this work analyzes BERT's immense research influence as an accessible technique surpassing specialized models. BERT catalyzed adoption of pretraining and transfer learning for NLP. Quantitatively, over 10,000 papers have extended BERT and it is integrated widely across industry applications. Future directions based on BERT scale towards billions of parameters and multilingual representations. In summary, this work reviews the method, performance, impact and future outlook for BERT as a foundational NLP technique.

 

We provide background on BERT's foundations like transformer encoders and transfer learning from universal language models. Core technical innovations include deeply bidirectional conditioning and a masked language modeling objective during BERT's unsupervised pretraining phase. For evaluation, BERT was fine-tuned and tested on eleven NLP tasks ranging from question answering to sentiment analysis via the GLUE benchmark, achieving new state-of-the-art results.

 

Additionally, this work analyzes BERT's immense research influence as an accessible technique surpassing specialized models. BERT catalyzed adoption of pretraining and transfer learning for NLP. Quantitatively, over 10,000 papers have extended BERT and it is integrated widely across industry applications. Future directions based on BERT scale towards billions of parameters and multilingual representations. In summary, this work reviews the method, performance, impact and future outlook for BERT as a foundational NLP technique.


Author Biography

Rajesh Gupta

Rajesh Gupta, University of Hyderabad, Hyderabad, India


References

Vapnik V. The nature of statistical learning theory. Springer Science & Business Media, 2013.

Farquad M.A.H., Ravi V. and Bose I. Churn prediction using comprehensible support vector machine: An analytical CRM application. Applied soft computing. 2014; 19: 31-40. https://doi.org/10.1016/j.asoc.2014.01.031 DOI: https://doi.org/10.1016/j.asoc.2014.01.031

Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L. and Stoyanov V. Roberta: A robustly optimized BERT pretraining approach. 2019; arXiv preprint arXiv:1907.11692.

Wang S., Chen B. Credit card attrition: an overview of machine learning and deep learning techniques. Informatics. Economics. Management. 2023; 2(4): 0134–0144. https://doi.org/10.47813/2782-5280-2023-2-4-0134-0144 DOI: https://doi.org/10.47813/2782-5280-2023-2-4-0134-0144

Mehrotra A. and Sharma R. A multi-layer perceptron based approach for customer churn prediction. Procedia Computer Science. 2020; 167: 599-606. https://doi.org/10.1016/j.procs.2020.03.326 DOI: https://doi.org/10.1016/j.procs.2020.03.326

Alexandru A.A., Radu L.E., Beksi W., Fabian C., Cioca D. and Ratiu L. The role of predictive analytics in preventive medicine. Rural and Remote Health, 2021; 21: 6618.

Ante L. Predicting customer churn in credit card portfolios. IEEE Transactions on Engineering Management. 2021; 68(4): 1039-1048.

Chen B. Dynamic behavior analysis and ensemble learning for credit card attrition prediction. Modern Innovations, Systems and Technologies. 2023; 3(4): 0109–0118. https://doi.org/10.47813/2782-2818-2023-3-4-0109-0118 DOI: https://doi.org/10.47813/2782-2818-2023-3-4-0109-0118

Carroll J. and Mane K.K. Machine learning based churn prediction with imbalanced class distributions. Open Journal of Business and Management. 2020; 8(3): 1323-1337.

S. Wang. Time Series Analytics for Predictive Risk Monitoring in Diabetes Care. International Journal of Enhanced Research in Science, Technology & Engineering. 2024; 13(2): 39-43.

Qiu X., Sun T., Xu Y., Shao Y., Dai N. and Huang X. Pre-trained models for natural language processing: A survey. Science China Technological Sciences. 2020; 63(10): 1872-1897. https://doi.org/10.1007/s11431-020-1647-3 DOI: https://doi.org/10.1007/s11431-020-1647-3

Wang S. and Chen B. TopoDimRed: a novel dimension reduction technique for topological data analysis. Informatics, Economics, Management. 2023; 2(2): 201-213. https://doi.org/10.47813/2782-5280-2023-2-2-0201-0213 DOI: https://doi.org/10.47813/2782-5280-2023-2-2-0201-0213

Amor N. B., Benferhat S., and Elouedi Z. Qualitative classification with possibilistic decision trees. Modern Information Processing. 2006: 159–169. https://doi.org/10.1016/B978-044452075-3/50014-5 DOI: https://doi.org/10.1016/B978-044452075-3/50014-5

Wong A., Young A.T., Liang A.S., Gonzales R., Douglas V.C., Hadley D. A primer for machine learning in clinical decision support for radiology reports. Acad Radiol. 2018; 25(8): 1097-1107.

Wang A., Singh A., Michael J., Hill F., Levy O. and Bowman S. April. Glue: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 2018; 353-355. https://doi.org/10.18653/v1/W18-5446 DOI: https://doi.org/10.18653/v1/W18-5446

Vapnik V. N. An overview of statistical learning theory. IEEE Transactions on Neural Networks. 1999; 10(5): 988–999. https://doi.org/10.1109/72.788640 DOI: https://doi.org/10.1109/72.788640

Bastos I. and Pregueiro T. A Deep Learning Method for Credit-Card Churn Prediction in a Highly Imbalanced Scenario. Iberian Conference on Pattern Recognition and Image Analysis. 2019; pp. 346-354.

Amin A., Al-Obeidat F., Shah B., Adnan A., Loo J. and Anwar S. Customer churn prediction in telecommunication industry using data certainty. Journal of Business Research, 2019; 94: 290-301. https://doi.org/10.1016/j.jbusres.2018.03.003 DOI: https://doi.org/10.1016/j.jbusres.2018.03.003

Rogers A., Kovaleva O. and Rumshisky A. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics. 2020; 8: 842-866. https://doi.org/10.1162/tacl_a_00349 DOI: https://doi.org/10.1162/tacl_a_00349

Wu Y., Gao T., Wang S. and Xiong Z. TADO: Time-varying Attention with Dual-Optimizer Model. 2020 IEEE International Conference on Data Mining (ICDM 2020). IEEE, 2020, Sorrento, Italy. 2020; 1340-1345. https://doi.org/10.1109/ICDM50108.2020.00174 DOI: https://doi.org/10.1109/ICDM50108.2020.00174

Swamidason I. T. J. Survey of data mining algorithms for intelligent computing system. Journal of Trends in Computer Science and Smart Technology. 2019; 01: 14–23. https://doi.org/10.36548/jtcsst.2019.1.002 DOI: https://doi.org/10.36548/jtcsst.2019.1.002

Wang S., Chen B. A deep learning approach to diabetes classification using attention-based neural network and generative adversarial network. Modern Research: Topical Issues Of Theory And Practice. 2023; 5: 37-41.

Raj J., Ananthi V. Recurrent neural networks and nonlinear prediction in support vector machines. Journal of Soft Computing Paradigm. 2019; 2019: 33–40. https://doi.org/10.36548/jscp.2019.1.004 DOI: https://doi.org/10.36548/jscp.2019.1.004

Devlin J., Chang M.W., Lee K. and Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. 2018; arXiv preprint arXiv:1810.04805.

Song H., Rajan D., Thiagarajan J.J. and Spanias A. Trend and forecasting of time series medical data using deep learning. Smart Health. 2018; 9: 192-211.

O'Hanlon T.P., Rider L.G., Gan L., Fannin R., Pope R.M., Burlingame R.W., et al. Classification of vasculitic peripheral neuropathies. Arthritis Care Res. 2011; 63(10):1508-1519.

Howard J. and Ruder S. Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018; 1: 328-339. https://doi.org/10.18653/v1/P18-1031 DOI: https://doi.org/10.18653/v1/P18-1031

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł. and Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017; 30.

Ziegler R., Heidtmann B., Hilgard D., Hofer S., Rosenbauer J., Holl R. DPV-Wiss-Initiative. Frequency of SMBG correlates with HbA1c and acute complications in children and adolescents with type 1 diabetes. Pediatr Diabetes. 2011; 12(1): 11-7. https://doi.org/10.1111/j.1399-5448.2010.00650.x DOI: https://doi.org/10.1111/j.1399-5448.2010.00650.x

Tang Y. Deep learning using linear support vector machines. 2013; arXiv preprint arXiv:1306.0239.

Wang S., Chen B. Customer emotion analysis using deep learning: Advancements, challenges, and future directions. 3rd International Conference Modern scientific research, 2023: 21-24.

Веб-сайт https://www.oajiem.com использует cookie файлы с с целью повышения удобства и эффективности работы Пользователя при работе с сервисами журнала "Modern Innovations, Systems and Technologies" - "Современные инновации, системы и технологии". Продолжая использование сайта, Пользователь дает согласие на использование файлов cookie.