To one approach to extracting named entities from unstructured texts

A. A. Voroshilova

https://orcid.org/0000-0002-4556-813X

S. Yu. Piskorskaya

https://orcid.org/0000-0002-5589-801X

DOI: https://doi.org/10.47813/2782-5280-2023-2-2-0301-0313

Keywords: information processing, unstructured text, named entity, lexeme, hidden Markov chain.


Abstract

The article considers one of the possible approaches to the extraction of named entities from unstructured texts. The complexity and laboriousness of the most common methods for solving this problem, based on the use of manually created finite automata, are noted. There are a number of difficulties in implementing this approach when processing multilinguistic texts, since for each new language and for each new class of entities, human intervention is required to manually create a new set of templates for working with new languages and new classes. The proposed approach involves the use of machine learning principles. The statement of the problem is given and the model of the Markov chain used in the recognition of named entities is described. On the basis of this model for the selection of named objects, the task is to find the most probable sequence of states that generate a sequence of tokens. The article describes the lexical material, including the composition of features and their descriptions, presents the decoding technique and estimation of the model parameters. In this paper, to solve the problem, the Viterbi algorithm is used, which is designed to find a sequence of states for which the probability of generating the observed chain of symbols is maximum. As experimental results, the characteristics of the accuracy of recognition of types of lexemes for different sizes of the training sample and a diagram of the number of errors by classes of lexemes are presented.


Author Biographies

A. A. Voroshilova

Anna Voroshilova, Candidate of Philosophical Sciences, Associate Professor, Department of Informatics, Siberian Federal University, Krasnoyarsk, Russia

S. Yu. Piskorskaya

Svetlana Piskorskaya, Doctor of Philosophy, Professor, Director of the Institute of Social Engineering, Reshetnev Siberian State University of Science and Technologies, Krasnoyarsk, Russia


References

Распопин Н.А., Карасева М.В., Зеленков П.В., Каюков Е.В., Ковалев И.В. Модели и методы оптимизации сбора и обработки информации. Сибирский аэрокосмический журнал. 2012; 2(42): 69-72.

Коровиков Н.А., Гончаров М.А., Кадров М.С. Анализ методов выделения именованных сущностей из неструктурированных документов. Международный журнал прикладных наук и технологий «Integral». 2019; 3: 328-332.

Абрамов П.С. Извлечение ключевой информации из текста. Новые информационные технологии в автоматизированных системах. 2018; 21: 217-219.

Киселев С.Л., Ермаков А.Е., Плешко В.В. Поиск фактов в тексте естественного языка на основе сетевых описаний. Компьютерная лингвистика и интеллектуальные технологии: труды международной конференции Диалог’2004. М.: Наука; 2004: 180-185.

Nadeau D., Sekine S. A survey of named entity recognition and classification. Linguisticae Investigationes. 2007; 1(30): 3-26. https://doi.org/10.1075/li.30.1.03nad

Gentile A. L. et al. Cultural Knowledge for Named Entity Disambiguation: A Graph-Based Semantic Relatedness Approach. Serdica Journal of Computing. 2010; 4(2): 217-242. https://doi.org/10.55630/sjc.2010.4.217-242

Bikel D. M., Miller S., Schwartz R., Weischedel R. Nymble: a high performance learning namefinder. Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97); 1997: 194-201. https://doi.org/10.3115/974557.974586

Brester C., Semenkin E., Kovalev I., Zelenkov P., Sidorov M. Evolutionary feature selection for emotion recognition in multilingual speech analysis. IEEE Congress on Evolutionary Computation (CEC 2015); 2015: 2406-2411. https://doi.org/10.1109/CEC.2015.7257183

Ковалев И.В., Лесков О.В., Карасева М.В. Внутриязыковые ассоциативные поля в мультилингвистической адаптивно-обучающей технологии. Системы управления и информационные технологии. 2008; 3-1(33): 157-160.

Зеленков П.В., Ковалев И.В., Карасева М.В., Рогов С.В. Мультилингвистическая модель распределенной системы на основе тезауруса. Сибирский аэрокосмический журнал. 2008; 1(18): 26-28.

Ковалев И.В. Системная архитектура мультилингвистической адаптивно-обучающей технологии и современная структурная методология. Телекоммуникации и информатизация образования. 2002; 3: 6.

Ковалев И.В., Полянский К.В., Зеленков П.В., Брезицкая В.В., Сидорова Г.А. Система поиска, анализа и обработки мультилингвистических текстов, интегрированная с информационно-поисковыми системами. Сибирский аэрокосмический журнал. 2013; 1(47): 48-52.

Appelt D., Hobbs J., Bear J., Israel D., Tyson M. FASTUS: A finitestate processor for information extraction from real-world text. Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93). Chambery, France; 1993: 1172–1178. https://doi.org/10.3115/1075671.1075701

Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989; 77(2): 257-286. https://doi.org/10.1109/5.18626

Wen Y. Text Mining Using HMM and PPM. Master's thesis. Department of Computer Science, University of Waikato. 2001.

Ковалев И.В., Карасева М.В., Суздалева Е.А. Системные аспекты организации и применения мультилингвистической адаптивно-обучающей технологии. Образовательные технологии и общество. 2002; 5(2): 198-212.

REFERENCES

Raspopin N.A., Karaseva M.V., Zelenkov P.V., Kayukov E.V., Kovalev I.V. Modeli i metody optimizacii sbora i obrabotki informacii. Sibirskij aerokosmicheskij zhurnal. 2012; 2(42): 69-72. (in Russian)

Korovikov N.A., Goncharov M.A., Kadrov M.S. Analiz metodov vydeleniya imenovannyh sushchnostej iz nestrukturirovannyh dokumentov. Mezhdunarodnyj zhurnal prikladnyh nauk i tekhnologij «Integral». 2019; 3: 328-332. (in Russian)

Abramov P.S. Izvlechenie klyuchevoj informacii iz teksta. Novye informacionnye tekhnologii v avtomatizirovannyh sistemah. 2018; 21: 217-219. (in Russian)

Kiselev S.L., Ermakov A.E., Pleshko V.V. Poisk faktov v tekste estestvennogo yazyka na osnove setevyh opisanij. Komp'yuternaya lingvistika i intellektual'nye tekhnologii: trudy mezhdunarodnoj konferencii Dialog’2004. M.: Nauka; 2004: 180-185. (in Russian)

Nadeau D., Sekine S. A survey of named entity recognition and classification. Linguisticae Investigationes. 2007; 1(30): 3-26. https://doi.org/10.1075/li.30.1.03nad DOI: https://doi.org/10.1075/li.30.1.03nad

Gentile A. L. et al. Cultural Knowledge for Named Entity Disambiguation: A Graph-Based Semantic Relatedness Approach. Serdica Journal of Computing. 2010; 4(2): 217-242. https://doi.org/10.55630/sjc.2010.4.217-242 DOI: https://doi.org/10.55630/sjc.2010.4.217-242

Bikel D. M., Miller S., Schwartz R., Weischedel R. Nymble: a high performance learning namefinder. Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97); 1997: 194-201. https://doi.org/10.3115/974557.974586 DOI: https://doi.org/10.3115/974557.974586

Brester C., Semenkin E., Kovalev I., Zelenkov P., Sidorov M. Evolutionary feature selection for emotion recognition in multilingual speech analysis. IEEE Congress on Evolutionary Computation (CEC 2015); 2015: 2406-2411. https://doi.org/10.1109/CEC.2015.7257183 DOI: https://doi.org/10.1109/CEC.2015.7257183

Kovalev I.V., Leskov O.V., Karaseva M.V. Vnutriyazykovye associativnye polya v mul'tilingvisticheskoj adaptivno-obuchayushchej tekhnologii. Sistemy upravleniya i informacionnye tekhnologii. 2008; 3-1(33): 157-160. (in Russian)

Zelenkov P.V., Kovalev I.V., Karaseva M.V., Rogov S.V. Mul'tilingvisticheskaya model' raspredelennoj sistemy na osnove tezaurusa. Sibirskij aerokosmicheskij zhurnal. 2008; 1(18): 26-28. (in Russian)

Kovalev I.V. Sistemnaya arhitektura mul'tilingvisticheskoj adaptivno-obuchayushchej tekhnologii i sovremennaya strukturnaya metodologiya. Telekommunikacii i informatizaciya obrazovaniya. 2002; 3: 6. (in Russian)

Kovalev I.V., Polyanskij K.V., Zelenkov P.V., Brezickaya V.V., Sidorova G.A. Sistema poiska, analiza i obrabotki mul'tilingvisticheskih tekstov, integrirovannaya s informacionno-poiskovymi sistemami. Sibirskij aerokosmicheskij zhurnal. 2013; 1(47): 48-52. (in Russian)

Appelt D., Hobbs J., Bear J., Israel D., Tyson M. FASTUS: A finitestate processor for information extraction from real-world text. Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93). Chambery, France; 1993: 1172–1178. https://doi.org/10.3115/1075671.1075701 DOI: https://doi.org/10.3115/1075671.1075701

Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989; 77(2): 257-286. https://doi.org/10.1109/5.18626 DOI: https://doi.org/10.1109/5.18626

Wen Y. Text Mining Using HMM and PPM. Master's thesis. Department of Computer Science, University of Waikato. 2001.

Kovalev I.V., Karaseva M.V., Suzdaleva E.A. Sistemnye aspekty organizacii i primeneniya mul'tilingvisticheskoj adaptivno-obuchayushchej tekhnologii. Obrazovatel'nye tekhnologii i obshchestvo. 2002; 5(2): 198-212. (in Russian)

Most read articles by the same author(s)

Веб-сайт https://www.oajiem.com использует cookie файлы с с целью повышения удобства и эффективности работы Пользователя при работе с сервисами журнала "Modern Innovations, Systems and Technologies" - "Современные инновации, системы и технологии". Продолжая использование сайта, Пользователь дает согласие на использование файлов cookie.