Об одном подходе к извлечению именованных сущностей из неструктурированных текстов

A. A. Voroshilova; S. Yu. Piskorskaya

doi:10.47813/2782-5280-2023-2-2-0301-0313

pdf (Русский)

Published

2023-07-17

Issue

Vol. 2 No. 3 (2023)

Section

Education

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

The journal «Informatics. Economics. Management» publishes materials under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license, hosted on the official website of the non-profit corporation Creative Commons:
This work is licensed under a Creative Commons Attribution 4.0 International License.

This means that users can copy and distribute materials in any medium and in any format, adapt and transform texts, use content for any purpose, including commercial ones. At the same time, the terms of use must be observed - an indication of the author of the original work and the source: you should indicate the output of the articles, provide a link to the source, and also indicate what changes have been made

How to Cite

Voroshilova, A. A., & Piskorskaya, S. Y. (2023). To one approach to extracting named entities from unstructured texts . Informatics. Economics. Management, 2(3), 0301–0313. https://doi.org/10.47813/2782-5280-2023-2-2-0301-0313

To one approach to extracting named entities from unstructured texts

A. A. Voroshilova

https://orcid.org/0000-0002-4556-813X

S. Yu. Piskorskaya

https://orcid.org/0000-0002-5589-801X

DOI: https://doi.org/10.47813/2782-5280-2023-2-2-0301-0313

Keywords: information processing, unstructured text, named entity, lexeme, hidden Markov chain.

Abstract

The article considers one of the possible approaches to the extraction of named entities from unstructured texts. The complexity and laboriousness of the most common methods for solving this problem, based on the use of manually created finite automata, are noted. There are a number of difficulties in implementing this approach when processing multilinguistic texts, since for each new language and for each new class of entities, human intervention is required to manually create a new set of templates for working with new languages and new classes. The proposed approach involves the use of machine learning principles. The statement of the problem is given and the model of the Markov chain used in the recognition of named entities is described. On the basis of this model for the selection of named objects, the task is to find the most probable sequence of states that generate a sequence of tokens. The article describes the lexical material, including the composition of features and their descriptions, presents the decoding technique and estimation of the model parameters. In this paper, to solve the problem, the Viterbi algorithm is used, which is designed to find a sequence of states for which the probability of generating the observed chain of symbols is maximum. As experimental results, the characteristics of the accuracy of recognition of types of lexemes for different sizes of the training sample and a diagram of the number of errors by classes of lexemes are presented.

Author Biographies

A. A. Voroshilova

Anna Voroshilova, Candidate of Philosophical Sciences, Associate Professor, Department of Informatics, Siberian Federal University, Krasnoyarsk, Russia

S. Yu. Piskorskaya

Svetlana Piskorskaya, Doctor of Philosophy, Professor, Director of the Institute of Social Engineering, Reshetnev Siberian State University of Science and Technologies, Krasnoyarsk, Russia

References

Распопин Н.А., Карасева М.В., Зеленков П.В., Каюков Е.В., Ковалев И.В. Модели и методы оптимизации сбора и обработки информации. Сибирский аэрокосмический журнал. 2012; 2(42): 69-72.

Коровиков Н.А., Гончаров М.А., Кадров М.С. Анализ методов выделения именованных сущностей из неструктурированных документов. Международный журнал прикладных наук и технологий «Integral». 2019; 3: 328-332.

Абрамов П.С. Извлечение ключевой информации из текста. Новые информационные технологии в автоматизированных системах. 2018; 21: 217-219.

Киселев С.Л., Ермаков А.Е., Плешко В.В. Поиск фактов в тексте естественного языка на основе сетевых описаний. Компьютерная лингвистика и интеллектуальные технологии: труды международной конференции Диалог’2004. М.: Наука; 2004: 180-185.

Nadeau D., Sekine S. A survey of named entity recognition and classification. Linguisticae Investigationes. 2007; 1(30): 3-26. https://doi.org/10.1075/li.30.1.03nad

Gentile A. L. et al. Cultural Knowledge for Named Entity Disambiguation: A Graph-Based Semantic Relatedness Approach. Serdica Journal of Computing. 2010; 4(2): 217-242. https://doi.org/10.55630/sjc.2010.4.217-242

Bikel D. M., Miller S., Schwartz R., Weischedel R. Nymble: a high performance learning namefinder. Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97); 1997: 194-201. https://doi.org/10.3115/974557.974586

Brester C., Semenkin E., Kovalev I., Zelenkov P., Sidorov M. Evolutionary feature selection for emotion recognition in multilingual speech analysis. IEEE Congress on Evolutionary Computation (CEC 2015); 2015: 2406-2411. https://doi.org/10.1109/CEC.2015.7257183

Ковалев И.В., Лесков О.В., Карасева М.В. Внутриязыковые ассоциативные поля в мультилингвистической адаптивно-обучающей технологии. Системы управления и информационные технологии. 2008; 3-1(33): 157-160.

Зеленков П.В., Ковалев И.В., Карасева М.В., Рогов С.В. Мультилингвистическая модель распределенной системы на основе тезауруса. Сибирский аэрокосмический журнал. 2008; 1(18): 26-28.

Ковалев И.В. Системная архитектура мультилингвистической адаптивно-обучающей технологии и современная структурная методология. Телекоммуникации и информатизация образования. 2002; 3: 6.

Ковалев И.В., Полянский К.В., Зеленков П.В., Брезицкая В.В., Сидорова Г.А. Система поиска, анализа и обработки мультилингвистических текстов, интегрированная с информационно-поисковыми системами. Сибирский аэрокосмический журнал. 2013; 1(47): 48-52.

Appelt D., Hobbs J., Bear J., Israel D., Tyson M. FASTUS: A finitestate processor for information extraction from real-world text. Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93). Chambery, France; 1993: 1172–1178. https://doi.org/10.3115/1075671.1075701

Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989; 77(2): 257-286. https://doi.org/10.1109/5.18626

Wen Y. Text Mining Using HMM and PPM. Master's thesis. Department of Computer Science, University of Waikato. 2001.

Ковалев И.В., Карасева М.В., Суздалева Е.А. Системные аспекты организации и применения мультилингвистической адаптивно-обучающей технологии. Образовательные технологии и общество. 2002; 5(2): 198-212.

REFERENCES

Raspopin N.A., Karaseva M.V., Zelenkov P.V., Kayukov E.V., Kovalev I.V. Modeli i metody optimizacii sbora i obrabotki informacii. Sibirskij aerokosmicheskij zhurnal. 2012; 2(42): 69-72. (in Russian)

Korovikov N.A., Goncharov M.A., Kadrov M.S. Analiz metodov vydeleniya imenovannyh sushchnostej iz nestrukturirovannyh dokumentov. Mezhdunarodnyj zhurnal prikladnyh nauk i tekhnologij «Integral». 2019; 3: 328-332. (in Russian)

Abramov P.S. Izvlechenie klyuchevoj informacii iz teksta. Novye informacionnye tekhnologii v avtomatizirovannyh sistemah. 2018; 21: 217-219. (in Russian)

Kiselev S.L., Ermakov A.E., Pleshko V.V. Poisk faktov v tekste estestvennogo yazyka na osnove setevyh opisanij. Komp'yuternaya lingvistika i intellektual'nye tekhnologii: trudy mezhdunarodnoj konferencii Dialog’2004. M.: Nauka; 2004: 180-185. (in Russian)

Nadeau D., Sekine S. A survey of named entity recognition and classification. Linguisticae Investigationes. 2007; 1(30): 3-26. https://doi.org/10.1075/li.30.1.03nad DOI: https://doi.org/10.1075/li.30.1.03nad

Gentile A. L. et al. Cultural Knowledge for Named Entity Disambiguation: A Graph-Based Semantic Relatedness Approach. Serdica Journal of Computing. 2010; 4(2): 217-242. https://doi.org/10.55630/sjc.2010.4.217-242 DOI: https://doi.org/10.55630/sjc.2010.4.217-242

Bikel D. M., Miller S., Schwartz R., Weischedel R. Nymble: a high performance learning namefinder. Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97); 1997: 194-201. https://doi.org/10.3115/974557.974586 DOI: https://doi.org/10.3115/974557.974586

Brester C., Semenkin E., Kovalev I., Zelenkov P., Sidorov M. Evolutionary feature selection for emotion recognition in multilingual speech analysis. IEEE Congress on Evolutionary Computation (CEC 2015); 2015: 2406-2411. https://doi.org/10.1109/CEC.2015.7257183 DOI: https://doi.org/10.1109/CEC.2015.7257183

Kovalev I.V., Leskov O.V., Karaseva M.V. Vnutriyazykovye associativnye polya v mul'tilingvisticheskoj adaptivno-obuchayushchej tekhnologii. Sistemy upravleniya i informacionnye tekhnologii. 2008; 3-1(33): 157-160. (in Russian)

Zelenkov P.V., Kovalev I.V., Karaseva M.V., Rogov S.V. Mul'tilingvisticheskaya model' raspredelennoj sistemy na osnove tezaurusa. Sibirskij aerokosmicheskij zhurnal. 2008; 1(18): 26-28. (in Russian)

Kovalev I.V. Sistemnaya arhitektura mul'tilingvisticheskoj adaptivno-obuchayushchej tekhnologii i sovremennaya strukturnaya metodologiya. Telekommunikacii i informatizaciya obrazovaniya. 2002; 3: 6. (in Russian)

Kovalev I.V., Polyanskij K.V., Zelenkov P.V., Brezickaya V.V., Sidorova G.A. Sistema poiska, analiza i obrabotki mul'tilingvisticheskih tekstov, integrirovannaya s informacionno-poiskovymi sistemami. Sibirskij aerokosmicheskij zhurnal. 2013; 1(47): 48-52. (in Russian)

Appelt D., Hobbs J., Bear J., Israel D., Tyson M. FASTUS: A finitestate processor for information extraction from real-world text. Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93). Chambery, France; 1993: 1172–1178. https://doi.org/10.3115/1075671.1075701 DOI: https://doi.org/10.3115/1075671.1075701

Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989; 77(2): 257-286. https://doi.org/10.1109/5.18626 DOI: https://doi.org/10.1109/5.18626

Wen Y. Text Mining Using HMM and PPM. Master's thesis. Department of Computer Science, University of Waikato. 2001.

Kovalev I.V., Karaseva M.V., Suzdaleva E.A. Sistemnye aspekty organizacii i primeneniya mul'tilingvisticheskoj adaptivno-obuchayushchej tekhnologii. Obrazovatel'nye tekhnologii i obshchestvo. 2002; 5(2): 198-212. (in Russian)

Informatics. Economics. Management

Published

Issue

Section

License

How to Cite

To one approach to extracting named entities from unstructured texts

Abstract

Author Biographies

A. A. Voroshilova

S. Yu. Piskorskaya

References

Most read articles by the same author(s)

Language

FOUNDERS

Abstracted and Indexed

The Journal is issued under the aegis of the Russian and International Union of Scientific and Engineering Public Associations

Access