Innovative Technologies for Creating Multilingual Audio content in the Publishing Industry

Alexey Kalmykov

pdf

Published: Nov 25, 2023

Keywords:

artificial intelligence natural language processing text-to-speech multilingual audiobooks blockchain technology

Alexey Kalmykov

Magic Dome Books s.r.o. Publishing House, Czech Republic

Abstract

This article explores using artificial intelligence, natural language processing, text-to-speech, machine learning, cloud platforms, and blockchain in the publishing sector to improve the production, accessibility, and distribution of multilingual audiobooks. It uses a literature review and case study approach to identify the adoption and use of these technologies in publishing roles and responsibilities. The results show that AI and NLP improve multilingual content generation, while TTS and machine learning enable the efficient generation of natural and digitally synthesized voices with multilingual competencies. Social networking offers a comfortable way to share content, while blockchain addresses piracy issues. However, ethical concerns, data reliance, and expensive solutions for minor players are the main limitations of these technologies. The findings suggest that while these technologies contribute to multilingual audio-content production, their efficiency depends on region, culture, and technology availability. Future development should prioritize language perspectives, ethical considerations, and cost issues for small and medium enterprises. The incorporation of AI with human resources could provide the best solution for audio content quality, cultural ingenuity, and sustainability.

Downloads

Download data is not yet available.

| Abstract views: 608 | PDF Downloads: 362 |

How to Cite

Kalmykov, A. (2023). Innovative Technologies for Creating Multilingual Audio content in the Publishing Industry. Law, Business and Sustainability Herald, 3(4), 72–87. Retrieved from https://www.lbsherald.org/index.php/journal/article/view/70

Issue

Vol. 3 No. 4 (2023): Law, Business and Sustainability Herald

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

Abualigah, L., Bashabsheh, M. Q., Alabool, H., & Shehab, M. (2020). Text Summarization: A Brief Review. In Studies in Computational Intelligence (pp. 1–15). Springer International Publishing.

Akhtar, Z. (2023). Deepfakes generation and detection: A short survey. Journal of Imaging, 9(1), 18. https://doi.org/10.3390/jimaging9010018

AL-Bakhrani, A. A., Amran, G. A., Al-Hejri, A. M., Chavan, S. R., Manza, R., & Nimbhore, S. (2023). Development of multilingual speech recognition and translation technologies for communication and interaction. In Advances in Intelligent Systems Research (pp. 711–723). Atlantis Press International BV. https://doi.org/10.2991/978-94-6463-196-8_54

Almutairi, Z., & Elgibreen, H. (2022). A review of modern audio Deepfake detection methods: Challenges and future directions. Algorithms, 15(5), 155. https://doi.org/10.3390/a15050155

Aparna, M., Srivatsa, S., Sai Madhavan, G., Dinesh, T. B., & Srinivasa, S. (2024). AI-Based Assistance for Management of Oral Community Knowledge in Low-Resource and Colloquial Kannada Language. In Big Data Analytics in Astronomy, Science, and Engineering (pp. 3–16). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-58502-9_1

Baevski, A., Schneider, S., & Auli, M. (2019). Vq-wav2vec: Self-supervised learning of discrete speech representations. In arXiv [cs.CL]. https://doi.org/10.48550/ARXIV.1910.05453

Bahja, M. (2020). Natural Language Processing Applications in Business. In E-Business. IntechOpen. https://doi.org/10.5772/intechopen.92203

Ballesteros, D. M., Rodriguez, Y., & Renza, D. (2020). A dataset of histograms of original and fake voice recordings (H-Voice). Data in Brief, 29(105331), 105331. https://doi.org/10.1016/j.dib.2020.105331

Ballesteros, D. M., Rodriguez-Ortega, Y., Renza, D., & Arce, G. (2021). Deep4SNet: deep learning for fake speech classification. Expert Systems with Applications, 184(115465), 115465. https://doi.org/10.1016/j.eswa.2021.115465

Beseghi, M. (2023). Subtitling for the deaf and hard of hearing, audio description and audio subtitling in multilingual TV shows. Languages, 8(2), 109. https://doi.org/10.3390/languages8020109

Bigioi, D., & Corcoran, P. (2023). Multilingual video dubbing—a technology review and current challenges. Frontiers in Signal Processing, 3. https://doi.org/10.3389/frsip.2023.1230755

Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., Roblek, D., Teboul, O., Grangier, D., Tagliasacchi, M., & Zeghidour, N. (2023). AudioLM: A language modeling approach to audio generation. ACM Transactions on Audio, Speech, and Language Processing, 31, 2523–2533. https://doi.org/10.1109/taslp.2023.3288409

Bugliarello, E., & Okazaki, N. (2020). Enhancing machine translation with dependency-aware self-attention. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.147

Deepak, G., Surya, D., Trivedi, I., Kumar, A., Lingampalli, A., & Vijayan, S. (2022). An artificially intelligent approach for automatic speech processing based on triune ontology and adaptive tribonacci deep neural networks. Computers & Electrical Engineering: An International Journal, 98(107736), 107736. https://doi.org/10.1016/j.compeleceng.2022.107736

Deshmukh, S., Elizalde, B., Singh, R., & Wang, H. (2023). Pengi: An Audio Language Model for audio tasks. In arXiv [eess.AS]. https://proceedings.neurips.cc/paper_files/paper/2023/file/3a2e5889b4bbef997ddb13b55d5acf77-Paper-Conference.pdf

Dixit, A., Kaur, N., & Kingra, S. (2023). Review of audio deepfake detection techniques: Issues and prospects. Expert Systems, 40(8). https://doi.org/10.1111/exsy.13322

Dobre, R. A., Preda, R. O., Badea, R. A., Stanciu, M., & Brumaru, A. (2020). Blockchain-Based Image Copyright Protection System using JPEG Resistant Digital Signature. In 2020 IEEE 26th International Symposium for Design and Technology in Electronic Packaging (SIITME). IEEE. https://doi.org/10.1109/siitme50350.2020.9292296

Elislah, N., & Irwansyah, I. (2022). Audiobook industry: Reading by using ear in the digital age. Jurnal Komunikasi Indonesia, 11(2), Article 2. https://doi.org/10.7454/jkmi.v11i2.1028

Giovannotti, P. (2023). Evaluating machine translation quality with conformal predictive distributions. In arXiv [cs.CL]. https://doi.org/10.48550/ARXIV.2306.01549

Have, I., & Pedersen, B. S. (2021). Reading Audiobooks. In Beyond Media Borders, Volume 1 (pp. 197–216). Springer International Publishing. https://doi.org/10.1007/978-3-030-49679-1_6

Huang, W.-C., Hayashi, T., Watanabe, S., & Toda, T. (2020). The sequence-to-sequence baseline for the voice conversion challenge 2020: Cascading ASR and TTS. In arXiv [eess.AS]. https://doi.org/10.48550/ARXIV.2010.02434

Iturregui-Gallardo, G. (2020). Rendering multilingualism through audio subtitles: shaping a categorisation for aural strategies. International Journal of Multilingualism, 17(4), 485–498. https://doi.org/10.1080/14790718.2018.1523173

Jafari, Z. (2023). The Role of AI in Supporting Indigenous Languages. AI and Tech in Behavioral and Social Sciences, 1(2), 4–11.

Jani, M. M., Panchal, S. R., Patel, H. H., & Raiyani, A. (2024). Multilingual speech recognition: An in-depth review of applications, challenges, and future directions. In Communication and Intelligent Systems (pp. 1–13). Springer Nature Singapore.

Karanasios, S., Nardi, B., Spinuzzi, C., & Malaurent, J. (2021). Moving forward with activity theory in a digital world. Mind Culture and Activity, 28(3), 234–253. https://doi.org/10.1080/10749039.2021.1914662

Kotsakis, R., Matsiola, M., Kalliris, G., & Dimoulas, C. (2020). Investigation of spoken-language detection and classification in broadcasted audio content. Information (Basel), 11(4), 211. https://doi.org/10.3390/info11040211

Kritikos, Y., Giariskanis, F., Protopapadaki, E., Papanastasiou, A., Papadopoulou, E., & Mania, K. (2023). Audio augmented reality outdoors. Proceedings of the 2023 ACM International Conference on Interactive Media Experiences, 199–204. https://doi.org/10.1145/3573381.3597028

Kumar, Y., Koul, A., & Singh, C. (2022). A deep learning approaches in text-to-speech system: a systematic review and recent research perspective. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-022-13943-4

Lakhotia, K., Kharitonov, E., Hsu, W.-N., Adi, Y., Polyak, A., Bolte, B., Nguyen, T.-A., Copet, J., Baevski, A., Mohamed, A., & Dupoux, E. (2021). On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9, 1336–1354. https://aclanthology.org/2021.tacl-1.79.pdf

Latif, S., Shoukat, M., Shamshad, F., Usama, M., Ren, Y., Cuayáhuitl, H., Wang, W., Zhang, X., Togneri, R., Cambria, E., & Schuller, B. W. (2023). Sparks of Large Audio Models: A survey and outlook. In arXiv [cs.SD]. https://doi.org/10.48550/ARXIV.2308.12792

Lee, S.-M. (2023). The effectiveness of machine translation in foreign language education: a systematic review and meta-analysis. Computer Assisted Language Learning, 36(1–2), 103–125. https://doi.org/10.1080/09588221.2021.1901745

Liu, X., Zhu, Z., Liu, H., Yuan, Y., Cui, M., Huang, Q., Liang, J., Cao, Y., Kong, Q., Plumbley, M. D., & Wang, W. (2023). WavJourney: Compositional audio creation with Large Language Models. In arXiv [cs.SD]. http://arxiv.org/abs/2307.14335

Liu, Y., Zhang, J., Xiong, H., Zhou, L., He, Z., Wu, H., Wang, H., & Zong, C. (2020). Synchronous speech recognition and speech-to-text translation with interactive decoding. Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 34(05), 8417–8424. https://doi.org/10.1609/aaai.v34i05.6360

Llanes-Ortiz, G. (2023). Digital initiatives for indigenous languages: UNESCO Publishing. https://unesdoc.unesco.org/ark:/48223/pf0000387186

Lopez-de-Ipina, K., Barroso, N., Calvo, P. M., Hernandez, C., Ezeiza, A., Susperregi, U., & Fernández, E. (2020). Multilingual audio information management system based on semantic knowledge in complex environments. Neural Computing & Applications, 32(24), 17869–17886. https://doi.org/10.1007/s00521-019-04618-7

Mahum, R., Irtaza, A., & Javed, A. (2023). Text to speech synthesis using deep learning. In Intelligent Multimedia Signal Processing for Smart Ecosystems (pp. 289–305). Springer International Publishing. https://doi.org/10.1007/978-3-031-34873-0_12

Mao, L., Zhang, X., Ma, J., & Jia, Y. (2023). A comparative study on the audio-visual evaluation of the grand Song of the Dong soundscape. Heritage Science, 11(1). https://doi.org/10.1186/s40494-023-00876-w

Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A., & Malik, H. (2023). Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Applied Intelligence, 53(4), 3974–4026. https://doi.org/10.1007/s10489-022-03766-z

Morita, T., & Koda, H. (2020). Exploring TTS without T using biologically/psychologically motivated neural network modules (ZeroSpeech 2020). In Proceedings of Interspeech 2020 (pp. 4856–4860). https://doi.org/10.21437/Interspeech.2020-3127

Ni, J., Wang, L., Gao, H., Qian, K., Zhang, Y., Chang, S., & Hasegawa-Johnson, M. (2022). Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition. https://doi.org/10.13140/RG.2.2.19818.18884

Pandita, K., Thakur, P. K. S., & Annamalai, S. (2023). Contextual transcription and Summarization of audio using AI. Proceedings of the 5th International Conference on Information Management & Machine Intelligence. https://doi.org/10.1145/3647444.3647871

Patkar, U. C., Patil, S. H., & Peddi, P. (2020). Machine Translation of English to Ahirani Language: A Review.

Pluszyńska, A. (2020). Copyright management by contemporary art exhibition institutions in Poland: Case study of the Zachęta National Gallery of Art. Paper presented at the Sustainability. https://doi.org/10.3390/su12114498

Polyak, A., Wolf, L., Adi, Y., Kabeli, O., & Taigman, Y. (2021). High fidelity speech regeneration with application to speech enhancement. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/ICASSP39728.2021.9414853

Raut, N. B., Pranesh, A. S., Nagulan, B., Pranesh, S., & Vasantharajan, R. (2023). An extensive survey on audio-to-text and text summarization for video content. In 2023 3rd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA) (pp. 1251–1257). IEEE. https://doi.org/10.1109/ICIMIA60377.2023.10426376

Rusmanayanti, A. (2021). The Use of Audiobooks as Part of Digital Literacies in Indonesian Students’ Perception. Paper presented at the 2nd International Conference on Education, Language, Literature, and Arts (ICELLA 2021). https://www.atlantis-press.com/article/125961964.pdf

Saini, M., Arora, V., Singh, M., Singh, J., & Adebayo, S. O. (2023). Artificial intelligence inspired multilanguage framework for note-taking and qualitative content-based analysis of lectures. Paper presented at the Education and Information Technologies. https://link.springer.com/article/10.1007/s10639-022-11229-8

Smith, M. K. (2016). Issues in cultural tourism studies (3rd ed.). Routledge Is.

Son, J.-B., Ružić, N. K., & Philpott, A. (2023). Artificial intelligence technologies and applications for language learning and teaching. Journal of China Computer-Assisted Language Learning. https://doi.org/10.1515/jccall-2023-0015

Song, H.-K., Woo, S. H., Lee, J., Yang, S., Cho, H., Lee, Y., Choi, D., & Kim, K.-W. (2022). Talking face generation with multilingual TTS. In arXiv [cs.CV]. https://doi.org/10.48550/ARXIV.2205.06421

Spiteri Miggiani, G. (2021). English-language dubbing: challenges and quality standards of an emerging localisation trend. https://www.um.edu.mt/library/oar/handle/123456789/97095

Stadlmann, C., & Zehetner, A. (2021). Human Intelligence Versus Artificial Intelligence: A Comparison of Traditional and AI-Based Methods for Prospect Generation Marketing and Smart Technologies. In Proceedings of ICMarkTech 2020 (pp. 11–22). Springer. https://doi.org/10.1007/978-981-33-4183-8_2

Stahlberg, F. (2020). Neural Machine Translation: A Review. The Journal of Artificial Intelligence Research, 69, 343–418. https://doi.org/10.1613/jair.1.12007

Tan, X. (2023). Neural text-to-speech synthesis: Springer.

Tan, X., Qin, T., Soong, F., & Liu, T.-Y. (2021). A Survey on Neural Speech Synthesis. In arXiv [eess.AS]. https://doi.org/10.48550/ARXIV.2106.15561

Tan, Z., Wang, S., Yang, Z., Chen, G., Huang, X., Sun, M., & Liu, Y. (2020). Neural machine translation: A review of methods, resources, and tools. AI Open, 1, 5–21. https://doi.org/10.1016/j.aiopen.2020.11.001

Valizada, A., Jafarova, S., Sultanov, E., & Rustamov, S. (2021). Development and Evaluation of Speech Synthesis System Based on Deep Learning Models. Symmetry, 13(5), 819. https://doi.org/10.3390/sym13050819

Vayadande, K., Nemade, M., Parbhanikar, S., Rathod, S., Raut, A., & Thorat, R. (2023). Efficient Content Exploration on YouTube: Automatic Speech Recognition-Based Video Summarization. In 2023 7th International Conference on Electronics, Communication and Aerospace Technology (ICECA). IEEE. https://doi.org/10.1109/iceca58529.2023.10395257

Yang, D., Tian, J., Tan, X., Huang, R., Liu, S., Chang, X., Shi, J., Zhao, S., Bian, J., Zhao, Z., Wu, X., & Meng, H. (2023). UniAudio: An audio foundation model toward universal audio generation. In arXiv [cs.SD]. http://arxiv.org/abs/2310.00704

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References