A Review of Fault Tolerance Techniques in Generative Multi-Agent Systems for Real-Time Applications

Authors

  • Subhan Uddin School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China Author
  • Babar Hussain School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China Author
  • Sidra Fareed School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China Author
  • Aqsa Arif School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China Author
  • Babar Ali School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China Author

DOI:

https://doi.org/10.64229/d8g06y36

Keywords:

Generative Agents, Rollback-Recovery, Human Behavior, Message-Passing Systems, Interactive Simulacra

Abstract

Generative multi-agent systems are emerging as a powerful paradigm for simulating human-like behavior in real-time applications such as interactive storytelling, virtual reality environments, and autonomous decision-making. These agents, often powered by large language models and memory systems, act independently and adapt over time. However, a critical challenge in deploying such systems is ensuring their fault tolerance. The ability to maintain operation in the presence of faults such as communication failures, memory corruption, agent crashes, or behavioral inconsistencies. This paper presents a comprehensive review of fault tolerance techniques for generative agents, focusing on methods such as memory check pointing, agent replication, fusion-based resilience, and consistency protocols. We analyse these approaches, drawing parallels from distributed systems, and evaluate their effectiveness in maintaining operational integrity in large-scale, real-time environments. Our findings suggest that while no single technique offers a one-size-fits-all solution, a combination of methods can provide robust fault tolerance and support the scalability and reliability of generative agent systems in dynamic, fault-prone environments.

References

[1]Q. Zhang and W. Ma. Generative multi-agent systems in complex environments. Journal of Artificial Intelligence Research, 62:87–104, 2020.

[2]J. Liu and Y. Zhang. Autonomous vehicles: A survey on fault tolerance and safety measures. IEEE Transactions on Intelligent Transportation Systems, 22(9):3871–3885, 2021.

[3]S. Ghosh. Distributed systems: An algorithmic approach. 2006.

[4]Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.

[5]F. Xu and Y. Zhang. Resilience in multi-agent systems for autonomous vehicles. IEEE Transactions on Systems, Man, and Cybernetics, 51(4):1101–1115, 2021.

[6]J. Chen and Z. Zhao. Adaptive fault recovery strategies for multi-agent systems using reinforcement learning. Journal of Autonomous Systems, 58(7):532–548, 2021.

[7]J. Park, J. O’Brien, C.J. Cai, M.R. Morris, P. Liang, and M.S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the ACM CHI Conference on Human Factors in Computing Systems, pages 1–17, 2023.

[8]Y. Jiang and X. Yang. Fault tolerance in autonomous systems: Applications and challenges. IEEE Transactions on Robotics, 39(5):1452–1465, 2021.

[9]S. Tan and K. Hu. Healthcare applications of generative multi-agent systems: Fault tolerance and memory models. Journal of Medical Systems, 44(3):91, 2020.

[10]F. Xu and L. Zhang. Coordination in autonomous vehicle systems using generative multi-agent models. IEEE Transactions on Intelligent Transportation Systems, 22(9):2451–2463, 2022.

[11]S. Ma and T. Wang. Vehicle-to-vehicle communication in autonomous systems. IEEE Transactions on Vehicular Technology, 69(5):4324–4337, 2020.

[12]E.N. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375–408, 2002.

[13]L. Zhou and S. Wang. Proactive fault tolerance in multi-agent systems. IEEE Transactions on Autonomous Systems, 7(3):345–358, 2021.

[14]F. Xu and H. Liu. Learning-based fault detection for autonomous systems. Journal of Artificial Intelligence Research, 59:415–430, 2022.

[15]J. Wang and X. Zhang. Adaptive fault recovery strategies in distributed multi-agent systems. IEEE Transactions on Robotics, 38(5):889–902, 2021.

[16]L. Liu and Y. Zhang. Adaptive fault-tolerant strategies for autonomous multi-agent systems. IEEE Transactions on Autonomous Systems, 8(4):1122–1134, 2021.

[17]T. Chen and Z. Yang. Autonomous decision-making in fault-tolerant systems for smart cities. ACM Transactions on Autonomous and Adaptive Systems, 17(2):76–89, 2022.

[18]A. S. Tanenbaum and M. Van Steen. Distributed Systems: Principles and Paradigms. Prentice Hall, 3rd edition, 2016.

[19]K. Valmeekam, S. Srivastava, and S. Zhang. Planning with large language models for generative agents. arXiv preprint arXiv:2305.16960, 2023.

[20]H. Wang and Y. Zhang. Load balancing for scalable and efficient multi-agent systems in smart cities. Journal of Internet of Things, 17(4):152–163, 2022.

[21]X. Xu, P. Zhao, and Y. Liu. Scalability and load distribution in multi-agent systems: A survey. IEEE Transactions on Systems, Man, and Cybernetics, 50(5):1719–1732, 2020.

[22]Y. Li and L. Zhang. Distributed memory management for scalable multi-agent systems. ACM Transactions on Autonomous and Adaptive Systems, 16(3):1–18, 2021.

[23]J. Zhang and Y. Cheng. Scalability challenges and solutions in multi-agent systems for autonomous fleets. IEEE Transactions on Intelligent Transportation Systems, 21(6):2899–2908, 2020.

[24]J. Hou and L. Zeng. Gossip protocols for decentralized communication in large-scale multi-agent systems. IEEE Internet of Things Journal, 8(5):3881–3891, 2021.

[25]Y. Liu and S. Li. Machine learning models for scalable multi-agent systems. Journal of Artificial Intelligence, 15(3):1104–1120, 2021.

[26]H. Huang and F. Yang. Quantum computing for scalable and resilient multi-agent systems. Quantum Computing and Communication, 2(2):55–70, 2021.

[27]J. Jiang and Y. Yu. Efficient incremental checkpointing for distributed systems. Journal of Cloud Computing, 8(1):12–25, 2020.

[28]R. Sundaram and K. Marzullo. The sly algorithm for fault-tolerant sensor fusion. Journal of Distributed Computing, 25(3):208–215, 2019.

[29]K. Valmeekam, S. Srivastava, and S. Zhang. Integrating symbolic reasoning with probabilistic models for generative agents. Artificial Intelligence Review, 54(2):103–117, 2020.

[30]Y. Li and X. Lu. Autonomous decision-making in fault-tolerant multi-agent systems. IEEE Transactions on Autonomous Systems, 10(2):121–137, 2021.

[31]A. S. Tanenbaum and M. Van Steen. Distributed systems: Principles and paradigms. 2019.

[32]E. N. Elnozahy, L. Alvisi, and Y. M. Wang. Survey on memory checkpointing in distributed systems. IEEE Transactions on Parallel and Distributed Systems, 32(5):1027–1040, 2021.

[33]Z. Ji and W. Zhang. Adaptive fault tolerance mechanisms in real-time multi-agent systems. IEEE Transactions on Autonomous Systems, 7(3):456–468, 2022.

[34]C. Li and X. Yang. Hybrid fault tolerance for autonomous systems: Combining replication and checkpointing for resilience. Journal of Autonomous Robotics, 45(5):205–220, 2021.

[35]R. Sundaram and K. Marzullo. The sly algorithm for sensor fusion. In Proceedings of the 1998 IEEE International Conference on Distributed Computing Systems, pages 208–215, 1998.

[36]H. Liu and Y. Zhang. Adaptive fault detection models for resilient multi-agent systems. IEEE Transactions on Autonomous Systems, 6(4):29–42, 2022.

[37]W. Zhang and X. Wang. Predictive fault tolerance in autonomous vehicle systems: A machine learning approach. Journal of Autonomous Systems, 16(3):103–118, 2021.

[38]T. Li and L. Zhang. Reinforcement learning for fault recovery in autonomous systems. IEEE Transactions on Robotics, 37(7):1911–1924, 2021.

[39]C. Chavez and A. Rodriguez. Anomaly detection in large-scale multi-agent systems: Applications and approaches. Artificial Intelligence Review, 53(2):311–327, 2020.

[40]A. Guerra and M. Pinto. Multi-agent reinforcement learning for fault tolerance in smart cities. IEEE Transactions on Smart Cities, 8(6):52–66, 2021.

[41]Z. Jia and W. Wu. Collaborative fault tolerance in autonomous vehicle fleets using multi-agent reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 22(8):2899–2910, 2021.

[42]L. Zhou and H. Zhang. Adaptive hybrid fault tolerance strategies for multi-agent systems in dynamic environments. Journal of Computing and Security, 52:135–149, 2022.

[43]B. Xu and L. Zhang. Quantum computing and fault tolerance in multi-agent systems. Quantum Computing Research, 4(2):67–82, 2022.

[44]T. Li and Y. Wang. Edge computing for fault tolerant multi-agent systems. IEEE Transactions on Cloud Computing, 10(6):1251-1264,2022.

Downloads

Published

2025-07-03

Issue

Section

Articles