Fail Fast, Fail Small: Designing Resilient Systems for the Future of Software Engineering

  IJRES-book-cover  International Journal of Recent Engineering Science (IJRES)          
  
© 2024 by IJRES Journal
Volume-11 Issue-5
Year of Publication : 2024
Authors : Jill Willard, James Hutson
DOI : 10.14445/23497157/IJRES-V11I5P106

How to Cite?

Jill Willard, James Hutson, "Fail Fast, Fail Small: Designing Resilient Systems for the Future of Software Engineering," International Journal of Recent Engineering Science, vol. 11, no. 5, pp. 51-58, 2024. Crossref, https://doi.org/10.14445/23497157/IJRES-V11I5P106

Abstract
The principles of "fail fast, fail small" have emerged as critical in modern software and system design. By planning for minor, manageable failures instead of catastrophic breakdowns, developers can ensure that systems degrade gracefully, maintaining functionality even when encountering issues. This article delves into strategies for designing resilient systems, beginning with the concept of slow degradation and distributed systems that prioritize core functions while allowing non-critical components to fail without significant user impact. The Netflix recommendation engine serves as a prime example of a system that continues to operate under failure conditions. Chaos engineering, a proactive methodology for stress-testing system robustness, is explored with real-world examples of its implementation. As AI continues to evolve, its role in identifying weaknesses and enhancing system resilience is becoming indispensable. The article highlights AI's potential to push the boundaries of chaos engineering and discusses the growing importance of hybrid cloud solutions, balancing cloud and on-premise resources for optimized resilience. Future trends emphasize the need for service scalability based on business-critical classifications, allowing systems to prioritize resources effectively. Designing systems to "fail fast, fail small" is not only about mitigating risk but also about building adaptive, future-proof architectures that anticipate the unknown.

Keywords
Fail fast, Chaos engineering, System resilience, AI-driven robustness, Hybrid cloud solutions.

Reference
[1] Boehmer, Annette Isabel, and Lindemann, Udo, “Open Innovation Ecosystem: Towards Collaborative Innovation,” In DS 80-8 Proceedings of the 20th International Conference on Engineering Design (ICED 15), Milan, Italy, vol. 8, pp. 31-40, 2015.
[Google Scholar] [Publisher Link]
[2] Vinod Khosla, “The Innovator’s Ecosystem,” Khoslaventures, pp. 1-27, 2011.
[Google Scholar] [Publisher Link]
[3] William S. Seidel, Licensing Myths & Mastery: Why Most Ideas Don’t Work and What to Do About It, Business Expert Press, 2017.
[Google Scholar] [Publisher Link]
[4] John Thomas, and Pam Mantri, “Axiomatic Cloud Computing Architectural Design,” Design Engineering and Science, Springer, Cham, pp. 605-657, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Nicholas Jeffrey, Qing Tan, and José R. Villar, “A Review of Anomaly Detection Strategies to Detect Threats to Cyber-Physical Systems,” Electronics, vol. 12, no. 15, PP. 1-34, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Aarti Dawra et al., “12 Enhancing Business Development, Ethics, and Governance with the Adoption of Distributed Systems,” Meta Heuristic Algorithms for Advanced Distributed Systems, vol. 193-209, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Francisco Henrique Cerdeira Ferreira et al., “A Framework for The Design of Fault-Tolerant Systems-Of-Systems,” Journal of Systems and Software, vol. 211, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Dathar Hasan, and Subhi R. M. Zeebaree, “Proactive Fault Tolerance in Distributed Cloud Systems: A Review of Predictive and Preventive Techniques,” Indonesian Journal of Computer Science, vol. 13, no. 2, PP. 1731- 1748, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Federico Reghenzani, Zhishan Guo, and William Fornaciari, “Software Fault Tolerance in Real-Time Systems: Identifying the Future Research Questions,” ACM Computing Surveys, vol. 55, no. 14s, pp. 1-30, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Kshitij Kumar, Dilemma of Speed Vs. Scale in Software System Development Best Practices from Industry Leaders, Doctoral Dissertation, Massachusetts Institute of Technology, pp. 90-93, 2017.
[Google Scholar] [Publisher Link]
[11] Xiaoliang Chen et al., “Automating Optical Network Fault Management with Machine Learning,” In Proceedings IEEE Communications Magazine, vol. 60, no. 12, pp. 88-94, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[12] S. Naghshbandi, E. Varga, and T. Dolan, Review of Emergent Behaviors of Systems Comparable to Infrastructure Systems and Analysis Approaches That Could Be Applied to Infrastructure Systems, University College London, Gower Street, London, pp. 1-64, 2020.
[Google Scholar] [Publisher Link]
[13] James Geisbush, and Samuel T. Ariaratnam, “Reliability Centered Maintenance (RCM): Literature Review of Current Industry State of Practice,” Journal of Quality in Maintenance Engineering, vol. 29, no. 2, pp. 313-337, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Jiusi Zhang et al., “Prognostics for the Sustainability of Industrial Cyber-Physical Systems: From an Artificial Intelligence Perspective,” In Proceedings IEEE Transactions on Industrial Cyber-Physical Systems, vol. 2, pp. 495-507, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Alok Mishra, and Ziadoon Otaiwi, “Devops and Software Quality: A Systematic Mapping,” Computer Science Review, vol. 38, pp. 1- 14, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Fatemeh Mostafavi et al., “An Interactive Assessment Framework for Residential Space Layouts Using Pix2pix Predictive Model at The Early-Stage Building Design,” Smart and Sustainable Built Environment, vol. 13, no. 4, pp. 809-827, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Israel Koren, and C. Mani Krishna, Fault-Tolerant Systems, Morgan Kaufmann, Elsevier Science, pp. 1-378, 2007.
[Google Scholar] [Publisher Link]
[18] Rolf Isermann, “Process Fault Detection Based on Modeling and Estimation Methods - A Survey,” Automatica, vol. 20, no. 4, pp. 387- 404, 1984.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Qiuping Yi et al., “Explaining Software Failures by Cascade Fault Localization,” In Proceedings ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 20, no. 3, pp. 1-28, 2015.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Pat Helland, “Fail-Fast Is Failing... Fast! Changes In Compute Environments are Placing Pressure on Tried-And-True Distributed-Systems Solutions,” Queue, vol. 19, no. 1, pp. 5-15, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Kati Kuusinen et al., “A Large Agile Organization on its Journey Towards Devops,” 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Prague, Czech Republic, pp. 60-63, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Ravi Teja Yarlagadda, “Devops and its Practices,” International Journal of Creative Research Thoughts (IJCRT), vol. 9, no. 3, pp. 111- 119, 2021.
[Google Scholar] [Publisher Link]
[23] Yuri Bobbert, and Maria Chtepen, Research Findings in the Domain of CI/CD and DevOps on Security Compliance, Strategic Approaches to Digital Platform Security Assurance, pp. 286-307, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Andrew Yoo et al., “Fail-Slow Fault Tolerance Needs Programming Support,” HotOS '21: Proceedings of the Workshop on Hot Topics in Operating Systems, Ann Arbor Michigan, pp. 228-235, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Shuiguang Deng et al., “Cloud-Native Computing: A Survey from the Perspective of Services,” Proceedings of the IEEE, vol. 112, no. 1, pp. 12-46, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Bhavana Chaurasia, Anshul Verma, and Pradeepika Verma, “An In-Depth and Insightful Exploration of Failure Detection in Distributed Systems,” Computer Networks, vol. 247, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Kjell Jørgen Hole, “Tutorial on Systems with Antifragility to Downtime,” Computing, vol. 104, no. 1, pp. 73-93, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Kieron J. Meagher, Arlene Wong, and Klaus G. Zauner, “A Competitive Analysis of Fail Fast: Shakeout and Uncertainty About Consumer Tastes,” Journal of Economic Behavior and Organization, vol. 177, pp. 589-600, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Olorunyomi Stephen Joel et al., “Navigating the Digital Transformation Journey: Strategies for Startup Growth and Innovation in the Digital Era,” International Journal of Management and Entrepreneurship Research, vol. 6, no. 3, pp. 697-706, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Christoph A. Thieme et al., “Incorporating Software Failure in Risk Analysis–Part 1: Software Functional Failure Mode Classification,” Reliability Engineering and System Safety, vol. 197, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Tarannom Parhizkar et al., “Degradation and Failure Mechanisms of Complex Systems: Principles,” Advances in Reliability, Failure and Risk Analysis, pp. 1-50, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Aleksander Sokolov, Andrey Larionov, and Amir Mukhtarov, “Distributed System for Scientific and Engineering Computations with Problem Containerization and Prioritization,” International Conference on Distributed Computer and Communication Networks, Moscow, Russia, pp. 68-82, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Philipp Homann et al., “Evaluation of Trust Metrics in an Artificial Hormone System,” 2024 IEEE 27th International Symposium on Real-Time Distributed Computing (ISORC), Tunis, Tunisia, pp. 1-12, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Eric Hutter, and Uwe Brinkschulte, “Handling Assignment Priorities to Degrade Systems in Self-Organizing Task Distribution,” 2021 IEEE 24th International Symposium on Real-Time Distributed Computing (ISORC), Daegu, Korea (South), pp. 132-140, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Dennis M. Buede, William D. Miller, The Engineering Design of Systems: Models and Methods, John Wiley and Sons, pp. 1-464, 2024
[Google Scholar] [Publisher Link]
[36] R. Shaw, and B. Butler, “Initial Accident Scenario Analysis in Support of a Preliminary DEMO Tritium Plant Design,” Fusion Engineering and Design, vol. 189, pp. 1-19, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[37] C.P. Shelton, P. Koopman, and W. Nace, “A Framework for Scalable Analysis and Design of System-Wide Graceful Degradation in Distributed Embedded Systems,” Proceedings of the Eighth International Workshop on Object-Oriented Real-Time Dependable Systems, 2003. (WORDS 2003), Guadalajara, Mexico, pp. 156-163, 2003.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Danny Weyns et al.., “Self-Adaptation in Industry: A Survey,” ACM Transactions on Autonomous and Adaptive Systems, vol. 18, no. 2, pp. 1-44, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[39] Tomas KLIESTIK et al., “Artificial Intelligence-Based Predictive Maintenance, Time-Sensitive Networking, and Big Data-Driven Algorithmic Decision-Making in the Economics of Industrial Internet of Things,” Oeconomia Copernicana, vol. 14, no. 4, pp. 1097- 1138, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[40] Jeremy Philippe et al., “Self‐Adaptation of Service Level in Distributed Systems,” Software: Practice and Experience, vol. 40, no. 3, pp. 259-283, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Dinko Omeragić et al., “The Employment of a Machine Learning-Based Recommendation System to Maximize Netflix User Satisfaction,” International Symposium on Innovative and Interdisciplinary Applications of Advanced Technologies, pp. 300-328, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[42] J. K. Yook, D. M. Tilbury, and N. R. Soparkar, “A Design Methodology for Distributed Control Systems to Optimize Performance in The Presence of Time Delays,” Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No. 00CH36334), Chicago, IL, USA, vol. 3, pp. 1959-1964, 2000.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Ian Riley, and Rose Gamble, “Using System Profiling for Effective Degradation Detection,” 2018 IEEE International Conference on Autonomic Computing (ICAC), Trento, Italy, pp. 169-174, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[44] Filippo Poltronieri, Mauro Tortonesi, and Cesare Stefanelli, “Chaostwin: A Chaos Engineering and Digital Twin Approach for The Design of Resilient IT Services,” 2021 17th International Conference on Network and Service Management (CNSM), Izmir, Turkey, pp. 234-238, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[45] Charalambos Konstantinou et al., “Chaos Engineering for Enhanced Resilience of Cyber-Physical Systems,” 2021 Resilience Week (RWS), Salt Lake City, UT, USA, pp. 1-10, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Sebastian Frank et al., “Verifying Transient Behavior Specifications in Chaos Engineering Using Metric Temporal Logic and Property Specification Patterns,” ICPE ‘23 Companion: Companion of the 2023 ACM/SPEC International Conference on Performance Engineering, Coimbra Portugal, pp. 319-326, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[47] Yongqing Zhu et al., “AI-based Proactive Storage Failure Management in Software-Defined Data Centres,” ICISS ‘23: Proceedings of the 2023 6th International Conference on Information Science and Systems, Edinburgh United Kingdom, pp. 231-237, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Alessio Diamanti, José Manuel Sánchez Vílchez, and Stefano Secci, “An AI-Empowered Framework for Cross-Layer Softwarized Infrastructure State Assessment,” IEEE Transactions on Network and Service Management, vol. 19, no. 4, pp. 4434-4448, 2022.
[CrossRef] [Google Scholar] [Publisher Link]