Abstract

The article examines how machine learning methods are being increasingly integrated into automated testing processes; however, the absence of a unified system of quality metrics complicates the assessment of their effectiveness. In light of rising costs for defect resolution and the growing proportion of unstable (flaky) tests, investigating quality metrics for automated tests, taking machine learning (ML) approaches into account, becomes highly relevant. The goals of this work are to classify existing metrics and apply them to the particular requirements of testing ML systems; further, an economic model is proposed for decision justification in CI processes. Research novelty lies in unifying classical, machine learning (ML)-oriented, and financial metrics, and evaluating their impacts on practical costs for infrastructure, as demonstrated through industrial cases from Facebook/Meta, Netflix, and Slack. It was found that machine learning (ML) approaches to automated testing via Predictive Test Selection led to significant savings in CPU hours (Gradle, Netflix) and reductions in the percentage of flaky tests—data from Facebook/Meta and Slack also confirms this. It is shown here that PR-AUC and other precision–recall–based metrics are more faithful to real-world imbalanced defect classes than ROC curves. The economic model, with coefficients C_FP and C_FN, enables the computation of the optimal tradeoff between test execution speed and defect detection rate, allowing decisions to be made during Continuous Integration/Continuous Deployment. Quantitative data give rise to integral metrics that combine coverage type, build stability, and economic components. The article will be of particular interest to quality assurance engineers, project managers, and researchers in the fields of software testing and machine learning.

Keywords

  • automated tests
  • machine learning
  • quality metrics
  • Predictive Test Selection
  • code coverage
  • PR-AUC
  • C_FP
  • C_FN
  • CI/CD
  • build stability.

References

  1. 1. H. Krasner, “Cost of Poor Software Quality in the U.S.: A 2022 Report,” CISQ, Dec. 16, 2022. https://www.it-cisq.org/the-cost-of-poor-quality-software-in-the-us-a-2022-report/ (accessed May 02, 2025).
  2. 2. M. Simonova, “Costly Code: The Price Of Software Errors,” Forbes, Dec. 26, 2023. https://www.forbes.com/councils/forbestechcouncil/2023/12/26/costly-code-the-price-of-software-errors/ (accessed May 03, 2025).
  3. 3. J. M. Zhang, M. Harman, and Y. Liu, “Machine Learning Testing: Survey, Landscapes and Horizons,” arXiv, Jun. 2019, doi: https://doi.org/10.48550/arxiv.1906.10742.
  4. 4. M. Z. Naser and A. H. Alavi, “Error Metrics and Performance Fitness Indicators for Artificial Intelligence and Machine Learning in Engineering and Sciences,” Architecture, Structures and Construction, Nov. 2021, doi: https://doi.org/10.1007/s44150-021-00015-8.
  5. 5. M. Machalica, A. Samylkin, M. Porth, and S. Chandra, “Predictive Test Selection,” arXiv, Jan. 2018, doi: https://doi.org/10.48550/arxiv.1810.05286.
  6. 6. F. Leinen, D. Elsner, A. Pretschner, A. Stahlbauer, M. Sailer, and Elmar Jürgens, “Cost of Flaky Tests in Continuous Integration: An Industrial Case Study,” Proceedings of 2024 IEEE Conference on Software Testing, Verification and Validation (ICST), pp. 329–340, May 2024, doi: https://doi.org/10.1109/icst60714.2024.00037.
  7. 7. O. Parry, G. Kapfhammer, M. Hilton, and P. McMinn, “Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures,” arXiv, 2025. https://arxiv.org/abs/2504.16777 (accessed May 08, 2025).
  8. 8. R. Featherman, K. J. Yang, B. Roberts, and M. D. Ernst, “Evaluation of Version Control Merge Tools,” arXiv, pp. 831–83, Oct. 2024, doi: https://doi.org/10.1145/3691620.3695075.
  9. 9. A. Patel, “Handling Flaky Tests at Scale: Auto Detection & Suppression,” Slack Engineering, Apr. 05, 2022. https://slack.engineering/handling-flaky-tests-at-scale-auto-detection-suppression/ (accessed May 08, 2025).
  10. 10. J. Rose, “The 2023 State of Software Delivery,” Circleci, 2024. Accessed: May 10, 2025. [Online]. Available: https://circleci.com/landing-pages/assets/CircleCI-The-2023-State-of-Software-Delivery.pdf
  11. 11. A. Igareta, “Navigating Imbalanced Datasets: Beyond ROC & GINI,” Klarna Engineering, Nov. 09, 2023. https://engineering.klarna.com/stop-misusing-roc-curve-and-gini-navigate-imbalanced-datasets-with-confidence-5edec4c187d7 (accessed May 13, 2025).
  12. 12. F. Movahedi, R. Padman, and J. F. Antaki, “Limitations of receiver operating characteristic curve on imbalanced data: Assist device mortality risk scores,” The Journal of Thoracic and Cardiovascular Surgery, vol. 165, no. 4, pp. 1433-1442.e2, Apr. 2023, doi: https://doi.org/10.1016/j.jtcvs.2021.07.041.
  13. 13. “Predictive Test Selection,” Gradle, 2025. https://gradle.com/develocity/product-tour/accelerate/predictive-test-selection/ (accessed May 17, 2025).
  14. 14. M. Ghanem et al., “Limitations in Evaluating Machine Learning Models for Imbalanced Binary Outcome Classification in Spine Surgery: A Systematic Review,” Brain Sciences, vol. 13, no. 12, pp. 1723–1723, Dec. 2023, doi: https://doi.org/10.3390/brainsci13121723.
  15. 15. M. Machalica, “Predictive test selection: A more efficient way to ensure reliability of code changes,” Engineering at Meta, Nov. 21, 2018. https://engineering.fb.com/2018/11/21/developer-tools/predictive-test-selection/ (accessed May 22, 2025).
  16. 16. “Netflix - Case Study,” Gradle. https://gradle.com/customers/story/netflix/ (accessed May 24, 2025).