Evaluation Methods for Intelligent Synthesis Technology of Aerospace Control Software
-
摘要: 程序合成是自动生成满足用户意图程序代码的软件开发活动, 随着人工智能在程序合成领域的成功应用, 智能程序合成技术逐渐成为软件开发的新范式. 虽然现有一些智能程序合成技术的评价方法, 但是仍面临许多问题需要进一步完善和改进. 本文通过调研智能程序合成技术使用的评价标准以及分析当前主流智能程序合成技术的评价方法, 分析并完善了智能程序合成技术的评价指标, 并结合航天嵌入式软件的特点, 构建了航天嵌入式软件智能合成的层级式评价指标体系, 设计了以动态和静态相结合为主的面向航天控制软件智能合成技术的综合评价方法. 通过实验验证了其中动静结合评价方法的有效性, 其能够获得与人类评分更高的皮尔逊相关系数.Abstract: Program synthesis is a technique for automatically generating programs, which derives corresponding program code from given specifications or requirements. With the successful application of artificial intelligence in the field of program synthesis, intelligent program synthesis technology has become a new paradigm for software development. Although there are some evaluation methods for intelligent program synthesis technology, there are still many challenges that need further improvement and refinement. This paper summarizes and refines evaluation indicators for intelligent program synthesis technology by investigating the evaluation criteria used in intelligent program synthesis technology and analyzing the mainstream evaluation methods of intelligent program synthesis technology. Combined with the characteristics of aerospace embedded software, a hierarchical evaluation indicator system for intelligent synthesis of aerospace embedded software is constructed, and a comprehensive evaluation method for intelligent synthesis technology of aerospace control software mainly based on dynamic and static combination is designed. By calculating the Pearson correlation coefficient with ChatGPT3.5 simulating human scores, it is found that the proposed combined dynamic and static evaluation method can obtain a higher correlation coefficient than either dynamic or static evaluation methods alone, and can reflect the improvement of performance after model iteration.
-
表 1 智能程序合成技术的评价指标及其使用的频率
Table 1. Evaluation indicators of intelligent program synthesis technology and their frequency of use
评价层面 评价指标 相关文献 使用频率/(%) 合成结果 程序正确性 [2–55] 46.49 程序规模 [3,8–12,15,16,22,23,31,33,34,39,42,46] 9.73 程序相似度 [10,11,19,47,48,56] 5.41 合成过程 合成时间 [6,8,10–13,15,19–21,24,27,34,36,39,41–45,47,49,57] 13.51 候选程序数量 [7,12–14,16,18,24,26,27,31,39–41,57] 9.19 训练合成器 训练数据量 [9,14–17,33,35,36,38,44] 6.49 其他 - [11,15,22,25,44–46,48,50,51,55,57,58] 9.18 表 2 实验结果
Table 2. Experimental results
评价方法 ChatGLM-6B 模型
ChatGLM2-6BChatGLM3-6B CHRF++ 0.406261948 0.435981525 0.50124219 AST_MATCH 0.312514349 0.355593869 0.44914179 DFG_MATCH 0.517394044 0.548241746 0.60469280 pass@k (k=1) 0.040579268 0.093689024 0.62344512 pass@k (k=10) 0.135562582 0.192008573 0.79927887 pass@k (k=100) 0.257184867 0.307372033 0.83987480 CodeBLEU 0.259490443 0.289887388 0.35368544 本文评价方法 0.231646673 0.276125765 0.58140539 表 3 ChatGPT3.5模拟人类评分结果
Table 3. ChatGPT3.5 simulates human rating results
模型 ChatGLM-6B ChatGLM2-6B ChatGLM3-6B 模拟人类评分 2.361817523 2.758730102 4.199085366 表 4 皮尔逊相关系数的计算结果
Table 4. Result of the calculation of the Pearson correlation coefficient
评价方法 ChatGLM-6B 模型
ChatGLM2-6BChatGLM3-6B CHRF++ 0.340366625 0.460577583 0.382847599 AST_MATCH 0.370650915 0.379543662 0.342105016 DFG_MATCH 0.379022475 0.339336446 0.155618439 pass@k (k=1) 0.602161597 0.576083246 0.380828102 pass@k (k=10) 0.578192141 0.505663582 0.228842017 pass@k (k=100) 0.457669783 0.442143948 0.151757714 CodeBLEU 0.418648037 0.453922700 0.358580074 本文评价方法 0.614578225 0.594448137 0.425688413 -
[1] 杨孟飞, 顾斌, 段振华, 等. 嵌入式软件智能合成框架及关键科学问题[J]. 中国空间科学技术, 2022, 42(4): 1-7YANG Mengfei, GU Bin, DUAN Zhenhua, et al. Intelligent program synthesis framework and key scientific problems for embedded software[J]. Chinese Space Science and Technology, 2022, 42(4): 1-7 [2] SHIN R, POLOSUKHIN I, SONG D. Improving neural program synthesis with inferred execution traces[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal: Curran Associates Inc. , 2018: 8931-8940 [3] HUANG D, ZHANG R, HU X, et al. Neural program synthesis with query[C]//The 10th International Conference on Learning Representations. Virtual Event: OpenReview. net, 2022 [4] RAMANI G, KARANDE S. Synthesis of mathematical programs from natural language specifications[OL]. arXiv preprint arXiv: 2304. 03287, 2023 [5] JAIN N, VAIDYANATH S, IYER A, et al. Jigsaw: large language models meet program synthesis[C]//Proceedings of the 44th International Conference on Software Engineering. Pittsburgh: ACM, 2022: 1219-1231 [6] CHRISTAKOPOULOU K, KALAI A T. Glass-box program synthesis: a machine learning approach[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 646-653 [7] ODENA A, SHI K, BIEBER D, et al. BUSTLE: bottom-up program synthesis through learning-guided exploration[C]//The 9th International Conference on Learning Representations. Austria: OpenReview. net, 2021 [8] DUMANCIC S, GUNS T, CROPPER A. Knowledge refactoring for inductive program synthesis[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Virtual Event: AAAI Press, 2021: 7271-7278 [9] ROSIN C D. Stepping stones to inductive synthesis of low-level looping programs[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Honolulu: AAAI Press, 2019: 2362-2370 [10] ZOHAR A, WOLF L. Automatic program synthesis of long programs with a learned garbage collector[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc. , 2018: 2098-2107 [11] HONG J, DOHAN D, SINGH R, et al. Latent programmer: discrete latent codes for program synthesis[C]//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR, 2021: 4308-4318 [12] SHI K, DAI H J, ELLIS K, et al. CROSSBEAM: learning to search in bottom-up program synthesis[C]//The 10th International Conference on Learning Representations. Virtual Event: OpenReview. net, 2022 [13] KALYAN A, MOHTA A, POLOZOV O, et al. Neural-guided deductive search for real-time program synthesis from examples[C]//The 6th International Conference on Learning Representations. Vancouver: OpenReview. net, 2018 [14] VALKOV L, CHAUDHARI D, SRIVASTAVA A, et al. HOUDINI: lifelong learning as program synthesis[C]//The 32nd International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc. , 2018: 8701-8712 [15] HANDA S, RINARD M C. Inductive program synthesis over noisy data[C]//The 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. USA: ACM, 2020: 87-98 [16] NYE M I, HEWITT L B, TENENBAUM J B, et al. Learning to infer program sketches[C]//Proceedings of the 36th International Conference on Machine Learning. Long Beach: PMLR, 2019: 4861-4870 [17] CHEN X Y, SONG D, TIAN Y D. Latent execution for neural program synthesis[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc. , 2021: 22196-22208 [18] FIJALKOW N, LAGARDE G, MATRICON T, et al. Scaling neural program synthesis with distribution-based search[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Virtual Event: AAAI Press, 2022: 6623-6630 [19] THAKOOR S, SHAH S, RAMAKRISHNAN G, et al. Synthesis of programs from multimodal datasets[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 184-191 [20] RAZA M, GULWANI S. Automated data extraction using predictive program synthesis[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco: AAAI Press, 2017: 882-890 [21] QUIRK C, MOONEY R, GALLEY M. Language to code: learning semantic parsers for if-this-then-that recipes[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing: ACL, 2015: 878-888 [22] ZHANG Y T. Scalability and precision improvement of neural program synthesis[C]//The 35th IEEE/ACM International Conference on Automated Software Engineering. Melbourne: IEEE, 2020: 1391-1393 [23] CHASINS S, PHOTHILIMTHANA P M. Data-driven synthesis of full probabilistic programs[C]//The 29th International Conference on Computer Aided Verification. Heidelberg: Springer, 2017: 279-304 [24] SI X J, LEE W, ZHANG R, et al. Syntax-guided synthesis of datalog programs[C]//The 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Lake Buena Vista: ACM, 2018: 515-527 [25] LAICH L, BIELIK P, VECHEV M T. Guiding program synthesis by learning to generate examples[C]//The 8th International Conference on Learning Representations. Addis Ababa: OpenReview. net, 2020 [26] ODENA A, SUTTON C. Learning to represent programs with property signatures[C]//The 8th International Conference on Learning Representations. Addis Ababa: OpenReview. net, 2020 [27] SI X J, YANG Y, DAI H J, et al. Learning a meta-solver for syntax-guided program synthesis[C]//The 7th International Conference on Learning Representations. New Orleans: OpenReview. net, 2019 [28] SHIN R, KANT N, GUPTA K, et al. Synthetic datasets for neural program synthesis[C]//The 7th International Conference on Learning Representations. New Orleans: OpenReview. net, 2019 [29] CHEN X Y, LIU C, SONG D. Execution-guided neural program synthesis[C]//The 7th International Conference on Learning Representations. New Orleans: OpenReview. net, 2019 [30] BUNEL R, HAUSKNECHT M J, DEVLIN J, et al. Leveraging grammar and reinforcement learning for neural program synthesis[C]//The 6th International Conference on Learning Representations. Vancouver: OpenReview. net, 2018 [31] POLOSUKHIN I, SKIDANOV A. Neural program search: solving programming tasks from description and examples[C]//The 6th International Conference on Learning Representations. Vancouver: OpenReview. net, 2018 [32] SHIN R, POLOSUKHIN I, SONG D. Towards specification-directed program repair[C]//The 6th International Conference on Learning Representations. Vancouver: OpenReview. net, 2018 [33] PARISOTTO E, MOHAMED A R, SINGH R, et al. Neuro-symbolic program synthesis[C]//The 5th International Conference on Learning Representations. Toulon: OpenReview. net, 2017 [34] BALOG M, GAUNT A L, BROCKSCHMIDT M, et al. DeepCoder: learning to write programs[C]//The 5th International Conference on Learning Representations. Toulon: OpenReview. net, 2017 [35] ALET F, LOPEZ-CONTRERAS J, KOPPEL J, et al. A large-scale benchmark for few-shot program induction and synthesis[C]//Proceedings of the 38th International Conference on Machine Learning. Virtual Event: PMLR, 2021: 175-186 [36] PU Y W, MIRANDA Z, SOLAR-LEZAMA A, et al. Selecting representative examples for program synthesis[C]//Proceedings of the 35th International Conference on Machine Learning. Stockholm: PMLR, 2018: 4158-4167 [37] SUN S H, NOH H, SOMASUNDARAM S, et al. Neural program synthesis from diverse demonstration videos[C]//Proceedings of the 35th International Conference on Machine Learning. Stockholm: PMLR, 2018: 4797-4806 [38] DEVLIN J, UESATO J, BHUPATIRAJU S, et al. RobustFill: neural program learning under noisy I/O[C]//Proceedings of the 34th International Conference on Machine Learning. Sydney: PMLR, 2017: 990-998 [39] MENON A K, TAMUZ O, GULWANI S, et al. A machine learning framework for programming by example[C]//Proceedings of the 30th International Conference on Machine Learning. Atlanta: PMLR, 2013: 187-195 [40] GU X D, ZHANG H Y, KIM S. Deep code search[C]//Proceedings of the 40th International Conference on Software Engineering. Gothenburg: ACM, 2018: 933-944 [41] DESAI A, GULWANI S, HINGORANI V, et al. Program synthesis using natural language[C]//Proceedings of the 38th International Conference on Software Engineering. Austin: ACM, 2016: 345-356 [42] SHRIVASTAVA D, LAROCHELLE H, TARLOW D. Learning to combine per-example solutions for neural program synthesis[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc. , 2021: 6102-6114 [43] CUI G F, ZHU H. Differentiable synthesis of program architectures[C]//Proceedings of the 35th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc. , 2021: 11123-11135 [44] YANG Y D, INALA J P, BASTANI O, et al. Program synthesis guided reinforcement learning for partially observed environments[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc. , 2021: 29669-29683 [45] SHAH A, ZHAN E, SUN J J, et al. Learning differentiable programs with admissible neural heuristics[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc. , 2020: 4940-4952 [46] GUPTA K, CHRISTENSEN P E, CHEN X Y, et al. Synthesize, execute and debug: learning to repair for neural program synthesis[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. Virtual Event: Curran Associates Inc. , 2020: 17685-17695 [47] ELLIS K, NYE M, PU Y W, et al. Write, execute, assess: program synthesis with a REPL[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc. , 2019: 9165-9174 [48] SHIN R, ALLAMANIS M, BROCKSCHMIDT M, et al. Program synthesis and semantic parsing with learned code idioms[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc. , 2019: 10824-10834 [49] ELLIS K, MORALES L, SABLÉ-MEYER M, et al. Learning libraries of subroutines for neurally-guided Bayesian program induction[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc. , 2018: 7816-7826 [50] ZHANG L, ROSENBLATT G, FETAYA E, et al. Neural guided constraint logic programming for program synthesis[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc. , 2018: 1744-1753 [51] LIANG C, NOROUZI M, BERANT J, et al. Memory augmented policy optimization for program synthesis and semantic parsing[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc. , 2018: 10015-10027 [52] CHEN X Y, LIU C, SHIN R, et al. Latent attention for if-then program synthesis[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona: Curran Associates Inc. , 2016: 4581-4589 [53] ELLIS K, SOLAR-LEZAMA A, TENENBAUM J B. Unsupervised learning by program synthesis[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montréal: MIT Press, 2015: 973-981 [54] ELLIS K, WONG C, NYE M, et al. DreamCoder: bootstrapping inductive program synthesis with wake-sleep library learning[C]//The 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. Canada: ACM, 2021: 835-850 [55] BEDNAREK J, PIASKOWSKI K, KRAWIEC K. Ain’t nobody got time for coding: structure-aware program synthesis from natural language[OL]. arXiv preprint arXiv: 1810. 09717, 2019 [56] MURALI V, QI L, CHAUDHURI S, et al. Neural sketch learning for conditional program generation[C]//The 6th International Conference on Learning Representations. Vancouver: OpenReview. net, 2018 [57] RAGHOTHAMAN M, WEI Y, HAMADI Y. SWIM: synthesizing what I mean-code search and idiomatic snippet synthesis[C]//2016 IEEE/ACM 38th International Conference on Software Engineering. Austin: ACM, 2016: 357-367 [58] BHUPATIRAJU S, AGRAWAL K K, SINGH R. Towards mixed optimization for reinforcement learning with program synthesis[OL]. arXiv preprint arXiv: 1807. 00403, 2018 [59] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: ACL, 2002: 311-318 [60] KULAL S, PASUPAT P, CHANDRA K, et al. SPoC: search-based pseudocode to code[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc. , 2019: 11883-11894 [61] CHEN M, TWOREK J, JUN H, et al. Evaluating large language models trained on code[OL]. arXiv preprint arXiv: 2107. 03374, 2021 [62] BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: ACL, 2005: 65-72 [63] POPOVIĆ M. chrF: character n-gram F-score for automatic MT evaluation[C]//Proceedings of the 10th Workshop on Statistical Machine Translation. Lisbon: ACL, 2015: 392-395 [64] POPOVIĆ M. chrF++: words helping character n-gram[C]//Proceedings of the 2nd Conference on Machine Translation. Copenhagen: ACL, 2017: 612-618 [65] TRAN N, TRAN H, NGUYEN S, et al. Does BLEU score work for code migration?[C]//IEEE/ACM 27th International Conference on Program Comprehension. Montréal: IEEE, 2019: 165-176 [66] REN S, GUO D Y, LU S, et al. CodeBLEU: a method for automatic evaluation of code synthesis[OL]. arXiv preprint arXiv: 2009. 10297, 2020 [67] PAN Y, LYU C. Measuring efficient code generation with GEC[C]//The 14th Asia-Pacific Symposium on Internetware. Hangzhou: ACM, 2023: 249-258 [68] IMPROTA C. Poisoning programs by un-repairing code: security concerns of AI-generated code[C]//IEEE 34th International Symposium on Software Reliability Engineering Workshops. Florence: IEEE, 2023: 128-131 [69] SIDDIQ M L, SANTOS J C S. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques[C]//The 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. Singapore: ACM, 2022: 29-33 [70] SU H R, AI J, YU D, et al. An evaluation method for large language models’ code generation capability[C]//The 10th International Conference on Dependable Systems and Their Applications. Tokyo: IEEE, 2023: 831-838 [71] KOVALCHUK S, FEDRUSHKOV D, LOMSHAKOV V, et al. Test-based and metric-based evaluation of code generation models for practical question answering[C]//The International Conference on Code Quality. St. Petersburg: IEEE, 2023: 73-86 -
-