Malware's polymorphism analysis using a hybrid machine learning algorithm approach

Halilou Claude BOBO HAMADJIDA; Aurelle TCHAGNA KOUANOU; Christian TCHAPGA TCHITO; Clarence TAMKO KOUADJO

Malware's polymorphism analysis using a hybrid machine learning algorithm approach

Halilou Claude BOBO HAMADJIDA, Aurelle TCHAGNA KOUANOU, Christian TCHAPGA TCHITO, Clarence TAMKO KOUADJO

27 Feb 2025 (modified: 05 Apr 2025)AIMS 2025 Workshop T2P SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Malware polymorphism, Tactics Techniques Procedures, Hybrid Machine Learning, Malware Classification, Cybersecurity datasets.

TL;DR: This paper presents a hybrid machine learning approach combining Fuzzy Ranking and Support Vector Machines to detect polymorphic malware with high precision, leveraging the CIC2022 dataset for enhanced cybersecurity decision-making.

Abstract: Background and Objective: The emergence of new technologies, such as artificial intelligence (AI), continues to enhance human life and activities. With advancements in information technology, communication, robotics, and Industrial Control Systems (ICS), accessing and utilizing powerful computational resources has become increasingly feasible. However, these same technologies can also aid malware in bypassing modern cybersecurity defenses by enhancing its capabilities. Polymorphism is a key example of an advanced malware technique that can be exploited using machine learning (ML). This research aims to address malware polymorphism by leveraging hybrid machine learning (HML) approaches. Method: Building on insights from previous research, this study focuses on selecting an appropriate dataset and HML algorithm to achieve high-precision polymorphism detection. The objective is not only to detect polymorphic malware with high accuracy but also to provide real-time descriptions of malware tactics, techniques, and procedures (TTPs) to improve decision-making in cybersecurity. To implement this approach, the CIC2022 malware dataset from the Canadian Institute of Cybersecurity was cleaned, formatted, pre-processed, and trained using an HML algorithm that combines Fuzzy Ranking (FR) and Support Vector Machine (SVM). The performance of the proposed method was evaluated using a confusion matrix, cross-validation AUC-ROC curve, F1 score, and false positive rate (FPR). Finally, the trained model was tested on simulated polymorphic malware to analyze its actual TTPs. Results: The integration of SVM with FR-based feature selection achieved an overall precision rate exceeding 0.95 for polymorphism detection. Furthermore, using the CIC2022 dataset, the model provided an approximate description of malware TTPs, achieving an even higher accuracy (0.9977) on the Microsoft BiG 2015 dataset when tested within an isolated Windows environment. Conclusion: The proposed approach demonstrates stability, efficiency, and reliability in detecting polymorphic malware. However, there was a slight deviation from the original research hypothesis regarding the dataset used, as CIC2022 was chosen over Malimg due to accessibility constraints.

Submission Number: 5

Loading