A Malware Classification Method Using visualization and Word Embedding Features

Document Type : Original Article

Authors

1 PhD student, Semnan University, Semnan, Iran

2 Associate Professor, Semnan University, Semnan, Iran

3 Professor, Semnan University, Semnan, Iran

Abstract

With the explosive growth of threats to Internet security, malware visualization in malware classification has become a promising study area in security and machine learning. This paper proposes a visualization method for malware analysis based on word embedding features of byte sequences.Based on some assistant information such as word embedding, the basic to a strong malware classification approach is to transfer the learned information from the malware domain to the image domain, which needs correlation modeling between these domains. However, most current methods neglect to model the relationships in an embedding way, ensue in low performance of malware classification. To catch this challenge, we consider the Word Embeddings duty as a Semantic Information Extraction. Our Proposed method aims to learn effective representations of malware families, which takes as input a set of embedded vectors corresponding to the malware. Word embedding is designed to generate features of a malware sample by leveraging its malware semantics. Our results show that visual models in the domain of images can be used for efficient malware classification. We evaluated our method on the kaggle dataset of Windows PE file instances, obtaining an average classification accuracy of 0.9896%.

Keywords


Smiley face

[1] A. Shalaginov, S. Banin, A. Dehghantanha, and K. Franke, "Machine learning aided static malware analysis: A survey and tutorial," Cyber threat intelligence, pp. 7-45, 2018.
[2] Z. Sun et al., "An opcode sequences analysis method for unknown malware detection," in Proceedings of the 2019 2nd international conference on geoinformatics and data analysis, 2019, pp. 15-19. 
[3] S. Ni, Q. Qian, and R. Zhang, "Malware identification using visualization images and deep learning," Computers & Security, vol. 77, pp. 871-885, 2018.
[4] W. Han, J. Xue, Y. Wang, L. Huang, Z. Kong, and L. Mao, "MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics," computers & security, vol. 83, pp. 208-233, 2019.
[5] G. G. Sundarkumar, V. Ravi, I. Nwogu, and V. Govindaraju, "Malware detection via API calls, topic models and machine learning," in 2015 IEEE International Conference on Automation Science and Engineering (CASE), 2015: IEEE, pp. 1212-1217. 
[6] A. Kumar, K. Kuppusamy, and G. Aghila, "A learning model to detect maliciousness of portable executable using integrated feature set," Journal of King Saud University-Computer and Information Sciences, vol. 31, no. 2, pp. 252-265, 2019.
[7] B. Jung, T. Kim, and E. G. Im, "Malware classification using byte sequence information," in Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems, 2018, pp. 143-148. 
[8] S. Jain and Y. K. Meena, "Byte level n–gram analysis for malware detection," in Computer Networks and Intelligent Computing: 5th International Conference on Information Processing, ICIP 2011, Bangalore, India, August 5-7, 2011. Proceedings, 2011: Springer, pp. 51-59. 
[9] M. El Boujnouni, M. Jedra, and N. Zahid, "New malware detection framework based on N-grams and support vector domain description," in 2015 11th international conference on information assurance and security (IAS), 2015: IEEE, pp. 123-128. 
[10] T. Wuechner, A. Cisłak, M. Ochoa, and A. Pretschner, "Leveraging compression-based graph mining for behavior-based malware detection," IEEE Transactions on Dependable and Secure Computing, vol. 16, no. 1, pp. 99-112, 2017.
[11] A. Damodaran, F. D. Troia, C. A. Visaggio, T. H. Austin, and M. Stamp, "A comparison of static, dynamic, and hybrid analysis for malware detection," Journal of Computer Virology and Hacking Techniques, vol. 13, pp. 1-12, 2017.
[12] M. Christodorescu and S. Jha, "Static analysis of executables to detect malicious patterns," in 12th USENIX Security Symposium (USENIX Security 03), 2003. 
[13]       M. Egele, T. Scholte, E. Kirda, and C. Kruegel, "A survey on automated dynamic malware-analysis techniques and tools," ACM computing surveys (CSUR), vol. 44, no. 2, pp. 1-42, 2008.
[14] M. Akour, I. Alsmadi, and M. Alazab, "The malware detection challenge of accuracy," in 2016 2nd International Conference on Open Source Software Computing (OSSCOM), 2016: IEEE, pp. 1-6. 
[15]       S. D. Nikolopoulos and I. Polenakis, "A graph-based model for malware detection and classification using system-call groups," Journal of Computer Virology and Hacking Techniques, vol. 13, no. 1, pp. 29-46, 2017.
[16] A. P. Namanya, I. U. Awan, J. P. Disso, and M. Younas, "Similarity hash based scoring of portable executable files for efficient malware detection in IoT," Future Generation Computer Systems, vol. 110, pp. 824-832, 2020.
[17] C.-I. Fan, H.-W. Hsiao, C.-H. Chou, and Y.-F. Tseng, "Malware detection systems based on API log data mining," in 2015 IEEE 39th annual computer software and applications conference, 2015, vol. 3: IEEE, pp. 255-260. 
[18] K. Griffin, S. Schneider, X. Hu, and T.-c. Chiueh, "Automatic Generation of String Signatures for Malware Detection," in RAID, 2009, vol. 5758: Springer, pp. 101-120. 
[19] M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, "Data mining methods for detection of new malicious executables," in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001, 2000: IEEE, pp. 38-49. 
[20] X. Liu, Q. Lei, and K. Liu, "A graph-based feature generation approach in Android malware detection with machine learning techniques," Mathematical Problems in Engineering, vol. 2020, 2020.
[21] E. Amer, S. El-Sappagh, and J. W. Hu, "Contextual identification of windows malware through semantic interpretation of api call sequence," Applied Sciences, vol. 10, no. 21, p. 7673, 2020.
[22]       R. Sihwail, K. Omar, and K. Z. Ariffin, "A survey on malware analysis techniques: Static, dynamic, hybrid and memory analysis," Int. J. Adv. Sci. Eng. Inf. Technol, vol. 8, no. 4-2, pp. 1662-1671, 2018.
[23] V. Verma, S. K. Muttoo, and V. Singh, "Multiclass malware classification via first-and second-order texture statistics," Computers & Security, vol. 97, p. 101895, 2020.
[24]       R. Vinayakumar, M. Alazab, K. Soman, P. Poornachandran, and S. Venkatraman, "Robust intelligent malware detection using deep learning," IEEE Access, vol. 7, pp. 46717-46738, 2019.
[25]       J. Fu, J. Xue, Y. Wang, Z. Liu, and C. Shan, "Malware visualization for fine-grained classification," IEEE Access, vol. 6, pp. 14510-14523, 2018.
[26] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, "Malware images: visualization and automatic classification," in Proceedings of the 8th international symposium on visualization for cyber security, 2011, pp. 1-7. 
[27]       Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar, "A survey on malware detection using data mining techniques," ACM Computing Surveys (CSUR), vol. 50, no. 3, pp. 1-40, 2017.
[28]       A. Hellal and L. B. Romdhane, "Minimal contrast frequent pattern mining for malware detection," Computers & Security, vol. 62, pp. 19-32, 2016.
[29] A. Narayanan, M. Chandramohan, L. Chen, and Y. Liu, "A multi-view context-aware approach to Android malware detection and malicious code localization," Empirical Software Engineering, vol. 23, pp. 1222-1274, 2018.
[30] K. Han, J. H. Lim, and E. G. Im, "Malware analysis method using visualization of binary files," in Proceedings of the 2013 Research in Adaptive and Convergent Systems, 2013, pp. 317-321.
[31] D. Vasan, M. Alazab, S. Wassan, H. Naeem, B. Safaei, and Q. Zheng, "IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture," Computer Networks, vol. 171, p. 107138, 2020.
[32] J. Zhang, Z. Qin, H. Yin, L. Ou, S. Xiao, and Y. Hu, "Malware variant detection using opcode image recognition with small training sets," in 2016 25th International Conference on Computer Communication and Networks (ICCCN), 2016: IEEE, pp. 1-9. 
[33] H. HaddadPajouh, A. Dehghantanha, R. Khayami, and K.-K. R. Choo, "A deep recurrent neural network based approach for internet of things malware threat hunting," Future Generation Computer Systems, vol. 85, pp. 88-96, 2018.
[34] S. Jha, D. Prashar, H. V. Long, and D. Taniar, "Recurrent neural network for detecting malware," computers & security, vol. 99, p. 102037, 2020.
[35] H.-J. Kim, "Image-based malware classification using convolutional neural network," in Advances in Computer Science and Ubiquitous Computing: CSA-CUTE 17, 2018: Springer, pp. 1352-1357. 
[36] M. S. I. Sajid, J. Wei, M. R. Alam, E. Aghaei, and E. Al-Shaer, "Dodgetron: Towards autonomous cyber deception using dynamic hybrid analysis of malware," in 2020 IEEE Conference on Communications and Network Security (CNS), 2020: IEEE, pp. 1-9. 
[37] D. Bilar, "Opcodes as predictor for malware," International journal of electronic security and digital forensics, vol. 1, no. 2, pp. 156-168, 2007.
[38]       J. Firth, "A synopsis of linguistic theory, 1930-1955," Studies in linguistic analysis, pp. 10-32, 1957.
[39]       T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
[40] A. O. Salau and S. Jain, "Feature extraction: a survey of the types, techniques, applications," in 2019 international conference on signal processing and communication (ICSC), 2019: IEEE, pp. 158-164. 
[41] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[42] J. Gu et al., "Recent advances in convolutional neural networks," Pattern recognition, vol. 77, pp. 354-377, 2018.
[43] J. Jeon, J. H. Park, and Y.-S. Jeong, "Dynamic analysis for IoT malware detection with convolution neural network model," IEEE Access, vol. 8, pp. 96899-96911, 2020.
[44] X. Meng et al., "MCSMGS: malware classification model based on deep learning," in 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2017: IEEE, pp. 272-275. 
[45] E. K. Kabanga and C. H. Kim, "Malware images classification using convolutional neural network," Journal of Computer and Communications, vol. 6, no. 1, pp. 153-158, 2017.
[46] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning. nature, 521 (7553), 436-444," Google Scholar Google Scholar Cross Ref Cross Ref, p. 25, 2015.
[47]       R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, "Microsoft malware classification challenge," arXiv preprint arXiv:1802.10135, 2018.
[48] M. Kalash, M. Rochan, N. Mohammed, N. D. Bruce, Y. Wang, and F. Iqbal, "Malware classification with deep convolutional neural networks," in 2018 9th IFIP international conference on new technologies, mobility and security (NTMS), 2018: IEEE, pp. 1-5. 
[49] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catanzaro, and C. Nicholas, "Malware detection by eating a whole exe," arXiv preprint arXiv:1710.09435, 2017.
[50] M. Krčál, O. Švec, M. Bálek, and O. Jašek, "Deep convolutional malware classifiers can learn from raw executables and labels only," 2018.
[51] D. Gibert, C. Mateu, J. Planes, and R. Vicens, "Classification of malware by using structural entropy on convolutional neural networks," in Proceedings of the AAAI conference on artificial intelligence, 2018, vol. 32, no. 1. 
[52] V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis, Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III. Springer, 2018.
[53] S. Quynn, "Identifying behaviors in executable binaries with Deep Learning," Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2021. 
Volume 11, Issue 1 - Serial Number 41
No. 41, Spring
May 2023
Pages 1-13
  • Receive Date: 10 August 2021
  • Revise Date: 02 September 2022
  • Accept Date: 24 December 2022
  • Publish Date: 22 May 2023