Phân loại mã độc Android sử dụng học sâu

Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu Phân loại mã độc Android sử dụng học sâu

Trang 1

MINISTRY OF EDUCATION AND TRAINING

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Trang 2

MINISTRY OF EDUCATION AND TRAINING

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

Ph.D Nguyen Kim Khanh Ph.D Hoang Van Hiep

Hanoi−2024

Trang 3

DECLARATION OF AUTHORSHIP

I declare that my dissertation titled "Android malware classification using deep learning" has been entirely composed by myself, supervised by my co-supervisors, Ph.D Nguyen Kim Khanh and Ph.D Hoang Van Hiep I assure you some statements as follows:

• This work was done as a part of the requirements for the degree of Ph.D Hanoi University of Science and Technology.

• This dissertation has not previously been submitted for any degree.

• The results in my dissertation are my independent work, except where works in the collaboration have been included Other appropriate acknowledgments are given within this dissertation by explicit references.

Ph.D NGUYEN KIM KHANH

Ph.D HOANG VAN HIEP

Trang 4

My dissertation was realized during my doctoral course at the School of Information Communication and Technology (SoICT), Hanoi University of Science and Technology (HUST) HUST is a special place where I accumulated immense knowledge in my Ph.D process.

A Ph.D process is not a one-man process Therefore, I am heartily thankful to my supervisors, Ph.D Nguyen Kim Khanh and Ph.D Hoang Van Hiep, whose encourage-ment, guidance, and support from start to finish enabled me to develop my research skills and understanding of the subject I have learned countless things from them This dissertation would not have been possible without their precious support.

I would like to thank the Executive Board and all members of the Computer Engi-neering Department, SoICT, and HUST for their frequent support in my Ph.D course I thank my colleagues at the Academy of Cryptography Techniques for their help.

Last but not least, I would like to thank my family: my parents, my wife, and my friends, who have supported me spiritually throughout my life They were always there to cheer me up and stand by me through good and bad times.

Hanoi, April, 2024 Ph.D Student

LE DUC THUAN

Trang 5

1OVERVIEW OF ANDROID MALWARE CLASSIFICATION BASED ON MACHINE LEARNING

Background Information 6

1.1.1 Android Platform 6

1.1.2 Overview of Android Malware 10

1.2 Android Malware Classification Methods 15

1.2.1 Signature-based Method 16

1.2.2 Anomaly-based Method 17

1.2.3 Android Malware Classification Evaluation Metrics 18

1.2.3.1 Metrics for the Binary Classification Problem 19

1.2.3.2 Metrics for Multi-labelled Classification Problem 20

1.2.4 Android Malware Dataset 20

1.3 Machine Learning-based Method for Android Malware Classification 24

1.4 Related Works 26

1.4.1 Related Works on Feature Extraction 26

1.4.1.1 Features Extraction Methods 26

1.4.1.2 Feature Augmentation Methods 37

1.4.1.3 Feature Selection Methods 38

1.4.2 Related Works on Machine Learning-based Methods 40

1.4.2.1 Random Forest Algorithm 41

1.4.2.2 Support Vector Machine 42

1.4.2.3 K-Nearest Neighbor Algorithm 43

1.4.2.4 Deep Belief Network 44

1.4.2.5 Convolutional Neural Network 44

1.4.2.6 Some Other Models 45

1.5 Proposed Methodology 46

Trang 6

1.6 Chapter Summary 48

2PROPOSED METHODS FOR FEATURE EXTRACTION49 2.1 Feature Augmentation based on Co-occurrence matrix 49

2.1.1 Proposed Idea 49

2.1.2 Raw Feature Extraction 50

2.1.3 Co-occurrence Matrix Feature Computation 51

2.1.4 Experimental Results 52

2.1.4.1 Experimental Dataset 52

2.1.4.2 Experimental Scenario 53

2.1.4.3 Malware Classification based on CNN Model 54

2.1.4.4 Summary of Experimental Results 54

2.2 Feature Augmentation based on Apriori Algorithm 55

2.2.1 Proposed Idea 55

2.2.2 Apriori Algorithm 56

2.2.2.1 Introduction to Apriori Algorithm 56

2.2.2.2 Apriori Algorithm 56

2.2.3 Feature Set Creation 57

2.2.3.1 Raw Android Feature Set .57

2.2.3.2 The Feature Augmentation Set .58

2.2.3.3 Input Feature Normalization 59

2.2.3.4 Feature Augmentation Set .59

2.2.4 Experimental Results 60

2.2.4.1 Experimental Dataset and Scenario 60

2.2.4.2 experiment based on CNN Model 61

2.2.4.4 Evaluation 62

2.3 Feature Selection Based on Popularity and Contrast Value in a Multi-

objective Approach 63

2.3.1 Proposed idea 63

2.3.2 Popularity and Contrast Computation 64

2.3.3 Pareto Multi-objective Optimization Method 65

2.3.4 Selection Function and Implementation 65

Trang 7

3.1.2 Boltzmann Machine and Deep Belief Network 77

3.1.2.1 Restricted Boltzmann Machine 77

3.1.2.2 Deep Belief Network 77

3.2.2.2 Raw Feature Dataset 82

3.2.2.3 Malware Classification using CNN Model 83

3.3 Proposed Method using WDCNN Model for Android Malware Classifi-

3.4 Applying Federated Learning Model 99

3.4.1 Federated Learning Model 99

3.4.2 Implement Federated Learning Model 100

Trang 10

1.4 Common API packages 31

1.5 Common suspicious API call 31

1.6 Some typical traffic flows 33

2.1 Details of parameters set in the CNN model 54

2.2 Classification with CNN model using accuracy measure (%) 54

2.3 Measurements evaluate effectiveness (%) 55

2.4 Details of parameters set in the CNN model 61

2.5 Classification results by CNN 62

2.6 Results of using CNN with measurements (%) 62

2.7 Details of parameters set in the CNN model for selection feature 68

2.8 Summary of feature evaluation measures selectivity functions (top (10))

– with API set 70

2.9 Summary of results with datasets and feature sets 70

2.10 Summary of results of proposed feature augmentation methods 72

3.1 Result with Acc measure (%) in scenario 1 .79

3.2 Result with Acc measure (%) in scenario 2 .79

3.3 Results with measures in scenario 3 (%) 79

3.4 Experimental results using CNN model 84

3.5 The datasets used for the experiment 91

3.6 Experimental results of Simple dataset 95

3.7 Experimental results of Complex dataset 96

3.8 Experimental results when comparing models 96

3.9 Accuracy comparison of models Features: Images 128x128 + permission + API 97

3.10 Experimental results with scenario 3 (%) 97

3.11 Average set of weights (accuracy - %) .104

3.12 Set of Weights according to the number of samples (accuracy - %) .105

3.13 Our proposed set of weights (accuracy - %) .105

3.14 Summary of results of proposed machine learning, deep learning models and comparison 106

Trang 11

LIST OF FIGURES

1.1 Architecture of Android OS system [37] 7

1.2 The increase of malware on Android OS .14

1.3 Types of malware on Android OS 14

1.4 Anomaly-Based Detection Technique 17

1.5 Overview of the problem of detecting malware on the Android 25

1.6 General model of feature extraction methods 27

1.7 Statistics of papers using machine learning and deep learning from

2019-2022 on dblp 40

1.8 Architecture of the CNN model [133] 45

2.1 Evaluation model for Android malware classification using co-occurrence

matrix 50

2.2 Output matrix with different size 52

2.3 Top (10) malware families in Drebin dataset 53

2.4 CNN having multi-convolutional networks 53

2.5 The process of research and experiment using Apriori 56

2.6 Apply the Apriori algorithm to the feature set .60

2.7 Architecture of CNN model used in the experiment with Apriori 61

2.8 Learning method implementation results 63

2.9 Proposing a feature selection model 64

2.10 Top (20) family of malware with the most samples in the AMD dataset 67 2.11 Experimental model when applying feature selection algorithm 69

2.12 Experimental results when applying feature selection algorithm 71

3.1 System development and evaluation process using the DBN 76

3.2 Architectural diagram of DBN application in Android malware detection 78 3.3 The overall model of the training and classification of malware using the

CNN model 81

3.4 Test rate according to the 10-fold 85

3.5 WDCNN model operation diagram 86

3.6 Structure and parameters of the WDCNN model 87

3.7 Top 20 malware family AMD and Drebin 91

3.8 Experimental model 94

3.9 Classification of malware depending on the number of labels 94

3.10 DEX file size by size in the Drebin dataset 100

Trang 12

3.11 Overall model using federated learning 101 3.12 Compare the results of the weighted aggregation methods 106 3.13 Classification results with influence factor 107

Trang 13

In the present day, there is a growing inclination towards the adoption of digital transformation and artificial intelligence in smart device applications across diverse operating systems This trend aligns with the advancements of the fourth industrial revolution and is being observed in numerous domains of social and economic activity According to the statistics [1] in June 2023, Android dominated the market for mobile operating systems with 70.79% Furthermore, the Android operating system is utilized in a diverse range of smart devices, including but not limited to mobile phones, televisions, watches, automobiles, vending machines, and network routers The rapid growth and variety of devices that use the Android operating system (OS) have contributed to the significant increase in the number, style, and appearance of malware According to the statistics [2], in 2021, there were a total of 3.36 million malware found in the Android OS market This situation leads to danger for users of mobile operating systems Solving the problems of malware detection is, therefore, urgent and necessary As reported in the DBLP database [3] from 2013 to 2022, there were 1,081 researches on this issue.

Two main approaches are commonly applied to detect Android malware: static and dynamic analysis Static analysis involves inspecting a program’s executable file structure, characteristics, and source code The advantage of static analysis is that it does not require that the code be executed (of course, it is pretty dangerous to run a malware file on a natural system) By examining the decompiled code, the static analysis can determine the flows and actions of the execution file and thus identify it as either malware or benign The disadvantage, however, is that some sophisticated malware can include malicious runtime behavior that can go undetected On the other hand, dynamic analysis involves executing potentially malicious code in a real or sandbox environment to monitor its behavior The sandbox environment helps analysts examine potential threats without putting the system at risk of infection Although dynamic analysis could detect threats that might be ignored by static analysis, this approach requires more time and resources than static analysis It may not be able to cover all the possible execution paths of the malware In summary, static analysis is said to help find known threats and vulnerabilities In contrast, dynamic analysis is suitable for finding new types and uncovering threats not previously documented (i.e., zero-day threats) For the problem of malware detection, dynamic analysis seems recommended for organizations that need a deeper understanding of malware behavior or impact and have the necessary tools and expertise to perform it For the problem of malware classification, static analysis is more popular due to its more straightforward

Trang 14

implementation This dissertation also uses static analysis as the main method for feature extraction [4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14].

Malware classification assigns malware samples into specific malware families, in-cluding benign ones Signature-based and machine learning-based methods have usu-ally been used for this problem Signature-based methods have been traditional and widely used [15, 16, 17] They rely on matching the "signature" of known malware samples with unknown ones As mentioned in the previous paragraph, static or dynamic analysis can extract the "signature" from samples Several limitations of signature- based methods exist as follows: (i) they cannot detect new or unknown malware; (ii) they are vulnerable to obfuscation and encryption techniques used by malware authors to evade detection; and (iii) they require constant updates of the signature database Machine-learning-based methods are emerging and promising techniques for malware classification They use various machine learning algorithms to learn from a large set of labeled malware samples and then classify new ones based on their features Conversely, machine-learning-based methods can overcome some of the challenges of signature-based methods, such as detecting new or unknown malware, handling complex or dynamic code features, and reducing human intervention and manual analysis However, the machine learning-based method has some drawbacks, including (i) requir- ing more time and resources than the signature-based method and (ii) the accuracy of classification depends on the quality of labeling the training data as well as the learning model Since machine-learning-based methods are more advanced than signature-based ones, this work focuses on the machine-learning-based method for Android malware classification.

Machine learning is a branch of artificial intelligence currently widely focused on various domains Machine learning approaches are divided into two main categories: supervised and unsupervised In Android malware detection and classification, unsupervised learning models do not require labeled data, meaning they can work with any Android app without knowing its class beforehand However, unsupervised learning models can also be less reliable and explainable, as they may group apps based on arbitrary or irrelevant features or fail to capture the true characteristics of malware Therefore, supervised learning is still more popular in applications for Android malware classification due to its more accurate and interpretable results [6, 18,

19, 20, 21, 22, 23, 2 4, 25, 26, 27] This work focuses on a supervised learning model based on the above analysis Super- vised learning requires a large and reliable dataset that labels Android apps benign or malware Fortunately, such datasets can be easily found on the Internet.

There are many steps involved in a machine learning problem, but two of the most important ones are data preparation and model evaluation:

• Data preparation is collecting, cleaning, transforming, and selecting the data that will be used for machine learning Data preparation is crucial because it affects

Trang 15

3 the

Trang 16

quality and performance of the machine learning model If the data is incomplete, inaccurate, irrelevant, or inconsistent, the model cannot learn the correct patterns and make accurate predictions.

• Model evaluation measures and compares the machine learning model’s perfor-mance on unseen data Model evaluation is important because it helps to de-termine how well the model generalizes to new situations and how reliable its predictions are.

Feature extraction is one step of data preparation For Android malware classifica-tion, the input of this step is a list of APK (Android Application Package) files, and the output is raw features extracted from APK files By applying static analysis, some examples of raw features would be (i) permissions (a list of permissions that the app requests from the system or user, such as permission to access the Internet, contact information, etc.), (ii) API calls (a list of methods the app invokes from the Android framework or other libraries), and (iii) resources (the files that the app uses to store data or provide user interface elements, such as images, icons, sounds, strings, layouts, etc.) These features are usually presented in "string" format and thus need to be con- verted into numbers before being used as input for the machine learning model Many related works have used or combined one of the above raw features without considering the relationship among these features [4, 5, 7, 26, 28, 29] In this dissertation, two methods are proposed for raw feature augmentation, i.e., those that tried to exploit the relationship between features based on the observation that if a malware requests specific permission, it may tend to call some particular APIs (therefore, observing the con-occurrence of permission and/or API calls may help to make the relationship between individual features).

For the model evaluation, several typical machine learning models were investigated and adopted for this problem, including SVM, RF, selection tree, KNN, Naive Bayes, etc [14,25, 27,30] Although these traditional machine learning models can achieve quite a high accuracy classification rate, they usually focus on malware detection problems, i.e., binary classification In recent years, deep learning models, such as the Convolutional Neural Network (CNN), have dominated many fields of machine learning, e.g., fingerprint recognition, face recognition, voice detection, anticipation, etc However, quite a few works tried CNN for the problem of Android malware detection and classification [31,32,33, 34] The advantage of deep learning models is that they can "learn" features from raw input data Manual feature extraction may, therefore, be unnecessary in some cases Some research has proposed the idea that by directly converting APK files into "images," the malware classification problem would, therefore, become an image classification one and thus can be solved by the CNN model [35, 36] This idea shows a good performance for malware on the Windows platform

Trang 17

but not good performance on the Android platform The poor result on Android is because, unlike the execution file in Windows, an APK file in Android is not a single file Still, it contains all the contents needed to run the application, including Android manifest and classes.dex (compiled java code), resources Therefore, simply converting an APK file into an "image" may make no sense Even in the case of converting only the classes.dex file (the runtime compiled code) into "image," the represented "image" may lack a lot of information stored in other files, and this consequently leads to a poor classification result Therefore, to take full advantage of deep learning, which is the ability to learn some hidden features of the sample files but keep improving the performance of Android malware classification, the Wide and Deep (WD) model was proposed for this problem Experimental results conducted on different datasets proved the feasibility of the proposed idea.

In summary, this dissertation offers the following main contributions: • Proposing feature enhancement methods:

– Feature augmentation based on co-occurrence matrix in the work [Pub.2].– Feature augmentation based on the Apriori algorithm in the work [Pub.6].– Feature selection based on popularity and contrast in the work [Pub.10].

• Proposing an improved model WDCNN to increase accuracy in classifying Android malware [Pub.3]

The dissertation employed a methodology integrating theoretical inquiry with empirical assessment to attain the previously mentioned outcomes Initially, the dissertation conducted a comprehensive review and synthesis of pertinent literature to establish a general problem and subsequently scrutinized it to identify unresolved issues Subse- quently, the dissertation introduces several approaches to address these issues during the feature extraction and training stages The proposed methodologies were tested using three datasets from trusted sources to assess and contrast their performance against alternative approaches The present dissertation is organized in the following structure:

• Chapter 1 Overview of Android Classification based onMachine Learn- ing This chapter summarizes the research and builds a

general model of Android malware classification Based on specific industry challenges and models, the problem, the target, and the classification method are provided.

On the other hand, this chapter summarized and analyzed related studies on feature engineering, such as extraction methods, augmentation methods, and feature selection, and indicated unresolved issues This chapter also analyzes studies on machine learning and deep learning models relating to Android malware and points out challenges posed in model selection and augmentation.

Trang 18

• Chapter 2 Proposed Methods for Feature Extraction This

chapter presents feature selection and augmentation methods proposed and developed in the dissertation Augmentation methods based on the Apriori algorithm indicate the application of the association rule mining algorithm to generate new features, which show correlations between features, adding new features to improve the feature set The co-occurrence matrix method helps find the relationship between features based on co-occurrence attributes to find new characteristics and help renovate the characteristic set A selection method based on popularity and contrast value developed two measures, and a characteristic evaluation method is based on these two measures; characteristic selection is based on the value of the evaluation method.

Classification Presenting the implementation of some deep learning

methods for Android malware detection problems and proposing some models of augmentations for the problem The first part of the chapter proposes and tests the application of deep belief and convolutional neural networks to detect Android malware Based on the study results at the beginning of the chapter about the superiority of the CNN model in the Android malware detection problem, the dissertation will propose augmentations for the deep-learning model The last part of the chapter proposes and enhances the federated deep learning model with federated weighting methods based on the sample set size for the Android malware detection problem.

Trang 19

Chapter 1

OVERVIEW OF ANDROID MALWARE CLASSIFICATIONBASED ON MACHINE LEARNING

Chapter 1 will provide an overview of foundational knowledge, covering aspects such as Android operating system architecture, malware, Android-specific malware, methods for classifying malware on Android, metrics used in machine learning and deep learning, and related works.

The dissertation then identifies scientific gaps within this context, serving as a basis to highlight new contributions in the subsequent chapters.

Android platform provides users tools and APIs to create applications (apps) for mobile phones, televisions, smartwatches, etc.

Trang 20

Figure 1.1: Architecture of Android OS system [37]

• Linux Kernel

The Android Operating System is built upon the Linux kernel version 2.6 Should they wish to be executed, all operations are carried out at this level These pro-cesses include memory management, hardware communications (driver models), security tasks, and process management.

Although Android was built upon the Linux kernel, the kernel has been heavily modified These modifications are tailor-made to satisfy the characteristics of handheld devices, such as the limited nature of the CPU, memory and storage, screen size, and, most importantly, the continuous need for wireless connections.

Trang 21

This level contains the following components:

– Display Driver: controls the screen’s display and captures user interactions

(e.g., touch, gestures).

– Camera Driver: manages the camera’s operation and receives data streams

from the camera.

– Bluetooth Driver: controls the transmission and reception of Bluetooth sig-

– USB Driver: manages the functionality of USB communication ports.– Keypad Driver: controls the keypad input.

– Wi-Fi Driver: responsible for sending and receiving wifi signals.

– Audio Driver: controls audio input and output devices, decoding audio

signals to sound and vice versa.

– Binder IPC Driver: handles connections and communication with wireless

networks such as CDMA, GSM, 3G, 4G, and E to ensure seamless communi-cation functionalities.

– M-System Driver: manages reading and writing operations on memory

devices like SD cards and flash drives.

– Power Management: monitors power consumption.

• Hardware Abstraction Layer – HAL

The hardware abstraction layer (HAL) provides standard interfaces that expose device hardware capabilities to the higher-level Java API framework The HAL consists of multiple library modules, each implementing an interface for a specific hardware component, such as the camera or Bluetooth module When a framework API calls to access device hardware, the Android system loads the library module for that hardware component.

• Android Runtime

The Android Runtime provides the libraries that any programs in Java need to function correctly It has two main components, much like the Java equivalent on personal computers The first component is the Core Library, which contains classes such as Java IO, Collections, and File Access The second component is the Dalvik Virtual Machine, an environment for running Android applications.

• Native C/C++ Libraries

This section comprises numerous libraries written in C/C++ to be utilized by software applications These libraries are grouped into the following categories:

Trang 22

– System C Libraries: these libraries are based on the C standard and are used

exclusively by the operating system.

– OpenGLES: Android supports high-performance 2D and 3D graphics with the

Open Graphics Library (OpenGL®), specifically, the OpenGL ES API OpenGL is a cross-platform graphics API specifying a standard 3D graphics processing hardware software interface.

– Media Libraries: this collection contains various code segments to support

the playback and recording of standard audio, image, and video formats.

– Web Library (LibWebCore): this component enables content viewing on

the web and is used to build the web browser software (Android Browser) and for embedding into other applications It is highly robust, supporting powerful technologies such as HTML5, JavaScript, CSS, DOM, AJAX, etc.

– SQLite Library: this is a database system that applications can utilize.

• Java API Framework The entire feature set of the Android OS is available

to you through APIs written in the Java language These APIs form the building blocks you need to create Android apps by simplifying the reuse of core, modular system components, and services, which include the following components:

– Activity Manager: this manages the lifecycle of applications and provides

tools to control Activities, overseeing various aspects of the application’s lifecycle and Activity Stack.

– Telephony Manager: provides tools for communication functions such as

making phone calls.

– XMPP Service: facilitates real-time communication.

– Location Manager: this class provides access to the system location services.– Window Manager: manages the construction and display of user interfaces

and the organization and management of interfaces between applications.

– Resource Manager: handles static resources of applications, including image

files, audio, layouts, and strings It enables access to embedded resources (not code) such as strings, color settings, and UI layouts.

– Notification Manager: allows applications to display notifications to users.– Content Providers: enables applications to publish and share data with other

– View System: a collection of views used to create the application user

• Application Layer

Trang 23

System Apps are apps that communicate with the users Some of these apps include:

– The basic apps that come with the OS, such as Phone, Contacts, Browser,

SMS, Calendar, Email, Maps, Camera, etc.

– The user-installed apps, like games, dictionaries, etc

These applications share these characteristics:

– Written in Java or Kotlin, with extension type APK (APK file).

– When an app is run, a Virtual Machine is initialized for that runtime The

app can be an Active Program with a user interface, a background app, or a service.

– Android is a multitasking operating system, meaning users can run multiple

programs and tasks simultaneously However, for each app, there exists only one instance This prevents the abuse of resources and generally helps the system run more efficiently.

– Applications in Android are assigned user-specific ID numbers to

differentiate their privileges when accessing resources, hardware configurations, and the system.

– Android is an open-source operating system, distinguishing it from many

other mobile operating systems It allows third-party applications to run in the background However, these background apps have a minor restriction, as they are limited to using only 5-10% of the CPU capacity This limitation is in place to prevent monopolization of CPU resources Background apps do not have a fixed entry point or a primary method to start execution.

1.1.2 Overview of Android Malware

• Malware Deftnition

According to NIST [38] , Malware is defined as:

“Malware, also known as malicious code, refers to a program that is covertly in-serted into another program intending to destroy data, run destructive or intrusive programs, or otherwise compromise the confidentiality, integrity, or availability of the victim’s data, applications, or operating system Malware is the most common external threat to most hosts, causing widespread damage and disruption and necessitating extensive recovery efforts within most organizations”.

From the above definition, it can be seen that malware is unsuitable for users and systems Understanding malware and how to prevent it helps protect users in today’s connected environment.

Trang 24

• Categories of Malware

The rise of malware comes with the development of the internet, especially when all activities, including social and financial, can now be performed online, and they are subject to anonymous attacks for unrighteous intentions Malware will be classified into seven types, as shown in Table 1.1 below [38, 39]:

Table 1.1: Types of malware

Malware

Viruses self-replicate by inserting copies of themselves into host programs or data files Viruses are often triggered through user interaction, such as opening a file or running a program Viruses can be divided into the following two subcategories:

– Compiled Viruses: a compiled virus is executed by an

operating system Types of compiled viruses include file infector viruses, which attach themselves to executable programs; boot sector viruses, which infect the master boot records of hard drives or the boot sectors of remov- able media; and multipartite viruses, which combine the characteristics of file infector and boot sector viruses.

– Interpreted Viruses: interpreted viruses are executed by

an application Within this subcategory, macro viruses take advantage of the capabilities of applications’ macro programming language to infect application documents and document templates In contrast, scripting viruses infect scripts that are understood by scripting languages processed by services on the OS.

Example: ILOVEYOU, CryptoLocker, Tinba, Welchia, Shlayer.

Trang 25

Malware

Worms: a worm is a self-replicating, self-contained program that usually executes itself without user intervention Worms are divided into two categories:

– Network Service Worms: a network service worm takes

advantage of a vulnerability in a network service to prop-agate itself and infect other systems.

– Mass Mailing Worms: a mass mailing worm is similar to

an e-mail-borne virus but is self-contained rather than infecting an existing file.

Example: Stuxnet, SQL Slammer

Trojan Horses

a Trojan Horse is a self-contained, nonreplicating program that, while appearing benign, actually has a hidden malicious purpose Trojan horses either replace existing files with malicious versions or add new ones to systems They often deliver other attacker tools to systems.

Example: Emotet, Triada.

Spyware is malware that can run secretly on the system without notifying users To disrupt system processes, spyware aims to collect private information and grant remote access to bad actors Spyware is often used to steal financial information or private user information.

Example: DarkHotel, Olympic Vision, Keylogger

Adware is the most commonly used malware to collect user data on the system and provide ads to users without permis-sion Even though adware isn’t occasionally dangerous, in some situations, adware can cause system crashes They can redirect browsers to unsafe websites with Trojan viruses and spyware In addition, adware is one of the reasons for system lagging.

Example: Fireball

Trang 26

Malware

Ransomware is a kind of malware that has permission to ac-cess system private information; it encrypts data to prevent user access, and then the attackers can take advantage of the situation and blackmail users Ransomware is usually part of phishing actions The attacker can encrypt information that can only be opened with his key.

Example: RYUK, Robbinhood, Clop, DarkSide

Fileless malware

Fileless malware live inside the memory This software will be processed from the victim system’s memory (NOT from files on the hard disk) Thus, it is harder to detect compared to other classic malware It also makes the encryption process harder because Fileless malware will disappear when restart- ing the system.

Example: Astaroth

• Android Malware Overview

Android OS always holds a high market share on the mobile operating system Following the statistics of [1] in June 2023, Android dominated 70.79% of the mobile market Thus, Android OS’s vulnerabilities are attractive to hackers, as all the social and financial activities can now be performed on mobile devices According to AV-Test [2], new types of malware are still being created annually, along with the development of an open-source OS like Android The malware increase from 2013 to March 2022 is shown in Fig 1.2.

• Android Malware Characteristics

Malware is a developing threat to every connected individual in the age of mobile phones and the internet Because of the financial incentives, the number and complexity of Android malware are growing, making it more difficult to detect Android malware is almost identical to the varieties of malware that users might be familiar with on their desktops, but it is only for Android phones and tablets Android malware primarily steals private information, which can be as common as the phone number, emails, or contacts of the user or as critical as financial credentials With that data, the scammers have many unlawful options that can earn them substantial money There are some signs indicating that a mobile device was infected by malware: (1) users often see sudden pop-up advertisements on their devices; (2) mobile batteries drain faster than usual; (3) users notice applications that they did not intentionally install; and (4) some apps do not appear on the screen after installation Android malware appears in many forms,

Trang 27

Figure 1.2: The increase of malware on Android OS

such as trojans, adware, ransomware, spyware, viruses, phishing apps, or worms Kaspersky has investigated widespread malware in 2020 and 2021 and categorized them (Fig 1.3) [40] Malware often infiltrates via various traditional sources, such as harmful downloads in emails, browsing dubious websites, or following links from unknown senders.

Figure 1.3: Types of malware on Android OS

Common sources of Android malware:

Trang 28

– Applications that have been infected: Attackers can collect

popular programs, repackage them with malware, and re-distribute them through download links This method is so effective that many fraudsters tend to design or advertise new apps; naive users may follow customized download links and accidentally install or download malware to their devices.

– Malvertisements: malvertising is the kind of malware embedded in

adver- tising distributed through advertisements A virus will be downloaded to the user’s computer if the user clicks on one of these pop-ups The user can block ads on the Android device, which is an effective way to prevent malware.

– Scams: phishing assaults and other standard email- or SMS-based frauds

are examples of online scams The email or message will contain a link to malware, which will be installed on the phone when the user clicks the link It’s one of the most common ways to infect Android phones.

– Direct download to the device: this is the most trivial way to infect

a device with malware The attackers must only directly connect a gadget or USB to the phone and install the virus programs However, it is difficult to do this way, because it is difficult for the attacker to gain direct access to the victim’s device.

1.2 Android Malware Classiftcation Methods

Two techniques are often used for malware detection: signature-based and anomaly- based.

A signature-based approach is often employed in commercial antivirus products, as the detection results attain high accuracy and precision Malware behaviors or fea-tures will be retained in a database of samples or characteristics A malware detection system (a detector) will analyze and recognize malware based on one or several characteristics that match pre-defined patterns Malware signatures can be static, known byte sequences, or behavior characteristics, such as network behavior However, this method is useless in detecting unknown or zero-day malware, as their unique traits do not exist in the program database.

On the other hand, the anomaly-based method can detect unknown suspicious be-havior This method is usually based on machine learning techniques The difference between normal and abnormal behavior can be modeled during training Since 2017, machine learning and deep learning, in particular, have been extensively applied for malware detection on mobile devices.

Trang 29

1.2.1 Signature-based Method

In this method, the signature of sample malware will be stored in a list of known threats and their indicators of compromise (IOCs) The signature can be extracted by static or dynamic analysis The method compares the sample’s signature with all the signatures stored in the database to decide whether a sample is malware.

One of the attributes of the signature-based method is high accuracy To achieve that, indicators stored in the database must be accurate, have comprehensive coverage, and be updated regularly, as new malware is born rapidly On the other hand, using a signature-based method is time-consuming The larger the number of files or apps that need to be checked, the longer the testing time required because the system needs to sequentially decompile each app, extract features, and then compare each feature with the patterns defined in the database The program can often combine static and dynamic signatures, e.g., data extracted from the decompiled code and behavioral data while the app runs The combination will provide more comprehensive coverage, but the examination time will increase considerably.

Permissions, API calls, class names, intents, services, or opcode patterns are often used to spot the malware In [16], Enck et al proposed a security service for the Android operating system called Kirin The Kirin authenticates an app at installation time using a set of protection rules designed to match the properties configured in the app Kirin system also evaluates configurations extracted from the installer’s manifest files and compares them with the rules set up and saved in the system.

Batyuk et al [17] applied static analysis on 1865 top free Android apps retrieved from the Android Market The experiments showed that at least 167 access private information such as IMEI, IMSI, and phone numbers among the analyzed apps One hundred fourteen apps read sensitive data and immediately write them to a stream, which indicates a significant privacy concern.

Dynamic analysis is highly efficient when dealing with obfuscation techniques such as polymorphism, binary packaging systems, and encryption However, app operation (even in a virtual environment) also costs dynamic analysis more time than static analysis Chen et al [15] proposed an approach to indicate dangerous samples in Android devices using static features and dynamic patterns The static features were acquired via decompilation of APK files, and connections between the app’s classes, attributes, methods, and variables will be extracted The program also analyzes function calls and the relationships between data threads when the Android app runs All that information can be used to deduce threat patterns and check whether the app accesses private data or conducts any illegal operation, e.g., sending messages without permission or stealing confidential information The experiments in the report show that the rate of malware found in 252 samples using the dynamic signature-based method is 91.6%.

Trang 30

Figure 1.4: Anomaly-Based Detection Technique

Despite the advantages mentioned above, there are two drawbacks to the signature- based detection method: (i) it cannot detect zero-day malware, and (ii) it can easily be bypassed by code obfuscation.

1.2.2 Anomaly-based Method

An anomaly-based method uses a different approach and can resolve problems An anomaly-based approach relies on heuristics and empirical running processes to detect abnormal activities The anomaly-based detection technique consists of the training and detection stages, as presented in Fig 1.4 This technique observes normal behaviors of the app over a period and uses attributes of standard models as vectors to compare and detect abnormal behaviors if any occur A set of standard behavior attributes will be developed in the training stage In the detection stage, when any abnormal “vectors” arise between the model and the running app, that app will be defined as an anomaly program This technique allows for recognizing even unknown malware and zero-day attacks.

In an anomaly-based approach, application-extracted behaviors can be achieved in three ways: static analyses, dynamic analyses, or hybrid analyses Static analyses will be investigated before installation using the app’s source code Dynamic analyses will perform the test and collect all the app data during execution, for example, API calls, events, etc., where hybrid methods use both.

However, the abnormal and expected behaviors of the samples are not easily sep-arated because of the large number of behaviors extracted There is no basis to de-termine what behavior is normal and not normal It is not feasible to divide these behaviors based solely on the analyst’s experience Machine learning models are ap-plied during training to minimize time and increase efficiency When applying ma-chine learning, the number of behaviors that should be fed into the training model can be enormous, as all behaviors must be collected as features Nowadays, there are many machine learning models have been applied to malware detection, such as

Trang 31

SVM (Support Vector Machine), KNN (K-Nearest Neighbors), RF (Random Forest), etc., and the modern deep-learning models DNN (Deep Neural Network), DBN (Deep Belief Network), CNN (Convolutional Neural Network), RNN (Recurrent Neural Net-work), LSTM (Long Short-Term Memory), GAN (Generative Adversarial NetNet-work), etc Those models will be discussed in a later section of the dissertation.

Schmidt et al [41] have analyzed Linux ELF (Executable and Linking Format) object files in an Android environment using the command readelf The function calls read from the executables are compared with the malware database for classification by using the Decision Tree learner (DT), Nearest Neighbor (NN) algorithm, and Rule Inducer (RI) This technique shows 96% accuracy in the detection phase with 10% false positives Schmidt et al extended their function calls-based technique to Symbian OS [42] They extracted function calls from binaries and applied their centroid machine, based on a lightweight clustering algorithm, to identify benign and malware executables The technique provides 790% detection accuracy and 0-20% false positives.

Schmidt et al [43] proposed a framework to monitor smartphones running Symbian OS and Windows Mobile OS to extract system features for detecting anomalous apps The proposed framework is based on tracking clients runs on mobile devices, collecting data describing the system state, such as the amount of free RAM, the number of running processes, CPU usage, and the number of SMS messages in the sent direc- tory, and sending it to the Remote Anomaly Detection System (RADS) The remote server contains a database to store the received features; the detection units access the database and run machine learning algorithms, e.g., AIS or SOM, to distinguish between normal and abnormal behaviors A meta-detection unit weighs the detection results of the different algorithms The algorithms were executed on four feature sets of different sizes, reducing the set of features from 70 to 14, thus saving 80% of disk space and significantly reducing computation and communication costs Consequently, the approach positively influences battery life and has a small impact on actual positive detection.

Only the machine learning methods applied to malware detection on the Android system will be discussed in this dissertation The next chapter will detail the analysis to get the behaviors or features by static, dynamic, and hybrid methods.

1.2.3 Android Malware Classification Evaluation Metrics

In the problem of recognizing and classifying, some commonly used measures are

Accuracy (Acc), Precision, Recall, F1-score, confusion matrix, ROC curve, AreaUnder the Curve (AUC), etc For the classification problem of having multiple

outputs, there are slight differences in the use of measures.

Trang 32

1.2.3.1 Metrics for the Binary Classification Problem

In the detection problem, the output has only two labels, commonly called Positive and Negative, where Positive indicates an app is malware, and Negative alludes to the opposite Hence, there are four definitions provided:

• TP (True Positive): apps correctly classified as malware.• FP (False Positive): apps mistakenly classified as malware.• TN (True Negative): apps correctly classified as benign.• FN (False Negative): apps mistakenly classified as benign.

While evaluating, the ratio (rate – R) of these four measures is considered:

• TPR = TP/(TP + FN ): True Positive Rate.• FNR = FN/(TP + FN ): False Negative Rate.• FPR = FP/(FP + TN ): False Positive Rate.• TNR = TN/(FP + TN ): True Negative Rate.

For the four above measures, FNR is crucial, as the higher this ratio, the less

trustworthy the model is because more malware apps will be mistakenly recognized

as benign For the FNR measure, the false alarm rate means benign apps aremistaken for malware, but it won’t be as important as the FNR.

The most popular and simplest measure is Accuracy (Acc), given in Equation 1.1.

Trang 33

Acc is often used with problems where the number of positive and negative samples are equal As for problems with a large deviation between the number of

positive and negative samples, the Precision, Recall, and F1-score measures are

often used.

• Precision is defined as the ratio of TP scores among those classified as positive(TP + FP ) The formula for calculating Precision is shown as Equation 1.2.

TPPrecision =

• Recall is defined as the ratio of TP points to those that are actually positive(TP + FN ) The formula for calculating Recall is shown as Equation 1.3.

Trang 34

1.2.3.2 Metrics for Multi-labelled Classification Problem

When there are multiple labels as output in the classification problem, it can be reduced to a detection problem for each class, considering the data belonging to the class under consideration to be positive and all the remaining data labels to be negative Thus, there will be a pair of precision and recall for each class The concepts of micro- average and macro-average will be used to evaluate the classification problem.

Micro-average precision and Micro-average recall are calculated in Equation

1.5.

Trang 35

where TPc, FPc, and FNc respectively are TP , FP , and FN of the class c.

Macro-average precision is the average of precisions by class, similar to Micro− average recall (called Recall: average actual classification of each class of malware

and benign), given in Equation 1.6 With the abovementioned measures, the Acc and Recall measures are used for this classification problem in experiments.

1.2.4 Android Malware Dataset

Many datasets have been published for the research community as follows:

• Contagio mobile: released in 2010 and last updated in 2010 It consists only of 189 malware without benign ones This dataset is public to users.

• Malgenome: samples were collected from 2010 to 2011 and published in 2012 The size of the dataset is 1260 malware However, the dataset was decommissioned in 2021.

• Virusshare: this is a repository of malware It has been provided publicly to users since 2011 This dataset only includes malware files without labels.

• Drebin: samples were collected from 2010 to 2012 and published in 2014 The dataset consists of 5560 malware divided into 178 malware families The number

Trang 36

of files in each family is not balanced Some families have only one or a few (less than 10) malware files, while others may have more than 1000 files Furthermore, Drebin also provides 123,453 benign samples in the form of extracted features.

• PRAGuard: released in 2015, PRAGuard consists of 10479 malware without mal-ware family labels PRAGuard was created by mixing MalGenome and Contagio Minidump data with seven different mixing techniques In April 2021, this dataset was decommissioned.

• Androzoo: Androzoo was created in 2016 and is still being updated It provides both malware and benign in large quantities However, Androzoo only provides apps, and they haven’t been classified as families So far, the number of files offered is over 20 million in the form of APKs.

• AAGM: made public in 2017, it consists of 3 categories: Adware with 250 apps, General Malware with 150 apps, and Benign with 1500 apps from Google Play • AMD: malware was collected from 2010 to 2016 and made public in 2017,

including 25,553 files from 71 families By 2021, this dataset is decommissioned.

• CICMalDroid 2020: samples collected in 2018 and published in 2020 with a size of 13,077 files in 5 categories (Adware, Banking Malware, SMS Malware, Mobile Riskware, Benign).

• InvesAndMal (CIC MalDroid2019): samples collected in 2017 and published in 2019 with 5491 files This dataset is divided into four categories (Adware, Ran-somware, Scareware, SMS malware) and consists of 42 families within the above categories Most of the benign accounts for 5000 samples It is currently still public.

• MalNet2020: the dataset was published in December 2020 with 1,262,024 samples This dataset is essentially downloaded from Androzoo but has extracted features from FCG (Function Call Graph) and Image The dataset is divided into 696 malware families and 47 malware types The APK file cannot be directly downloaded from MalNet’s homepage (https://mal-net.org/), but the author team only provides SHA256 to download from Androzoo.

In the experiments in the doctoral dissertation (including experiments in journal articles and conferences), the following datasets were used:

• Virusshare: in the conference paper FAIR [Pub.4], a small number of samples, including 500 (250 malware and 250 benign), were used Since the number of malware and benign programs is balanced, the only measure to apply is accuracy.

Trang 37

• Drebin: This is a well-known dataset used in many papers by local and foreign authors During this research work, the Drebin dataset was constantly implemented, such as:

– [Pub.1]: this research experimented on the entire Drebin dataset (including

both benign and malware provided) The article showed that using a CNN model had advantages over the original Drebin SVM model Because the Drebin dataset has a significant imbalance of samples between families, addi- tional measures are also applied to obtain a better evaluation.

– [Pub.2, Pub.6]: those journals utilized the entire Drebin malware combined

with 7,140 benign samples from a different source Multiple measurements were performed to evaluate the feature selection in the paper.

• AMD: similar to Drebin, this dataset is widely used by researchers due to the large quantity and variety of samples.

– In [Pub.10], the 65 families with the most samples were appropriate for the

research The [Pub.11] study used the AMD dataset with families with at least 20 samples (including 35 families).

– In [Pub.3], Drebin and AMD were employed as malware data.

The datasets are summarized in Table 1.2as follows:

Table 1.2: Summary of Android malware datasets

Trang 38

Standard of dataset evaluation:

To evaluate the quality of a dataset, the dissertation uses several criteria: the num-ber of samples, the numnum-ber of labels, the distribution of samples among classes, and the level of updating of the datasets These criteria can help ensure that the dataset is comprehensive, well-labeled, balanced, and up-to-date, which can increase the reliability and generalization of the research results.

The quality of classiftcation depends on the dataset:

Based on the above datasets, some datasets are suitable for malware detection tasks (which only provide malware and are not divided into many malware families) and classification tasks (in which many malware families are within malware) It also needs to be combined with a separate benign set (which can be downloaded from sources such as Androzoo, Google Play, etc.) With the same machine learning and deep learning algorithm, the adaptation to each dataset gives different results This is because the features extracted in each dataset (a set of samples) are different Assuming that all datasets have good quality, there is still a clear difference between each dataset due to the different years of publication Each year, Google provides new versions with many changes, so the features extracted in each set are different" Some

Trang 39

datasets have specific

Trang 40

features, such as datasets containing C++ code instead of just Java code, datasets containing scrambled and not simply readable like regular code, encrypted datasets, or datasets that have code rearranged in different positions, etc From the above, it can be seen that the quality of each dataset significantly affects the classification quality.

Modiftcation and advancing the dataset:

The investigation conducted in the dissertation indicates that the labeled datasets exhibit a discrepancy among distinct malware families The Virusshare and Androzoo datasets, which furnish APK files, exhibit a partiality towards specific labels when subjected to labeling software despite their lack of inherent labeling Consequently, this research has incorporated multiple supplementary evaluation metrics to furnish a more all-encompassing appraisal of the correlation among diverse families with

varying quantities, including but not limited to precision, recall, and F1-score.

1.3 Machine Learning-based Method for Android Malware Clas- siftcation

The problem of malware classification on the Android platform is described in Fig 1.5 In general, there are four steps involved in Android malware classification.

1 Feature extraction

APK file is a compressed file containing other files like Androidmanifest.xml(later called the XML file), and classes.dex (later called the DEX file), etc.

Features extracted from APK files form a dataset and serve as input to training models Features are critical to a model and the key components for the model to make true or false decisions Arrays of features can be collected via static analysis, dynamic analysis, or hybrid techniques and then tailored to make a feature set For example, index classes can be transformed to image features, collect groups of dynamic features such as permission, API call, intent, etc., or

transform file code to a “smali ” file A set of extracted features could be definedas a “raw feature dataset.”

2 Feature selection

Features of the original dataset (original feature dataset) were transformed into binary form, and then these binary values can be specified differently as:

• Images: transformed from text to binary or Hex code for image point

collection.

• Frequency: attributes occurrence (Permission, API call, etc.) frequency in

APK’s file can be transformed into features.

• Binary encoding: if the defined behavior takes place, pass “ 1 ”, else

pass “0 ”.