The project goal is to evaluate the application of different Machine Learning techniques in order to classify PE files as malicious or benign.
The data set can be found here.
Each sample has more than 70 features obtained through static analysis of its correspondent PE file (e.g. SizeOfCode, SectionMaxEntropy etc.).
The data set contains 19,611 sample of which 14,599 malicious and 5,012 benign.
We used three supervised classification techniques:
- K Nearest Neighbors
- Support Vector Machine (3rd order polynomial kernel)
- Logistic Regression
We performed 5-fold cross validation on each technique in order to find the optimal hyperparameter: the number K of neighbors, the C values for tuning the margin and the regularization parameter respectively.