Today’s networked world makes every system an easy target for cyberattacks. Automated tools make it easier for attackers to execute successful attacks and a new threat emerges almost every second. In this environment, it’s hard for cybersecurity to keep up. According to Cybersecurity Ventures, cybercrime is expected to cause $6 trillion (US) worth of damages globally in 2021. The damage could reach $10.5 trillion annually by 2025.
In today’s cyber threat environment, this means persistently tracing and correlating millions of external and internal data points across your organization’s users and infrastructure. You clearly can’t do this with people alone; you need machine learning which can recognize patterns and predict threats in massive data sets, all at machine speed. In this blog I’ll discuss why machine learning (ML) is so crucial, as well as share an example illustrating the development of an ML algorithm for identifying phishing websites.
Why machine learning
Using machine learning models, cybersecurity teams can rapidly detect threats and isolate them for in-depth investigation. Machine learning can look at groups of network requests or traffic with similar characteristics and can identify anomalies. ML algorithms continuously analyze data to find patterns that help detect malware in traffic. It predicts malicious activity and protects data by detecting suspicious user behavior.
The right ML model can detect never-before-seen malware that is attempting to run on endpoints. It can spot new malignant files and events based on the attributes and behaviors of known malware. ML techniques include dimensionality reduction (converting many dimensions into fewer ones), clustering (identifying groups of items with similar characteristics), and statistical sampling. They can also help us use statistical information to develop baselines that can provide useful information about normal or abnormal behavior. Doing this, we can use data to identify variations from the normal.
Phishing URL detection using machine learning
Phishing is a common type of cyber-attack where a cybercriminal sends a fraudulent message designed to deceive an individual into revealing sensitive information to the attacker or to install malicious software on the target’s infrastructure, such as ransomware.
Machine learning algorithms are one of the most powerful and successful techniques in detecting phishing websites. Phishing attacks have some common characteristics which can be identified by machine learning methods.
Using a data set made up of important characteristics or attributes of URLs, I was able to predict phishing websites by implementing a machine learning model. For more information on the dataset, visit UCI Machine Learning Repository. Below is the snippet of python code. I imported this dataset in the OpenText Magellan Notebook. OpenText Magellan delivers a ready-to-use AI-powered analytics platform, which includes machine learning, data discovery, text analytics and sophisticated visualization and dashboarding. Learn more about Magellan.
Data Exploration
The dataset has 30 features. Here I explored some of the features. The URL has a detailed description of each feature and the values derived, by applying the condition such as length, PageRank, google index, age etc. applied on the attributes of the target URL.
Below is the correlation heatmap, each square showing the correlation between the variables on each axis.
RandomForestClassifier algorithm has been fitted on the training dataset and applied on the test dataset. This model has around 97 percent accuracy.
The classification report shown below is used to measure the quality of predictions from the algorithm. It displays the model’s precision, recall and F1 score. The metrics are calculated by using true and false positives and true and false negatives. There are four ways to check if the predictions are right or wrong:
- TN / True Negative: when a case was negative and predicted negative
- TP / True Positive: when a case was positive and predicted positive
- FN / False Negative: when a case was positive but predicted negative
- FP / False Positive: when a case was negative but predicted positive
Precision – Accuracy of positive predictions.
Precision = TP/(TP + FP)
Recall: Fraction of positives that were correctly identified.
Recall = TP/(TP+FN)
The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
The accuracy can be further improved by applying other algorithms or tuning the parameters; however, this blog is mainly focused on demonstrating one of the use cases leveraging ML in cybersecurity.
Data plays a vital role in the field of machine learning and the availability of quality data that support the environment will reduce false positives. However, as this example shows, machine learning as a complement to cybersecurity can be more proactive and efficient.
The OpenText™ Professional Services team has years of experience and can offer organizations multiple options for addressing cyber security objectives using AI & Analytics Services. OpenText™ EnCase™ Endpoint Security incorporates AI, automation, and machine learning to identify threats in near-real time and at scale. Visit our website to learn more about OpenText AI & Analytics Services and OpenText Security Services.
Author: Sridhar Sambarapu, Data Scientist, Professional Services – Center of Excellence