Machine Learning Features for Malicious URL Filtering- The Survey

Smaranya Dey; Eshan Jain; Arunita Das

Machine Learning Features for Malicious URL Filtering- The Survey

Authors: Smaranya Dey, Eshan Jain, Arunita Das

Malicious URL is the URL created for harmful purposes which contains spam, phishing, misleading applications like fake antiviruses or fake codecs. The use of this kind of URLs might lead to monetary loss, theft of sensitive information such as personal details or corporate data, disruption of operations, unauthorized access to system resources etc. Often these websites are built to look like a genuine website to deceive the users in installing malicious content in their systems. As per NetCraft January 2018 web server survey, there are 1.8 billion sites across 213 million unique domain names. According to Symantec Internet Security Threat Report 2018, 1 in 13 web requests lead to malware which is up 3% from 2016. Sudden rise of cyber-attacks in recent years makes this problem indispensable for both private and public organizations. The primary objective of the paper is to provide a near exhaustive set of meaningful features that can help professionals and practitioners to facilitate their own research and practical applications on malicious URL filtering. These features are systematically classified and described in keyword-based features, lexical features, content-based features (HTML and JavaScript), IP Address Properties based features, web-rank and score-based features. This paper also briefly discusses on how URL filtering techniques have evolved in the past. The paper talks about traditional techniques like blacklisting URLs, heuristic approaches while also highlighting the shortcomings of these approaches. We then touch upon newer machine learning based techniques like cosine similarity-based URL classification, Support Vector Machines and Neural Network based models.

Comments: 8 Pages.

Download: PDF

Submission history

[v1] 2019-06-21 01:17:46

Unique-IP document downloads: 234 times

Vixra.org is a pre-print repository rather than a journal. Articles hosted may not yet have been verified by peer-review and should be treated as preliminary. In particular, anything that appears to include financial or legal advice or proposed medical treatments should be treated with due caution. Vixra.org will not be responsible for any consequences of actions that result from any form of use of any documents on this website.

Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.

Artificial Intelligence

Machine Learning Features for Malicious URL Filtering- The Survey

Submission history