Head of SafeDNS’s Machine Learning: “ML is not a magic wand”
In this interview, we sit down with Jurgen Lorenz, the Head of Machine Learning at SafeDNS, to gain insights into the key role that machine learning plays in the company's web filtering solutions. Jurgen sheds light on how the department works on the identification of website categories, reveals the intricacies of training the ML models, shares the challenges encountered in adapting the models to different contexts, uncovers the secrets that differentiate SafeDNS from industry competitors, and shares SafeDNS's plans.
Well, mainly our department is responsible for identifying website categories. The ML specialists focus on parsing and timely crawling these categories, processing third-party sources with thematic lists of domains, balancing and verifying records from all sources, and forming the final database assemblies. These assemblies serve as the foundation for our work and the solutions we provide to our clients.
SafeDNS employs various models to classify texts, creating a training dataset from manually tagged websites on specific topics. Mathematical models, such as binary classifiers and neural networks for different languages, are prepared to determine if a site corresponds to a given topic. Predictions are made with a certain probability, utilizing over 100 models to consistently assess sites. The final verdict is reached by aggregating this data, taking into account the trust level assigned to each data source. We use text models as well as models for image processing (specifically for identifying explicit content) and heuristics-based models for alternative site classification.
The main challenge lies in the small number of sites in a particular language available for training samples. Additionally, complexities arise with hieroglyphs, rare dialects, and when working with regions in Asia and Southern countries. To address these challenges, we utilize synthetic data and, in some cases, translate language models from more popular languages. Working with English is advantageous, as over half of the world's internet content is in this language.
To understand site popularity and facilitate additional categorization in new regions, we analyze user logs.
Here I should add that the job takes us to the most hidden, sometimes darkest corners of the Internet. It is quite an adventure, really. Thanks to this, our solution is able to identify and categorize resources even in the rarest languages, which is definitely our advantage.
There is no surprise there. Key metrics include accuracy and error rate.
Sure, we do rely on customer feedback to initiate timely retraining of models in case of expected degradation. As I said earlier, we are able to categorize websites in quite rare languages. Actually, our clients and their feedback help us a lot in terms of recategorizing those types of resources.
I am calling it. Just kidding. First of all, SafeDNS is a company with 13 years of solid experience, a diverse client base that ensures comprehensive error correction, and trusted partners that provide us with domain list improvement.
Secondly, I would like to underline that our text models (currently we have more than 1000 of them) are trained on a huge number of different and complex resources, which means that we do not just look through the Wikipedia pages and that is it. Our crawlers, just like search engines, go through websites once a month and do so at high speed.
What else makes us different? Well, I guess, it is our unique database: it contains 2 billion URL records and includes 20% more phishing sites than other companies’ databases. The ones within the industry must know that URL categorization is way more complex to accomplish, you just cannot afford to have a smaller database.
Seems like I could go on forever answering this question, right? I just want to add one more thing: the fact that we use Passive DNS technology makes us able to track connections between domains from a historical perspective. For example, say you approached a random domain. We can see that a while ago there was a phishing site pointed to the same IP. The SafeDNS filtering will check that domain more often than the others and pessimize it since its reputation is obviously questionable.
And finally, if we speak about machine learning particularly, I should say that despite recent hype around this phenomenon, it is not a magic wand and requires human involvement. Even a classification accuracy of 99% in a database of 100 million records results in 1 million errors, which is a huge number. That is why we pay serious attention to the human factor. There is manual tagging and 24/7 top-notch technical support to handle error-related issues.
We are planning to move towards a list of new categories, introducing more segmented topics. While we currently have 60+ categories, we aim to expand to 120+, allowing for better segmentation of domains and increased accuracy.