How do we fill our database?

General overview

The database of SafeDNS content filtering service is updated daily and contains more than 10 million sites. The sources for replenishing the database are SafeDNS's own data, which includes a machine learning system and a moderator service, as well as various public and private sources of security feeds (phishing, infected sites, managing botnet servers, etc.).Every external source is constantly checked for quality. Data from external sources using special algorithms that exclude duplication and categorization errors of individual sources are combined into a single database of SafeDNS service.In addition, an important part of the service base are resources manually added by moderators, both by user requests (more than 500 new sites every day) and by their own lists (guaranteed operation of infrastructure services and programs).The machine learning system currently contains over 100 million sites and over 450 million individual pages in the index. Our own AI ML adds 10 to 15 million new pages to the index every day.


How do we fill our database?

We use more than 60 sources to form the database. Some of these sources are static, they do not change. Such sources need to be updated periodically (delete domains that are currently invalid, change categories if the content has changed). The department of Machine learning are constantly engaged in such updating. Another type of sources is external sources. Some of them are open, some of them we buy. Often they have some kind of narrow specialization, for example, a list of phishing sites, or advertising domains. The most important ones are usually being generated by our AI ML.
Everyday, around 10 million domains are being generated by our users logs and external resources, the new domains go through classification procedure in our database by the AI ML where every page of those domains get a scan in-order to be sorted and classified in the related category. Also, we have a re-classification procedure for the out dated domains, to ensure they are correctly categorized. The AI ML scans all pages in those domains by downloading the content, then analyzing texts, links, specific site links, images, engines...etc.There are separate classifiers for different languages such as English, Spanish, and other languages.
As mentioned above, we have multiple sources that generate domains for our database and our main goal is to classify them and check the relevance of the source. The relevance is checked by our parameter named as the weight of the source. The heavier the source the more important it is to our database.