'I Would like to develop a web crawling system. how to get the Web URL List?

I Would like to develop a web crawling system.

any idea to get Malaysia visited Web URL? or how to get all the world Domain name? like google bot can web crawling for all web site.



Solution 1:[1]

Start your crawler on a handful of sites that you know about that tend to link to other sites. For example you might want to start crawling https://example.my/

Your crawler will need a queue of URLs that it needs to crawl and a set of URLs that it has already visited. As your crawler visits a page, it needs to list all the links on the page, check the set to see if each has already been crawled, and add any that haven't been to the queue.

A crawler like Googlebot doesn't limit itself to crawling just one top level domain or language. You will also need logic to determining what you don't want to crawl. For example you might want to ignore any URLs you encounter that are not on the .my top level domain. Alternately you could crawl all pages you encounter, but after downloading a page you could detect its content language and ignore the page if it isn't Malaysian.

You will quickly find that crawling all Malaysian content is a huge task. It is too big to handle from a single crawler with a single internet connection. Large crawlers have to be written as distributed systems that run on many machines, possibly in several data centers. They need to support distributed data structures to communicate with each other about what has been crawled and what is available to crawl.

Furthermore, the web is changing all the time. If you want your archive to be up-to-date, you will need to come up with a system of re-crawling URLs that have already been crawled. You will need to figure out how often each URL is worth recrawling. That is going to be different depending on the update frequency and importance of the URL.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1