'How to allow Googlebot to Crawl my Nextjs App deployed to azure app services?
I am working on a project based on nextjs and strapi cms. This is deployed to azure app services which pulls the docker image from a azure container registry. Originally it has a front door and a front door and CDN profiler resources. Front door has been defined with basic WAF managed rules which is as follows.
- Cross-site scripting
- Java attacks
- Local file inclusion
- PHP injection attacks
- Remote command execution
- Remote file inclusion
- Session fixation
- SQL injection protection
- Protocol attackers
This site also does not include a robots.txt file. However when I do a live URL test on the site through https://search.google.com/test/mobile-friendly it says that URL is not available to Google. However this sites has been able to index through Bing search console.
In azure app services, are there any default configurations which block googlebot crawling through the site. Or is there any other resources that affect for this.
Following are the main resources that has been used while hosting this. Was not able find any specific rule that might be blocking googlebot. Azure app service, front door, front door WAF policy, Front door and CDN profiles, Container registry
Also I noticed that the app service which is hosting the cms is allowing the googlebot to crawl through the site, but front end is not allowing this. It would be a great help if someone can guide me on the steps that I need to follow in this case. As I am also somewhat new to azure, was not able to find the exact reason for this.
Update : I tried adding the robots.txt file to the site and surprisingly then the google was able to reach that URL and crawl through. However, I was under the impression that although the site do not include robots.txt file the site should be able to crawled by Google. If someone can explain the reason for this cause, it would be a great help.
Solution 1:[1]
Need of robots.txt
- You may not require a robots.txt file if there are no sections on your site where you want to control user-agent access.
- Type in your root domain, then add /robots.txt to the end of the URL to see if you have a robots.txt file. Example , Google robots file is located at google.com/robots.txt. You do not currently have a (live) robots.txt page if no txt page shows.
- Robots.txt is a text file that webmasters use to tell web robots (mostly search engine robots) how to crawl their website's pages.
- The robots.txt file is part of the robots exclusion protocol (REP), which governs how robots crawl the web, access and index material, and serve that content to people.
You don't need a robots.txt file if you want all of Google to be able to crawl your sites.
By specifying Googlebot as the user agent, you can prevent or allow all of Google's crawlers from accessing some of your content.
User-agent: Googlebot
Disallow:
- If Cloudflare is preventing Googlebot from indexing your site, you can use the following settings. Turn off Cloudflare Specials under Firewall settings > Managed Rules.
- Disable the rules one at a time to avoid losing all of the other Cloudflare Specials features. Please see Cloudflare Managed Special Rules for more information.
Please refer About Robots.txt ,Crawling and Indexing and New bot protection rule for more information
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | HarshithaVeeramalla-MT |
