IIS Search Engine Optimization Toolkit includes the Robots Exclusion feature for managing the content of robots.txt file for you web site; and the Sitemaps and Sitemap Indexes feature for managing site's sitemaps. This walkthrough explains how and why to use these features.
Background
The search engines crawlers will spend limited time and resources on your web site. Therefore, it's critical to do the following:
- Prevent the crawlers from indexing the content that is not important and/or should not be surfaced in search result pages;
- Point the crawlers to the content that you deem most important for indexing;
There are two protocols that are used today to achieve these tasks: Robots Exclusion protocol and Sitemaps protocol.
Robots Exclusion protocol is used to tell search engine crawlers which URLs it should NOT request when crawling a web site. The exclusion instructions are placed into a text file named robots.txt, which located at the root of the web site. Most of the search engine crawlers usually look for this file and follow the instructions in it.
Sitemaps protocol is used to inform search engine crawlers about URLs that are available for crawling on your web site. In addition Sitemaps are used to provide some additional metadata about the URLs, such as last modified time, modification frequency, relative priority, etc. Search Engines might use this metadata when indexing your web site.
Prerequisites
1. Setting up a web site or an application
In order to complete this walkthrough you will need an IIS 7 hosted web site or a web application that you control. If you do not have any then the easiest way to get one is to install it from the Microsoft Web Application Gallery. For the purposes of this walkthrough a popular blogging application DasBlog was used.
2. Analyzing the Web Site
Once you have a web site or a web application you may want to analyze it to understand how a typical search engine will crawl the content of your site. To do that, follow the steps outlined in articles "Using Site Analysis to Crawl a Web Site" and "Using Site Analysis Reports". While analyzing it you will notice that there are certain URLs that are available for the search engines to crawl, but there is no real benefit in having them being crawled or indexed. For example login pages or resource pages should not be even requested by search engine crawlers. URLs like these should be hidden from search engines by adding them to robots.txt file.
Managing robots.txt file
Robots Exclusion feature is used to author a robots.txt file which tells search engines which parts of the web site should not be crawled and indexed. The following steps describe how to use this tool.
- Open the IIS Management Console by typing INETMGR in the Start menu.
- Navigate to your web site by using the tree view on left hand side (for example, Default Web Site).
- Click on the Search Engine Optimization icon within the Management section:

- On the SEO main page, click on the "Add a new disallow rule" task link within the Robots Exclusion section.

Adding Disallow and Allow Rules
The "Add Disallow Rules" dialog will open automatically:

Robots Exclusion protocol uses "Allow" and "Disallow" directives to inform search engines about URL paths that can be crawled and the ones that cannot. These directives can be specified for all search engines or for specific user agents identified by a user-agent HTTP header. Within the "Add Disallow Rules" dialog you can specify which search engine crawler the directive applies to by entering the cralwer's user-agent into the "Robot (User Agent)" field.
The URL Path tree view is used to select which URLs should be disallowed. You can choose from several options when selecting the URL paths by using the "URL structure" drop down list:
- Physical Location - you can choose the paths from the physical file system layout of your web site;
- From Site Analysis (analysis name) - you can choose paths from the virtual URL structure that was discovered when site was analyzed with Site Analysis tool
- <Run new Site Analysis...> - you can run the new site analysis to get the virtual URL structure for your web site and then select URL paths from there.
If you have completed the steps described in the prerequisites section then you should already have a site analysis available. Choose it in the drop down list and then check the URLs that need to be hidden from search engines by using the checkboxes in the "URL Paths" tree view:

After selecting all the directories and files that need to be disallowed, click OK. As a result you will see the new disallow entries in the main feature view:

Also, the robots.txt file for the site will be updated (or created if it did not exist). Its content will look similar to this:
User-agent: *
Disallow: /EditConfig.aspx
Disallow: /EditService.asmx/
Disallow: /images/
Disallow: /Login.aspx
Disallow: /scripts/
Disallow: /SyndicationService.asmx/
To verify how robots.txt works, go back to the Site Analysis feature and re-run the analysis for the site. In the Reports Summary page choose the "Links Blocked by robots.txt" report in the "Links" category. This report will display all the links that have not been crawled because they have been disallowed by robots.txt file that you have just created.

Managing sitemap files
Sitemaps and Sitemap Indexes feature is used to author sitemaps on your web site to inform search engines of the pages that should be crawled and indexed. The following steps describe how to use that tool:
- Open the IIS Management Console by typing INETMGR in the Start menu.
- Navigate to your web site by using the tree view on left hand side.
- Click on the Search Engine Optimization icon within the Management section:

- On the SEO main page, click on the "Create a new sitemap" task link within the Sitemaps and Sitemap Indexes section.
- The New Sitemap dialog will open automatically.

- In the Sitemap page click "Add URLs..." action
Adding URLs to the sitemap
The Add URLs dialog will look similar to below:

Sitemap file is basically a simple XML file that lists URLs along with some metadata, such as change frequency, last modified date and relative priority. "Add URLs" dialog is used to add new URL entries into the sitemap xml file. Each URL in the sitemap must be in a fully qualified URI format (i.e. it must include protocol prefix and domain name). So the first thing you have to specify is what domain will be used for the URLs that you are going to add to the sitemap.
The URL Path tree view is used to select which URLs should be added to the sitemap. You can choose from several options by using the "URL structure" drop down list:
- Physical Location - you can choose the URLs from the physical file system layout of your web site;
- From Site Analysis (analysis name) - you can choose URLs from the virtual URL structure that was discovered when site was analyzed with Site Analysis tool;
- <Run new Site Analysis...> - you can run the new site analysis to get the virtual URL structure for your web site and then select URL paths from there.
If you have completed the steps described in the prerequisites section then you should already have a site analysis available. Choose it in the drop down list and then check the URLs that need to be added to the sitemap.
Modify the "Change Frequency", "Last Modified Date" and "Priority" options if necessary and then add the URLs to the sitemap by clicking OK. As a result of this a sitemap.xml file will be updated (or created if it did not exist) and its content will look as below:
<urlset>
<url>
<loc>http://myblog/2009/03/11/CongratulationsYouveInstalledDasBlogWithWebDeploy.aspx</loc>
<lastmod>2009-06-03T16:05:02</lastmod>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>http://myblog/2009/06/02/ASPNETAndURLRewriting.aspx</loc>
<lastmod>2009-06-03T16:05:01</lastmod>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
</urlset>
Adding sitemap location to robots.txt
Now that you have created a sitemap, you will need to let search engines know where it is located so that they can start using it. The simplest way to do that is to add the sitemap location URL to robots.txt file.
In the Sitemaps and Sitemap Indexes feature view choose the sitemap that you have just created and then click "Add to Robots.txt" action:

As a result your robots.txt file will look as below:
User-agent: *
Disallow: /EditService.asmx/
Disallow: /images/
Disallow: /scripts/
Disallow: /SyndicationService.asmx/
Disallow: /EditConfig.aspx
Disallow: /Login.aspx
Sitemap: http://myblog/sitemap.xml
Registering sitemaps with search engines
In addition to adding the sitemap location to robots.txt file it is recommended to submit your sitemap location URL to major search engines. That will allow you to obtain useful status and statistics about your web site from search engine's webmasters tools.
Summary
In this walkthrough you have learned how to use Robots Exclusion and Sitemaps and Sitemap Indexes features of IIS Search Engine Optimization Toolkit to manage the robots.txt and sitemap file on your web site. IIS Search Engine Optimization Toolkit provides an integrated set of tools that work together to help you author and validate correctness of the robots.txt file and sitemaps before search engines start using them.