IIS Site Analysis is a tool within IIS Search Engine Optimization Toolkit that can be used to analyze web sites with the purpose of optimizing the site's content, structure and URLs for search engine crawlers. In addition the tool can be used to discover and fix common problems in site content that negatively affect site's users experience. IIS Site Analysis includes a web crawler that is used to crawl through all publicly available site links and resources for the purposes of downloading the content that will be used for site analysis.
Crawling a web site
The first step in analyzing a web site is to crawl through all the resources and URLs that are publicly exposed by this web site. This is what IIS Site Analysis tool does when a new site analysis needs to be created. To have IIS Site Analysis crawl a web site and collect data for analysis, follow these steps:
- Launch the SEO tool by going to Start > Program Files > IIS 7.0 Extensions and clicking on the Search Engine Optimization (SEO) Toolkit icon.
- This will automatically open the SEO main page.
- Click on the "Create a new analysis" task link within the Site Analysis section.

- In the New Analysis dialog box, enter the name that uniquely identifies the analysis report. Also, enter the URL where the crawler should begin.

Note that because we did not choose a specific Web site on the machine, it is possible to crawl any web site that is publicly accessible on the internet. Refer to the "Web Crawler Settings" section for more details about the "New Analysis" dialog box.
- Once all the parameters have been specified click OK to start the analysis:

The two numbers reported during analysis are:
- Links Processed - this is the total number of links that have been crawled and downloaded by web crawler
- Total Links - this is the total number of links found while crawling web site.
Note that Web Crawler always runs on a client machine. If you connect to a remote IIS server, and start a new Analysis, the Web Crawler will be hosted within the IIS Manager process (InetMgr.exe) running on a machine that connects to a remote IIS server. All the collected data and cached web content are kept on a file system of a client machine as well.
After the web site has been crawled and analyzed the Analysis Summary view will be shown. Refer to section "Using the Site Analysis Reports" article for more details on how to analyze the site for SEO and content specific problems.
Web Crawler Settings
Other parameters that can be specified when starting web crawling for a new analysis are:
- Maximum Number of Links - this setting controls how many unique links will be processed and downloaded from a web site during crawling. A link is any URL that is used within a page markup, including hyperlinks as well as references to image files, css files and javascript files. Increasing this number will increase the size of the reports file and will make the crawling process run longer.
- Maximum Download Size per Link - this setting controls how many kilobytes of content will be downloaded per link. Increasing this number will increase the size of the cached content stored by Site Analysis on a local file system.
- Ignore 'nofollow' attribute - 'nofollow' attribute as well as 'nofollow' meta tag are used to tell search engine crawlers to not follow certain or all hyperlinks in the page. This is used as a means of protection against spam in blog comments. If pages on your site use this attribute then by default the hyperlinks on those pages will not be processed and analyzed during site analysis. Note that links to resources, such as images, css and javascript files, will be still processed. If it is necessary to analyze all the hyperlinks that use this attribute, use this setting to ignore 'nofollow' attributes and meta tags when gathering data for site analysis.
- Ignore 'noindex' meta tag - 'noindex' tag is used to tell search engine crawlers not not index the content of the page. If pages on your site use this meta tag, then by default the content of those pages will not be searched for any violations. If it is necessary to analyze the content of pages that use this attribute, use this setting to ignore 'noindex' meta tag when gathering and processing data for site analysis
- External Links - this drop down list can be used when your web site has sub-domains or when you want to run analysis on a particular directory within a site. This setting controls whether sub-domains should be treated as external or internal links, as well as whether subdirectories should be treated as external or internal links.
In addition, the following generic settings can be configured for web crawler:
- Maximum Number of Concurrent Requests - this setting controls how many concurrent requests the web crawler will make.
- Reports Location - specifies the directory on a local file system, where all crawled data and cached web site content is stored.
Blocking IIS Site Analysis Web Crawler
All HTTP requests made by IIS Site Analysis Web Crawler have an HTTP header "user-agent" set to:
"iisbot/1.0 (+http://www.iis.net/iisbot.html)"
IIS Site Analysis Web Crawler is fully compliant with the robots exclusion protocol. This means that you can use robots.txt file to prevent IIS Site Analysis Web Crawler from crawling your web site. For example you may want to do it to prevent other people from running IIS Site Analysis against your web sites.
In order to prevent IIS Site Analysis Crawler from crawling a web site, add the following lines at the end of the robots.txt file located at the site's root directory:
User-Agent: iisbot
Disallow: /
Summary
You have now successfully configured IIS Site Analysis to crawl a web site and gather the data about site's content and structure. For information on how to analyze the gathered data by using Site Analysis reports, refer to "Using Site Analysis Reports".
Comments