XML Sitemap (sitemap.xml) – A Comprehensive Guide
Posted: Wed Dec 11, 2024 9:56 am
Sitemap.xml, or site map, is a file that is supposed to contain a list of pages within a website that are important to us . It is created to make it easier for indexing robots to reach resources that are important to us, especially those that are newly created and those that are difficult to access due to the structure of our website or the way it is linked internally.
The sitemap file as a single file should not exceed 50 MB and 50,000 URLs as per Google guidelines. The sitemap file should be created in XML format.
Did you know that?
XML is an abbreviation of Extensible Markup Language. Translated into Polish, we get "Extensible Markup Language". XML files are used to convey data in a structured way, because they are platform independent, which is why this format is so popular and universal.
As a website administrator, blogger or e-commerce owner, you often make the mistake of wanting to include all subpages in such a sitemap. Why is this not a good choice? What about pages such as regulations, to which all links have the rel= “nofollow” attribute value? Or pages with the noindex tag? You will learn in detail about which URLs should be included in the sitemap later in the article.
What data does sitemap.xml consist of?
As I mentioned earlier, the XML format allows you to present data in a systematic way. This means that it mainly consists of specific tags that have a specific role. Using this format means that everyone sends information about URLs in the same way that crawlers can easily read. Below you will find information about the 3 most important tags without which a sitemap cannot exist.
Important! The sitemap.xml file should be UTF-8 encoded!
Most important tags in sitemap.xml
<urlset> Contains a file and a reference to the current protocol standard. It is the starting and ending element of each sitemap.xml file. It contains all the tags.
<url> The parent tag of each URL entry that we want Google's crawlers to find. The <url> tag must have another <loc> tag in it to be valid. It can be enhanced with additional/optional tags, which I'll discuss later.
<loc> Location <loc> is a tag that indicates the location of a given subpage. The tag should contain the URL presented in full form, i.e. also with the http/https protocol.
Optional tags in sitemap.xml
<lastmod> Informs about the modification date of the content on a given subpage. Robots know if the content of a given subpage has been changed since the last scan. In lastmod we use W3C Datetime, which allows you to specify only the date in the form YYYY-MM-DD, without specifying the time.
<priority> A tag that is supposed to indicate to crawlers which subpages are most important to us and should be indexed first. The range of values in this tag is from 0.0 to 1.0, where the default priority for subpages is 0.5. I consider this element to be of average use, as Gary Illyes confirmed in 2017 that crawlers ignore it.
source:
<changefreq> Tag defining the frequency of changes on a given subpage within the site. This element was intended to help determine the frequency of scanning a given subpage, correlated with the changes made to them.
Valid values that can be in <changefreq>
always - documents that change every time they are opened
hourly - changes every hour
daily - changes every day
weekly - changes every week
monthly - changes every month
yearly - changes every year
never - never changed
What URLs should be included in the sitemap?
As I mentioned in the beginning of the article, not all URLs zalo database should be in our sitemap. I often come across incorrect configuration of this element, which can negatively affect the crawl budget of the site. So let's make sure that only valuable subpages are in the sitemap. These include primarily:
Pages that generate response code 200
Pages not blocked in robots.txt
Canonical links
Pages that are valuable to users
Pages that are not password protected or have difficult access
Depending mainly on the type of website, these will be:
Home page, category and product pages, blog posts, blog categories, FAQ pages, static (information) pages.
What URLs should not be included in your sitemap?
Above I have included information on what URLs a sitemap should contain. It is also equally important to know which URLs we should avoid when creating a sitemap. These are primarily:
URLs with redirects
40X and 50X Error Pages
Pages blocked in robots.txt
Pages tagged with noindex
Pages of little value to users (regulations, privacy policies)
Pagination pages
Search results pages
Pages with filtering/sorting parameters
Having trouble with your visibility in Google?
Rely on the specialists from KS!
Check out the offer!
How to generate a sitemap? - Most popular methods
Depending on how large a website we have and what content management system (CMS) we use, generating a sitemap can be done using free tools (sitemap.xml generators) or built-in/additional tools/plugins.
How to generate sitemap.xml for WordPress?
Let's start with the most popular content management system in Poland (almost 49% market share according to data from . The fastest and easiest way to create a sitemap is to use the Yoast SEO plugin. It automatically creates a sitemap for us, we choose the appropriate settings and decide which resources to include in it. The plugin is very intuitive and easy to use. Additionally, its basic version has options sufficient for most webmasters.
The sitemap file as a single file should not exceed 50 MB and 50,000 URLs as per Google guidelines. The sitemap file should be created in XML format.
Did you know that?
XML is an abbreviation of Extensible Markup Language. Translated into Polish, we get "Extensible Markup Language". XML files are used to convey data in a structured way, because they are platform independent, which is why this format is so popular and universal.
As a website administrator, blogger or e-commerce owner, you often make the mistake of wanting to include all subpages in such a sitemap. Why is this not a good choice? What about pages such as regulations, to which all links have the rel= “nofollow” attribute value? Or pages with the noindex tag? You will learn in detail about which URLs should be included in the sitemap later in the article.
What data does sitemap.xml consist of?
As I mentioned earlier, the XML format allows you to present data in a systematic way. This means that it mainly consists of specific tags that have a specific role. Using this format means that everyone sends information about URLs in the same way that crawlers can easily read. Below you will find information about the 3 most important tags without which a sitemap cannot exist.
Important! The sitemap.xml file should be UTF-8 encoded!
Most important tags in sitemap.xml
<urlset> Contains a file and a reference to the current protocol standard. It is the starting and ending element of each sitemap.xml file. It contains all the tags.
<url> The parent tag of each URL entry that we want Google's crawlers to find. The <url> tag must have another <loc> tag in it to be valid. It can be enhanced with additional/optional tags, which I'll discuss later.
<loc> Location <loc> is a tag that indicates the location of a given subpage. The tag should contain the URL presented in full form, i.e. also with the http/https protocol.
Optional tags in sitemap.xml
<lastmod> Informs about the modification date of the content on a given subpage. Robots know if the content of a given subpage has been changed since the last scan. In lastmod we use W3C Datetime, which allows you to specify only the date in the form YYYY-MM-DD, without specifying the time.
<priority> A tag that is supposed to indicate to crawlers which subpages are most important to us and should be indexed first. The range of values in this tag is from 0.0 to 1.0, where the default priority for subpages is 0.5. I consider this element to be of average use, as Gary Illyes confirmed in 2017 that crawlers ignore it.
source:
<changefreq> Tag defining the frequency of changes on a given subpage within the site. This element was intended to help determine the frequency of scanning a given subpage, correlated with the changes made to them.
Valid values that can be in <changefreq>
always - documents that change every time they are opened
hourly - changes every hour
daily - changes every day
weekly - changes every week
monthly - changes every month
yearly - changes every year
never - never changed
What URLs should be included in the sitemap?
As I mentioned in the beginning of the article, not all URLs zalo database should be in our sitemap. I often come across incorrect configuration of this element, which can negatively affect the crawl budget of the site. So let's make sure that only valuable subpages are in the sitemap. These include primarily:
Pages that generate response code 200
Pages not blocked in robots.txt
Canonical links
Pages that are valuable to users
Pages that are not password protected or have difficult access
Depending mainly on the type of website, these will be:
Home page, category and product pages, blog posts, blog categories, FAQ pages, static (information) pages.
What URLs should not be included in your sitemap?
Above I have included information on what URLs a sitemap should contain. It is also equally important to know which URLs we should avoid when creating a sitemap. These are primarily:
URLs with redirects
40X and 50X Error Pages
Pages blocked in robots.txt
Pages tagged with noindex
Pages of little value to users (regulations, privacy policies)
Pagination pages
Search results pages
Pages with filtering/sorting parameters
Having trouble with your visibility in Google?
Rely on the specialists from KS!
Check out the offer!
How to generate a sitemap? - Most popular methods
Depending on how large a website we have and what content management system (CMS) we use, generating a sitemap can be done using free tools (sitemap.xml generators) or built-in/additional tools/plugins.
How to generate sitemap.xml for WordPress?
Let's start with the most popular content management system in Poland (almost 49% market share according to data from . The fastest and easiest way to create a sitemap is to use the Yoast SEO plugin. It automatically creates a sitemap for us, we choose the appropriate settings and decide which resources to include in it. The plugin is very intuitive and easy to use. Additionally, its basic version has options sufficient for most webmasters.