XML Sitemaps: Best Practices That Still Matter
XML sitemaps have been around since 2005 and are sometimes dismissed as a solved problem. Drop in a plugin, generate a sitemap, submit it to Search Console, done. That approach works for a simple blog. For an e-commerce site with tens of thousands of product pages, a multilingual site covering UAE and GCC markets in both Arabic and English, or a large content publication, a lazy sitemap implementation creates real crawl and indexing problems.
The core purpose of an XML sitemap is to tell Google about URLs it might not discover through normal link crawling, and to signal priority and freshness information about your most important pages. Done well, sitemaps accelerate indexing for new content, improve crawl budget allocation, and provide a clean audit trail of what you want indexed. Done poorly, they include redirected pages, non-canonical URLs, and outdated lastmod dates that make Google distrust the entire sitemap.
This guide covers the practices that separate effective sitemap management from the set-it-and-forget-it approach. The difference shows up in how quickly new pages appear in search results and how accurately Google's crawl budget is directed toward your highest-value content.
What Belongs in an XML Sitemap
Only include canonical, indexable, 200-status URLs. A sitemap is a positive signal to Google: here are the pages I want you to crawl and consider indexing. Including redirected URLs (which waste crawl budget following the chain), noindexed pages (which contradict themselves), and URLs returning 404 errors (which signal poor site health) all degrade the signal quality.
Run a crawl of your site monthly and cross-reference the sitemap against the crawl results. Any URL in your sitemap that returns a non-200 status code, has a canonical pointing elsewhere, or has a noindex tag is a sitemap error. Clean sitemaps with accurate data build Google's trust in your sitemap submissions over time.
lastmod: Use It Accurately or Not at All
The lastmod attribute is supposed to indicate when the page content was last meaningfully updated. When used accurately, it helps Google prioritize re-crawling recently updated content. When every page in your sitemap shows a lastmod of today's date (a common behavior of auto-generating plugins that update lastmod on every sitemap regeneration), Google stops trusting lastmod entirely and ignores it.
Set lastmod only when content has genuinely changed: when the main body text was edited, new information was added, or significant updates were made. Do not update lastmod when only metadata, ads, or navigation elements changed. For a Dubai property listing site, lastmod should update when the listing price changes or new photos are added, not when an unrelated site element is modified.
- Update lastmod only when substantive page content changes
- Do not set lastmod to the current date on every sitemap generation
- Store lastmod values in your CMS alongside content update timestamps
- Verify that your sitemap plugin updates lastmod based on content modification, not publication
- For pages that never change (privacy policy, terms), set lastmod to the actual last edit date and leave it
Sitemap Index Files for Large Sites
A single XML sitemap file can contain up to 50,000 URLs and must not exceed 50MB uncompressed. For large sites, use a sitemap index file: an XML file that lists multiple individual sitemap files. The sitemap index file itself has no URL limit. Individual sitemap files should be logically organized by content type: one for product pages, one for blog posts, one for category pages, one for location pages.
This organization makes it easy to monitor crawl coverage by content type in Search Console. If Google has indexed 95% of your blog posts but only 60% of your product pages, the separate sitemaps reveal this immediately. You can also diagnose issues with specific content types without wading through a monolithic sitemap with mixed URL types.
Image and Video Sitemaps
Image sitemaps are an extension of regular page sitemaps that include image-specific metadata within each URL entry. They help Google discover images embedded in pages that might be loaded dynamically, and they allow you to provide additional context like image titles, captions, geographic location, and licensing information. For a UAE tourism or real estate site with thousands of property or destination photos, image sitemaps can improve image search traffic meaningfully.
Video sitemaps use the VideoObject extension to provide duration, description, thumbnail URL, and content URL for videos embedded on pages. For sites using videos as a core content format, video sitemaps increase the likelihood of appearing in Google Video results and can improve rich result eligibility for VideoObject structured data.
hreflang in Sitemaps for Multilingual Sites
For sites serving content in multiple languages, hreflang annotations tell Google which page serves which language and regional audience. These annotations can be placed in the page HTML head, HTTP response headers, or in the XML sitemap. The sitemap approach is useful when you cannot easily modify the page HTML (for example, on a legacy CMS), but it requires that every URL in the hreflang group links to all other language variants.
For a UAE business with Arabic (ar-AE) and English (en-AE) versions of each page, each URL in the sitemap should include xhtml:link entries for both the Arabic and English variants, and an x-default entry pointing to the preferred default. A common error is including hreflang only in the sitemap without including it in the page HTML as well; use both for maximum signal consistency.
Submitting and Monitoring Sitemaps
Submit your sitemap index URL in Google Search Console under Sitemaps. Bing Webmaster Tools has a separate sitemap submission. For AI crawlers, include the Sitemap: directive in your robots.txt file so any crawler that reads robots.txt can discover your sitemap automatically.
Check the Sitemaps report in Search Console regularly. It shows how many URLs from each sitemap Google has discovered and indexed. A large gap between submitted and indexed count is normal for new sitemaps but should narrow over time. A persistent large gap on an established site indicates quality or canonicalization problems in the sitemap URLs rather than a discovery issue.
XML sitemaps are a crawl budget and indexing tool that rewards careful management. The practices that matter most are restricting sitemaps to canonical, 200-status, indexable URLs; using lastmod accurately rather than auto-generating today's date on every entry; organizing large sites into typed sitemap files via a sitemap index; and submitting via both Search Console and robots.txt. For UAE businesses managing multilingual content across Arabic and English, hreflang in sitemaps adds a critical layer of language targeting. Done well, sitemaps directly accelerate indexing for new content and improve the efficiency of every crawl visit.
Frequently asked questions
Do I need an XML sitemap if my site is well internally linked?
Well-linked small to medium sites may not see significant benefit from a sitemap beyond what normal crawling discovers. But sitemaps become important for large sites, pages that are not well linked internally, new content that needs rapid indexing, and multilingual sites requiring hreflang signals. For most sites, the cost of maintaining a clean sitemap is low and the occasional benefit is real.
How often should I update my sitemap?
Update and resubmit your sitemap whenever you add significant new content, remove large sections, or change your URL structure. For frequently publishing sites like news or blogs, a sitemap that updates automatically with each new post is ideal. For static or slowly changing sites, monthly reviews are sufficient.
Should I include pagination pages in my sitemap?
Generally no. Pagination pages (page 2, page 3, etc. of blog archives or category listings) have thin unique content and including them in your sitemap wastes crawl budget signal. Include only the first page of each paginated series. The paginated pages are discoverable through internal links without needing to be in the sitemap.
What happens if my sitemap contains 404 or redirected URLs?
Google will note these as errors in the Sitemaps report in Search Console. It will follow redirects and crawl the final destination, consuming an additional redirect hop from the crawl budget. Over time, persistent sitemap errors reduce Google's confidence in the accuracy of your sitemap data. Clean sitemaps produce better crawl efficiency.