Finding which pages on your Drupal 8 site aren't being linked to from your menu or main listings

10:15 mlncn dinarcon: Good morning! Have you heard of a module for this use case? MASS is “looking for a site map just as part of a website audit to make sure that we are reviewing all existing pages, for example if there are active pages that are navigable to directly but don’t have a path from the homepage. Is there something like that available in the back end of the website?” 10:21 dinarcon mlncn, good morning 10:22 dinarcon mlncn, not that I know 10:22 mlncn yeah haven’t been able to find anything. There should be! 10:22 dinarcon mlncn, from the homepage only? 10:22 dinarcon mlncn, most things won’t be linked from there 10:23 mlncn No from anywhere. A path from the homepage, she said, meaning even if you click five links you can get there from the homepage. 10:23 dinarcon mlncn, feeling lucky? https://www.drupal.org/project/simple_sitemap_views 10:23 dinarcon mlncn, I guess you can take the regular XML sitemap and compare it with all routes on the system and diff that 10:24 dinarcon mlncn, drupal console has an example of how to get all routes 10:24 mlncn dinarcon: there’s better maintained sitemap modules but their purpose is very different, for giving search engines a list. Of course, maybe after you do that, you can then use standard SEO tools to identify pages that you have on your site but you don’t link to from your site 10:25 dinarcon mlncn, well, if you are allowed to follow links, then it needs crawling logic which is not as easy as diff’ing two lists 10:26 dinarcon mlncn, why not using a tool like spidermonkey? 10:26 dinarcon https://www.screamingfrog.co.uk/seo-spider/ 10:26 supy-agaric Title: Screaming Frog SEO Spider Tool & Website Crawler | Screaming Frog (at www.screamingfrog.co.uk) 10:26 mlncn Really want we want is something that follows links any way it can (no knowledge of Drupal, because Drupal has like eight ways of doing links) and then something, yeah, that knows every system path that Drupal provides, and will give us the remainder of the latter— just the paths that spidering can’t reach 10:27 mlncn dinarcon: right, so something that compares the output of a sitemap with the output of spidering 10:28 dinarcon mlncn, very likely outside of drupal 10:28 dinarcon aka not a drupal module 10:28 mlncn dinarcon: but was looking for that, already built :-) In this case MASS is quite well structured so i can tell them they only have to look at the pages, which they don’t have many of 10:29 mlncn dinarcon: oh, good point. Even though i require inside-Drupal knowledge, i can get that from a standard site map, and there must be SEO tools that will compare the spidering to the site map 10:30 sfreudenberg dinarcon: hello 10:31 dinarcon mlncn, https://www.deepcrawl.com/blog/best-practice/sitemap-audits-and-advanced-configuration/ 10:31 supy-agaric Title: Sitemap audits and advanced configuration - DeepCrawl (at www.deepcrawl.com) 10:31 dinarcon sfreudenberg, hello 10:31 mlncn dinarcon: thanks! I also found https://www.upbuild.io/blog/orphan-pages-find-fix-verify/ 10:31 supy-agaric Title: Orphan Pages: How To Find, Fix and Verify | UpBuild (at www.upbuild.io)

That link provides a good overview of a manual process for finding orphan pages.

I finally updated the client-facing support ticket.

@jburgi Useful answer:

All content types except pages automatically show up in their sections, unless (in the case of videos) they are explicitly set to “Other” (this category was added to allow videos meant only for embedding in other content, if i recall correctly).

Therefore reviewing the 20 pages to see if there are any there you’d expect to have more prominently linked should be sufficient:

https://example.org/admin/content?status=All&type=page&title=&langcode=All

Oh also, people content (staff, board, etc.) recently had an ‘inactive’ status added, and those person pages won’t be in lists, but will still be able to be linked to from project pages. We could create an administrative view quite quickly for you to see any people who currently won’t be listed on the /team page. And we could create another such admin view for videos in the “Other” category (which an inspection tool likely wouldn’t provide useful results for because videos don’t really have their own URLs).


Interesting answer that’s been getting in the way of sending a quick answer:

There totally should be an easy tool to do exactly that, audit to review make sure that all existing, active pages are linked to from the home page or other section. That takes the results of an external spidering of the site (same as search engines do) and compares it with an internally-produced sitemap to identify orphan pages. (A couple hours of research later…) I’m pretty sure i can build one…

There is a paid service that claims to do this:

Search for orphan pages.

WebSite Auditor lets you find pages that aren’t linked to from other pages of your site, but do exist. They could be old pages you forgot to link to, pages missed in a site migration or redesign, or even a sign that your site has been hacked. You’ll easily find these pages in your project by the Orphan Pages tag. If they are important pages that aren’t linked to by mistake, you’d want to find them and start linking to them internally, to pass on some link juice and encourage search engines to crawl them more frequently.

That’s from https://www.link-assistant.com/news/new-seo-spider.html

I don’t see how they can find truly orphan pages, because their only way to find them seems to be spidering— they don’t have internal knowledge of the site the way Drupal can provide knowledge of itself.

It’s a download tool. Which seems to be the same case for Screaming Frog, which comes up in a lot of articles— in any case, Screaming Frog either doesn’t have an API, or their vaunted search engine optimization isn’t enough to help me find the API. So i’ll have to do the spidering myself, from the site to spider, it looks like. This might also unintentionally be a good load test tool… doing the work of spidering at the same time as it has to serve up pages to respond to its own spider.

These ones came up in regular search but not on Packagist.org so i’ll only come back to it if i don’t find something suitable in Packagist: https://github.com/BruceDone/awesome-crawler and php-spider https://php.libhunt.com/php-spider-alternatives

Don’t think i’ll go with this one; it seems to simple in that it doesn’t appear to honor / report on no-index meta tags and the like: https://packagist.org/packages/zrashwani/arachnid

This one at least mentions that it “supports a politeness policy” https://packagist.org/packages/vdb/php-spider

I think we’ll go with https://packagist.org/packages/spatie/crawler It very clearly supports ignoring robots.txt and robots meta tags:

By default, the crawler will respect robots data. […] Robots data can come from either a robots.txt file, meta tags or response headers. More information on the spec can be found here: http://www.robotstxt.org/. Parsing robots data is done by our package spatie/robots-txt.

This is frustrating. 99.99% of the time, a page should never be manually included or excluded in a sitemap— rather, if it should be excluded, it should have the meta tag ``

Dear metatag: Why would anyone want to set token-powered tags on a per-page basis? Tokens are for types of content; if anyone wants to change it for a particular piece of content they would just override it, not set a new token!

But that’s just the beginning of the problem. It seems the only way to override particular meta tags on a per-page (per piece of content) basis is to go to Admin » Structure » Content Types » [Content type] » Manage fields and add a field of type metatag. This field then is placed on the

The service Mauricio listed just needs to be provided with a sitemap and it does the rest. But based on its pricing we would be providing a service valued at at least $10/mo to every site using this.