NS News & Views

By Clive Norman
List all 37 articles

Using PowerShell to check all pages in a website

| Tags: PowerShell Website Script SysAdmin DevOps

Recently St Mary’s Shaftesbury had a new website built (I would like to register, that whilst I was actively involved in the project management, I did not design this website).

In truth this has been a pretty enormous, and sometimes quite arduous task, which I shan’t digress into with this blog post.  However, part of the process was transferring content from the previous website to the new one, in a systematic and controlled manner.

For the best part this worked very well, but it was soon discovered that there were some “odd” pieces of code appearing on certain pages; even more confusing, this code only appeared on certain computers.

The bottom line was, that as part of the process to transfer content, the dreaded “copy and paste” had been used without sanitising the text (e.g. removing hidden formatting etc).  Even stranger, this code only manifest itself when using ie8 – other browsers were happy to ignore it; hence the claim that it was only displaying on some computers.

So, the mystery of the rogue code resolved, the next task was to establish how to check every page on the website, using the appropriate browser.  Yes, we could sit a bored techie down to go through every page, or we could try and be clever with some PowerShell scripting.

A bit of Googling later, the below script was written.  It was actually a relatively simple process, for what is now a very effective tool.  In short, I knew that we had a sitemap (well, of sorts – unfortunately it wasn’t a traditional sitemap.xml, which would have made life a lot easier!) – but it was at least a sitemap.

Using the PowerShell 3 Invoke-WebRequest method to get the sitemap, the script then grabs appropriate href links from elements belonging to those meeting a specific css class (in this instance “rmLink”).  This is now technically a list of all pages in the website.

All that is now left to do, is to enumerate through each webpage (searching the html) looking for the “suspect” code, and to display the page url if and when found!

Clear-Host
$hsg = Invoke-WebRequest -Uri http://www.stmarys.eu/sitemap.aspx
$links = $hsg.Links | Where class -eq 'rmLink' | select  href
$linksWithDomain =  $links | Foreach-Object{"http://www.stmarys.eu/" +  $_.href}
$linksWithDomain | foreach{
    $webClient = new-object System.Net.WebClient
    $webClient.Headers.Add("user-agent", "PowerShell Script")
       $output = ""
       $output = $webClient.DownloadString($_.toString())
       if ($output -like "*supportLists*") {
          "Dodgy Code On Page " + ($_.tostring())
       }
 }

We did indeed locate three pages, that had this rogue code present (so it was worth the investment in time creating the script).  We also now have a pretty neat tool for checking webpages in the future!