This little tip will hopefully be helpful to webmasters with deadbeat clients out there. Scenario: you host site xyz.com for Joe Smith. The site runs on ancient version of Joomla, which relies on an old, unsupported version of PHP. It's a pain to update, edit, and worst of all is a huge liability. You might get paid to get the site reactivated, but you haven't heard from Joe in long enough that you suspect you won't. You've decided to leave the site functional for another month. Meanwhile, you have a webserver to migrate. What is the easiest way to move the site without breaking it?
The gist of this command is as follows:
We turn on mirror for recursion and time-stamping, get page requisites to get things besides just the content, such as script files, css, images, etc., continue which will resume a partial download of the site, and convert links which will convert absolute links so they work on a new or local site. We also ignore robots.txt files which might restrict crawling, and wait 1 second so we don't hammer the other server too hard.
The result is a folder containing all of the files (images, html and scripts) required by the site, placed perfectly to create a 100% working, but static representation of the original site.
Much thanks to Jon Bickar for his post which accurately describes how to use wget this way.
- Run an old version of PHP. There are lots of problems with this option, which I don't think I need to go into.
- Convert it to a static
Imagine if you could have a working snapshot of the site and all of its resources. Images, javascript widgets, CSS, content, everything! The site is a lot harder to edit (although not impossible), but you are no longer anchored to any PHP version (heck, you don't even need PHP at all), and there is no vulnerable CMS waiting to be compromised, trash your webserver and your reputation. Brilliant! But how to do this simply?
There are plugins for various popular CMS which claim to do this. I have tried some and found mixed results. In the end it was not what I hoped for in terms of simplicity or end result.
...Enter wget! A very common linux command-line tool which can suck down websites. Normally it only downloads content or a single file, but it turns out there are a bunch of argument permutations which can be applied in order to entirely and statically duplicate a living, breathing website. It goes something like this:
wget -P . -mpck --user-agent="" -e robots=off --wait 1 -E http://www.site.com/
The gist of this command is as follows:
We turn on mirror for recursion and time-stamping, get page requisites to get things besides just the content, such as script files, css, images, etc., continue which will resume a partial download of the site, and convert links which will convert absolute links so they work on a new or local site. We also ignore robots.txt files which might restrict crawling, and wait 1 second so we don't hammer the other server too hard.
The result is a folder containing all of the files (images, html and scripts) required by the site, placed perfectly to create a 100% working, but static representation of the original site.
Much thanks to Jon Bickar for his post which accurately describes how to use wget this way.
Comments