Skip to main content

Convert your CMS site to static in seconds with a simple linux command line

This little tip will hopefully be helpful to webmasters with deadbeat clients out there.  Scenario: you host site xyz.com for Joe Smith.  The site runs on ancient version of Joomla, which relies on an old, unsupported version of PHP. It's a pain to update, edit, and worst of all is a huge liability.  You might get paid to get the site reactivated, but you haven't heard from Joe in long enough that you suspect you won't.  You've decided to leave the site functional for another month.  Meanwhile, you have a webserver to migrate.  What is the easiest way to move the site without breaking it?

  1. Run an old version of PHP.  There are lots of problems with this option, which I don't think I need to go into.
  2. Convert it to a static
Imagine if you could have a working snapshot of the site and all of its resources. Images, javascript widgets, CSS, content, everything!  The site is a lot harder to edit (although not impossible), but you are no longer anchored to any PHP version (heck, you don't even need PHP at all), and there is no vulnerable CMS waiting to be compromised, trash your webserver and your reputation.  Brilliant!  But how to do this simply?

There are plugins for various popular CMS which claim to do this.  I have tried some and found mixed results.  In the end it was not what I hoped for in terms of simplicity or end result.

...Enter wget!  A very common linux command-line tool which can suck down websites.  Normally it only downloads content or a single file, but it turns out there are a bunch of argument permutations which can be applied in order to entirely and statically duplicate a living, breathing website. It goes something like this:


wget -P . -mpck --user-agent="" -e robots=off --wait 1 -E http://www.site.com/

The gist of this command is as follows:
We turn on mirror for recursion and time-stamping, get page requisites to get things besides just the content, such as script files, css, images, etc., continue which will resume a partial download of the site, and convert links which will convert absolute links so they work on a new or local site.  We also ignore robots.txt files which might restrict crawling, and wait 1 second so we don't hammer the other server too hard.

The result is a folder containing all of the files (images, html and scripts) required by the site, placed perfectly to create a 100% working, but static representation of the original site.

Much thanks to Jon Bickar for his post which accurately describes how to use wget this way.


Comments

Popular posts from this blog

Reaper, Linux, and the Behringer X-Air - Complete Studio Solution, Part 1

Introduction and Rationale This is part one of a major effort to document my experiences with recreating my home studio, entirely using Linux.  Without getting into too many of the specifics, a few months ago I decided that I was unhappy with Windows' shenanigans - to the point that I was ready to make a serious attempt to leave it behind.  For most in this situation, the obvious choice is to switch to Mac OS.  With its proven track record, support, and options for multimedia production, it is naturally the first alternative to consider if your goal is to simply use something other than Windows. For me the choice was not so simple. I despise Mac OS and, in general, the goals and philosophies put forth by Apple in an effort to ostensibly provide users with an "easy" working environment.  It does not help that I have also failed to find any aspect of the Mac OS UI intuitive, but I realize that this is a subjective matter. With my IT background and user-control* f...

An Alternative Take on AI Doom and Gloom

 I've purposely held my tongue until now on commenting about "AI" (or, more specifically as has come to be known, GAN or Generative Adversarial Networks).  It seems like it is very in-style to complain about how it has made a real mess of things, it is displacing jobs, the product it creates lacks soul, it's going to get smart and kill us all, etc. etc.  But I'm not here to do any of that. Rather I am going to remind everyone of how amazing a phenomenon it is to watch a disruptive technology becoming democratized From the time of its (seeming) introduction to the public at large, around November of 2022, to late 2023, the growth and adoption rate has been nothing short of explosive. It features the fastest adoption rate of any new technology ever, by a broad margin.  To give a reference, the adoption rate for AI image and text generation, real-world uses, in just 12 months is comparable to all of that of the another disruptive technology, the World Wide Web, takin...

RANT TIME: Why do replies to a message I sent go to my spam folder?

Despite what one would think/hope, sending a message to a given address does not inherently give Google a high confidence that a reply from this address is expected (and, for example, that it should bypass spam checks). I have confirmed with Google's tech support that there is no way to automatically have this happen. The user can do the following: 1. Add the address to your contacts list in Gmail. 2. Check spam folder for replies, and mark it as "not spam" if something ends up there, which should influence the fate of future replies received. I can also approve an address at the domain level, i.e. if it is a big vendor or similar. I've had to do this with several of our Chinese vendors. I regularly ask engineering and purchasing to give me a list of the supplies we deal with, so I can approve them as a preventative measure. For what it's worth, all of the false positive instances of reply -> spam we have experienced have involved the sender's email server ...