Skip to main content

Convert your CMS site to static in seconds with a simple linux command line

This little tip will hopefully be helpful to webmasters with deadbeat clients out there.  Scenario: you host site xyz.com for Joe Smith.  The site runs on ancient version of Joomla, which relies on an old, unsupported version of PHP. It's a pain to update, edit, and worst of all is a huge liability.  You might get paid to get the site reactivated, but you haven't heard from Joe in long enough that you suspect you won't.  You've decided to leave the site functional for another month.  Meanwhile, you have a webserver to migrate.  What is the easiest way to move the site without breaking it?

  1. Run an old version of PHP.  There are lots of problems with this option, which I don't think I need to go into.
  2. Convert it to a static
Imagine if you could have a working snapshot of the site and all of its resources. Images, javascript widgets, CSS, content, everything!  The site is a lot harder to edit (although not impossible), but you are no longer anchored to any PHP version (heck, you don't even need PHP at all), and there is no vulnerable CMS waiting to be compromised, trash your webserver and your reputation.  Brilliant!  But how to do this simply?

There are plugins for various popular CMS which claim to do this.  I have tried some and found mixed results.  In the end it was not what I hoped for in terms of simplicity or end result.

...Enter wget!  A very common linux command-line tool which can suck down websites.  Normally it only downloads content or a single file, but it turns out there are a bunch of argument permutations which can be applied in order to entirely and statically duplicate a living, breathing website. It goes something like this:


wget -P . -mpck --user-agent="" -e robots=off --wait 1 -E http://www.site.com/

The gist of this command is as follows:
We turn on mirror for recursion and time-stamping, get page requisites to get things besides just the content, such as script files, css, images, etc., continue which will resume a partial download of the site, and convert links which will convert absolute links so they work on a new or local site.  We also ignore robots.txt files which might restrict crawling, and wait 1 second so we don't hammer the other server too hard.

The result is a folder containing all of the files (images, html and scripts) required by the site, placed perfectly to create a 100% working, but static representation of the original site.

Much thanks to Jon Bickar for his post which accurately describes how to use wget this way.


Comments

Popular posts from this blog

Reaper, Linux, and the Behringer X-Air - Complete Studio Solution, Part 1

Introduction and Rationale This is part one of a major effort to document my experiences with recreating my home studio, entirely using Linux.  Without getting into too many of the specifics, a few months ago I decided that I was unhappy with Windows' shenanigans - to the point that I was ready to make a serious attempt to leave it behind.  For most in this situation, the obvious choice is to switch to Mac OS.  With its proven track record, support, and options for multimedia production, it is naturally the first alternative to consider if your goal is to simply use something other than Windows. For me the choice was not so simple. I despise Mac OS and, in general, the goals and philosophies put forth by Apple in an effort to ostensibly provide users with an "easy" working environment.  It does not help that I have also failed to find any aspect of the Mac OS UI intuitive, but I realize that this is a subjective matter. With my IT background and user-control* f...

An Alternative Take on AI Doom and Gloom

 I've purposely held my tongue until now on commenting about "AI" (or, more specifically as has come to be known, GAN or Generative Adversarial Networks).  It seems like it is very in-style to complain about how it has made a real mess of things, it is displacing jobs, the product it creates lacks soul, it's going to get smart and kill us all, etc. etc.  But I'm not here to do any of that. Rather I am going to remind everyone of how amazing a phenomenon it is to watch a disruptive technology becoming democratized From the time of its (seeming) introduction to the public at large, around November of 2022, to late 2023, the growth and adoption rate has been nothing short of explosive. It features the fastest adoption rate of any new technology ever, by a broad margin.  To give a reference, the adoption rate for AI image and text generation, real-world uses, in just 12 months is comparable to all of that of the another disruptive technology, the World Wide Web, takin...

The Hellscape that is Google’s Web in 2023

Alternate title: "were we better off in 2015 2007?" Time now for another anti-capitalist, “get off my lawn” posting for all the folks out there who won’t see it anyway, because they don’t read real blogs for the reasons specified in this very article. The web has existed for 30 years now. One would think our ability to access information on it would keep getting better. However, I watch as web search is instead devolving every year, to the point where people are giving up and hoping for the next thing.  While this sounds dire, this kind of behavioral change has historical precedent. Remember running your own mail or web server, or better yet, having a phone that you might actually answer calls to, even if you don’t recognize the caller’s number?  Yes, those ideas are gone too. It's all thanks to the uncontrolled thirst for advertising. Let’s walk through the experience of someone doing a simple Google search for “how to control poison ivy”.  The desired outcome would be...