Notes

[wget] How to store an entire website offline

Edit on GitHub


Commands
2 minutes
1wget --random-wait -r -p -e robots=off -U mozilla http://yoursite.com

OR, as a bash function added to ~/.bash_profile (MacOS) or ~/.bashrc (Linux)

 1# Download an entire website
 2# -p --page-requisites: get all the elements that compose the page (images, CSS and so on)
 3# -e robots=off you don't want wget to obey by the robots.txt file
 4# -U mozilla as your browsers identity.
 5# --random-wait to let wget chose a random number of seconds to wait, avoid get into black list.
 6# Other Useful wget Parameters:
 7# -k --convert-links: convert links so that they work locally, off-line.
 8# --limit-rate=20k limits the rate at which it downloads files.
 9# -b continues wget after logging out.
10# -o $HOME/wget_log.txt logs the output
11
12getwebsite() {
13    wget --random-wait -r -p -e robots=off -U mozilla $1
14}

To use the bash function

1getwebsite http://websitelink.com

wget options

  • --random-wait This option causes the time between requests to vary between 0.5 and 1.5 * wait seconds

  • -r recursive retreiving

  • -p for --page-requisites, download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

  • -e for executing commands. -e robots=off is the command being sent in this instance.

  • -U for --user-agent equal to --user-agent=mozilla. Identify as Mozilla to the HTTP server.