Comments - Mirroring the MariaDB Knowledge Base
Content reproduced on this site is the property of its respective owners,
and this content is not reviewed in advance by MariaDB. The views, information and opinions
expressed by this content do not necessarily represent those of MariaDB or any other party.
Httrack does a better job (because it is its purpose) than wget at creating a local mirror.
The main problem when mirroring the knowledge base, is that it makes heavy use or redirections, and wget will follow those redirections, and you end up with lots of duplicated pages. (I have even hit wget's limit of 20 redirections) Httrack, on the other hand, will mirror the redirection by creating a small HTML file with a Refresh meta (effectively simulating the HTTP redirect), and will only download the page once.
Another problem is the login page. That link doesn't show in the mirror, but it is actually present in a div.nav-top-mobile, and as web scrapers don't care about CSS, they will follow that link. The big problem is that the login link contains the originating page as a parameter, so you end up downloading that login page once for each page. So, the path /kb/user/ should always be excluded from mirroring. With wget you can do it with the option --reject-regex /kb/user/ With httrack you can do it with the filter -kb-mirror.mariadb.com/kb/user/*
The mirror layout should probably be fixed to remove that useless login link.
Also, if you are only interested in one language, limiting yourself to that path (/kb/en/ for english) will significantly reduce the number of pages (you will still need the content in /kb/static/).
With all that, here is an httrack command line to mirror the knowledge base in english, in the local directory kb-mirror-en:
httrack --mirror --path kb-mirror-en --sockets=2 --structure=100 --robots=0 http://kb-mirror.mariadb.com/kb/en/ +kb-mirror.mariadb.com/kb/static/\*
The --sockets=2 option is to avoid hammering the server with too much simultaneous connections (the default is 8), and the --structure=100 option will avoid creating a subdirectory with the hostname (useless as we are downloading only a single host). (the --clean option may be useful too, if you don't want to use the update feature of httrack)
With all that (and limiting myself to english), I ended up with a mirror containing a little less than 10,000 files (with more than half being auto-generated redirection files) for a total size of "only" 230MiB.
I forgot to add that it "only" took 1h20.