'How to mirror only a section of a website?

I cannot get wget to mirror a section of a website (a folder path below root) - it only seems to work from the website homepage.

I've tried many options - here is one example

wget -rkp -l3 -np  http://somewebsite/subpath/down/here/

While I only want to mirror the content links below that URL - I also need to download all the page assets which are not in that path.

It seems to work fine for the homepage (/) but I can't get it going for any sub folders.



Solution 1:[1]

Use the --mirror (-m) and --no-parent (-np) options, plus a few of cool ones, like in this example:

wget --mirror --page-requisites --adjust-extension --no-parent --convert-links
     --directory-prefix=sousers http://stackoverflow.com/users

Solution 2:[2]

I usually use:

wget -m -np -p $url

Solution 3:[3]

I use pavuk to accomplish mirrors, as it seemed much better for this purpose just from the beginning. You can use something like this:

/usr/bin/pavuk -enable_js -fnrules F '*.php?*' '%o.php' -tr_str_str '?' '_questionmark_' \
               -norobots -dont_limit_inlines -dont_leave_dir \
               http://www.example.com/some_directory/ >OUT 2>ERR

Solution 4:[4]

Check out archivebox.io, it's an open-source, self-hosted tool that creates a local, static, browsable HTML clone of websites (it saves HTML, JS, media files, PDFs, screenshot, static assets and more).

By default, it only archives the URL you specify, but we're adding a --depth=n flag soon that will let you recursively archive links from the given URL.

Solution 5:[5]

For my use case the no parent option didn't quite work.

I was trying to mirror https://www.example.com/section and URLs under it like https://www.example.com/section/subsection. This can't be done with --no-parent because if you start at /section then it'll download the entire site if you start at /section/ then the site redirects to /section and now it's at parent so wget stops. Fun.

Instead, I am using --acept-regex 'https://www.example.com/(section|assets/).*'. This worked. (Although it would download sectionfoobar but this was acceptable for me and now we are wandering into regexp territory which is amply covered elsewhere on SO.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 kenorb
Solution 2 ninjalj
Solution 3 rubo77
Solution 4 Nick Sweeting
Solution 5 chx