'How to ignore specific type of files to download in wget?

How do I ignore .jpg, .png files in wget as I wanted to include only .html files.

I am trying:

wget  -R index.html,*tiff,*pdf,*jpg -m http://example.com/

but it's not working.



Solution 1:[1]

Use the

 --reject jpg,png  --accept html

options to exclude/include files with certain extensions, see http://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options.

Put patterns with wildcard characters in quotes, otherwise your shell will expand them, see http://www.gnu.org/software/wget/manual/wget.html#Types-of-Files

Solution 2:[2]

# -r : recursive    
# -nH : Disable generation of host-prefixed directories
# -nd : all files will get saved to the current directory
# -np : Do not ever ascend to the parent directory when retrieving recursively. 
# -R : don't download files with this files pattern
# -A : get only *.html files (for this case)

For instance:

wget -r -nH -nd -np -A "*.html" -R "*.gz, *.tar"  http://www1.ncdc.noaa.gov/pub/data/noaa/1990/

Solution 3:[3]

Worked example to download all files excluding archives:

wget -r -k -l 7 -E -nc \
 -R "*.gz, *.tar, *.tgz, *.zip, *.pdf, *.tif, *.bz, *.bz2, *.rar, *.7z" \
 -erobots=off \
 --user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" \
 http://misis.ru/

Solution 4:[4]

this is what I get from wget --help:

Recursive accept/reject:
  -A,  --accept=LIST               comma-separated list of accepted extensions.
  -R,  --reject=LIST               comma-separated list of rejected extensions.
       --accept-regex=REGEX        regex matching accepted URLs.
       --reject-regex=REGEX        regex matching rejected URLs.
       --regex-type=TYPE           regex type (posix|pcre).
  -D,  --domains=LIST              comma-separated list of accepted domains.
       --exclude-domains=LIST      comma-separated list of rejected domains.
       --follow-ftp                follow FTP links from HTML documents.
       --follow-tags=LIST          comma-separated list of followed HTML tags.
       --ignore-tags=LIST          comma-separated list of ignored HTML tags.
  -H,  --span-hosts                go to foreign hosts when recursive.
  -L,  --relative                  follow relative links only.
  -I,  --include-directories=LIST  list of allowed directories.
  --trust-server-names             use the name specified by the redirection
                                   url last component.
  -X,  --exclude-directories=LIST  list of excluded directories.
  -np, --no-parent                 don't ascend to the parent directory.

so you can use -R or --reject to reject extentions this way:

wget -R="index.html,*.tiff,*.pdf,*.jpg" http://example.com/

and in my case here is final command which I wanted to recursively download/update none-html files from an indexed website directory:

wget -N -r -np -nH --cut-dirs=3 -nv -R="*.htm*,*.html" http://example.com/1/2/3/

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Jet Blue
Solution 3
Solution 4