'.htaccess Use variables to check if subpage exists

I have a website, which is a one-pager JS-based site. It has different "subsites" and "categories", but since everything is controlled via JS, and since there are about 30 subsites and 50 categories, I have never created a "normal" subsite system, just like

www.example.com/subsite/category

Instead I have only the main site

www.example.com

and that's all, everything else is controlled via JS.

But I want to achieve better results on Google ranking, and for that I need to create subpages as well. I want to keep the JS-based behavior, and that part is ready to handle the different URLs (www.example.com/subsite/category) the right way: it is checking the URL, takes the subsite and the category, and passes to the right JS as parameters. So my one-pager site acts like a multi-pager. And it's fine in this way.

At this point my .htaccess redirects all non-existing directories to the home page, keeping the URL itself unchanged, so the JS can use it properly.

RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule . index.html [L]

But I want to handle the non-existing subsites and categories as well, I want to 404 them. And I need to handle it via .htaccess.

So I thought maybe there is a way to handle the existing subsites and categories as variables in .htaccess, where all of their combinations are accepted, but all other goes to 404.

For example (in JS, since I don't know how to handle this in .htaccess):

var subsitesArray = ["foosite", "barsite"];
var categoriesArray = ["foocategory", "barcategory"];

So the valid URLs would be:

www.example.com/foosite/foocategory
www.example.com/foosite/barcategory
www.example.com/barsite/foocategory
www.example.com/barsite/barcategory

And all other would be non-valid, so 404.

If I would have to set all URLs manually, that would be 30*50 URLs... That's way too much.

Is it possible to solve it somehow in .htaccess?

UPDATE#3 Please update the code to support the following points:

  • The /site1 ... /site30, /category1 ... /category50 subsites are available on the server (index.html in these directories), so the .htaccess rules should not forward them to index.html (but let the "physical" files to be opened).
  • So only /site1/category1 ... /site1/category50 ... /site30/category1 ... /site30/category50 variants should be redirected to index.html.
  • www.example.com/////site1///category2 (so a lot of / characters in-between) are still accepted, however should not be.
  • When the link ends with a / character, it's not accepted, however it should be. www.example.com/site1/category is accepted, but www.example.com/site1/category/ (ending with /) is not, however it should be.

Can you please update the code? These would be the final modifications, and it would work perfectly.

Thank you in advance.



Solution 1:[1]

This would seem to be a continuation of your earlier question (except that previously you had "site" and "category" separated by a hyphen in the URL and "site" and "category" could also seemingly contain a hyphen (at least in your example), which would have made checking these independently in .htaccess pretty much impossible.)

www.example.com/<site>/<category>

Following this URL pattern you could validate the <site> and <category> separately, so you would end up with 30 + 50 (80) directives, as opposed to 30 * 50 (1500) directives if you were to do this one-by-one.

For example, you could do something like the following in .htaccess (this replaces your existing rule):

DirectoryIndex index.html

RewriteEngine On

# If the request is not of the form "/site" or "/site/category" then stop here
RewriteRule !^[^/.]+(/[^/.]+)?$ - [L]

# Validate "site" (first path segment)
RewriteCond $1 !=site1
RewriteCond $1 !=site2
RewriteCond $1 !=site3
# etc.
RewriteCond $1 !=site30
RewriteRule ^([^/.]+) - [R=404]

# Validate "category" (second path segment)
RewriteCond $1 !=category1
RewriteCond $1 !=category2
RewriteCond $1 !=category3
# etc.
RewriteCond $1 !=category50
RewriteRule ^[^/.]+/([^/.]+)$ - [R=404]

# Front-controller
RewriteRule . index.html [L]

UPDATE: I removed the filesystem check entirely in favour of a rule that checks the format of the URL-path at the top of the file (ie. RewriteRule !^[^/.]+(/[^/.]+)?$ - [L]). If the requested URL-path does not (as denoted by the ! prefix) match a URL of the form /<value> or /<value>/<value> then the remaining directives are skipped entirely and the request is not rewritten to index.html.

!=site1 - The ! prefix negates the expression, so it is successful when the expression does not match. The = prefix-operator makes this a lexicographical string comparison (exact match), rather than a regex. Each of the conditions (RewriteCond directives) are implicitly AND'd. So the rule is triggered only when all conditions do not match.

Where $1 in each rule contains the value of the captured group from the RewriteRule pattern. In the first rule this contains the value of the first path-segment (the "site") and in the second rule this contains the value of the second path-segment (the "category").

If the "site" does not match in the first rule then a 404 is immediately triggered. If the "category" does not match in the second rule (only processed if the "site" is valid) then a 404 is immediately triggered.

RewriteRule ^([^/.]+) - [R=404]

This rule captures the first path segment in the URL-path and stores this in the $1 backreference. So, given a URL of the form /site/category (or just /site) then this captures site. This is then used in the preceding RewriteCond directives to validate that $1 contains one of the expected values. If all the preceding conditions are successful (ie. the first path segment does not match one of the permitted "sites") then a 404 is triggered. Note that I've restricted the path segment so it can no longer contain dots (this is the same pattern as used in the first rule).

RewriteRule ^[^/.]+/([^/.]+)$ - [R=404]

This is similar to above, except that it captures the second path segment (ie. the "category") in the $1 backreference. The preceding conditions then validate this. This is only processed if the preceding rule is not successful (ie. the first path segment matches a valid "site").

Define the custom error document as your index.html front-controller (in which you re-check the requested URL in JavaScript) then you can customise the response using JavaScript, as with all your other URLs (this should go at the top of the file):

ErrorDocument 404 /index.html

UPDATE / ERROR CORRECTION:

RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -d [OR]
RewriteRule ^ - [L]

There should be no OR flag on the 2nd condition! I've corrected this in the code above.

The OR on the 2nd condition would have prevented the rest of the code from doing anything! So any site/category URL would have resulted in a 404.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1