The recent update of this website (mentioned here) caused many pages of the previous version to disappear (I renamed a few things and reorganized most of the information on display, which broke many old links that referred to content that had been moved somewhere else). However, having been recently influenced by some discussions about the preservation of information on the Internet (see this blog post by David Madore (in French)), I decided to task myself with implementing redirections from old dead links to their corresponding shiny new pages.
To be fair, I didn’t really believe that anyone had ever saved any of my pages to their bookmarks: so, if I had had to guess, I probably would have bet that no one would have needed or noticed these redirections in the first place. But, just in case, my principles required me to make them; and, being mindful of SEO1, it was obvious that Google prefers redirections to 404 errors. All in all, providing some kind of continuity between the old and the new website seemed like a good idea, and one that wouldn’t be too hard to implement: it was just a matter of writing some substitutions with regex2, so nothing could go wrong… could it?
Oh, sweet summer child.
— Me, to my past self.
Let’s just that that everything did not go according to plan: in the end, I spent four or five hours trying to get these redirections working with a
Some mistakes could have easily been avoided and definitely were on me. This is, of course, despite me trying my best to be a model student: for example, before uploading my
.htaccess file, I locally simulated resolutions of URLs, and doing so enabled me to discover some silly typos in my regex. After I had fixed them all, I was sure that my substitutions were correct; and naively, I had hoped that it would be sufficient testing and that everything would work smoothly on the first try after the update: “Oh, sweet summer child” (bis repetita). Let’s just say that it did not.
Indeed, in the beautiful world of computer tinkering, it so happens that the very same
.htaccess file on two different websites can lead to two different behaviors, which is something my kind of testing had failed to take into account. Let me provide an example of that: in this day and age, the modern custom wanted me to remove
.html extensions from URLs so that the reader can go to
acallard.net/about-me to learn about my childhood, instead of having to type
acallard.net/about-me.html in their address bar. But there was a slight deficiency in my very narrow knowledge of Apache configuration files: I did not know that Apache automatically adds a trailing
/ to any URL that points to something that resolves to a directory.
The trouble is that I have a
talks webpage (redirecting to
talks.html), which contains a (somewhat incomplete and partial) list of the academic talks I have given somewhere and sometime in my life; and that I also have a
talks/ directory, that contains the slides of said talks. So, when typing
acallard.net/talks, Apache would redirect to
acallard.net/talks/ and display the
talks/ directory instead of the nice homonymous
talks.html page that I had spent so much time and craft preparing: one may find this very funny, but somehow at 10pm the humor of the situation eluded me.
This problem was quickly solved by (googling) adding
DirectorySlash Off in the root
.htaccess file. No biggie. Yet, the impact on my morale was not completely negligible: the first problems I had met with my
.htaccess configuration were not even about the redirections that I mentioned above in my introduction (and which, I assure you and despite seemingly opposite appearances, are the topic of this entry), but rather about some obscure details of my basic URL rewriting. As a consequence, my hopes that my wonderfully prepared update would be doing wonders on the first try were somewhat lessened. Or, to be blunt: I knew by then that the easy update that I had planned would turn into a nightmare3.
Trouble in paradise
After the initial (and probably predictable) troubles (that I had failed to foresee in my (by then long dead) stubborn optimism), it was now time to see how well my redirections (from the old URLs of the previous website to the new ones) were performing. And… They were not working at all!
Let me explain what these redirections are. Before the last update, I had a dedicated page for each of my publications (so I had
publications/second-paper.html, etc…). When developing the new website, I decided that this was one click too many to do for the reader, and I settled for having a single
publications.html page that would contain all of them. So I wrote in my
RewriteRule ^publications/.* /publications [L,R=301]
to redirect any URL beginning with
publications/[insert anything here] to
publications.html itself. And guess what? It failed miserably: when typing
acallard.net/publications/whatever, Apache was keeping the url intact (instead of redirecting to
publications.html) and the website was then nicely answering with a 404 error (as it was supposed to, since
publications/whatever no longer pointed to any actual HTML page).
I took a long breath: it was definitely strange, but I had been editing the
.htaccess file at the last possible moment (which means: it was late), so this was probably just a silly typo. I temporarily rolled everything back, and then started to do some sanity checks on my redirection rules. Here’s the thing: I couldn’t see any mistake in my configuration, and local simulations said that these redirections were rewriting URLs as expected. Another breath: this was getting weird. I tried a bunch of various different ideas, and then the funniest thing happened: I realized that the rule above would work with the previous version of the website (so, the old HTML files), but not with the newer one.
Let me be perfectly clear: with the same
.htaccess file in both cases, my redirections worked perfectly fine if my root folder contained the old HTML files; but as soon as I deleted them to put the new HTML files instead, the redirections stopped working. Rolling back: redirections worked. Updating again: they no longer did.
THIS WAS NOT WHAT WAS SUPPOSED TO BE HAPPENING.
Let’s skip to the end of the story (this post is becoming too long anyway): here is my understanding of what happened. Assume that, for some reasons, I want to redirect everything in the directory
index.html (or any other page, really, even one that does not exist:
index.html has no role in this whatsoever). Then I (frustratingly) write (at 1am):
RewriteRule ^a/b/.* index.html [L,R=301]
It turns out that this RewriteRule works if and only if… the directory
a/ exists in the root directory!
I don’t know how to explain how utterly stupid I think this is, for so many reasons I don’t even know where to begin! This is about URL rewriting, which really is just a special case of text rewriting: why should rewriting on some text depend on what is actually stored on my drive? Even then, assume that for some reasons
RewriteRule “checks” the existence of the source (which kind of defeats its point, but whatever) before rewriting anything: in this case, can anybody explain to me why the existence of the directory
a/ has no influence whatsoever on the redirections? For all that matters,
a/ can be empty. It boils down to this: if
a/ exists, then
/a/b/[anything] redirects to
a/ does not exist, then the
RewriteRule doesn’t do anything at all.
Even though I did not enjoy in the slightest bit where my conclusions were taking me, my honed detective skills left me with only one explanation: directories in the root folder have a special role in redirections.
By then, I had gotten a few hours of sleep and I had started annoying people during coffee breaks at work the next morning. As the reactions of said colleagues were more along the lines of “why do you even bother with this stuff?” instead of “of course, let me help you here”, I looked up online and I found the following post on serverfault: “RewriteRule only works if folder does not exist” (let’s note that this is the exact opposite of the problem I had, but at this point I was ready to read anything remotely related to my issue). This post refers to a paragraph of the Apache documentation that I had completely missed:
mod_rewritetries to guess whether you have specified a file-system path or a URL-path by checking to see if the first segment of the path exists at the root of the file-system. For example, if you specify a Substitution string of
/www/file.html, then this will be treated as a URL-path unless a directory named
wwwexists at the root of your file-system […].
Now, that explains things. Well… at least it’s something. Calling it an explanation may be a bit of a stretch: it seems weird that there’s no way to pass a flag forcing Apache to recognize a path as filesystem instead of URL (or the other way around), so that it can only fall back on guessing the author’s intention based on a seemingly questionable heuristic (checking whether the first segment of the path exists on your drive does not look like a good idea to my ignorant mind). Why this choice? It was unavoidable that, eventually, somebody (like me) would stumble upon this issue without having read the proper documentation: they would then waste a couple of hours not understanding why their very good configuration files don’t work as expected, and get very, very frustrated.
As I was complaining4 to some colleague, they said that there probably was a very good reason at the time for people to settle on this solution: that is probably very true, though I have no idea what such a motive could be; and despite some googling, I did not find any answer to my question. In fact, the issue is very rarely mentioned on the web at all. Once you know of it, of course, it becomes easy to see that it is mentioned in the “Substitution” paragraph of the “RewriteRule Directive” section. But I guess that this kind of configuration files has the same pitfall that also plagues poor souls trying to get through their first merge errors in git: it is very easy to search their documentation if you already know what you’re looking for; but if not, good luck.
What do I do now?
Obviously, I solved the problem by creating an empty directory
publications/ which magicked the issue away! I should be happy… But nah. I’m mostly frustrated.
I mean, for starters, I don’t even know why my rewriting rule
RewriteRule ^publications/.* /publications [L,R=301] only works if Apache interprets it in terms of filesystem paths and not as URL paths; and I find not knowing to be very disappointing. I also don’t know what the rule would be if I wanted it to work with URL paths instead. I don’t even know if there is any way to force Apache to read my RewriteRule as a filesystem path and not a URL one (I mean, any way that doesn’t consist in creating the empty directory
publications/: I already did that).
If anyone has any input to provide about this, I’d be happy to hear it: call it sunken cost fallacy or misplaced curiosity if you want, but I lost a couple of hours of my time to a weird edge case that nobody is talking about, so for my own peace of mind I’d really like to know!
Why is my computer so mean to me?
To be perfectly clear: when I talk about SEO here, it is more a matter of principles (I don’t like Google pointing to some dead links, it’s not nice to the people browsing it) than a matter of rank: I don’t care about online ranking at all. I already don’t have any claim to celebrity with my work, it would be ridiculous to have some with this blog! ↩
I should probably mention that I will be teaching regex (among other things) this year: that may have given me a somewhat unwarranted boost of confidence in my
I was done with the HTML+CSS roughly at 10pm in the evening, so I thought I could publish everything online then and then go to bed. Well, I rather spent three hours trying to get my redirections to a working state while testing “in production”, I gave up around 1am, and I completely rewrote my
.htaccessfile the next morning. ↩
In case this post didn’t help the realization dawning on you: I’m very good at complaining! ↩