Lights out… and back on again

This web site went down for a day. It's not the strangest technical problem I've ever seen, but it's a good example of that frustrating kind where even after it's been remediated and things are back to normal the cause isn't obvious and therefore there's no surefire way to mitigate against future recurrences.

On the morning of Saturday 21 January 2023, I opened the Seafile shared file-hosting app on my phone to check some notes I'd been drafting there. Those notes are stored remotely by the Seafile server-side software on my virtual private server (VPS), an Internet-connected virtual machine (VM, a sort of virtual computer that can operate alongside other VMs on shared computer hardware) that I rent at a monthly rate. But all I got was a connection error. Concerned, I tried loading my blog, also hosted on the VPS, in a browser. That didn't work either. I looked at my dashboard on the hosting provider's web site.


Well, I couldn't imagine why it would be shut off. I hadn't requested anything like that. And I hadn't changed anything since I connected the day before. So I just turned it back on. But no dice. Still nothing connecting, no response when I tried to ping the IP address of the VPS, no route for me to issue terminal commands to it. Both small kids in the house were awake at this point, and I was the only adult awake with them, so I was only taking the limited diagnostic steps I could take with half-glances at my phone in the middle of other stuff. Still, I sent a support inquiry to the hosting provider. I was hoping to hear that it was a temporary degradation of data center connectivity or I just needed to update payment info or something. Anything that would make it not really my problem.

But no, when I got a response, the ball was back in my court. I was politely reminded that the VPS hosting is for experts who diagnose their own server issues, and that there was an emergency console I could use as an alternative way to send terminal commands, and if that should fail there was a rescue mode that would suspend my VPS and give me a kind of barebones, temporary rescue server I could use to investigate and edit the VPS's files.

Eventually I wasn't the only adult around anymore, so I grabbed my laptop and tried the emergency console, but it crashed while loading. On to the rescue mode, then. I managed to connect to the virtual disk containing the VPS files, but not without the rescue system complaining about and automatically cleaning up some “orphaned inodes.” This indicated that something had gone mildly wrong with the way the files were stored on the disk. I began poking around in system logs files looking for something that would tell me why the VPS shut down. But where I expected to see maybe some error message to explain this, I found only a log where the last messages were a refused incoming connection from a user called pivpn at a particular IP address, followed at some later time by an incomprehensible log entry containing a mix of numbers, symbols, and letters from various scripts, including ő and Ə.

At this point I had exhausted most of my diagnostic options. All I had gathered was that something had caused my VPS to start writing information to the disk incorrectly and to shut off, and now it wasn't booting correctly. I sent a desperate follow-up to my support contact hoping they had access to some kind of hypervisor log or something, some information from outside the VPN that would tell me something definitive like “The VPS ran out of memory and had to be shut down,” but this went unanswered. I had seen all the info there was to see.

Children dipping their hands into a whirlpool generator

I felt a little down on myself at that point. What did I think I was going to accomplish, playing at sysadmin, especially when I ought to be spending time with the kids? Maybe it would be better if I scrapped the idea of running a web site again and cancelled my VPS subscription. But I figured it was better to put the whole thing aside for a while and let the frustration fade before I made a decision like that. Anyhow, we had committed ourselves to take the kids to the science museum.

So I let the web site stay offline while the kids ran around the interactive exhibits. They had a blast, and I got a good workout trying to corral the two-year-old.

Back at home I decided to give up on diagnosis and switch to remediation. There was no more I was going to learn about why the site's VPS went offline, so it was time to find the most expedient way to bring it back up.

I launched the rescue system again and began downloading a bunch of files. I had all the data for my blog stored in a database system called MySQL. A system like MySQL is supposed to be kind of opaque in that it has its own format for storing and organizing data in an optimized way, and instead of working with or knowing anything about this format, the user or any application that connects to the database uses a relatively simple language called Structured Query Language (SQL) to fetch information from or add information to the database. The usual way to back up an SQL database like this is to perform an “SQL dump,” which means asking the database system to output a list of SQL commands that could recreate all the information in the database on a new system. I hadn't got around to setting up automatic SQL dumps to backup my blog, so I didn't have anything like that. But I hoped that if I kept a copy of all of MySQL's files on the disk, the contents of a directory called /var/lib/mysql/, I could somehow restore my blog's data from that. I also backed up the static images I was serving from the root domain to embed in blog posts, and the files for my Seafile server.

Then I deleted the whole VPS and opened up a new one. The process of actually installing the software I had running on the VPS before was faster than it might have been otherwise, because my hosting provider has a NixOS VPS image ready to go, and I had the whole system configuration saved as a file that I could use with NixOps to enact that configuration from my laptop. That part took a matter of minutes.

The thing that seemed a lot trickier was extracting the blog data (mostly the posts themselves) from the MySQL files. I poked around in some forms and got the idea of copying over to the new VPS just the files for the blog database itself and some general MySQL files with names that began with ib. That seemed too simple to me, surely there could be some peculiarity of MySQL installed on one machine that might make it unable to use these files created on another. But I tried it anyway. MySQL refused to start after that, complaining of corrupt formatting in those ib files. So I returned it to the way it was before, and this time only copied over the files for the blog database itself. This time MySQL started, and was able to show me a list of all the data tables in the blog database, but couldn't actually access anything in those tables. Trying to load my blog in the browser at this point yielded a server error. So this time I decided to try something that I didn't really expect to work at all; I replaced the whole /var/lib/mysql/ directory on the new VPS with the one I had copied from the old VPS. It worked perfectly, the while blog loaded in my browser again right away. A similar approach restored everything on my Seafile server.

Not knowing what caused this outage means I don't really know what can be done to prevent it from happening again. It's in this kind of situation that people who work with computers sometimes like to trot out unlikely-sounding explanations to cover their asses. Maybe a stray cosmic ray hit the server hardware just so and flipped an important bit from 1 to 0. Maybe someone walked through the data center in clothes that produced a static discharge, corrupting some storage. Maybe some NSA agent doesn't like the font I use on my blog and used an undisclosed vulnerability in the Nginx web server to corrupt crucial system files. Suggesting any of these would be a fancy way to basically shrug my shoulders, absolve myself of any responsibility, and acknowledge that I don't have a way to keep it from happening again.

But I can at least take some steps that will hopefully make it easier to recover if it does happen again. Turns out my hosting provider doesn't charge all that much extra per month to automatically keep several daily snapshots of the virtual disk, so if I notice within a week that something like this has happened, I can quickly revert to the state things were in before. And maybe I'll get around to automating those SQL dumps.

Where not otherwise noted, the content of this blog is written by Dominique Cyprès and licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.