Logbook Outage Update
First, thanks to all for your patience. While we're waiting on the database to rebuild, I'd like to take this moment to provide some greater insight into what happened.
Last month, our cloud service provider, Amazon, distributed a set of mandatory security patches to all of our servers. These patches required system reboots and we never knew exactly when they would come. The "reboot day" came and went with little fanfare. We did notice a few problems on the Forums server which were immediately corrected. Everything else survived the reboot and was working properly, including the Logbook server.
This week, while performing routine health checks on our servers, we noticed that a key disk drive on the Logbook machine was filling up and had reached 86% full. While looking at ways to remedy this, we noticed that the error log file on the Logbook server had grown to an enormous size, nearly 300 Gigabytes. Looking at the log, there was an indication of an internal fault in the Logbook database which was causing a flood of warning messages. Despite these warnings, the Logbook server was apparently running normally.
Although the server was still running and the data was (apparently) still intact, we shut the server down as a precautionary measure. At first, we didn't know if this was a problem that we could correct immediately, or one which would require an extended outage. Accordingly, our first announcement indicated that the outage was less severe than it was, based on what we knew at the time. When we discovered that a full database reload was the only fix, we amended out announcement to more closely convey the situation.
The problem was extremely esoteric and it is our belief that it was caused by last month's forced reboot. The most trustworthy fix was to dump the entire database and reload it into a fresh database engine. This is easy, technically, but very time consuming as there are now in excess of 72 million QSO's (stored in 67 gigabytes) to insert and index back into the database. The time needed to re-insert these records is not immediately known, however, at this moment, about 18 hours later, the database operation appears to be about 50% complete. Clearly, the Logbook won't be back this afternoon and may well take another 18 hours to complete.
While we regret the outage, we still believe that, given the circumstances, the precautionary shutdown was the best course to take to avoid the loss of QSO data. For those IT professionals among us, the same issue exists even with replicated databases as the error seems to propagate to replication servers. We don't use replication, however we do snapshot the entire 500GB database every night and we have several weeks worth of back copies to draw upon in the event of a more substantial disaster.
Again, we thank you for your patience and understanding as we are working diligently to bring the server back online.
73, -fred
Fred Lloyd, AA7BQ