I had a stressful end of workday yesterday. I accidentally deleted the database on three live sites, including our most important e-commerce site, and had to scramble to restore from backups. I was setting up a new set up servers. I cloned some old ones to ensure the same versions of OS and all the setup quickly. The only problem with that approach is that I then had to delete the sites from the new servers, one by one to ensure everything was removed correctly with my automated scripts. This goes kinda slow, and I could probably come up with a better way for this type of situation if I’m going to do it more often. Anyway, since the servers were just cloned and had the IPs from the old server set, the sites couldn’t connect to the new database server, preventing my removal script from deleting the site database automatically. I had to go in and do DROP DATABASE
manually for each. Probably out of muscle memory of connecting to the server with those sites, I logged in to the original, live one without noticing, and dropped three databases before noticing the server name.
Uh oh. I checked the sites listed on my terminal as done so far, and indeed, they were throwing 500 errors. Panic, especially since my boss had just recently come back to the office. Since I rarely have to restore databases, I had to remind myself where the backups were stored. I found the most recent, which was from four hours prior. We do full backups every six hours. So some data would be lost. I hoped nothing important, but that was to worry about once the sites were back up. I had to gunzip
the backups, log in to MySQL, create the database, USE
it, and run source
on the file. Rushing, I once forgot to switch to the new database and overwrote the wrong one, forcing me to redo it again. Since the databases were now new, I had to look up the USER
and GRANT
commands needed to allow the sites to connect, fill them with the correct values, and run them for each site. After all this, I checked the sites to verify several pages were working. Phew.
Then it was time to figure out if things had happened since the backup that would be missing from the data. If any orders or anything like that had come in, it might be difficult to reproduce them properly. Luckily, only one of the sites does e-commerce at all, one’s barely used, and one doesn’t have a lot of dynamic changes other than sending newsletters. To figure out what was done in the missing data timeframe, I looked at the site access logs, particularly for POST
requests. I had to weed out bot spam, logins, and other unimportant requests. Lucky for me, the e-commerce site was the only one that had anything done at all, and no orders. Basically two things happened, one a comment on an order line item and the other a modification to a product. Phew.
The end of day was approaching, and people were leaving. I had been considering whether to tell my boss right away or wait until the backups were restored and I did my search for the possible data changes that would need restored on top. I was sort of dreading having to tell him and trying to figure out the best way. When I was ready, he was out sitting at the conference table. I went out and he at first started to say goodbye as if I were heading off, but I sat down next to him with my laptop. I told him I messed up the Sweet Modern database and had to restore from backup from earlier that day. I told him from the access logs what I had found for missing changes. He said the line item comment was likely added by our sales lady, so he’d tell her she’d need to re-add it the next day. He determined the product that was modified was one sold in person that day, so I just had to set the inventory to zero. He then went on to talking about other stuff, as if no biggie. So, phew again.
Things ended up fine, but that was quite stressful in the moment. I was tired when I headed home. In the future I will have to try not to rush doing things like this. It might be nice to do the database backups more frequently, but doing full database dumps actually takes a fair amount of processor power on the DB server, and I’m not sure if we’d want to have it going more often. Then again, I was very lucky this time with only two relatively minor transactions lost. I will have to look at that and see if maybe it can be done more efficiently or if it isn’t even high enough load to worry about.