SAN FRANCISCO -- Facebook MySQL database engineer Rob Wultsch has one primary suggestion for managing thousands of servers: keep it simple, stupid.
KISS is a common and popular acronym, and in the case of Facebook, Wultsch said it is key to being able to quickly provision servers, reduce the single points of failure and increase automation.
“Simple systems are easy to put back together,” he said. “Complex systems are not.”
One of Wultsch’s main tenets is that servers fail, and there’s nothing anyone can do about it. So database and systems administrators should balance that with simplicity. Have as few hardware stock keeping units (SKUs) as possible, find operating systems and firmware that works and stick with it as long as possible. Variety and upgrades are scary.
“It’s better if you have one SKU,” he said, “or maybe a big box and a little box SKU. Then you can just replace them, and prayer doesn’t have to be part of the equation.”
Wultsch said the MySQL database servers at Facebook are better organized with much fewer single points of failure than at his previous job, at GoDaddy.com as a MySQL database administrator (DBA). As the server environment grows larger, making sure there is homogeneity at the hardware, operating system, database and software levels is key to managing changes, handling backups and rolling back when necessary.
Wultsch gave some other suggestions for handling large database server environments:
- Make sure there are warm spare servers. Have many of them, sitting in the rack, ready to be used. “This goes back to the fact that backups fail,” he said.
- Be able to quickly provision and reprovision hosts. “At my last job it took hours to put a new OS on a server. At my new job it’s easy: run one command, go to lunch and by the time I get back it’s probably done.”
- Have external support. “It’s nice to be able to call out when things go very wrong.”
Managing people in large MySQL database server environments is as important as the technical lessons, if not more so, Wultsch said.
First, burnout is “very real.”
“Being a DBA is long hours. It’s high-stress, even though it’s pretty good pay,” he said. “So a lot of people don’t want to do it.”
Wultsch said that a DBA in large-scale environments must be able to program well in order to properly manage thousands of servers. At Facebook, DBAs must be able to program in a “P language” such as Perl or Python, have good Linux and Bash scripting skills and decent database knowledge.
As important is being humble.
“We deal with too much to know anything amazingly well,” he said. “We do what we think is right and sometimes we’re wrong.”
Wultsch added that the ramp-up at a large-scale company is long. At Facebook, it is usually six months before a new DBA can do anything useful and a full year to be fully up to speed. He said it was the same situation at GoDaddy.com.
Finally, he said that mistakes happen, and it should be an organization’s goal to minimize them. That is done through policy, by not calling up database administration in the middle of the night whenever something small goes awry.
“When people don’t sleep because they get called every time a monitor blips, they tend to make more mistakes,” he said.