I’ve seen the inside of lots and lots of businesses over the past decade or so. Though the technology has changed somewhat dramatically in many areas of the data center, the general, high-level methodologies for building a sane environment can still be applied. While the implementation is typically done by system administrators, it also helps if their managers and people who hire systems administrators are at least moderately clueful.
While it’s true that the high-level methodologies haven’t changed, likewise, the ways in which people abuse or neglect them hasn’t really changed much either. Here’s a list of things to avoid, and things to jump on, when building and growing your environment.
- Don’t hire systems administrators: I have to say, that this is a problem I only used to see in very small businesses who couldn’t afford them, and perhaps didn’t warrant having one full-time, and this is changing it seems. I have now spoken with perhaps 4 managers in as many months that should really have at least one full-time sysadmin, and instead are just assigning systems tasks to volunteers from the engineering team. The results are disastrous, and get worse as time passes and the environment grows.At least develop an ongoing relationship with a systems guru and bring them in for projects as they arise — that’s worlds better than the results you get from doling out the tasks to people who don’t really have a systems background.The problem isn’t that any given task is “hard”. Most aren’t. The problem is actually multi-faceted: First, if something goes wrong, developers typically don’t have the background to understand the implications and impact of that. He also probably doesn’t have the experience to quickly fix the problem. Further, he may not know of a resource to get authoritative information on the topic (hint: online forums are not typically the best source of authoritative information for complex systems issues.)
- Don’t automate: I’m an advocate of investing in automation very early and very often. If you’re just starting your business, and it relies heavily on technology, the first wave of systems you buy should include a machine to facilitate automated installation in some form. The reasons for this are many, but a couple of big ones are:
- Consistency: if your system builds are easily repeatable, you can typically set up an automated install regime to make base installs identical, and alter only those parts of the install that support some unique service the system provides. You can then be sure that, to a very large degree, all of your machines are identical in terms of the packages installed, the base service configurations, the user environment, etc.
- Server-to-Sysadmin ratio: automating just about anything in your system environment results in less overhead in terms of man hours devoted to that task. Automating the installation, backups, monitoring, log rotation, etc., means that each system administrator you hire can manage a larger number of machines.
- Make security an afterthought: Security should be a very well thought out component of everything you do. People have been preaching this for eons, and yet it’s pretty clear to me that plenty (I might even say a majority) of businesses don’t even practice the basics, like keeping on top of security updates to systems and applications, removing/archiving user accounts when people leave the company, and setting up a secure means of remote access.Security breaches are a nightmare, typically. There are a lot of questions to answer when it happens, and most of them take some time and manual drudgery to answer. In addition, machines need to be reimaged, data needs to be recovered, audits need to take place, and of course, the big one: everyone has to spend time in meetings to talk about all of this, and then, magically, projects are planned to immediately insure that it never happens again… until it does, because the only time those projects gain priority is when a breach occurs. Get on top of it now, and save yourself the headaches, and costs, and other potential (and way bigger) disasters.
- Don’t plan for failure: “plan for failure” is presently a term bandied about in relation to building scalable, reliable services using large numbers of machines, but the phrase also applies to good old infrastructure services. For example, for some reason, there are managers out there who demand that DNS services be handled by in-house systems, and then they end up with only a single DNS server for their entire domain! I’m not kidding! I’ve seen that twice in the past year, and that’s too much. For every service you deploy, you should make a list of all of the interdependencies that arise from the use of that service. Then determine what your tolerance for downtime is for that particular service, taking into account services that might go away if this service is unavailable. Why? Because it’s going to fail. If the service itself doesn’t fail, something else will — like the hard drive — and your end users won’t know or care about the difference. They’ll just know the service is gone.
- Don’t communicate: as environments grow and become more complex, changes, tweaks, and modifications will be required. No systems environment I’ve ever seen is static. You’re going to want to implement services a little differently to offer more security, more reliability, a better overall user experience, or whatever. Communicating with the people you serve about these changes should be a part of the planning for projects like this. Users should know that a change is happening, how and when it’s happening, how they’ll be affected during the time of the change, and, hopefully, how their lives will be better because of the change.In addition, you should be thinking about how you’ll communicate with your systems team, and your users, in the event of a catastrophe. Do you have an out-of-bound mechanism for communication that doesn’t depend on your network being alive? If, I dunno, you lost commercial power, say, causing your UPS to kick itself on and immediately blow up, leaving your entire data center completely black, and you were the only person in the building, how would you get in touch with people to help you through the disaster? Note that this implies that your email server, automated phone routing, etc., are down as well.
- Don’t utilize metrics: “If you can’t measure it, you can’t manage it” holds true in systems management as well. How do you know when you’ll need more disk? How do you know when your database server will start to perform badly? How do you know when to think about that reverse proxy for your web servers? You need to monitor everything. Resource utilization metrics are a key to growing your environment in a cost-effective way while still giving services the resources they need.
- Don’t utilize monitoring: not metrics in this case, but service and system availability monitoring. What’s the cost to you if your website is unavailable for 1 minute? One hour? One day? If your database server, which serves your web site, goes down at 10pm and nobody knows about it until 8am, how does that affect your business? In reality, you *always* have availability monitoring: your customers will be your alert mechanism in the absence of any other monitoring solution. And what’s the cost in terms of the perception of the service you provide as a result? Monitoring can be non-trivial, but it is absolutely essential in almost all environments.
- Don’t use revision control: Revision control can get you and your team out of so many headaches that I can’t list them all here. I’m not even going to tell you which tools to consider, because if you’re running an environment without revision control, almost anything is better than what you have. Revision control can be used to save different versions of all of the configuration files on your systems, documentation for all of your systems, all of the code written in your environment (i.e. those scripts used for system automation, etc.), and your automated installation template files. It can also be utilized in the chain of tools used to perform rollouts of new applications in a sane way (it can also be used to do rollouts in insane ways, but that’s another post). Revision control is equal parts disaster recovery, convenience, accountability, consistency, and control. To the extent that activity can be measured, it also provides metrics.
- Don’t use configuration management: Depending on the size of the environment, this can come down to something as simple as an NFS mount or a set of imaging templates, with maybe some rsync-ish scripts around (in a revision control system!), or it can get complex, involving things like Puppet or CFEngine, along with other tools depending on your platform and other restrictions. The idea, though, is to abstract away some of the low-level, manual drudgery that goes along with systems management, so that you can, to quote the Puppet website, “focus more on how things should be done and less on doing them.” This ties in nicely with things like revision control, recoverability, the goals of consistency, automation, and increasing your system-to-sysadmin ratio.
- Don’t be a people person: All of IT, not just systems management, has historically had issues communicating with business unit personnel. Likewise, business personnel have no idea how to communicate with IT personnel. No matter which side of the fence you fall on, generally being good with people, communication, perceiving and predicting the needs of others, will be very beneficial in winning consensus and gaining support for your projects. If nothing else, your career benefits from having a reputation of having good interpersonal skills and being an “all around good guy.” I know this is a completely non-technical item, but my experience is that some very large problems in systems management are not related to technology, but rather people.