"Our admins will go in and look at the logs to see what happened before a server locked up," says Steve Luciano, Network Administrator for New Pig Corporation, an industrial safety and plant maintenance vendor headquartered in Tipton, Pennsylvania. "But it's difficult to keep on top of all the servers amongst everything else they have to do."
New Pig searched for a means of presenting storage and networking information from disparate sources in a useful and centralized format. This led to the company acquiring and installing Event Log Management (ELM) software.
The key element to track when managing storage systems is, of course, the disk drives.
"You have to understand that disk drives are like light bulbs," says Paul Santeler, VP of Management Networking and High Availability Products Group at Hewlett-Packard. "They will fail. It is how well prepared you are when one fails that makes the difference between a well-run or poorly run data center."
To help in preparing for possible upcoming failures, disks use a system called Self-Monitoring Analysis and Reporting Technology (S.M.A.R.T.). S.M.A.R.T. monitors up to thirty different items within the drive, including seek time, head flying height, the amount of time it takes to spin a disk up to its rated speed, and the internal temperature of the drive.
S.M.A.R.T. analyzes all these monitored elements and creates an overall health assessment for the drive based on algorithms the manufacture establishes for that particular model. When it appears a device is approaching the failure point, S.M.A.R.T. alerts the administrator in (hopefully) enough time to back up the drive and replace it. If the disk is part of a RAID array, there is an additional level of protection.
"When there is a failure coming, the S.M.A.R.T. drive passes that information to the RAID controller," says Santeler. "But RAID does its own analysis as well, monitoring hundreds or thousands of things on the drive itself to try to see as a whole what might cause failure."
But drive status is just one part of ensuring the availability and performance of storage systems. A complete view requires an end-to-end view of the entire process as it affects the end users. Therefore, it is wise to also keep tabs on other sources of information, including:
FECN/BECN - FECNs (Forward Explicit Congestion Notifications) and BECNs (Backward Explicit Congestion Notifications) are Frame Relay messages that notify the receiving (FECN) or sending (BECN) device that there is congestion in the network.
SNMP - SNMP (Simple Network Management Protocol) lets administrators monitor and manage such items as CPU utilization, available disk space, temperature, up or down status of devices, connections or services, excessive errors on switches/routers, server fan failure, and bandwidth utilization.
Security Threats - This includes password hacking, stealth and port scans on firewalls, application failures due to viruses, and login authentication failures stored in firewall or other security logs.
While all this information should make it easy to proactively manage storage and network systems, the problem in most cases is very much one of too much information. Even a medium-sized network can have hundreds of separate logs, and within each of these logs is more information than can easily be digested and operated on. This is where Event Log Management (ELM) tools help out. Examples of ELMs include Adiscon GmbH's Event Reporter; Somix Technologies, Inc.'s Logalot; TNT Software's ELM Log Manager; GFI Software's LANGuard; and RGE, Inc.'s IPSentry.
ELMs aggregate all the information contained in the Event Logs and Syslogs into a single database and present that information in a single interface. While this is easier than having to individually log onto each piece of equipment to view the logs, the real value in ELMs lies in their ability to winnow down the information to a manageable level.
ELMs store all log entries, but since the vast majority of entries are routine items that never need to be seen, the non-essential entries can be configured to not show up on the management console. When something does require intervention, though, administrators can set the appropriate alerting and escalation policies.
New Pig, for example, uses Logalot for ELM. "If you have a problem with a switch and are getting a lot of Cyclic Redundancy Check (CRC) errors, it won't send a hundred e-mails," says Luciano, "but they all get tallied on the bulletin board so I can go there to view them."
Having all alerts available in a single console makes it easier to quickly track down the source of a problem. For instance, knowing that you have simultaneous alerts from the Intrusion Detection System and from the database server indicating excessive CPU utilization provides a quicker answer to what is happening than if you had to track down each individually.
"Before, it was a matter of not really knowing what was going on or why things were happening," Luciano says. "Now, when the IS manager wants to find out what is going on with the network, she can go to the bulletin board and see all the active situations that are going on."
With storage growing at 50 to 100 percent annually in many organizations, ELM tools certainly won't solve all problems. They do, however, simplify the often overwhelming business of dealing with multitudes of alerts, alarms, and events. ELMs allow the administrator to set alerting parameters for storage resources (such as disk space, fragmentation levels, and disk performance criteria) and gather those alerts into one central repository. At the end of the day, that means the most vital alerts come to your immediate attention while the abundance of duplicative or less important events remain hidden until you need to drill down further to learn more about specific situations.