Tuesday, September 26, 2006

Too Much of a Good Thing: Managing Information Overload in Storage Management

http://www.gfi.com/news/en/esmlaunch.htm
http://www.enterpriseitplanet.com/storage/features/article.php/2215321

When managing storage and other network elements, you can easily end up with far too much of a good thing. Servers, routers, switches, desktops, firewalls, intrusion detection systems - each produce a wealth of information detailing every aspect of their performance, as well as the performance of related network elements. Result: You end up with an overwhelming amount of data. A vast sea of unimportant alerts within device-specific logs masks a handful of vital alerts that require immediate analyses, coordination and priority attention by administrators.

"Our admins will go in and look at the logs to see what happened before a server locked up," says Steve Luciano, Network Administrator for New Pig Corporation an industrial safety and plant maintenance vendor headquartered in Tipton, Pennsylvania. "But is difficult to keep on top of all the servers amongst everything else they have to do."

New Pig searched for a means of presenting storage and networking information from disparate sources in a useful and centralized format. This led to the company acquiring and installing Event Log Management (ELM) software.

Mother Lode

The key element to track when managing storage systems is of course the disk drives.

"You have to understand that disk drives are like light bulbs," says Paul Santeler, VP of Management Networking and High Availability Products Group at Hewlett-Packard Company (Palo Alto, CA). "They will fail. It is how well prepared you are when one fails that makes the difference between a well-run or poorly run data center."

To help in preparations for an upcoming failure, disks use a system called Self-Monitoring Analysis and Reporting Technology (S.M.A.R.T.). S.M.A.R.T. monitors up to thirty different items within the drive - such as seek time, head flying height, the amount of time it takes to spin a disk up to its rated speed and drive internal temperature.

S.M.A.R.T. analyzes all these monitored elements and creates an overall health assessment for the drive based on algorithms the manufacture establishes for that particular model. When it appears a device is approaching the failure point, the device should alert the administrator in enough time to back up the drive and replace it. If the disk is part of a RAID array, there is an additional level of protection.

"When there is a failure coming, the S.M.A.R.T. drive passes that information to the RAID controller," says Santeler. "But RAID does its own analysis as well, monitoring hundreds or thousands of things on the drive itself to try to see as a whole what might cause failure."

But drive status is just one part of ensuring the availability and performance of storage systems. A complete view requires an end-to-end view of the entire process as it affects the end users. Therefore, it is wise to also keep tabs of other sources of information including:

FECN/BECN - FECNs (Forward Explicit Congestion Notifications) and BECNs. (Backward Explicit Congestion Notifications) are on a frame relay network indicating that there is a congestion problem.

SNMP - SNMP (Simple Network Management Protocol) lets administrators monitor and mange such times as CPU utilization, available disk space, temperature, up or down status of devices, connections or services, excessive errors on a switch/router, server fan failure and bandwidth utilization.

Security Threats - This includes password hacking, stealth and port scans on firewalls, application failures due to viruses, and log in authentication failures which is stored in firewall or other security logs.

Alert Reduction

While all this information should make it easy to proactively manage, the problem in most cases is very much one of too much information. Even a medium sized network can have hundreds of separate logs and within each of these logs is more information than can easily be digested and operated on. This is where ELMs help out. These products include Adiscon GmbH's (Erftstadt, Germany) Event Reporter, Somix Technologies, Inc.'s (Sanford, ME) Logalot, TNT Software's (Vancouver, WA) ELM Log Manager, GFI Software's (Cary, NC) LANGuard and RGE, Inc.'s (Danville, IN) IPSentry.

ELMs aggregate all the information contained in the Event Logs and Syslogs into a single database and present that information in a single interface. While this is easier than having to individually log onto each piece of equipment to view the logs, their real value lies in their ability to winnow down the information to a manageable level.

Although the ELM stores all the log entries, it allows the administrator to set policies for how these alerts are handled. The vast majority of log entries are routine items that never need to be seen, and so these don't have to show up on the management console. But when something does require intervention, administrators can set the appropriate alerting and escalation policies.

New Pig, for example, uses Logalot for ELM. "If you have a problem with a switch and are getting a lot of Cyclic Redundancy Check (CRC) errors, it won't send a hundred e-mails," says Luciano, "but they all get tallied on the bulletin board so I can go there to view them."

Having all the alerts in a single console makes it easier to quickly track down the source of a problem. For instance, knowing that you have simultaneous alerts from the Intrusion Detection System and one indicating excessive CPU utilization on a database server gives a quicker answer to what is happening than if you had to track down both individually.

"Before it was a matter of really not knowing what was going on or why things were happening," Luciano says. "Now when the IS manager wants to find out what is going on with the network she can go to the bulletin board and see all the active situations that are going on now."

Simplification

With storage growing at 50 to 100 percent annually in many organizations, ELM tools certainly won't solve all your problems. But they do simplify the business of dealing with a multitude of alerts, alarms and events. ELM allows the administrator to set alerting parameters for storage resources (such as disk space, fragmentation levels and disk performance criteria) and gather those alerts into one central repository. At the end of the day, that means that the most vital alerts come to your immediate attention while the multitude of duplicative or less important events are available, if you need to drill down further to find out more about specific situations.

References:
Tags: http://www.technorati.com/tag/

No comments: