effective logging management system & policy ?

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

Distributions

› Ubuntu

› Fedora

› CentOS

中文资源站

› 网易开源镜像站

这是一个创建于 4557 天前的主题，其中的信息可能已经有所发展或是发生改变。

I've been long time dreaming about an effective system & policy for logging management. Ideally, it should be *usable at extreme condition.

Yes, I'm talking about USABILITY. I guess most system administrators, experienced or newbie, may have failed at disaster recovery at least once, even you were fulfilled with thousands of backups. Backups are important, while what matters here is that the approach to rehabilitate is missing.

When it comes to logging systems or policies, the question becomes, are you ready for the crime scene investigation? Unfortunately this is not a joke to me. I always define myself a detective, or sometimes a firefighter. Imagine such a scenario, a server is down and it's not maintained by you. Now it's your time to find what happened and to make it right by any means. You'd better be fast.

So you grab a copy of log files, expecting some obvious clues to be found. I have to admit that I'd take a deep breath before diving into a deep sea of information. Wait a moment, you think you've got all available log files? You are too naive. Unix and Linux, despite commercial or free distos, they vary system-widely.

Take a typical linux-based web server as example, you may first check rotating configs and estimate the time intervals that exceptions occurred. The syslog is general one but is far from enough. Web server and database daemons have its own ones. Since you start digging into the problem, you may need network-related logs as well, say iptables etc. If nothing seems weird, you may take package management systems into consideration. Sometimes account auditing will force you checking su/secure/auth logs. In a fatal condition like hacker invasion, these logs are probably no longer reliable and you have to ensure no rootkit exists at first. By the way, if the machine unluckily is kernel-hardened, all your work would time 3 or even more before you can get close to your target.

Remember I've said *alienation? Some developers tried so hard to keep the management work easy and clear, so applauses to Gentoo communities. Commercial powers could do better, Mac OS X seems to reflow log information system-widely. But I still have complaints. To Solaris, what the hell are there 30+ directories under /var/log/ ? To HP, can you explain your philosophy, if your logging system is define by roles like admins/users, who the hell is network named nettl? Could log filename be more ugly than nettl.LOG000? And to AIX, does your proprietary implementation give you business success?

Do blame me on my dirty words. Actually I tried so hard to be calm. This kind of additional work f*cks me so often, and no pleasure at all.

Now we are just about to read logs, but usually several hours have passed. As I can say, cat/grep/tail are among the most powerful tools for log analyzing, especially you are familiar to regular expressions. When trouble-shooting, any visual solution like a web search engine which connects to log database can't provide more details.

If you happen to have some knowledges about software development, you must know that end users rarely understand what the errors mean. Nor do system administrators. A more common case is like this, you sorted logs by levels, some FATAL ones appeared to be interesting. But a 30-minutes research proved out to be a waste of time, because either it was a segmentation fault, or an out-of-memory failure. Absolutely profiling a web application is other topic.

Believe me this is not the worst case. Some of logs are naturally unreadable since it was not written for system administrators. Among the readable lines, find what really useful is somehow a word guessing game. A log file is generally rotated at 500KB or per week, so read it through is mission impossible. What I can do is to try different keyword combinations, if I'm lucky enough there will be some hints. (Web application coders may understand this well, if someone used automated sql-injection scripts and broke into the system, you probably had to read every http request to locate your bug.)

Here is my story about why logging systems may fail. It does log well, but it is not handy enough to reproduce the crime scene. I wonder if you have any advices or solutions. Thank you.

P.S. I originally post this article in my mailing list. I will post a Chinese summary lately when I get my pc.

目前尚无回复

system log your