On Fri, Apr 8, 2011 at 11:55 AM, Perry Krug <email@example.com> wrote:
Hey guys, I'm working on analyzing some AOL log files and noticed that I only have about two days worth of data in the logs. I assume this is because the logs are reaching the 100mb threshold and getting rolled over.
What are our options for changing this? I think it would be nice to a) increase the 100mb to more like 1gb and/or b) force the system to keep 7 (or maybe 10) days worth of logs.
The SASL binary logs are only rolled over by size, not length of time. It's really good for embedded systems where you have a limited amount of space and never want to fill a disk under any circumstances, but it's not necessarily as good for us. OTOH, it at least makes our space usage for logs predictable.
The current setup makes it quite hard to track back for any issue that happened more than 2 days ago and also makes it hard to track the growth/change of statistics.
Overall, I suspect we need to seriously rethink (or at least rework our logging and supportability mechanism in the very near future. Some thoughts:
-As I've brought up before, we really need a "user visible" log that can be quickly parsed and understood. We've discussed taking the UI log output and making it longer and saveable...that would be a great start
Totally agree. I've been adding more user-visible logs as I go. These are currently gossiped around; moving them to CouchDB would make me feel a lot more comfortable about significantly expanding retention and number of these.
-It's my understanding that 1.7 will be storing stats in CouchDB...is this correct? If so, does it make sense to export these stats when doing the collect_info into some sort of csv so that they can be easily graphed and trended?
1.7 may not store the stats in CouchDB, but exporting stats into CSV form is as easy with Mnesia stats as it is with CouchDB stats (though the former requires writing Erlang code to do it while CouchDB could potentially be accessed directly from Python). Once stats are in CouchDB we could easily just replicate them to a central server. Exporting to CSV sounds like a good idea in the near term.
-A hard look at some customer log files would be useful to see where we can compact / change some of the output. i.e., there are lots of messages that could be replaced with the standard "this message has been repeated x times". Also, there are a number of memcached log messages that are currently being recorded as "INFO" level but should be "ERROR" (they contain the word FATAL...that means bad to me right?)
Memcached's default logger doesn't distinguish message types. It should really prefix the log messages with the type of output. Once that happens, we could easily modify ns_port_server to separate the types by prefix and log them at different levels.
-It would be very helpful to have the cluster-wide metrics available in the logs so that support could easily look at the whole system rather than have to add up each node (which gets touch at 10, let alone 100 nodes)
Once we have stats being aggregated ahead of time rather than on demand in the GUI, we can do this. Right now all the aggregation is happening in Menelaus and only when you're looking at them, so we don't have access to aggregated stats to log them. This will probably need to happen anyway to make stats scale to large clusters.
-A reworking of the collect_info (and some test cases to make sure it continues to work across versions...because we've broken it before) is needed.
-It could contain some auto-upload functionality (to our S3 bucket)
-it could be linked to directly from the "generate diagnostic report" in the UI
-the formatting of how and where information is displayed could be improved (breaking up the output into multiple files rather than one large 'membase.log' file)
Agreed on all of these, though we probably don't want to provide script that can write stuff directly to our S3 bucket; it should go through some kind of messaging system so we can control access and re-point it.
I know everyone's got lots on their plates, but supportability and serviceability will become increasingly important if we want to be able to scale our support resources and make our customers comfortable with the function of the product.
I totally agree.
I "think" this could be a great project for someone to work on as they learn the inner workings of the product and how to diagnose issues, etc.
Maybe Dale, Aaron, or Volker could take some of this on.