Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.6.5.3
    • Fix Version/s: feature-backlog
    • Component/s: ns_server
    • Security Level: Public
    • Labels:

      Description

      On Fri, Apr 8, 2011 at 11:55 AM, Perry Krug <perry@couchbase.com> wrote:
      Hey guys, I'm working on analyzing some AOL log files and noticed that I only have about two days worth of data in the logs. I assume this is because the logs are reaching the 100mb threshold and getting rolled over.

      What are our options for changing this? I think it would be nice to a) increase the 100mb to more like 1gb and/or b) force the system to keep 7 (or maybe 10) days worth of logs.

      The SASL binary logs are only rolled over by size, not length of time. It's really good for embedded systems where you have a limited amount of space and never want to fill a disk under any circumstances, but it's not necessarily as good for us. OTOH, it at least makes our space usage for logs predictable.

      The current setup makes it quite hard to track back for any issue that happened more than 2 days ago and also makes it hard to track the growth/change of statistics.

      Overall, I suspect we need to seriously rethink (or at least rework our logging and supportability mechanism in the very near future. Some thoughts:
      -As I've brought up before, we really need a "user visible" log that can be quickly parsed and understood. We've discussed taking the UI log output and making it longer and saveable...that would be a great start

      Totally agree. I've been adding more user-visible logs as I go. These are currently gossiped around; moving them to CouchDB would make me feel a lot more comfortable about significantly expanding retention and number of these.

      -It's my understanding that 1.7 will be storing stats in CouchDB...is this correct? If so, does it make sense to export these stats when doing the collect_info into some sort of csv so that they can be easily graphed and trended?

      1.7 may not store the stats in CouchDB, but exporting stats into CSV form is as easy with Mnesia stats as it is with CouchDB stats (though the former requires writing Erlang code to do it while CouchDB could potentially be accessed directly from Python). Once stats are in CouchDB we could easily just replicate them to a central server. Exporting to CSV sounds like a good idea in the near term.

      -A hard look at some customer log files would be useful to see where we can compact / change some of the output. i.e., there are lots of messages that could be replaced with the standard "this message has been repeated x times". Also, there are a number of memcached log messages that are currently being recorded as "INFO" level but should be "ERROR" (they contain the word FATAL...that means bad to me right?)

      Memcached's default logger doesn't distinguish message types. It should really prefix the log messages with the type of output. Once that happens, we could easily modify ns_port_server to separate the types by prefix and log them at different levels.

      -It would be very helpful to have the cluster-wide metrics available in the logs so that support could easily look at the whole system rather than have to add up each node (which gets touch at 10, let alone 100 nodes)

      Once we have stats being aggregated ahead of time rather than on demand in the GUI, we can do this. Right now all the aggregation is happening in Menelaus and only when you're looking at them, so we don't have access to aggregated stats to log them. This will probably need to happen anyway to make stats scale to large clusters.

      -A reworking of the collect_info (and some test cases to make sure it continues to work across versions...because we've broken it before) is needed.
      -It could contain some auto-upload functionality (to our S3 bucket)
      -it could be linked to directly from the "generate diagnostic report" in the UI
      -the formatting of how and where information is displayed could be improved (breaking up the output into multiple files rather than one large 'membase.log' file)
      -...

      Agreed on all of these, though we probably don't want to provide script that can write stuff directly to our S3 bucket; it should go through some kind of messaging system so we can control access and re-point it.

      I know everyone's got lots on their plates, but supportability and serviceability will become increasingly important if we want to be able to scale our support resources and make our customers comfortable with the function of the product.

      I totally agree.

      I "think" this could be a great project for someone to work on as they learn the inner workings of the product and how to diagnose issues, etc.

      Maybe Dale, Aaron, or Volker could take some of this on.

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          perry Perry Krug added a comment -

          A Pivotal Tracker story has been created for this Issue: http://www.pivotaltracker.com/story/show/14344887

          Show
          perry Perry Krug added a comment - A Pivotal Tracker story has been created for this Issue: http://www.pivotaltracker.com/story/show/14344887
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          IMHO this needs to be split into smaller, clearly defined and manageable tickets.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - IMHO this needs to be split into smaller, clearly defined and manageable tickets.
          Hide
          perry Perry Krug added a comment -

          "Latest" description/spec: http://hub.internal.couchbase.com/confluence/display/supp/Logging+Improvement+Project

          Still needs to be broken down by PM into manageable tasks to be further defined and implemented

          Show
          perry Perry Krug added a comment - "Latest" description/spec: http://hub.internal.couchbase.com/confluence/display/supp/Logging+Improvement+Project Still needs to be broken down by PM into manageable tasks to be further defined and implemented
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Perry, may I ask you to revise that wiki page? We completed a ton of this items. If something still does not hold we need to know. What is indeed done needs to be removed in order to avoid clutter and confusion.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Perry, may I ask you to revise that wiki page? We completed a ton of this items. If something still does not hold we need to know. What is indeed done needs to be removed in order to avoid clutter and confusion.
          Hide
          perry Perry Krug added a comment -

          Okay, we will do that. I think there is still a lot of work that does need to be done, but I appreciate the fact that it is cluttered with already completed tasks.

          Show
          perry Perry Krug added a comment - Okay, we will do that. I think there is still a lot of work that does need to be done, but I appreciate the fact that it is cluttered with already completed tasks.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          I assure you, you will be surprised how much of this is actually "completed". It could be not good enough but we provide ton of what you asked for (which was how many years ago?)

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - I assure you, you will be surprised how much of this is actually "completed". It could be not good enough but we provide ton of what you asked for (which was how many years ago?)
          Hide
          perry Perry Krug added a comment -

          Hehe, it was about 2 years ago. At the moment, the biggest challenge for us (the field/support/customers) is in being able to actually read and understand the logs. I definitely believe that the information is in there, but the Erlange format especially makes it very challenging to read and understand what is going on and when it is not working (there is also an extremely large amount of "spamming" the logs when something is not working properly)

          Show
          perry Perry Krug added a comment - Hehe, it was about 2 years ago. At the moment, the biggest challenge for us (the field/support/customers) is in being able to actually read and understand the logs. I definitely believe that the information is in there, but the Erlange format especially makes it very challenging to read and understand what is going on and when it is not working (there is also an extremely large amount of "spamming" the logs when something is not working properly)

            People

            • Assignee:
              don Don Pinto
              Reporter:
              perry Perry Krug
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Gerrit Reviews

                There are no open Gerrit changes