Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4785

Meaningful alert when low-level packet corruption on node

    Details

      Description

      Logs showed that some low-level corruption in network data was apparent. Symptom is that nodes are going up and down. Not clear in the UI that this is happening only on 2 nodes. Not clear in UI that it's low-level corruption. Not clear that these nodes are consistently having a problem, and need to be failed over. No info bubbles up about why the node flaps up and down, or how to report this up to data center or Amazon (in this case on EC2).

      Need a clear alert to user, suggesting to fail over a troublesome node. Ideal to have concrete examples of the corrupt data to pass on to data center ops.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        are you sure this is really critical ?

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - are you sure this is really critical ?
        Hide
        TimSmith Tim Smith (Inactive) added a comment -

        My priority calibration may be off here. It is OK for product management or whoever to re-triage this request based on a larger picture of priorities.

        Tim

        Show
        TimSmith Tim Smith (Inactive) added a comment - My priority calibration may be off here. It is OK for product management or whoever to re-triage this request based on a larger picture of priorities. Tim
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        I would actually rephrase this bug saying that the node status should become red when nss_server detects a corruption during send/receive and change the issue type from enhancement into a bug.

        and the fact this happened in ec2 environment makes it more important

        Show
        farshid Farshid Ghods (Inactive) added a comment - I would actually rephrase this bug saying that the node status should become red when nss_server detects a corruption during send/receive and change the issue type from enhancement into a bug. and the fact this happened in ec2 environment makes it more important
        Hide
        peter peter added a comment -

        Maybe this has been resolved because of recent infinity fixes in Erlang

        Show
        peter peter added a comment - Maybe this has been resolved because of recent infinity fixes in Erlang
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Not fixed.

        We indeed think we've fixed cause of this.

        But if this happens again, only thing we'll see is node being red for a moment in UI.

        Unfortunately erlang doesn't provide us a way to monitor and react on this particular condition. It'll be just disconnect and you cannot know why.

        So only way to fix seems to be extending erlang's vm.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Not fixed. We indeed think we've fixed cause of this. But if this happens again, only thing we'll see is node being red for a moment in UI. Unfortunately erlang doesn't provide us a way to monitor and react on this particular condition. It'll be just disconnect and you cannot know why. So only way to fix seems to be extending erlang's vm.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited

        We haven't fixed it.

        We think we have fixed CBSE-whatever by working around some unknown subtle bug in infinity trapping via signals that's specific to Linux on EC2 (or any linux or any xen, we don't know).

        This particular request is to make this condition when low-level erlang code detects packet corruption and disconnects pair of nodes visible to end user. Via alert particularly. It makes sense to me.

        Regarding what Farshid said. We do mark node as red, but next second we re-establish connection and things work again, until this happens next time.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited We haven't fixed it. We think we have fixed CBSE-whatever by working around some unknown subtle bug in infinity trapping via signals that's specific to Linux on EC2 (or any linux or any xen, we don't know). This particular request is to make this condition when low-level erlang code detects packet corruption and disconnects pair of nodes visible to end user . Via alert particularly. It makes sense to me. Regarding what Farshid said. We do mark node as red, but next second we re-establish connection and things work again, until this happens next time.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Also I think we can fix it, but in not necessarily future-proof and pleasant way. We can grep log message that erlang logs via error logging facility that our logger implementation intercepts. That seems like the only path (excluding erlang VM modification) that can produce alerts from this kind of events.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Also I think we can fix it, but in not necessarily future-proof and pleasant way. We can grep log message that erlang logs via error logging facility that our logger implementation intercepts. That seems like the only path (excluding erlang VM modification) that can produce alerts from this kind of events.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Depends on non-poor-man's alerts

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Depends on non-poor-man's alerts
        Hide
        bshumate Brian Shumate added a comment -

        It would be helpful if network errors or network partition conditions could
        be logged and represented almost in the same manner as the uptime command's
        representation of load average, i.e. number of network issues in the last
        5/15/30 minutes or similar somewhere in the web console UI.

        Show
        bshumate Brian Shumate added a comment - It would be helpful if network errors or network partition conditions could be logged and represented almost in the same manner as the uptime command's representation of load average, i.e. number of network issues in the last 5/15/30 minutes or similar somewhere in the web console UI.

          People

          • Assignee:
            don Don Pinto
            Reporter:
            TimSmith Tim Smith (Inactive)
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Gerrit Reviews

              There are no open Gerrit changes