Logs showed that some low-level corruption in network data was apparent. Symptom is that nodes are going up and down. Not clear in the UI that this is happening only on 2 nodes. Not clear in UI that it's low-level corruption. Not clear that these nodes are consistently having a problem, and need to be failed over. No info bubbles up about why the node flaps up and down, or how to report this up to data center or Amazon (in this case on EC2).
Need a clear alert to user, suggesting to fail over a troublesome node. Ideal to have concrete examples of the corrupt data to pass on to data center ops.