Details
-
Epic
-
Resolution: Fixed
-
Critical
-
None
-
None
-
Security Level: Public
-
None
-
Done
Description
As a user, new employee (any technical position), i would like to have a starting point for support operations to Membase which should include but not limited to:
(a separate task would be opened for each item)
1. Check list / best practices for deployment
a. Capacity planning - examples of usage and recommended deployments (EC2 / physical nodes)
b. Linux - Set up swap (give examples), set up swappines (Dustin, can you give guideline for that)
c. Windows - we need to gain more experience here!
d. EC2 best practices (we also need solutions for disappearing nodes..)
e. Physical nodes best practices
f. What UI stats should I watch for (warning for bad weather)
2. Detailed procedure for recovering a slow machine
a. Shut down the node
b. Backup the DB files
c. Vacuum the files (if fragmented)
d. Run script to delete dead vBuckets - Perry, Dustin, as simple as this procedure is I would like to have a well-defined script that non-membaseengineers would feel comfortable and safe to run. Bhawana will be the first user to verify it.
e. Restart the node and observe all of the below
f. Definition of a healthy cluster (memory under control, replication is done, write queue are empty or draining fast, etc.
3. Replication control
The tool to stop, pause, and start replication in single/dual mode on the whole cluster or towards a specific machine
4. Backup and recovery
a. Customers must perform backup procedure
b. I believe we already have good solution both for backup and for recovery, this should be well documented and tested.
c. Backup should include defragmentation of files
5. Monitor the product for dummies:
Until we have monitoring in place, I would like to have scripts as part of the management directory that provides info about the whole cluster. For example, get the list of servers from the REST api and return the status of all dispatchers on the cluster. Here is the list of command I constantly run yesterday, Perry, since these are "support" scripts, I would like you to own this and use the help of the dev team where needed. please refine my list or add to it as you learn more about the product.
a. Dispatcher
for i in
; do echo machineon $i; /opt/membase/bin/ep_engine/management/stats $i:11210dispatcher;done
b. ep-engine Stats
for i in
c. Tap
for i in { list of servers on the cluster }
; do echo machineon $i; /opt/membase/bin/ep_engine/management/stats $i:11210 tap | egrep"rec_fetched|pending_backfill";done
d. vBucketCtl
for i in
; do echo machineon $i; /opt/membase/bin/ep_engine/management/vbucketctl $i:11210 list | awk'
{print $3}' | sort -n | uniq -c ; done
e. Machine stats
Not sure how to do that, but we should be able to see the basic paramsof the top command along with the memcached process