Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
1.6.5
-
Security Level: Public
-
None
-
An Amazon ec2 m1.xlarge instance, running ami-6832d801. That AMI is Ubuntu Karmic 9.10 64-bit, running on instance-store in us-east-1
Description
My project is considering Membase as a replacement for Memcache in our cluster of production servers (we're hoping for a drop-in replacement that can give us transparent consistent hashing and replication.)
The cache is used to hold objects fetched from Amazon S3. The objects are lzo compressed while they live in S3 and in the cache, ranging from about 3 to 60 Kb in (compressed) size, and have (the standard lzo) checksums. To be clear, the cache never sees any uncompressed data, but the checksums enable the downstream API processes to detect corruption.
While investigating a switch to Membase, we got as far as bootstrapping a testbed server running a local, unclustered Membase instance with an accompanying Moxi. While using our warmup script to put the testbed server through its paces, we noticed that as the cache warmed requests began reliably failing with checksum errors (which in our experience occur only very rarely.)
Using tshark, I was able to observe cache misses for a given key and subsequent insertions filling that key, and later see a cache hit return a subtly different value. In the handful of observed cases the different values never changed length, but had one to three different bytes near the end, usually within the last 20 or so. Exactly which values become corrupted is not predictable, but the warmup script is long enough to induce some checksum failures nearly each time it is run.
Here is the snippet of bash script that installs and configures Membase on the testbed servers:
dpkg -i some/path/to/membase-server-community_x86_64_1.6.5.deb
|
export PATH="$PATH:/opt/membase/1.6.5/bin/cli"
|
|
# Create an area for membase to put its on-disk storage
|
mkdir -p /mnt/membase
|
chown membase:membase /mnt/membase
|
membase node-init -c $PRIVATE_HOSTNAME:8091 \
|
--node-init-data-path=/mnt/membase
|
|
membase cluster-init -c $PRIVATE_HOSTNAME:8091 \
|
--cluster-init-username=user \
|
--cluster-init-password=passwd \
|
--cluster-init-ramsize=6000
|
|
membase bucket-create -c $PRIVATE_HOSTNAME:8091 -u user -p passwd \
|
--bucket=default \
|
--bucket-type=membase \
|
--bucket-port=11222 \
|
--bucket-ramsize=6000 \
|
--bucket-replica=3
|
Membase itself was installed through a local copy of the membase-server-community_x86_64_1.6.5.deb package. The md5, 69708294d3a68382dbb5f548ead4ddf6, matches the one listed on the site. I've just noticed that version 1.6.5.3 has been released since our first investigation, I will see if I can reproduce the issue with this version as well.
Are there some logs or debug command output that you would like me to post? I can also dig up the tshark pcap files witnessing the corrupt value coming out if needed.