NoSQL

Couchbase Smart Client Failure Scenarios

 

The Couchbase java smart client communicates directly with the cluster to maintain awareness of data location. To do so it gathers information about the cluster architecture from a manually maintained configuration file listing all the nodes. The smart client configuration is done within the Java code and does not have a pre-designated file while the Moxi configuration is generally installed at /opt/moxi/etc/moxi-cluster.cfg

Assuming the smart client is on a separate server from the affected node there are two situations where communication between the client and a specific node might be interrupted.

In the first scenario, a node may fail. If so, the rest of the cluster will detect that from standard heartbeat checks, which are built in to Couchbase, and map its data to the replica nodes. The smart client is informed of the remappings and should be able to find all identified data again. There are known bugs with some client versions (e.g. 1.0.3) -- if you experience timeouts with the client, be sure you’re using the latest build. We also recommend that you use autofailover and that you test your email alerts. You must manually rebalance after recovery; this does not happen on its own.

In the second and more common scenario a network or DNS outage has occurred. If a node is unreachable by one or more clients, yet all nodes can still talk to that node, there is no built-in mechanism for the cluster to remap data from that node to other nodes.

Additionally, there is no built-in mechanism for the smart client to reroute traffic so you will experience timeouts in this situation.  When the network issue resolves the client should stop presenting errors.

Consider scripting a heartbeat check to run on your app servers that use the Couchbase CLI and specify failover procedures.

Couchbase rebalance freeze issue

We came across a Couchbase bug during a rebalance while upgrading online to 1.8.1 from 1.8.0.  

Via the UI, we upgraded our first node, re-added it to the cluster, and then set the rebalance off.  It was progressing fine, then stopped around 48% for all nodes.  The tap and disk queues were quiet and there were no servers in pending rebalance.  The upgraded node was able to service requests, but with only a small percentage of the items relative to the other nodes.  The cluster as a whole did not suffer in performance during this issue though there are some spikes in cpu during any rebalance.  

We decided to stop the rebalance, wait a few minutes, then rebalance and we see it is moving again, progressing beyond what it was.  It stopped again, now at 75%. Let sit for 7 mins, then hit Stop Rebalance and Rebalance. Not progressing at all now.

Couchbase support pointed to a bug where if there are empty vbuckets, rebalancing can hang.  This is fixed in 2.0.  The work around solution is to populate buckets with a minimum of 2048 short time to live (TTL >= (10 minutes per upgrade + (2 x rebalance_time)) x num_nodes) items so all vbuckets have something in them.  We then populated all buckets successfully and were able to restart the rebalance process which completed fine.

Reference:

http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-getting-started-upgrade-online.html

Distributed Counter Performance in HBase - Part 1

Recently I was tasked with setting up an HBase cluster to test compared against Amazon's DynamoDB offering. The client had tested that it worked well for up to 10k updates/sec, but were concerned about the cost. I set up a 7-node HBase cluster in the following configuration:

  • Node1: NameNode, HMaster, Zookeeper
  • Nodes 2-7: DataNodes, RegionServers
Every node was configured with the following hardware:
  • 32GB RAM
  • 4x Intel Xeon E7320 2.13GHz
  • 4x SAS SEAGATE  ST3300657SS drives (JBOD configuration, no RAID)
  • 4x 1GB ethernet NICs in 2x2 bonded interfaces (only one used by HBase)
HBase was configured (almost) as follows:
  • hbase-env:
    • export HBASE_HEAPSIZE=8000
    • export HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
    • export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps" 
    • export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
    • export HBASE_MASTER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
    • export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
    • export HBASE_THRIFT_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
    • export HBASE_ZOOKEEPER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"
  • hbase-site:
    • dfs.support.append =  true
    • hbase.cluster.distributed =  true
    • hbase.tmp.dir =  /var/lib/hbase/tmp
    • hbase.client.write.buffer =  8388608
    • hbase.regionserver.handler.count =  20
The Thrift interface was started on the HMaster for a client program to connect and do work.

 

A single table was created with (almost) the following definition:

 

{NAME => 'timelesstest', DEFERRED_LOG_FLUSH => 'true', FAMILIES => [{NAME => 'family', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

 

A simple Python program using the happybase HBase client was written to increment a single counter in that table as fast as possible. At first, we could only achieve around 700 increments per second, no matter how many client programs we ran. Looking at the HBase console, it was apparent that only one node was doing any work (which is as expected), but we expected better than 700 increments per second.

 

I did some analysis of the cluster. The CPU, Disk, and RAM footprints were all very low. The only indication the HBase cluster was doing work at all was that the three RegionServers hosting the table timelesstest were doing about 10x the interrupts/second of the others. Until I looked at the network. There was about 150KB/sec being transferred in and out of HMaster and the primary RegionServer for timelesstest, and about 380KB/sec for the replica RegionServers. At this point, I guessed that HBase was committing every single transaction coming through. The documentation pointed out a setting I could change: "hbase.regionserver.optionallogflushinterval", but tweaks to that value did nothing.

 

As I was speaking with Ryan Rawson, one of the original HBase committers, he pointed out that the setting only applied to a given table if the DEFERRED_LOG_FLUSH is set at the table level, which the documentation didn't make very clear[1]. So I simply ALTERed the timelesstest to enable the deferred log flushing, et voila! The number of increments per second a single client could achieve skyrocketed to 2k/sec. We added six more clients to the mix and achieved a sustained 10.5k/sec increments spiking at times up to 11.5k increments/sec.

 

Since we were incrementing only a single counter, this means that we were only using about 1/6th the power of the cluster, since there were 6 RegionServers present. We will soon be running more tests, with thousands of counters and perhaps dozens of clients. I suspect we will quickly run into a bottleneck on the Thrift server and may need to open that bottleneck up to achieve the theoretical max of about 60k increments/sec on this cluster (if that is actually achievable, which it may not due to there being a 3x replication factor for the table).

 

Stay tuned!

 

[1] I let the HBase documentation guys know about the confusion. Hopefully future HBase administrators won't have to worry about this gotcha anymore.

Exploring Configuration Management with Ansible

 

What is Ansible?

Ansible is a configuration management and deployment system, like Puppet, Capistrano, Fabric, and Chef. Its aim is to be radically simple and let you use your existing scripts to help with cluster configuration and software deployment whenever possible. Here are the ways that Ansible differentiates itself.

Simplicity

Ansible does not include a client/server architecture with pull-based clients (although in more recent versions, it does include pull-based configuration and deployment). Rather, it uses pre-existing network infrastructure: SSH. Every company has SSH installed on their cluster servers, and Ansible simply rides on top of this infrastructure to get the code and configuration out to the nodes.

Language Agnostic

You can write modules for Ansible in whatever language you desire. You simply tie into its API and go. If you have wanted to use configuration management tools like Chef, but felt put off by the need to learn Ruby, this would be an ideal choice for you.

Configuration Management and Deployment

Ansible playbooks can run steps in a defined order between roles, allowing you to be sure that, for example, the database is set up before the WWW servers start and attempt to create their database schemas. 

Why Recommend Ansible?

Ansible fills a very useful niche. It is good for companies that have grown to the point where configuration management and deploy tools would save time and help manage complexity, but not large enough to hire specialised Ruby-speaking configuration management experts.

It is good for companies that already have invested effort into an SSH infrastructure for inter-node communication. Because Ansible piggybacks onto your already-existing SSH communications infrastructure, there is no need to build large, complex server infrastructure and schedule client runs.

Finally, Ansible is good for companies that don't have a lot of time to invest into learning a new tool. There is no need to spend weeks reading tutorials and watching videos from conferences; the tool is radically simple. A few hours with the documentation, and you're off and running.

If you'd like to hear more about how Palomino can help you with installing and configuring Ansible, feel free to contact us! If you have experience with Ansible, feel free to comment, and let's start a dialogue.

An overview of Riak

 

At PalominoDB, we constantly evaluate new technologies and database options for our clients’ environments. NoSQL databases are especially popular right now, and Riak is an increasingly-recommended option for highly available, fault-tolerant scenarios. Moss Gross attended an introductory workshop, and shares his findings here. For more on Riak, please see the Basho wiki.

What is Riak?

Some of the key features of Riak include:

- license is apache2

- key-value store

- masterless

- distributed

- fault-tolerant

- eventually consistent (Given a sufficiently long time, all nodes in the system will agree. Depending on your requirements, this could be a determining factor on whether you should use Riak)

- highly available

- scalable

- each node has one file

- supports map/reduce for spreading out query processing among multiple processes

- The database itself is written in Erlang.  There are currently clients written in Ruby, Java, Javascript, Python and many other languages.

 

Key Concept Definitions

 

Node:  One running instance of the Riak binary, which is an Erlang virtual machine.

Cluster:  A group of Riak nodes.  A cluster is almost always in a single datacenter.

Riak Object: One fundamental unit of replication.  An object includes a key, a value, and may include one or more pieces of metadata.

Value:  Since it is a binary blob, it can be any format.  The content-type of the value is given in the object's metadata.

Metadata:  Additional headers attached to a Riak Object.  Types of metadata include content-type, links, secondary indexes

Vector Clock: An opaque indicator that establishes the temporality of a Riak object.  This is used internally by Erlang, and the actual timestamp is not intended to be read by the user from the vector clock.

Bucket: Namespace.  This can be thought of as an analog to a table, however Riak namespaces are unlike tables in any way.  They indicate a group of Riak objects that share a configuration.

Link:  Unidirectional pointer from one Riak object to another.  Since an object can have multiple links, bi-directional behavior is possible.  These links use the HTTP RFC.

Secondary Index:  An index used to tag an object for a fast lookup. This is useful for one-to-many relationships

 

 

More Key Concepts

 

Riak uses consistent hashing, which can be thought of as a ring which maps all the possible Riak objects, the number of which can be up to 2^160.

 

Partition: Logical Division of the ring.  This corresponds to logical storage units on the disk.   The number of partitions must be a power of two, and in practice should be generally very high.  The default number of partitions is 64, and this is considered a very small number.

 

 

Operations:

 

To insert data, the basic operation is a PUT request to one of the hosts.  The bucket is indicated in the address, and metadata is added via headers to the request.  For example: curl -v http://host:8091:/buckets/training/keys/my-first-object -X PUT -H "content-type: text/plain" -d "My first key"

 

To insert data without a key, use a POST request.   To retrieve an object, use GET, and to delete, use DELETE.

 

Riak doesn't have any inherent locking.  This must be handled in the application layer.

 

Administration:

 

Configuration:

Riak has two configuration files, located in /etc/ by default.

 

vm.args: identifies the node to itself and other clusters.  The name of the node is of the form 'name@foo', where 'name' is a string and 'foo' can be an ip or a hostname, and it must resolve to a machine, using /etc/hosts or DNS, etc.

 

app.config:  identifies the ringstate directory and the addresses and ports that the node listens on.

 

 

Logs:

 

There are four main logs

 

console.log:  All the INFO, WARN, and ERR messages from the node.

crash.log: crash dumps from the Erlang VM

error.log:  just the ERR messages from the node

run_erl.log: logs the start and stop of the master process

 

Diagnostics:

 

Things to check for proper operation:

 

Locally from the machine the node is on, you can run the command 'riak-admin status'. This gives one minutes stats for the node by name and cluster status

 

Cluster Status:

nodename: 'xxxx' (compare to what the rest of the cluster things it should be)

connected_nodes (verify it's what's expected)

ring_members (includes OOC members)

ring_ownership (has numbers that should show a general balance in

indexes across all of the nodes, if one is significantly different,

indicates a problem)

 

Storage Backend

 

Riak is very versatile in that you have many choices as to what to use for your storage backend.  Some of the possibilities include memory, Bitcask (a homegrown solution written in Erlang, which keeps just a hashmap of keys in memory), LevelDB (which will allow secondary indexing, uses compaction, and does not require you to store all keys in memory), or a combination.  They recommend actually using different storage engines for different buckets if it seems more appropriate.

 

Creating backups in Riak is relatively straightforward:

 

1. stop node

2. copy data directory (if not in memory)

3. start node back up

4. let read repair handle the rest

 

 

Monitoring and Security

 

- There is no built-in security in Riak - it's up to the administrator to add access restrictions via http authentication, firewalls, etc. The commercial product, Riak CS, has ACLs and security built-in however.

- JMX monitoring is built-in and can be enabled in the application configuration - just specify the port.

- MapReduce can be enabled by a setting in app.config.  JavaScript or Erlang can be used.

 

Redis Persistence

Clients often ask us about the benefits of using key-value stores, such as Redis, for high-volume environments. One of the key benefits that is often cited is the durability offered by Redis persistence (as described in detail here).

Developer Salvatore ‘antirez’ Sanfilippo delves into the topic on his blog. Here are some key questions you should consider as you evaluate Redis in your own environment.

Redis replication (and one of two methods of persistence) is achieved using the AOF (append-only file), which logs all statements that modify data. In the example in the article, there is a DELETE issued on a non-existent key, and that statement doesn't get logged nor replicated to a slave Redis. But in real life, masters and slaves get out-of-sync, and it can be handy to have a statement that does nothing on the master, but when replicated to the slave, has the tangible effect of moving the master and slave closer toward convergence. It seems at least it should be an option to have "no-effect" statements replicate from the master to the slave.

When the AOF gets too large, an AOF rewrite occurs. This is the minimal set of statements needed to log to reproduce the data set as it is in memory:

You may wonder what happens to data that is written to the server while the rewrite is in progress. This new data is simply also written to the old (current) AOF file, and at the same time queued into an in-memory buffer, so that when the new AOF is ready we can write this missing part inside it, and finally replace the old AOF file with the new one.

So what happens when we run out of RAM? That would be a very interesting behaviour to have defined unambiguously. He also doesn't talk about how long it takes to write the in-memory delta to the AOF before the swap-over. I suspect during that time, the server will be largely unresponsive (or should be, to preserve data integrity). On a busy server, there might be millions of entries in the in-memory delta buffer once the new AOF is finished writing.

If you think Redis having persistence means you can serve data from disk, you'd be wrong. The point of persistence is just to get the in-memory-only dataset back after a crash or restart.

AOF rewrites are generated only using sequential I/O operations, so the whole dump process is efficient even with rotational disks (no random I/O is performed). This is also true for RDB snapshots generation. The complete lack of Random I/O accesses is a rare feature among databases, and is possible mostly because Redis serves read operations from memory, so data on disk does not need to be organized for a random access pattern, but just for a sequential loading on restart.

So Redis really is just (a fast) memcached but with some persistence methods.

There is an option for fsync'ing data to the AOF in various ways. Be careful, the "appendfsync everysec" setting is actually worst-case every TWO seconds. It's only every second on average.

When you set appendfsync always, Redis still doesn't do an fsync after every write. If there are multiple threads writing, it'll batch the writes together doing what he calls "group commit." That is, every thread that performs a write in the current event loop will get written at the same time at the end of the event loop. I can't think of any downside to this, as I don't think clients get their response that data was written until the end of the event loop.

Restarting Redis requires either re-loading an RDB (Redis snapshot) or replaying AOF transactions to get to the state before the server was stopped. Redis is an in-memory database, so as you might expect, the start-up times are fairly onerous.

Redis server will load an RDB file at the rate of 10 ~ 20 seconds per gigabyte of memory used, so loading a dataset composed of tens of gigabytes can take even a few minutes.

That's the best case, as we note from the following:

Loading an AOF file [takes] twice per gigabyte in Redis 2.6, but of course if a lot of writes reached the AOF file after the latest compaction it can take longer (however Redis in the default configuration triggers a rewrite automatically if the AOF size reaches 200% of the initial size).

It isn't pretty, but I'm really glad the author is giving so much transparency here. An optimisation mentioned is to run a Redis slave and have it continue serving the application while the Redis master is restarted.

The author notes that in high-volume environments, a traditional RDBMS can in theory serve reads from the moment it's started, in practice that can cause the database to become fairly unresponsive as the disks seek like wild to pull in the required data. He further notes that Redis, once it starts serving reads, serves them at full speed.

At Palomino, we are dedicated to bringing rigor in benchmarking and analysis to the DBMSs that are our core compentencies. In a future post, watch for us to test Redis and put some solid numbers behind some of this. We are interested in seeing how Redis performs during AOF rewrites and how long it takes to start up on typical modern hardware at typical modern loads.

Building libMemcached RPMs

A client running CentOS 5.4 Amazon EC2 instances needed the latest libMemcached version installed. With the inclusion of the "make rpm" target, libMemcached makes it easy to build the libMemcached RPMs by doing the following:

Spin up a new CentOS Amazon EC2 instance,

As root on the new instance:

yum install @development-tools
yum install fedora-packager
/usr/sbin/useradd makerpm

Now change to the makerpm user and build the RPMs:

su - makerpm
rpmdev-setuptree
tree
wget http://launchpad.net/libmemcached/1.0/1.0.2/+download/libmemcached-1.0.2.tar.gz
tar -zxf libmemcached-1.0.2.tar.gz
./configure && make rpm
find . -name '*rpm*'

References:
http://libmemcached.org
http://fedoraproject.org/wiki/How_to_create_an_RPM_package
https://launchpad.net/libmemcached/+download

New England Database Summit

The New England Database Summit is an all day conference-style event where participants from the research community and industry in the New England area can come together to present ideas and discuss their research and experiences working with on data-related problems.  It is an academic conference with applications to real life, and includes any type of database.

The 5th annual NEDB will be held in Cambridge, MA MIT (in 32-123) on Friday, February 3, 2012.  Anyone who would like is welcome to present a poster (registration required), or submit a short paper for review.  We plan to accept 8--10 papers for presentation (15 minutes) at the meeting.   All posters will be accepted.

For more details, and to register and / or upload a paper, see:

http://db.csail.mit.edu/nedbday12/

Syndicate content
Website by Digital Loom