August 2012

Productivity Squared: Character Graphs in-Terminal

In the course of a large cluster database administrator's job, there are dozens of times a week it can be useful to visualise some data. You're constantly working with machines that have hundreds of databases, directories, files, log files with often millions of entries each.

Wouldn't it be nice if you could visualise these situations?

  • You have dozens of directories full of files. What's the relative size of each?
  • You have a logfile with 1.5M entries. You grep out "ERROR" and that is 50,000 timestamped entries. What times of day have the most errors?
  • You have a process list with 30,000 entries, comprised of 7 unique commands. What is the relative frequency of each command?
  • You have a database with 10,000 tables, each having several million rows. Which are the 30 biggest, and what is their relative size?
  • You have slow query logs spanning a couple of weeks. You wish to know the days and times-of-day that had the most slow queries.

Typically, you can use awk, sort, grep -c, and uniq -c to give some output that contains "keys" and "values" and you can eyeball it to get a general sense. In fact, if the output is decimal, the number of digits is a base-10 logarithmic graph of the relative sizes of things. But sometimes that isn't good enough. For example, a 7x difference will not be obvious from eyeballing a bunch of numbers.

You can copy the output, paste into a CSV, load it up in Libre Office Calc, and generate graphs if you want to spend another couple minutes every command to generate the graphs. Wouldn't it be nice to just do it right there in the terminal where you already are?

If you did web searches for this sort of thing, up until a couple of weeks ago, the best you'd've found was a page referencing an amazing awk script that takes a list of numeric keys and presents an ASCII histogram. However, it couldn't take non-numeric keys, and it could not help with several of the use-cases presented above.

Enter the Palomino-written "distribution." It will take a large input, munge it, tokenise it, tally it, and give you beautiful in-terminal graphs with ANSI colour coding, unicode partial-width characters, and more. So go to the project page linked, download it, and play with it. You'll find plenty of examples, and using a little imagination, you will begin to get a different view of the log files you're constantly working with.

Happy administering those database clusters!

Distributed Counter Performance in HBase - Part 1

Recently I was tasked with setting up an HBase cluster to test compared against Amazon's DynamoDB offering. The client had tested that it worked well for up to 10k updates/sec, but were concerned about the cost. I set up a 7-node HBase cluster in the following configuration:

  • Node1: NameNode, HMaster, Zookeeper
  • Nodes 2-7: DataNodes, RegionServers
Every node was configured with the following hardware:
  • 32GB RAM
  • 4x Intel Xeon E7320 2.13GHz
  • 4x SAS SEAGATE  ST3300657SS drives (JBOD configuration, no RAID)
  • 4x 1GB ethernet NICs in 2x2 bonded interfaces (only one used by HBase)
HBase was configured (almost) as follows:
  • hbase-env:
    • export HBASE_HEAPSIZE=8000
    • export HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
    • export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps" 
    • export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
    • export HBASE_MASTER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
    • export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
    • export HBASE_THRIFT_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
    • export HBASE_ZOOKEEPER_OPTS="$HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"
  • hbase-site:
    • dfs.support.append =  true
    • hbase.cluster.distributed =  true
    • hbase.tmp.dir =  /var/lib/hbase/tmp
    • hbase.client.write.buffer =  8388608
    • hbase.regionserver.handler.count =  20
The Thrift interface was started on the HMaster for a client program to connect and do work.

 

A single table was created with (almost) the following definition:

 

{NAME => 'timelesstest', DEFERRED_LOG_FLUSH => 'true', FAMILIES => [{NAME => 'family', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

 

A simple Python program using the happybase HBase client was written to increment a single counter in that table as fast as possible. At first, we could only achieve around 700 increments per second, no matter how many client programs we ran. Looking at the HBase console, it was apparent that only one node was doing any work (which is as expected), but we expected better than 700 increments per second.

 

I did some analysis of the cluster. The CPU, Disk, and RAM footprints were all very low. The only indication the HBase cluster was doing work at all was that the three RegionServers hosting the table timelesstest were doing about 10x the interrupts/second of the others. Until I looked at the network. There was about 150KB/sec being transferred in and out of HMaster and the primary RegionServer for timelesstest, and about 380KB/sec for the replica RegionServers. At this point, I guessed that HBase was committing every single transaction coming through. The documentation pointed out a setting I could change: "hbase.regionserver.optionallogflushinterval", but tweaks to that value did nothing.

 

As I was speaking with Ryan Rawson, one of the original HBase committers, he pointed out that the setting only applied to a given table if the DEFERRED_LOG_FLUSH is set at the table level, which the documentation didn't make very clear[1]. So I simply ALTERed the timelesstest to enable the deferred log flushing, et voila! The number of increments per second a single client could achieve skyrocketed to 2k/sec. We added six more clients to the mix and achieved a sustained 10.5k/sec increments spiking at times up to 11.5k increments/sec.

 

Since we were incrementing only a single counter, this means that we were only using about 1/6th the power of the cluster, since there were 6 RegionServers present. We will soon be running more tests, with thousands of counters and perhaps dozens of clients. I suspect we will quickly run into a bottleneck on the Thrift server and may need to open that bottleneck up to achieve the theoretical max of about 60k increments/sec on this cluster (if that is actually achievable, which it may not due to there being a 3x replication factor for the table).

 

Stay tuned!

 

[1] I let the HBase documentation guys know about the confusion. Hopefully future HBase administrators won't have to worry about this gotcha anymore.

Benchmarking NDB Against InnoDB on a Write-Intensive Workload

Last month, we evaluated Amazon's new SSD offerings with an extensive series of performance benchmarks.

As a followup, we've prepared a second series of benchmarks that specifically explore performance issues with a write-intensive workload on both NDB and InnoDB storage engines.

Download a free PDF of our findings, and as always, we welcome your feedback, comments and questions below.

Cascade Replication and Delayed servers on PostgreSQL 9.2

Postgresql 9.2 will be released with a new replication feature: Cascade Replication. It will improve streaming between wide networks, shortening distances between the servers. For example: a master server in San Francisco, a slave in Chicago and a cascade server in Barcelona.
But it has other benefit, e.g. when you need to add nodes to the cluster, you can to take backups from your slave without directly affecting the master in terms of I/O of traffic while also minimizing the connections to it.
The setup is easy and intuitive. Like from a slave from a master, but with the difference on the primary_conninfo, which will point to the slave server.
Things to consider when cascading replication:
  • One of the 9.0 limitations in replication -cleaned on 9.1- is the problem that when a master is vacuuming a big table that affects a query that is being executed on the slave. That feature can be enabled with the hot_standby_feedback on the slave server and will add more communication between the servers (if you have very slow connection, maybe you want to disable it)
  • If you have synchronous replication it will only affect 1 slave server against the master, the cascade slave won’t be affected by this setup (cascade server cannot be synchronous).
  • The control function pg_xlog_replay_pause (which stops applying the new incoming wal records) affects only the server in which you are executing it. If the server in which you are executing the control function has cascade servers, those servers will continue streaming and applying the records (occasionally you can have a=c but b!=a, besides replication has a->b->c chain), cause that function avoids the replay, not the streaming shipping. If you want to stop the cascading to all the hosts (not only replication, although the streaming), just stop the slave or change the primary_conninfo in the cascade servers. That is useful if you run BI queries on the cascade servers and don’t want to be affected by writes, so you can delay completely the replication until finish your reports.
  • If you want to use archiving, it must be set up from the master server. Archiving is necessary if you want to enlarge the availability in case of a long downtime of a slave. If you want a delayed server or if you want an incremental backup for PITR in case of disaster.
In our test we’ll have 5 servers, that I’ll label with the port number and data directory to avoid confusions. In this case you can run postgres in same port in different hosts, but just for make clear the difference I will use different ports between them:
Master (8888, data_master) Slave(8889 data_slave) Cascade Slave(8890 data_cascade_slave) delayed standby server using omnipitr-restore (8891 data_delayed) Delayed slave hot standby server using a handmade script (8892 data_streaming_delayed).
To take a backup for replication, first create a user with replication grants and add it to pg_hba.conf in the master. From psql, run:
    CREATE USER repl WITH REPLICATION PASSWORD ‘password’;

pg_hba.conf:
local   replication     repl                            md5
host    replication     repl        127.0.0.1/32            md5
host    replication      repl        0.0.0.0/0               md5

*NOTE: if you want to setup several slaves in one move, make a backup locally in the master, compress it and then ship it among the slaves. Compress and backup command:
bin/pg_basebackup -Ft -p8888 -Urepl -Ddata_slave -z -Z9


Servers configuration
Master (8888) configuration:
postgresql.conf:
wal_level = hot_standby
max_wal_senders=3
wal_keep_segments = 30
archive_mode=on
archive_command = 'cp %p /opt/pg92beta/archives_master/%f'
pg_hba.conf:
local   replication     repl                                md5
host    replication     repl        127.0.0.1/32            md5
host    replication      repl        0.0.0.0/0               md5
max_wal_senders should be a value -at least- the number of slaves connected to this server (cascade servers doesn’t count, cause they connect to the other slaves). If you don’t know how many slaves you need, you can setup a higher value (like 10 for example). That won’t affect performance and the process of wal sender only will work if another slave is brought up.
wal_keep_segments is optional, but is a very comfortable setup when we need to have a large available segments on the master to update the slaves. If one slave gets into a not-so-long downtime it can get updated using directly the streaming replication, instead the archiving files. Is not a time value, and this is important to keep in mind. Keeping 30 segments maybe for a high load will represent just a few hours of activity and in a low activity server maybe a few days. archive_mode and archive_command are in charge of copy those segments to other destination to be used for incremental backups or update slaves that were in long period of downtime.
Slave (8889) Configuration:
postgresql.conf:
wal_level = hot_standby
max_wal_senders=3
wal_keep_segments = 30
hot_standby = on
hot_standby_feedback = on
recovery.conf:
standby_mode = on
primary_conninfo = 'user=postgres host=localhost port=8888'
trigger_file=/tmp/promote_me

This is a classic setup of a asynchronous slave server through streaming replication. No delay specified, just replicate as fast as it can. The hot_standby_feedback option, allows to communicate with the master, in case of long running queries on the slave and to avoid vacuums during its execution on the master. Cascade slave (8890) configuration:
recovery.conf:
standby_mode = on
primary_conninfo = 'user=postgres host=localhost port=8889'
Configuration of this host is like every slave, the difference is that the primary_conninfo will point to the slave server where you have taken the backup.
Delayed Slave (8891) configuration:
recovery.conf:
restore_command = '/opt/pg92beta/omniti/bin/omnipitr-restore -D /opt/pg92beta/data_delayed -l /opt/pg92beta/omniti-restore.log -s /opt/pg92beta/archives_master -w 300 %f %p'
archive_cleanup_command = '/opt/pg92beta/bin/pg_archivecleanup /opt/pg92beta/archives_master %r'
standby_mode = on
You can download omnipitr-restore at https://github.com/omniti-labs/omnipitr . Some documentation here at OmniPitr with delay.
restore_command is in charge to apply the corresponding segments and the option “-w 300” is the “wait” to apply the delay. The server will be on standby mode, which means that is not accessible to query it.

Streaming delayed hot-standby slave (8892) configuration:
Configuration is almost the same as other slaves, just keep an eye on: postgresql.conf:
#...same configuration as the other slaves...
hot_standby_feedback = off   #not necessary cause the server is delayed

Script for delayed using recovery_target_time:
#!/bin/bash
 
HOME_PG=/opt/pg92beta
DATA_PG=$HOME_PG/data_streaming_delayed
DELAYED_DATE=$(date "+%Y-%m-%d %H:%M:%S %Z" --date="5 minutes ago")
RECOVERY_FILE_LOCATION=$DATA_PG/recovery.conf
RECOVERY_FILE_LOCATION_DONE=$DATA_PG/recovery.done
CONTROL_COMMAND=$HOME_PG/control_streaming_delayed.sh
PORT=8892
 
 
res=$($CONTROL_COMMAND stop)
 
if [ -f $RECOVERY_FILE_LOCATION_DONE ] ; then
 mv $RECOVERY_FILE_LOCATION_DONE $RECOVERY_FILE_LOCATION
fi
 
sed -i "s/.*recovery_target_time[ |\=].*/recovery_target_time =\'$DELAYED_DATE\'/g" $RECOVERY_FILE_LOCATION
 
#Taken from the doc:
# If this recovery target is not the desired stopping point,
# then shutdown the server, change the recovery target settings
# to a later target and restart to continue recovery.
res=$($CONTROL_COMMAND start)
control_streaming_delayed.sh is just a one line script (--mode=fast is required to stop queries when restarted):
su postgres -c "bin/pg_ctl -D data_streaming_delayed --mode=fast -l logfile_streaming_delayed $1"
This configuration has its cons:
The “delay” frame isn’t exact. Why? You need to cron the script - which has the recovery_target_time change- each x period of time. During that period of time, the delay will increase until you re-execute the script. So, if you have configured a “30 minutes ago” and your script is executed each 5 minutes, your delay will be between 30 and 35 minutes. The scripts executes a stop/start, which is necessary to load the new configuration. DO NOT RUN A RELOAD, because the slave will stop the recovery mode. Due this, you will need to have in mind if you want to make queries in that server.
But its advantages are:
You can query the server. You can just stop the refresh of the recovery_target_time for make your reports against that server. Due that the server replay is paused, you will experience more performance in your queries.

After every start, you will find something like:
LOG:  database system was shut down in recovery at 2012-07-04 11:14:39 UTC
LOG:  entering standby mode
LOG:  redo starts at 0/1E000020
LOG:  consistent recovery state reached at 0/1E010B48
LOG:  database system is ready to accept read only connections
LOG:  recovery stopping before commit of transaction 14024, time 2012-07-04 11:14:18.319328+00
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.

Do not execute pg_xlog_replay_resume() if you want to keep recovering. If you execute it, will replay the last records and start as a stand alone server, which isn’t our objective. That function will update until the last record received and will continue doing that until something stops the replay.
Let’s see an example:
root@lapp /opt/pg92beta# bin/psql -p8888 -Upostgres
psql (9.2beta2)
Type "help" for help.
 
postgres=# insert into check_time values (now());
INSERT 0 1
postgres=# select * from check_time ;
            i
----------------------------
2012-07-04 11:19:47.667041
2012-07-04 11:43:23.794384
(2 rows)
 
postgres=# \q
root@lapp /opt/pg92beta# ./change_recovery_target_time.sh
root@lapp /opt/pg92beta# tail data_streaming_delayed/pg_log/postgresql-2012-07-04_11
postgresql-2012-07-04_110653.log  postgresql-2012-07-04_110853.log  postgresql-2012-07-04_111440.log  postgresql-2012-07-04_114335.log
postgresql-2012-07-04_110808.log  postgresql-2012-07-04_111315.log  postgresql-2012-07-04_111954.log
root@lapp /opt/pg92beta# tail data_streaming_delayed/pg_log/postgresql-2012-07-04_114335.log
LOG:  database system was shut down in recovery at 2012-07-04 11:43:33 UTC
LOG:  entering standby mode
LOG:  redo starts at 0/1E000020
LOG:  consistent recovery state reached at 0/1E012734
LOG:  database system is ready to accept read only connections
LOG:  recovery stopping before commit of transaction 14027, time 2012-07-04 11:43:23.797137+00
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.
root@lapp /opt/pg92beta# bin/psql -p8892 -Upostgres
psql (9.2beta2)
Type "help" for help.
 
postgres=# select * from check_time ;
            i
----------------------------
2012-07-04 11:19:47.667041
(1 row)
 
postgres=# \q

For large amounts of data, maybe you want to have more large delays (1 day, 6 hours) due to the spended time to start an instance. Based on some experience, 1 day shuold be find to detect problems or to work with reports.
http://pgxn.org/search?q=delay&in=docs
http://pgxnclient.projects.postgresql.org/install.html
http://vibhorkumar.wordpress.com/2011/11/15/new-replication-and-recovery-features-in-postgresql-9-1/

Exploring Configuration Management with Ansible

 

What is Ansible?

Ansible is a configuration management and deployment system, like Puppet, Capistrano, Fabric, and Chef. Its aim is to be radically simple and let you use your existing scripts to help with cluster configuration and software deployment whenever possible. Here are the ways that Ansible differentiates itself.

Simplicity

Ansible does not include a client/server architecture with pull-based clients (although in more recent versions, it does include pull-based configuration and deployment). Rather, it uses pre-existing network infrastructure: SSH. Every company has SSH installed on their cluster servers, and Ansible simply rides on top of this infrastructure to get the code and configuration out to the nodes.

Language Agnostic

You can write modules for Ansible in whatever language you desire. You simply tie into its API and go. If you have wanted to use configuration management tools like Chef, but felt put off by the need to learn Ruby, this would be an ideal choice for you.

Configuration Management and Deployment

Ansible playbooks can run steps in a defined order between roles, allowing you to be sure that, for example, the database is set up before the WWW servers start and attempt to create their database schemas. 

Why Recommend Ansible?

Ansible fills a very useful niche. It is good for companies that have grown to the point where configuration management and deploy tools would save time and help manage complexity, but not large enough to hire specialised Ruby-speaking configuration management experts.

It is good for companies that already have invested effort into an SSH infrastructure for inter-node communication. Because Ansible piggybacks onto your already-existing SSH communications infrastructure, there is no need to build large, complex server infrastructure and schedule client runs.

Finally, Ansible is good for companies that don't have a lot of time to invest into learning a new tool. There is no need to spend weeks reading tutorials and watching videos from conferences; the tool is radically simple. A few hours with the documentation, and you're off and running.

If you'd like to hear more about how Palomino can help you with installing and configuring Ansible, feel free to contact us! If you have experience with Ansible, feel free to comment, and let's start a dialogue.

Syndicate content
Website by Digital Loom