MySQL

Benchmarking Postgres on AWS 4,000 PIOPs EBS instances

Introduction

Disk I/O is frequently the performance bottleneck with relational databases. With AWS recently releasing 4,000 PIOPs EBS volumes, I wanted to do some benchmarking with pgbench and PostgreSQL 9.2. Prior to this release the maximum available I/O capacity was 2,000 IOPs per volume. EBS IOPs are read and written in 16Kb chunks with their performance limited by both the I/O capacity of the EBS volumes and the network bandwidth between an EC2 instance and the EBS network. My goal isn't to provide a PostgreSQL tuning guide, an EC2 tuning guide, or a database deathmatch complete with graphs; I'll just be displaying what kind of performance is available out-of-the-box without substantive tuning. In other words, this is an exploratory benchmark not a comparative benchmark. I would have liked to compare the performance of 4,000 PIOPs EBS volumes with 2,000 PIOPs EBS volumes, but I ran out of time so that will have to wait for a following post.

Setup

Region

I conducted my testing in AWS' São Paulo region. One benefit of testing in sa-east-1 is that spot prices for larger instances are (anecdotally) more stable than in us-east. Unfortunately, sa-east-1 doesn't have any cluster compute (CC) instances available. CC instances have twice the bandwidth to the EBS network than non-CC EC2 instances. That additional bandwidth allows you to construct larger software RAID volumes. My cocktail napkin calculations show that it should be possible to reach 50,000 PIOPs on an EBS-backed CC instance without much of a problem.

EC2 instances

I tested with three EC2 instances: an m1.large from which to run pgbench, an m2.2xlarge with four EBS volumes, and an m1.xlarge with one EBS volume. All EBS volumes are 400GB with 4,000 provisioned IOPs. The m1.large instance was an on-demand instance; the other instances  — the pgbench target database servers — were all spot instances with a maximum bid of $0.05. (In one case our first spot instance was terminated, and we had to rebuild it). Some brief testing showed that having an external machine driving the benchmark was critical for the best results.

Operating System

All EC2 instances are running Ubuntu 12.10. A custom sysctl.conf tuned the Sys V shared memory as well as set swappiness to zero and memory overcommit to two.

kernel.shmmax = 13355443200
kernel.shmall = 13355443200
vm.swappiness = 0
vm.overcommit_memory = 2

Packages

The following packages were installed via apt-get:

  • htop
  • xfsprogs
  • debian-keyring
  • mdadm
  • postgresql-9.2
  • postgresql-contrib-9.2

In order to install the postgresql packages a pgdb.list file containing

deb http://apt.postgresql.org/pub/repos/apt/ squeeze-pgdg main

was placed in /etc/apt/sources.list.d and the following commands were run:

gpg --keyserver pgp.mit.edu --recv-keys ACCC4CF8
gpg --armor --export ACCC4CF8 | apt-key add -
apt-get update

RAID and Filesystems

For the one volume instance, I simply created an XFS file system and mounted it on /mnt/benchmark.

mkdir /mnt/benchmark
mkfs.xfs /dev/svdf 
mount -t xfs /dev/svdf /mnt/benchmark
echo "/dev/svdf    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab

For the four volume instance it was only slightly more involved. mkfs.xfs analyzes the underlying disk objects and determines the appropriate values for stride and width. Below are the commands for assembling a four volume mdadm software RAID array that is mounted on boot (assuming you've attached the EBS volumes to your EC2 instance). Running dpkg-reconfigure rebuilds the initrd image.

mkdir /mnt/benchmark
mdadm --create /dev/md0 --level=0 --raid-volumes=4 /dev/svdf /dev/svdg /dev/svdh /dev/svdi
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
mkfs.xfs /dev/md0
echo "/dev/md0    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab
dpkg-reconfigure mdadm

Benchmarking

pgbench is a utlity included in the postgresql-contrib-9.2 package. It approximates the TPC-B benchmark and can be looked at as a database stress test whose output is measured in transactions per second. It involves a significant amount of disk I/O with transactions that run for relatively short amounts of time. vacuumdb was run before each pgbench iteration. For each database server pgbench was run mimicking 16 clients, 32 clients, 48 clients, 64 clients, 80 clients, and 96 clients. At each of those client values, pgbench iterated ten times in steps of 100 from 100 to 1,000 transactions per client. It's important to realize that pgbench's stress test is not typical of a web application workload; most consumer facing web applications could achieve much higher rates than those mentioned here. The only pgbench results against AWS/EBS volumes that I'm-aware-of/is-quickly-googleable is from early 2012 and, at its best, achieves rates 50% slower than the lowest rates found here. I drove the benchmark using a very small, very unfancy bash script. An example of the pgbench commandline would be:

pgbench -h $DBHOST -j4 -r -Mextended -n -c48 -t600 -U$DBUSER

m1.xlarge with single 4,000 PIOPs volume

The maximum transaction volume for this isntance was when running below 48 concurrent clients and under 500 transactions per client. While the transaction throuput never dropped precipitously at any point, loads outside of that range exhibited varying performance. Even at its worst, though, this instance handled between 600-700 transactions/second.

m2.2xlarge with four 4,000 PIOPs volumes

I was impressed; at no point did the benchmark stress this instance — the tps rate was between 1700-1900 in most situations with peaks up to 2200 transactions per second. If I was asked to blindly size a "big" PostgreSQL database server running on AWS this is probably where I would start. It's not so large that you have operational issues like worrying about MTBFs for ten volume RAID arrays or trying to snapshot 4TB of disk space, but it is large enough to absorb a substantial amount of traffic.

Graphs and Tabular Data

single-4K-volume tps

The spread of transactions/second irrespective of number of clients.

Box plot of transactions per second. Single 4K volume

Data grouped by number of concurrent clients with each bar representing an increase in 100 transactions per second ranging from 100 to 1,000.

Bar graph of transactions per second grouped by concurrent clients. Single 4K volume

Progression of tps by individual level of concurrency. The x-axis tick marks measure single pgbench runs from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Single 4K volume

Raw tabular data

txns/client1002003004005006007008009001000
clients
1614551283118365311975336311009923648
321500124212327577476301067665688709
482818648997051029749736593766641
6494412817041010739596778662820612
808158931055809597801684708736663
96939889774772798682725662776708

four-4,000-PIOPs-volumes tps

Again, a box plot of the data with a y-axis of transactions/second.

Box plot of transactions per second. Four 4,000 PIOPs volumes

Grouped by number of concurrent clients between 100 and 1,000 transactions per client.

Bar graph of transactions per second grouped by concurrent clients. Four 4,000 PIOPs volumes

TPS by number of concurrent clients. The x-axis ticks mark pgbench runs progressing from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Four 4,000 PIOPs volumes

Tabular data m2.2xlarge with four 4,000 PIOPs EBS volumes

txns/client1002003004005006007008009001000
clients
161487161718771415138818821897177112671785
321804208321601791125919972230150117171918
481810215212961951211717751709180318171847
641810158015682056181117841849190919421658
801802204414672142164518961933174018211851
961595140320471731178318591708189617511801

Nagios Check Calculated on Mysql Server Variables

Nagios Check For Calculating Based on Mysql Server Variables

Recently we needed to make a change for a client to one of our mysql monitoring tools so I thought it would be a good opportunity to highlight the tool and discuss some of the changes that I made.

You can access the tool on Github from our public repository here.

Before this change if you were using either the "varcomp" or "lastrun-varcomp" modes, it would only return a WARNING if your criteria for comparison were exceeded. In the new version, both WARNING and CRITICAL states can return to nagios. Here is an example: Let's say you want to alert on maximum connections, but the number of maximum connections is different for different hosts. Instead of writing distinct per/host checks, you can use this check to do a simple calculation and alert on the result of that calculation. In this case you want to send a warning when your connections exceeds 75% of max connections and alert when it reaches 80%. The entry in your nagios config file might look something like this:

mysql_health_check.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ --mode varcomp --expression "Threads_connected / max_connections * 100" --comparison_warning="> '$ARG3$'" --comparison_critical "> '$ARG4$' --shortname percent_max_connections

Where the expression flag uses the names that are returned from any of "FULL PROCESSLIST", "ENGINE INNODB STATUS", "GLOBAL VARIABLES", "GLOBAL STATUS", or "SLAVE STATUS". The comparison_warning and comparison_critical flags are going to be evaluated in perl, so ensure that it is a valid perl expression (in this case you could use either > or -gt. You'll definitely need to test out your commands with a few different use cases to ensure you have good syntax. When a varcomp or lastrun-varcomp check is run, the results are kept in a local cache file against which you can make comparisons. So, to give a ridiculous example, if you want to ensure that the number of open table definitions didn't increase too much between samples, you could do something like this

mysql_health_check.pl -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$ --mode lastrun-varcomp --expression "current{Open_table_definitions} - lastrun{Open_table_definitions}" --comparison_warning="> 10" --comparison_critical "> 20" --shortname increase_in_open_table_defs

Where the notation current{} and lastrun{} refer to which sample time you want. You can see more details on all the features of the script in the README on the plugin page. We welcome any comments and suggestions.

Benchmarking NDB vs Galera

Inspired by the benchmark in this post, we decided to run some NDB vs Galera benchmarks for ourselves.

We confirmed that NDB does not perform well using m1.large instances. In fact, it’s totally unacceptable -  no setup should ever have a minimum latency of 220ms - so m1.large instances are not an option. Apparently the instances get CPU bound, but CPU utilization never goes above ~50%. Maybe top/vmstat can’t be trusted in this virtualized environment?

So, why not use m1.xlarge instances? This sounds like a better plan!

As in the original post, our dataset is 15 tables of 2M rows each, created with:

./sysbench --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --mysql-table-engine=ndbcluster --mysql-user=user --mysql-host=host1 prepare

Benchmark against NDB was executed with:

for i in 8 16 32 64 128 256

do

./sysbench --report-interval=30 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-user=user --mysql-port=3306  --mysql-host=host1,host2 --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$i run > ndb_2_nodes_$i.txt

done

After we shutdown NDB, we started Galera and recreated the table, but found that running sysbench was failing. A suggestion from Hingo was to use --oltp-auto-inc=off, which worked.

Our benchmark against NDB was executed with:

for i in 8 16 32 64 128 256

do

./sysbench --report-interval=30 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-user=user --mysql-port=3306  --mysql-host=host1,host2 --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$i --oltp-auto-inc=off run > galera_2_nodes_$i.txt

done

Below are the graphs of average throughput at the end of 10 minutes, and 95% response time.

 

 

 

 

Galera clearly performs better than NDB with 2 instances!

But things become very interesting when we graph the reports generated every 10 seconds.

 

 

 

 

 

Surprised, right? What is that?

Here we see that even if the workload fits completely in the buffer pool, the high number of TPS causes aggressive flushing.

We assume the benchmark in the Galera blog post was CPU bound, while in our benchmark the behavior is I/O bound.

We then added another 2 more nodes (m1.xlarge instances), but kept the dataset at 15 tables x 2M rows , and re-ran the benchmark with NDB and Galera. Performance on Galera gets stuck, due to I/O. Actually, with Galera, we found that performance on 4 nodes was worse than with 2 nodes; we assume this is caused by the fact that the whole cluster goes at the speed of the slower node.

Performance on NDB keeps growing as new nodes are added, so we added another 2 nodes for just NDB (6 nodes total).

 

 

 

 

The graphs show that NDB scales better than Galera, which is not what we expected to find.

It is perhaps unfair to say that NDB scales better than Galera, but rather that NDB checkpoint causes less stress on I/O than InnoDB checkpoint, thus the bottleneck is on InnoDB and not Galera itself. To be more precise, the bottleneck is on slow I/O.

The follow graph shows the performance with 512 threads and 4 nodes (NDB and Galera) or 6 nodes (only NDB). Data collected every 30 seconds.

"When the Nerds Go Marching In"

Palomino was honored to serve as part of the team of technologists on President Obama's re-election campaign. Atlantic Magazine ran a fascinating piece about Narwhal, the sophisticated data architecture that enabled the campaign to track voters, volunteers and online trends.

Palomino CEO Laine Campbell joined the team in Chicago for the final days of the campaign, ensuring maximum uptime and performance on the MySQL databases. Afterwards, President Obama thanked her for Palomino's contributions.

When Disaster Strikes: Hurricane Sandy

 

We devoted the Palomino Newsletter this month to the important topic of disaster recovery, in light of the challenges posed by Hurricane Sandy. If you're not already receiving our newsletter, you can subscribe here.


Hurricane Sandy has been on many people's minds of late; mine not least.  Having lived the last 4 years of my life in Manhattan and on the Jersey Shore, the loss of lives, the destruction of homes, business and memories, and the disruption of so much has me in shock.  I grew up in Louisiana, and hurricanes were a way of life.  You didn't do something hoping that a hurricane would not come by.  You assumed a hurricane would come.  At least, that's how I was taught.  That's the mentality I try to bring into my architectures, my process and my planning as well.  So, when hurricane Sandy bore down on the East Coast, my alarm bells started ringing, just as my email started exploding.  Every one of our US-East Amazon customers was in danger.  Who knew when power would go out?  And when it would come back?

Palomino is proud to be an Amazon Web Services consulting partner. That said, we recognize that Amazon has had its share of instability.  A few weeks ago, US-East experienced some significant EBS latency and unavailability.  We've lost availability zones.  We've lost regions.  We've found availability zones inexplicably unpredictable in terms of latency and availability.  Amazon forces us to think resiliently.  Not in preventing disasters, but weathering them, bouncing back, and being ready.  Some say this is an issue with Amazon.  That the unreliability is a drawback.  Perhaps I'm the eternal optimist, but I simply see it as a way to force rigor in anticipating, documenting and practicing our availability and business continuity plans.

None of this is new or incredibly enlightening.  Any operations person worth their salt thinks of failure and what can go wrong, and they think of it often.  So what's the point here?  I thought I'd share the war stories of the weekend to help cast a light on varying degrees of preparation.

 

Client One

Client One contacted us.  They had anticipated the problem and already been preparing to create multi-region EC2 environments; Sandy just accelerated things.  This client is in RDS, Amazon's Relational Database as a Service - in this case MySQL as a service.  RDS is such a convenient tool, until it isn't.  One of the big drawbacks? No cross-region support.  Yes, you can use Multi-AZ replication for Master availability across availability zones.  Yes, you can also create replicas in multiple availability zones.  If you do both of these things, you've got a certain level of fault tolerance in place.  You can still get hurt if your master does a multi-AZ failover.  All of your replicas will break, as RDS doesn't take into account the ability to move manually to the next binlog when a master crashes before closing their binlogs.  Thus, you are without replicas.  Not great.  But you have a working master.  Similarly, you have multiple replicas across AZs, to tolerate those failures.  But cross-region?  Nothing.

So, we had to dump all of our RDS instances and load them into RDS in another region.  Parallel dumps and loads were kicked off, accelerating the very painful process of a logical rebuild of a system.  We used SSD ephemeral storage on EC2 to speed this up as well.  The process still took 2 days.  OpenVPNs had to be set up with mappings for port 3306 to allow replication.  If this hadn't already been in process before Sandy was a threat, we never would have been ready in time.  We still had and have issues.  You can't replicate from RDS in one region to another.  Custom ETL must be created in order to keep each table as in sync as possible.  We'd done this work in a previous plan to move off of RDS, mapping tables to one of three categories - static (read-only), insert only, or upd/del.  Static just needs to be monitored for changes.  Insert only can be kept close to fresh with high water marks and batch inserts.  Transactional requires keys on updated at and created at fields, and confidence in the values in those fields.  Deletes present even bigger problems.  Digging in further is out of scope here, but consider it a future topic.

Summary: Client One was in-process for multi-region disaster recovery (DR).  A fire-drill occurred, and people had to work long, long hours doing tedious work.  But, had Sandy hit their region with the force it hit further north, we'd have been ready.

 

Client Two

Client Two contacted us also.  They had known that they were at risk, but they were small, they were pushing new features and refactoring applications, and DR was far out on their roadmap.  They too, were on RDS.  They could not afford the amount of custom work our larger clients requested, so we had to create a best effort approach.  RDS instances were created in Portland, along with cache servers, transaction engines, web services and the rest of the stack. Amazon Machine Images (AMIs) were kicked out, and we built a dump and copy process across regions.  There would be data loss, up to many hours, if the region went down and never came up.  But they would not be dead in the water.  Data loss can be mitigated by more frequent dumps and copies, but not eliminated completely.

Summary: Client Two had no plans for multi-region DR.  They had taken a conscious risk.  Luckily they had the talent and agility of a small company and could move fast with our help.  Failing over would have hurt, but they'd still be alive.

 

Client Three

We reached out proactively to Client Three. They had put together a multi-region plan for critical systems last year before we started working with them, which included scripts to rapidly build out new clusters of Hadoop based systems.  It was supposed to just work.  When we started working with Client Three, we’d scheduled our DR review, testing and modernizing for our Q4 checklist.  Too little, too late, right?  Sure enough, things didn't "just work".  It wasn't horrible, but a weekend of cleaning up, rescripting and fixing problems as they rose occurred.  But had we had to fail over? They would've been ready.

Summary: Client Three had anticipated and architected DR, but they hadn’t tested it.  Luckily we had the days before the storm to test and to fix this.  If they hadn't planned at all, I'm not sure we would've made it.

 

It’s also worth remembering that you are not alone in these shared environments.  All weekend shops were staking claims on instances and storage, and building out.  Rolling out resources got slower, and if you didn't claim, you'd lose out.  This has to be considered in your plans.  

 

To recap: Palomino loves AWS.  We're a consulting partner and have helped many clients in many different business models deploy, scale and perform in AWS.  But DR is not a luxury anymore.  It's a necessity.  Architectures have to take multi-AZ and multi-region plans into consideration in the beginning.  Many people use AWS so they save money on hardware.  They get upset when you point out the labor and extra instances needed to guarantee they can weather these storms.  But it's a hard reality.  It's one of the reasons we only recommended RDS in early phases, when downtime is tolerable.  Good configuration management also means you can deploy a skeleton infrastructure in another region; you can explode that to a full-blown install with ease.  But you have to practice, and you have to move fast.  If you think your region can go down, go to DEFCON and push the buttons.  If you're wrong, you can always tear back down.

Anticipate.
Plan.
Build it early.
Automate it.
Test it.
Test it.
Test it.
Test it.

If you haven't been able to donate to the Red Cross or other institutions helping our fellow brothers and sisters in the Northeast and in the Caribbean, please take some time to do so.  Having lost property and cared for loved ones displaced by Katrina, and now hearing so many horror stories from New Jersey and New York, I urge everyone to donate money, donate shelter, donate time and skills if you have them.  

 

Jetpants Usage and Installation Gotchas

If you regularly manage enormous data sets, you've probably heard about Tumblr's exciting new toolkit called Jetpants, which automates common processes for MySQL data management, most notably in the area of rebalancing shards for more efficient scaling. In evaluating and using Jetpants in various environments, we identified some interesting installation and usage gotchas, and we've documented those here - let us know if you find other issues worth exploring.  

Jetpants Installation/Usage Gotchas for Ubuntu 12.04:

Jetpants has to-date only been tested on RHEL/CentOS distributions, though working with the author (Evan Elias of Tumblr, thank you much!) I was able to get it running on Ubuntu 12.04 and Mint Maya/13. There will be more work to do, since RHEL/CentOS report back a service status ("service mysql status" output, see later in this post) differently than Ubuntu/Mint, and there may be other differences as well.

For your Jetpants Console workstation, install Ruby 1.9.3 (also 1.9.1 will be installed), and check /etc/alternatives/ruby points at ruby1.9.3.  Also be sure you installed the MySQL client libraries.

# apt-get install ruby1.9.3 rubygems libmysqlclient-dev
# ln -sf /usr/bin/ruby1.9.3 /etc/alternatives/ruby
# ln -sf /usr/bin/gem1.9.3 /etc/alternatives/gem

After that, do gem install Jetpants. You may get scary warnings about inability to convert ASCII to UTF8 characters during documentation installation as below:

# gem install jetpants

Building native extensions.  This could take a while...
Fetching: sequel-3.39.0.gem (100%)
Fetching: net-ssh-2.5.2.gem (100%)
Fetching: coderay-1.0.7.gem (100%)
Fetching: slop-3.3.3.gem (100%)
Fetching: method_source-0.8.gem (100%)
Fetching: pry-0.9.10.gem (100%)
Fetching: highline-1.6.15.gem (100%)
Fetching: colored-1.2.gem (100%)
Fetching: jetpants-0.7.4.gem (100%)
Successfully installed mysql2-0.3.11
Successfully installed sequel-3.39.0
Successfully installed net-ssh-2.5.2
Successfully installed coderay-1.0.7
Successfully installed slop-3.3.3
Successfully installed method_source-0.8
Successfully installed pry-0.9.10
Successfully installed highline-1.6.15
Successfully installed colored-1.2
Successfully installed jetpants-0.7.4
10 gems installed
Installing ri documentation for mysql2-0.3.11...
Installing ri documentation for sequel-3.39.0...
Installing ri documentation for net-ssh-2.5.2...
unable to convert "\xE7" from ASCII-8BIT to UTF-8 for lib/net/ssh/authentication/pageant.rb, skipping
Installing ri documentation for coderay-1.0.7...
Installing ri documentation for slop-3.3.3...
Installing ri documentation for method_source-0.8...
Installing ri documentation for pry-0.9.10...
Installing ri documentation for highline-1.6.15...
Installing ri documentation for colored-1.2...
Installing ri documentation for jetpants-0.7.4...
Installing RDoc documentation for mysql2-0.3.11...
Installing RDoc documentation for sequel-3.39.0...
Installing RDoc documentation for net-ssh-2.5.2...
unable to convert "\xE7" from ASCII-8BIT to UTF-8 for lib/net/ssh/authentication/pageant.rb, skipping
Installing RDoc documentation for coderay-1.0.7...
Installing RDoc documentation for slop-3.3.3...
Installing RDoc documentation for method_source-0.8...
Installing RDoc documentation for pry-0.9.10...
Installing RDoc documentation for highline-1.6.15...
Installing RDoc documentation for colored-1.2...
Installing RDoc documentation for jetpants-0.7.4...

Be sure you've created /var/www writeable by the user you'll execute Jetpants as, and be sure you have an /etc/jetpants.yaml that is similar to the template provided on the Jetpants website.

And now you're ready to run Jetpants!

$ jetpants 

Tasks:

  jetpants activate_slave            # turn a standby slave into an active slave

  jetpants clone_slave               # clone a standby slave

  jetpants console                   # Jetpants interactive console

  jetpants destroy_slave             # remove a standby slave from its pool

  jetpants help [TASK]               # Describe available tasks or one specific task

  jetpants pools                     # display a full summary of every pool in the topology

  jetpants pools_compact             # display a compact summary (master, name, and size) of every pool in the topology

  jetpants promotion                 # perform a master promotion, changing which node is the master of a pool

  jetpants pull_slave                # turn an active slave into a standby slave

  jetpants rebuild_slave             # export and re-import data set on a standby slave

  jetpants regen_config              # regenerate the application configuration

  jetpants shard_cutover             # truncate the current last shard range, and add a new shard after it

  jetpants shard_offline             # mark a shard as offline (not readable or writable)

  jetpants shard_online              # mark a shard as fully online (readable and writable)

  jetpants shard_read_only           # mark a shard as read-only

  jetpants shard_split               # shard split step 1 of 4: spin up child pools with different portions of data set

  jetpants shard_split_child_reads   # shard split step 2 of 4: move reads to child shards

  jetpants shard_split_child_writes  # shard split step 3 of 4: move writes to child shards

  jetpants shard_split_cleanup       # shard split step 4 of 4: clean up data that replicated to wrong shard

  jetpants summary                   # display information about a node in the context of its pool

  jetpants weigh_slave               # change the weight of an active slave

As mentioned before, Jetpants is currently only well-tested for RHEL/CentOS distributions. RHEL/CentOS report back a service status ("service mysql status" output) differently than Ubuntu/Mint. Following is the requirements of Jetpants and actual output:

/sbin/service mysql status # check if mysql daemon running; output must include the string 'not running' if mysql is not running

This is how it works on both Ubuntu 12.04 and Mint Maya/13:

# service mysql stop
mysql stop/waiting

# service mysql status
mysql stop/waiting
# service mysql start
mysql start/running, process 31346
# service mysql status
mysql start/running, process 31346

I decided to check the return code, but it's always 0 regardless of the service state, so we must parse the output in Jetpants. See the following output:

# service mysql status ; echo $?
mysql stop/waiting
0
# service mysql start
mysql start/running, process 31629
# service mysql status ; echo $?
mysql start/running, process 31629
0

To support Debian-alikes, Jetpants must check for the other string, since the strings being output by "service" makes it impossible to check for a single substring across distros.

Over the coming days, I will be identifying other CentOS/RHEL-specific code and writing patches to make it work with at least also Ubuntu 12.04. In the future, if you're trying to port Jetpants to another distribution, you'll be able to use my pull request as a template for the areas that are distribution-specific.

When running Jetpants in EC2, there are some more gotchas (or perhaps they're just "be sure to follow best-practices"):

1. You must allow root login to your cluster. Generate a privkey/pubkey pair and allow root logins. 

      a. Edit ~root/.ssh/authorized_keys and get rid of the part where it has the "command=..." that prints an error and logs you out.

     b. Edit /etc/cloud/cloud.cfg and set disable_root: 0

     c.  Edit /etc/ssh/sshd_config and set PermitRootLogin without-password

2. You must be sure your master/slaves replication topology uses only IP addresses. If your slaves have, eg "Master_Host: ec2-5-2-3-8.compute.amazonaws.com" then Jetpants will fail. IP addressses is how you want your cluster configured in production, so do it, but it may surprise you if you're trying to build a small test cluster.

3. You must allow root login from localhost on all your MySQL instances. In other words, this command must return valid output on your slaves: mysql -e 'show slave status\G' - note that you can put root login credentials in ~root/.my.cnf rather than allow passwordless MySQL root logins.

In general, it's good to remember that Jetpants is a set of utilities for controlling an actual large MySQL cluster, so it will not work so well for tiny little test deploys where you've cut corners, and this is how it should be. Here are more Jetpants Environment Requirements/Limitations, from the documentation:

1. Need pigz installed on your cluster. If you use the Palomino Cluster Tool to build your shards, pigz will be installed for you.

2. No auto_increment on sharded tables. Must have ID generator.

3. PK of all sharded tables should start with the shard key. This will vastly improve performance of long-running processes like shard splits.

4. Port :3306 everywhere. One instance per node.

5. (Probably) doesn't work with MyISAM.

Benchmarking NDB Against InnoDB on a Write-Intensive Workload

Last month, we evaluated Amazon's new SSD offerings with an extensive series of performance benchmarks.

As a followup, we've prepared a second series of benchmarks that specifically explore performance issues with a write-intensive workload on both NDB and InnoDB storage engines.

Download a free PDF of our findings, and as always, we welcome your feedback, comments and questions below.

Exploring Configuration Management with Ansible

 

What is Ansible?

Ansible is a configuration management and deployment system, like Puppet, Capistrano, Fabric, and Chef. Its aim is to be radically simple and let you use your existing scripts to help with cluster configuration and software deployment whenever possible. Here are the ways that Ansible differentiates itself.

Simplicity

Ansible does not include a client/server architecture with pull-based clients (although in more recent versions, it does include pull-based configuration and deployment). Rather, it uses pre-existing network infrastructure: SSH. Every company has SSH installed on their cluster servers, and Ansible simply rides on top of this infrastructure to get the code and configuration out to the nodes.

Language Agnostic

You can write modules for Ansible in whatever language you desire. You simply tie into its API and go. If you have wanted to use configuration management tools like Chef, but felt put off by the need to learn Ruby, this would be an ideal choice for you.

Configuration Management and Deployment

Ansible playbooks can run steps in a defined order between roles, allowing you to be sure that, for example, the database is set up before the WWW servers start and attempt to create their database schemas. 

Why Recommend Ansible?

Ansible fills a very useful niche. It is good for companies that have grown to the point where configuration management and deploy tools would save time and help manage complexity, but not large enough to hire specialised Ruby-speaking configuration management experts.

It is good for companies that already have invested effort into an SSH infrastructure for inter-node communication. Because Ansible piggybacks onto your already-existing SSH communications infrastructure, there is no need to build large, complex server infrastructure and schedule client runs.

Finally, Ansible is good for companies that don't have a lot of time to invest into learning a new tool. There is no need to spend weeks reading tutorials and watching videos from conferences; the tool is radically simple. A few hours with the documentation, and you're off and running.

If you'd like to hear more about how Palomino can help you with installing and configuring Ansible, feel free to contact us! If you have experience with Ansible, feel free to comment, and let's start a dialogue.

Palomino Evaluates Amazon’s New High I/O SSD Instances

Amazon Web Services (AWS) recently introduced a new storage layer with SSD as its backend. Because their previous EBS storage has been challenging in the areas of I/O throughput and stability, we were extremely excited to run comparative benchmarks to offer a recommendation for customers considering upgrading to SSD. 

The opportunity to expand I/O throughput horizontally can create a longer runway for sharding (distributing data-sets across multiple MySQL logical clusters in order to reduce write I/O), which can be quite compelling. 

Our extensive tests, described in detail in our Evaluation Report prepared by René Cannao and Laine Campbell (click here for a free PDF), illustrate the potentially enormous gains in throughput, performance and cost for companies scaling MySQL in Amazon. Palomino would strongly recommend that clients consider upgrading to SSD storage. Learn more.

Optimizing your MySQL Tuning Methodology

 

Optimizing your queries

There are two general methods for creating a query plan for a query. A rule-based optimizer goes through your query, sees if you're doing X or Y, and then optimizes for method A or B depending. Rule-based optimizers grow rather large as there are many possible cases to account for. Sometime in the past couple of decades, rule-based optimizers fell out of favour and cost-based optimizers took over instead. At the moment the query is submitted, the optimizer does some simple calculations of costs of various ways of doing the query and picks the probably-best. It turns out the simplicity of the optimizer combined with the "it generally works" nature of the beast means it totally wins the popularity contest.

 

My Tuning Methodology, the Overview

With this in mind, I wanted to evaluate my own tuning methodology. At first blush, I thought my method rather coarse and unrefined. I generally look at every variable in my.cnf, do a quick search if I don't know what it is, set it to something that “seems about right,” and then go to the next one.

 

My tuning experience is almost exclusively in high-volume shops, which isn’t representative of the many different MySQL installations of different sizes in the world. In high-volume environments, there are no good rules of thumb for tunings, because if you followed any such rules, your MySQLs would crash into the dust. So this rather naïve approach is actually a good start. Because there's that word again. Rule.

 

You can tune the my.cnf by having several rules of thumb handy. A very long ruleset of "if this, then that" either documented or as a mental checklist. This works well on MySQL installations that are typical, but it isn't my methodology. Mine is iterative and experimentally-based.

 

My Tuning Methodology, the Details

So now I've got a my.cnf that I’ve touched line-by-line and picked values that seemed reasonable. Anyone who's ever administered a MySQL knows this my.cnf is going to be a disaster if left unchecked. But here's where my magic begins and why my methodology works in high-volume environments where no rules of thumb apply.

 

I place the my.cnf onto a machine that is taking representative load. That is, it receives queries much like it would if it were performing its production role. There are a lot of ways of doing this, but my favorite is to put the slave into a production pool at a small fraction (perhaps 1%) of the workload of a currently-production slave. 

 

Next, I point trending software at the slave. What do I monitor? That's easy. EVERYTHING. Every single statistic I can think of, I graph, from the OS and DBMS. Everything sar would ever tell you. Everything from "show global status". Everything in log files I can turn into numbers.

 

Then I measure the "black-box expected throughput" of the machine. That is, it's getting a fraction of the query traffic, so is it performing obviously better than the other hosts performing full work? Say this machine is supposed to be able to answer its typical queries in 250ms. Is it doing it in at most 250ms, and perhaps better? If so, I increase the load on the machine. I keep increasing the load until it starts to struggle. Say I get it up to 150% normal load, and it's now taking 260ms to answer its queries. Now I know it's "stressed."

 

At this point, I start to look at every statistic I've graphed and look for things at their limits. Is CPU maxed out? Is disk I/O at a plateau? Are read_rnd_next handler stats off-the-charts? I find it useful to have the same metrics coming from an existing reference slave, but sometimes they are obviously in a bad state (like if read_rnd_next is pushing 100,000,000/s for example, we probably need some indexes).

 

From here, I begin to get an idea of tunings or, more precisely, "experiments I can run." Maybe the tables opened is going crazy. So I tune table_open_cache, restart the server, and see if the graph corrects (secondary concern) and if my server is responding better, say 10ms faster per query (primary concern, which is most-correctly measured as "does the application seem happy with this server in its current state?"). Maybe disk I/O is crazy, but InnoDB free pages is nothing. That suggests an experiment to increase innodb_buffer_pool_size.

 

These are perhaps obvious examples, but there can be some subtler and surprising tuning choices to be made when you start viewing graphs of statistics you don't even know the meaning of. For example, suppose you didn't know what handler_read_rnd_next meant. You look at it and see it's 10,000x larger than handler_read_next which is 50x larger than handler_read_first. You start looking up what those stats mean, et voila, you've made a useful discovery.

 

At this point, I've formulated a hypothesis, which sounds a bit like this: "The graphs indicate that my system is doing suboptimal X. There is a tuning A that affects X. Thus I will change its current value from M to N, and I expect it will perform more optimally in X." I make the change, and verify the evidence on the graphs. Most importantly, the primary concern is that the application, the thing using my DBMS, should perform better. If the graphs indicate everything is better, but the application is performing worse, we must accept the failed experiment!

 

This leads us to an anecdote about InnoDB tuning.

 

Why Experimental Evidence Matters

At a previous company, I was running several 100% InnoDB MySQL servers in a particular slave query pool, and was getting a particular performance characteristic. I was looking at my.cnf and noted innodb_buffer_pool_size was rather small (2GB out of 24GB total available RAM) and that I had an extremly low cache hit rate. I formulated the obvious hypothesis: that I should increase the buffer pool to decrease my deplorable cache hit rate. Upon increasing the innodb_buffer_pool_size to 6GB (I believe in changing things in small increments), however, I discovered that although my cache hit rate much nicer, my query latency was also up (and the application was clearly suffering).

 

The first thing to do was tune it back to 2GB to ensure things returned to status quo. Then I formulated a new experiment which made no sense, but I had to try: I tuned innodb_buffer_pool_size to 1GB... Performance INCREASED over the original 2GB setting!

 

Further experimentation settled on about a 750MB innodb_buffer_pool_size being optimal for the machines in this particular query pool (that is, smaller values began leading to worse performance again).

 

This was EXTREMELY counter-intuitive, as it should be to you if you're a MySQL DBA. It took me a long time to accept the truth of this situation and move on. I formulated many other hypotheses in the process of trying to understand this phenomenon. MyISAM? Bad trending data? Matter/antimatter collision? Practical joke by my co-workers who were in the datacenter siphoning off the blue smoke from my machines?

 

To add to the mystery, I observed two additional phenomena: First, in other slave pools, increased innodb_buffer_pool_size indeed had the expected result: better cache hit rate and better performance. Second, several months later these machines were replaced with a new set of hardware that was configured similarly in RAM/disk, but were AMD processors instead of Intel. Upon loading them up with MySQL and putting them into the pool, they performed extremely badly compared to their peers. It took almost as long to discover the obvious, that increasing the innodb_buffer_pool_size decreased latency on this configuration, as it took to originally ignore this intuition. But the FACTS, that is, the graphs coming out of the trending software didn't lie. On this new configuration, we could see the performance increase as innodb_buffer_pool_size was tuned up. I believe we settled on about 18GB of total 24GB RAM as the optimal tuning.

 

Wrapping up the Experiment, Standardize on your Configuration

At some point, after staring at graphs, modifying parameters, altering tables, changing sysctl.conf settings, remounting filesystems, it starts to be more difficult to suggest tunings that will have a great effect. Keep in mind that the business has its own goals. After a while, getting another 5% performance out of your servers isn't worth the same to the business as, say, helping the developers push out the next release. So recognize that although you could spend the next several months staring at graphs and performing experiments (which are fun), at this point you should be satisfied with your increased performance and get back to the daily grind. Roll your tunings out to the pool of DBMSes, monitor the cluster to be sure that, on the whole it's performing as well as the individual test machine was (This is important! Sometimes a subset of the machines will behave in a deviant fashion differently than you expect! Be prepared for it!), and if so, move on to something else more important. Don't worry. Changes in workload or hardware configuration or application code will necessitate revisiting this procedure in the future.

 

Syndicate content
Website by Digital Loom