Blog

Insights from the PgCon 2013

PgCon 2013 was attended by 256 people across the globe.  Attendees had the opportunity to enjoy tutorials, talks and an excellent unconference (this last deserves a special mention).

I lectured a talk related with Full text search using Sphinx and Postgres (you can find the slides at http://t.co/lgFoLq37EC, and all of the talks have been recorded).  The quality of the talks in general was quite good, but I don't want to repeat what you will find in other posts.

The unconference was attended quite late into the evening. You can find a schedule of it, as well as the minutes of some of the talks that happened (and others that didn't also) here.

There was a special emphasis on the pluggable storage feature, albeit most agree that it will be a very difficult feature to implement in the near versions. A topic related to this, was the Foreign Data Wrapper enhancements.

Pluggable Storage engine was extended after. The main reason of why everybody agrees with this feature, is because an API for the storage will allow companies to collaborate with code and avoid forks to other projects.

There was a long discussion also about migrations on the hall, using pg_upgrade.

The features about replication were bi-directional and logical replication.

Full text search unconference discussion was pretty interesting. Oleg Bartunov and Alexander showed a really interesting work coming up for optimizing GIN indexes. According to their benchmarks, Postgres could improve the performance significantly.

There were a lot of discussion I missed, due the wide number of tracks and "hall spots". But th emajority of attendees I heard agreed that the unconference was quite exciting and granted the possibility to bring many new ideas.

 

 

 

 

Supporting Feminism in Technology - Part 1

 

I've been contemplating the topic of feminism and misogyny in the technology field a lot of late.  This blog post is a culmination of a significant amount of thought and reflection on the topic of women in technology.  At Palomino, I focus a lot on the values around bringing underserved populations into technology.  Women, people from working class or impoverished backgrounds, people who are gay and lesbian or transgendered, and people of latino and african american backgrounds are traditionally highly underrepresented in the US technological workforce.  Palomino, even being a woman owned business, is not exempt from this issue.  One of my goals is to build up not just Palomino's DBA and engineer population to reflect higher percentages of these populations, but to support more people in the entire community in having these opportunities.  I'd like to focus on gender in this conversation, though I have much to say on the topics of race and class as well.  In fact, they are all interrelated.


To dig into the topic, one of the first things to consider and that people ask, is why is this a big deal.  And at the top level, without any time put into consideration, I can see why someone might feel this way.  After all, if a DBA is good, why does their gender, race or background matter?  And, if you are simply considering the output of an individual or organization, this is pretty true.  But, there is more.  My hope is that people who focus on the importance of open source software and free access to technology would understand the importance of building larger populations of women engineers and administrators, but that has proven itself to not be true.


The fact is, that on a macro level, one of the largest ways to get more people from underserved populations into jobs such as database administration, infrastructure architecture and software engineering is to provide them with mentors and role models who have already broken through the barriers to make it.  And, we do exist.  Palomino and Blue Gecko were both built by women.  Oracle's Ace and Ace Director list has about 12 (of 390) women on it.  I have been meeting more people of color and women working on senior teams of clients.  I do see potential role models and mentors out there.  But, don't get too excited.  There are still plenty of  opportunities for improving how we build welcoming workplaces for the talented and diverse engineers already out there making their way in the field.  


There's also the selfish part of the equation.  More often than not, when I find women and people of color in the wild, with successful records as engineers and operators, they tend to be extremely good at their jobs.  They tend to be excellent communicators, good with clients, detailed with project planning and highly technically competent.  Is this because of more innate talent?  No.  It is because the amount of willpower, inner strength, self-confidence and chutzpah required to succeed for these people is much higher than the predominant demographic of engineers.


1. Mindfulness in Language and Communication - I find this is particularly true in remote workforces, such as Palomino where the entire culture is often built about word choice and expression of ideas.  There are obvious cases, such as how often people start email threads with "Gentlemen".  Then, there is simply the propagation of cultures around masculinity, or "brogramming".  This is more delicate.  After all, there are plenty of women, myself included, who enjoy conversations around traditionally masculine pursuits and endeavors.  I lean more towards not an exclusion of topic, but an inclusion and mindfulness of those who might feel left out.  Do you ask women about their favorite sports teams?  Do you keep an eye out for folks who might retreat from certain topics and adjust your conversations accordingly?  Shutting off social conversation is not generally helpful, but as leaders in organizations, it is a responsibility to help guide conversations to be as inclusive and supportive as possible to all staff.  And of course, any traditionally sexist, racist or classist conversations need to be privately nipped in the bud immediately as a manner of course.  Creating space for other conversations outside of traditionally masculine ones to occur is also critical.  Ask people who are not from the dominant race/class/gender in your organization about their weekends and pastimes.  Don't assume a woman is interested in knitting, but give her a chance to express what she likes.  She might surprise you and your team with the diverse range of interests that might be brought up.


2. Examine the Gendered Roles and Behaviors - Go to most tech sites and look at their team pages.  I'm willing to bet that if you are looking at client facing positions that require emotional intelligence and empathy, you will find more women than in the technical fields.  Palomino is no exception.  Our project and account management teams are all female.  Our office manager is male, however.  Ultimately, I don't recommend the policing of the gender of individual roles, but I do believe it's important to examine key expectations and behaviors around staff.  For instance, it is common practice to assume engineers and administrators do not have the emotional/social capacity to interact with users/clients.  So, organizations put account managers or project managers in between, who are often female and thus considered more socially and emotionally adept.  Rarely is it considered a priority to encourage the technical staff to step up, improve their soft skills such as empathy and to interact directly with the client base.  Instead, we build a culture of mothering, which is harmful to all parties involved.


Additionally, do we value the roles that are more empathetic, client facing and emotionally intelligent?  People always discuss how hard it is to retain and find good DBAs, and their salaries, power and "catering to" reflect this in the organization.  While a good PM may not be as hard to find, they are still just as valuable to an organization.  Do you take these roles for granted, or do you also make them feel as important, valued and encouraged as your more technical staff?  Do you let mediation fall to these same people, or do you encourage all staff to develop their skills in negotiation and conflict resolution?  


3. The Devil is in the Details - At the recent Percona Live conference, T-Shirts were given to all attendees.  When asked if there were women's sizes, the organizers stated they were unisex.  Unisex is not actually unisex.  It is men's, and not designed for women's bodies.  These details, while not large individually, add up to a feeling of being an add on; just as much as lack of kosher meals, or wheelchair ramps far from the main entrance can cause one to feel like an afterthought.   Take the extra step to define and socialize your diversity policy and your code of conduct.  O'Reilly has a great code of conduct at http://oreilly.com/conferences/code-of-conduct.html.  Note, that defining the code of conduct or the diversity policy is not enough.  You need to talk to people about these things and engage them.  When you are discussing policies around employees, or evaluating a new client, think about how this fits in to your policies.  When you are planning a company offsite, organizing a conference or writing a blog post, think about these policies.  Who will be involved or affected by your choices?  What can you do to make them feel more included?  Take the time to really think about this.


4. Recruiting - This is a challenging position, and one that I've had to consider for quite some time.  At Palomino, I'd say we get perhaps 1 out of 20 applicants who are women via our natural model of letting people come to us via word of mouth.  That is obviously a horrible ratio.  Too often, people just say "well, if women don't interview how can we hire them?".  That's a cop out.  Most hiring managers know that you don't get A players from a passive recruiting strategy.  This is just as true for getting women to interview for technical positions.  You need to spend time going to events such as the ADA Initiative Unconference (http://sf.adacamp.org), Women Powering Technology Summit (http://www.witi.com) and sponsoring, speaking and getting involved.  There are numerous meetups, from Girls who Code in NYC, Girls in Tech in Las Vegas and Women in Tech in SF.  Additionally, you should be going through LinkedIn to find women and contacting them.  Even if they are not interested, by building a network that includes more and more women, you are improving the possibility that you will find the right women for your organization.  Get out there and speak at meetups, start some introductory courses for women coming out of college and continue to build that network.  There is no reason to stay at a 5% rate of interviews, but you have to work!


This is part 1 in 2 parts.  I'd like to focus next on some ways in which dialogue around the conversations can go wrong, and how to discuss and respond to conversations around feminism and misogyny in a constructive manner.  I do look forward to feedback and conversations around the topic, and I thank you for your time in reading and considering this.

Benchmarking Postgres on AWS 4,000 PIOPs EBS instances

Introduction

Disk I/O is frequently the performance bottleneck with relational databases. With AWS recently releasing 4,000 PIOPs EBS volumes, I wanted to do some benchmarking with pgbench and PostgreSQL 9.2. Prior to this release the maximum available I/O capacity was 2,000 IOPs per volume. EBS IOPs are read and written in 16Kb chunks with their performance limited by both the I/O capacity of the EBS volumes and the network bandwidth between an EC2 instance and the EBS network. My goal isn't to provide a PostgreSQL tuning guide, an EC2 tuning guide, or a database deathmatch complete with graphs; I'll just be displaying what kind of performance is available out-of-the-box without substantive tuning. In other words, this is an exploratory benchmark not a comparative benchmark. I would have liked to compare the performance of 4,000 PIOPs EBS volumes with 2,000 PIOPs EBS volumes, but I ran out of time so that will have to wait for a following post.

Setup

Region

I conducted my testing in AWS' São Paulo region. One benefit of testing in sa-east-1 is that spot prices for larger instances are (anecdotally) more stable than in us-east. Unfortunately, sa-east-1 doesn't have any cluster compute (CC) instances available. CC instances have twice the bandwidth to the EBS network than non-CC EC2 instances. That additional bandwidth allows you to construct larger software RAID volumes. My cocktail napkin calculations show that it should be possible to reach 50,000 PIOPs on an EBS-backed CC instance without much of a problem.

EC2 instances

I tested with three EC2 instances: an m1.large from which to run pgbench, an m2.2xlarge with four EBS volumes, and an m1.xlarge with one EBS volume. All EBS volumes are 400GB with 4,000 provisioned IOPs. The m1.large instance was an on-demand instance; the other instances  — the pgbench target database servers — were all spot instances with a maximum bid of $0.05. (In one case our first spot instance was terminated, and we had to rebuild it). Some brief testing showed that having an external machine driving the benchmark was critical for the best results.

Operating System

All EC2 instances are running Ubuntu 12.10. A custom sysctl.conf tuned the Sys V shared memory as well as set swappiness to zero and memory overcommit to two.

kernel.shmmax = 13355443200
kernel.shmall = 13355443200
vm.swappiness = 0
vm.overcommit_memory = 2

Packages

The following packages were installed via apt-get:

  • htop
  • xfsprogs
  • debian-keyring
  • mdadm
  • postgresql-9.2
  • postgresql-contrib-9.2

In order to install the postgresql packages a pgdb.list file containing

deb http://apt.postgresql.org/pub/repos/apt/ squeeze-pgdg main

was placed in /etc/apt/sources.list.d and the following commands were run:

gpg --keyserver pgp.mit.edu --recv-keys ACCC4CF8
gpg --armor --export ACCC4CF8 | apt-key add -
apt-get update

RAID and Filesystems

For the one volume instance, I simply created an XFS file system and mounted it on /mnt/benchmark.

mkdir /mnt/benchmark
mkfs.xfs /dev/svdf 
mount -t xfs /dev/svdf /mnt/benchmark
echo "/dev/svdf    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab

For the four volume instance it was only slightly more involved. mkfs.xfs analyzes the underlying disk objects and determines the appropriate values for stride and width. Below are the commands for assembling a four volume mdadm software RAID array that is mounted on boot (assuming you've attached the EBS volumes to your EC2 instance). Running dpkg-reconfigure rebuilds the initrd image.

mkdir /mnt/benchmark
mdadm --create /dev/md0 --level=0 --raid-volumes=4 /dev/svdf /dev/svdg /dev/svdh /dev/svdi
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
mkfs.xfs /dev/md0
echo "/dev/md0    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab
dpkg-reconfigure mdadm

Benchmarking

pgbench is a utlity included in the postgresql-contrib-9.2 package. It approximates the TPC-B benchmark and can be looked at as a database stress test whose output is measured in transactions per second. It involves a significant amount of disk I/O with transactions that run for relatively short amounts of time. vacuumdb was run before each pgbench iteration. For each database server pgbench was run mimicking 16 clients, 32 clients, 48 clients, 64 clients, 80 clients, and 96 clients. At each of those client values, pgbench iterated ten times in steps of 100 from 100 to 1,000 transactions per client. It's important to realize that pgbench's stress test is not typical of a web application workload; most consumer facing web applications could achieve much higher rates than those mentioned here. The only pgbench results against AWS/EBS volumes that I'm-aware-of/is-quickly-googleable is from early 2012 and, at its best, achieves rates 50% slower than the lowest rates found here. I drove the benchmark using a very small, very unfancy bash script. An example of the pgbench commandline would be:

pgbench -h $DBHOST -j4 -r -Mextended -n -c48 -t600 -U$DBUSER

m1.xlarge with single 4,000 PIOPs volume

The maximum transaction volume for this isntance was when running below 48 concurrent clients and under 500 transactions per client. While the transaction throuput never dropped precipitously at any point, loads outside of that range exhibited varying performance. Even at its worst, though, this instance handled between 600-700 transactions/second.

m2.2xlarge with four 4,000 PIOPs volumes

I was impressed; at no point did the benchmark stress this instance — the tps rate was between 1700-1900 in most situations with peaks up to 2200 transactions per second. If I was asked to blindly size a "big" PostgreSQL database server running on AWS this is probably where I would start. It's not so large that you have operational issues like worrying about MTBFs for ten volume RAID arrays or trying to snapshot 4TB of disk space, but it is large enough to absorb a substantial amount of traffic.

Graphs and Tabular Data

single-4K-volume tps

The spread of transactions/second irrespective of number of clients.

Box plot of transactions per second. Single 4K volume

Data grouped by number of concurrent clients with each bar representing an increase in 100 transactions per second ranging from 100 to 1,000.

Bar graph of transactions per second grouped by concurrent clients. Single 4K volume

Progression of tps by individual level of concurrency. The x-axis tick marks measure single pgbench runs from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Single 4K volume

Raw tabular data

txns/client1002003004005006007008009001000
clients
1614551283118365311975336311009923648
321500124212327577476301067665688709
482818648997051029749736593766641
6494412817041010739596778662820612
808158931055809597801684708736663
96939889774772798682725662776708

four-4,000-PIOPs-volumes tps

Again, a box plot of the data with a y-axis of transactions/second.

Box plot of transactions per second. Four 4,000 PIOPs volumes

Grouped by number of concurrent clients between 100 and 1,000 transactions per client.

Bar graph of transactions per second grouped by concurrent clients. Four 4,000 PIOPs volumes

TPS by number of concurrent clients. The x-axis ticks mark pgbench runs progressing from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Four 4,000 PIOPs volumes

Tabular data m2.2xlarge with four 4,000 PIOPs EBS volumes

txns/client1002003004005006007008009001000
clients
161487161718771415138818821897177112671785
321804208321601791125919972230150117171918
481810215212961951211717751709180318171847
641810158015682056181117841849190919421658
801802204414672142164518961933174018211851
961595140320471731178318591708189617511801

Put Opsview Hosts Into Downtime via the Shell

Recently a client of ours who used opsview to manage their resources needed to place some of their hosts into downtime in conjunction with some other cron-scheduled tasks. In order to implement that functionality, I created this simple script that should work with most installations of opsview, or with a few modifications, can be modified to be used with other, similar REST interfaces. To use, modify the 5 variables at the top of the script as necessary. The url and username are what come with the default installation of opsview. Modify CURL if it's in a different place on your system. Then, to use, for example: opsview_rest_api_downtime.sh -p Pa5sw0rd -h host_name_in_opsview -c create -t 2 Where host_name is the hostname as defined in opsview, not necessarily the same as its actual hostname.
#!/bin/bash
#
# create or delete downtime for a single host using opsview curl rest api
 
CURL=/usr/bin/curl
OPSVIEW_HOSTNAME="opsview.example.com"
USERNAME=apiuser
URL="/rest/downtime"
hours_of_downtime=2
 
usage()
{
    echo "Usage: $0 -p <opsview apiuser password> -h <host> -c (create|delete) [-t <hours_of_downtime>]"
    exit 1
}
 
while getopts p:h:t:c: opt
do
    case $opt in 
      p) password=$OPTARG;;
      h) host=$OPTARG;;
      t) hours_of_downtime=$OPTARG;;
      c) command=$OPTARG;;
      \?) usage;;
    esac
done
 
 
if [ "x$password" = "x" ] || [ "x$host" = "x" ] || [ "x$command" = "x" ]
then
    usage
fi
 
# LOGIN
 
token_response=`$CURL -s -H 'Content-Type: application/json' https://$OPSVIEW_HOSTNAME/rest/login -d "{\"username\":\"$USERNAME\",\"password\":\"$password\"}"`
token=`echo $token_response | cut -d: -f 2 | tr -d '"{}'`
if [ ${#token} -ne 40 ]
then
    echo "$0: Invalid apiuser login. Unable to $command downtime."
    exit 1
fi
 
 
if [ "$command" = "create" ]
then
    # create downtime - POST
    starttime=`date +"%Y/%m/%d %H:%M:%S"` 
    endtime=`date +"%Y/%m/%d %H:%M:%S" -d "$hours_of_downtime hours"`
    comment="$0 api call"
    data="{\"starttime\":\"$starttime\",\"endtime\":\"$endtime\",\"comment\":\"$comment\"}"
    result=`$CURL -s -H "Content-Type: application/json" -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $token" https://$OPSVIEW_HOSTNAME$URL?host=$host -d "$data"`
    exit_status=$?
else
    # delete downtime - DELETE
    params="host=$host"
    result=`$CURL -s -H "Content-Type: application/json" -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $token" -X DELETE https://$OPSVIEW_HOSTNAME$URL?$params`
    exit_status=$?
fi
echo "$result" | grep $host > /dev/null
host_in_output=$?
if [ "$exit_status" -ne "0" ] || [ "$host_in_output" -ne "0" ]
then
  echo "Unable to $command downtime for $host.  Result of call:"
  echo $result
  exit 1
fi

Benchmarking NDB vs Galera

Inspired by the benchmark in this post, we decided to run some NDB vs Galera benchmarks for ourselves.

We confirmed that NDB does not perform well using m1.large instances. In fact, it’s totally unacceptable -  no setup should ever have a minimum latency of 220ms - so m1.large instances are not an option. Apparently the instances get CPU bound, but CPU utilization never goes above ~50%. Maybe top/vmstat can’t be trusted in this virtualized environment?

So, why not use m1.xlarge instances? This sounds like a better plan!

As in the original post, our dataset is 15 tables of 2M rows each, created with:

./sysbench --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --mysql-table-engine=ndbcluster --mysql-user=user --mysql-host=host1 prepare

Benchmark against NDB was executed with:

for i in 8 16 32 64 128 256

do

./sysbench --report-interval=30 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-user=user --mysql-port=3306  --mysql-host=host1,host2 --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$i run > ndb_2_nodes_$i.txt

done

After we shutdown NDB, we started Galera and recreated the table, but found that running sysbench was failing. A suggestion from Hingo was to use --oltp-auto-inc=off, which worked.

Our benchmark against NDB was executed with:

for i in 8 16 32 64 128 256

do

./sysbench --report-interval=30 --test=tests/db/oltp.lua --oltp-tables-count=15 --oltp-table-size=2000000 --rand-init=on --oltp-read-only=off --rand-type=uniform --max-requests=0 --mysql-user=user --mysql-port=3306  --mysql-host=host1,host2 --mysql-table-engine=ndbcluster --max-time=600 --num-threads=$i --oltp-auto-inc=off run > galera_2_nodes_$i.txt

done

Below are the graphs of average throughput at the end of 10 minutes, and 95% response time.

 

 

 

 

Galera clearly performs better than NDB with 2 instances!

But things become very interesting when we graph the reports generated every 10 seconds.

 

 

 

 

 

Surprised, right? What is that?

Here we see that even if the workload fits completely in the buffer pool, the high number of TPS causes aggressive flushing.

We assume the benchmark in the Galera blog post was CPU bound, while in our benchmark the behavior is I/O bound.

We then added another 2 more nodes (m1.xlarge instances), but kept the dataset at 15 tables x 2M rows , and re-ran the benchmark with NDB and Galera. Performance on Galera gets stuck, due to I/O. Actually, with Galera, we found that performance on 4 nodes was worse than with 2 nodes; we assume this is caused by the fact that the whole cluster goes at the speed of the slower node.

Performance on NDB keeps growing as new nodes are added, so we added another 2 nodes for just NDB (6 nodes total).

 

 

 

 

The graphs show that NDB scales better than Galera, which is not what we expected to find.

It is perhaps unfair to say that NDB scales better than Galera, but rather that NDB checkpoint causes less stress on I/O than InnoDB checkpoint, thus the bottleneck is on InnoDB and not Galera itself. To be more precise, the bottleneck is on slow I/O.

The follow graph shows the performance with 512 threads and 4 nodes (NDB and Galera) or 6 nodes (only NDB). Data collected every 30 seconds.

"When the Nerds Go Marching In"

Palomino was honored to serve as part of the team of technologists on President Obama's re-election campaign. Atlantic Magazine ran a fascinating piece about Narwhal, the sophisticated data architecture that enabled the campaign to track voters, volunteers and online trends.

Palomino CEO Laine Campbell joined the team in Chicago for the final days of the campaign, ensuring maximum uptime and performance on the MySQL databases. Afterwards, President Obama thanked her for Palomino's contributions.

Chef Cookbooks for HBase on CentOS Released

At Palomino, we've been hard at work building the Palomino Cluster Tool. Its goal is to let you build realistically-sized[1] and functionally-configured[2] distributed databases in a matter of hours instead of days or weeks as it is at present. Today marks another milestone toward that goal as we release our Chef Cookbook for building HBase on CentOS!

 

Background

Riot Games was kind enough to open source their Chef Cookbook for building a Hadoop cluster. Although the code wasn't in a state that would produce a functional cluster, and it was almost entirely undocumented, it was a great start.

Recently I was tasked with building an HBase cluster on CentOS using Chef. Although I've written a Cookbook (three times!) to do so, my code was never fully optimized. It could build a cluster, but only with hard-coded configuration parameters, or it produced a cluster that was running in a non-realistic non-production configuration.

Using the Riot Games Cookbook and the lessons I'd learned in the past, I whipped it into shape. I not only modified it to produce a functional cluster in a non-Riot environment, but also to build HBase on top of that! There are over 800 changes in the diff and documentation on how to use it.

 

Source Code

Here you can find the newest Chef Cookbook for HBase on CentOS. Here you can find the original Ansible Playbooks for HBase on Ubuntu. If you would like to use this code to build your own cluster, you are encouraged to join the mailing list to get help and advice from your peers.

 

Notes

[1] A distributed database can be tested functionally by installing on a single machine, but when it comes time to run benchmarks, or to discover the other 90% of functionality that only appears in a distributed setup, you will want to have the database installed on many machines, preferably dozens.

[2] Many projects seem to stop short of installing the database in a way that would let you benchmark it. Perhaps there are shortcuts taken like putting all database files into /tmp, or disabling logging, or removing tricky/subtle components in the interest of simplicity. The Palomino Cluster Tool provides you with a cluster that's actually ready for production. Sure, you still have to edit the configurations a little, but a good base generic configuration is provided.

Couchbase rebalance freeze issue

We came across a Couchbase bug during a rebalance while upgrading online to 1.8.1 from 1.8.0.  

Via the UI, we upgraded our first node, re-added it to the cluster, and then set the rebalance off.  It was progressing fine, then stopped around 48% for all nodes.  The tap and disk queues were quiet and there were no servers in pending rebalance.  The upgraded node was able to service requests, but with only a small percentage of the items relative to the other nodes.  The cluster as a whole did not suffer in performance during this issue though there are some spikes in cpu during any rebalance.  

We decided to stop the rebalance, wait a few minutes, then rebalance and we see it is moving again, progressing beyond what it was.  It stopped again, now at 75%. Let sit for 7 mins, then hit Stop Rebalance and Rebalance. Not progressing at all now.

Couchbase support pointed to a bug where if there are empty vbuckets, rebalancing can hang.  This is fixed in 2.0.  The work around solution is to populate buckets with a minimum of 2048 short time to live (TTL >= (10 minutes per upgrade + (2 x rebalance_time)) x num_nodes) items so all vbuckets have something in them.  We then populated all buckets successfully and were able to restart the rebalance process which completed fine.

Reference:

http://www.couchbase.com/docs/couchbase-manual-1.8/couchbase-getting-started-upgrade-online.html

Benchmarking NDB Against InnoDB on a Write-Intensive Workload

Last month, we evaluated Amazon's new SSD offerings with an extensive series of performance benchmarks.

As a followup, we've prepared a second series of benchmarks that specifically explore performance issues with a write-intensive workload on both NDB and InnoDB storage engines.

Download a free PDF of our findings, and as always, we welcome your feedback, comments and questions below.

Testing and Analyzing Performance with Benchmarks

Generic benchmark tools can be very useful for testing performance on your system. These benchmark tools normally have a set of predefined workloads, but often they don't match your specific workload in useful ways.

One of the best ways to reproduce your workload is to have a good sense of the application that uses the database and how it manages requests to the database. If this is not an option, it is also possible to analyze traffic and to find the most common queries, and use those to define the most common workload.

You can analyze traffic in many ways, from tcpdump to general log, from binlog (only for DML statements) to slow query log.

Afterwards it is possible to analyze them with pt-query-digest (or the obsolete mk-query-digest) to find the most common and/or heavy queries.

In the system we analyze here, the workload was mainly write intensive and involved just 4 tables:

  • tableA was receiving single-row INSERT statements;
  • for each insert on tableA , on average 200 INSERTs were performed in the other 3 tables, distributed as follows: 100 on tableB, 95 on tableC, 5 on tableD (to be more specific , for each INSERT on tableB there is an INSERT either on tableC or tableD).

 

The system also receives SELECT statements, but in a very small number and very simple primary key lookup.

To simulate the workload, we generated a simple perl script that spawns a certain number of threads that perform the DML statements, and other threads that perform the SELECT statements.

At regular intervals, the script prints statistics and progress.

The benchmark test was executed in a setup with 2 hosts: one host where the client was running, and another host where the servers were running.

The RDBMS tested were: MariaDB 5.2.3 with TokuDB 5.2.7 and InnoDB, and Percona 5.5.20.

Additionally, Percona 5.5.20 was tested as multiple instances running on the same hosts.

 

The goal of the first benchmark test was to compare TokuDB against InnoDB for this specific workload.

We executed MariaDB with TokuDB with the following (simple) config file:

[mysqld] 
user=mysql 
table_open_cache=1024 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
datadir=/localfio/datadir
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M 
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/tokudb 
 

 

We found the performance of InnoDB significantly better compared than TokuDB in this instance, though this test - where the dataset fits almost entirely in memory - does not show the real power of TokuDB, which excels at insertion rate at scale. Because these tables have very few indexes, TokuDB and Fractal tree indexes weren't very efficient. Furthermore, the benchmarks were running on FusionIO, which meant that performance on InnoDB didn't degrade much as on spinning disks. We excluded TokuDB out from the next benchmark tests because they are all cases which are not well-suited for TokuDB’s strengths.

We temporarily abandoned MariaDB, and tested Percona 5.5.20 with the following config file:

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=2G
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 

 

We tried various innodb_flush_method attempts, and the graphs show that O_DIRECT performs slightly better than the default fsync(), even if the benchmark shows a weird bootstrap. We also tried ALL_O_DIRECT, which performed badly.

 

Additionally, we tried innodb_log_block_size=4096 instead of the default 512, but nothing changed: insert rate wasn't affected.

 

One of the goals of this benchmark was to test if running multiple mysqld instances on the same host performs better than a single mysqld instance.

On this specific hardware, the answer seems to be yes. Configuring 8 mysqld instances with the same config file listed below (but different paths and ports), throughput is significantly higher. Note that innodb_buffer_pool_size was set to 256M to try to stress the IO subsystem.

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 
 

 

All the above tests were executed using 36 client connections for writes and 36 client connections for reads.

 

We then ran a new cycle of tests, but instead of using 36 x 2 connections, we used 80 x 2 (80 for writes and 80 for reads).

 

 

With 80 connections, throughput was higher than with 36 connections, but at nearly regular intervals we found performance dropping. This seems independent from the size of the buffer pool.

It is interesting to note that with only one mysqld instance, FusionIO was performing at 4.7k – 4.8k IOPS, while with 8 mysqld instances FusionIO was performing at 27k – 29k IOPS. As expected, with a small buffer pool performance tends to slowly degrade when the data doesn't fit in memory.

We tried various values of innodb_write_io_threads, but this didn't make any difference, since the Redo Log was the most written and not the tablespaces.

To better analyze the throughput, we reduced the sample time to 10 seconds and reran the test:

 

 

It is clear that throughput drops from time to time, and for a nearly constant amount of time. While the test was running, we tried to monitor the mysqld instances, but there was no clear indication of why they were stalling. The Redo Log wasn't anywhere close to full and InnoDB wasn't performing aggressive flushing. The amount of data read from disk was pretty low but the amount of data written was spiking. Yet, the writes weren't coming from InnoDB.

The reason for the stalls became apparent when we analyzed the content of /proc/meminfo: the Linux Virtual Memory (VM) subsystem was performing dirty pages flushing!

We changed the dirty_background_ratio from 10 (the default) to 1 , and reran the test.

sysctl -w vm.dirty_background_ratio=1

 

Throughput is now way more stable, although performance has dropped by 2.8%. It is interesting to note that throughput drops at nearly the same time no matter the value of dirty_background_ratio.

A quick analysis of MySQL source code shows that binlog are synced to disk when closed, therefore the drops in throughput may be caused by the flush of binary logs.

We then raised vm.dirty_background_ratio up to 10 (the default value) and lowered max_binlog_size from 1G to 64M.

 

 

Throughput doesn't drop drastically as in the two previous tests, but goes up and down at more regular intervals.

At the end of this test, performance with max_binlog_size=64M is ~4% lower than the initial test with max_binlog_size=1G (in both cases, vm.dirty_background_ratio=10).

The last setup of 8 instances with a 256M buffer pool each and max_binlog_size=64M was then compared with a new setup:  4 instances with a 512M buffer pool each (2GB total in both cases) and max_binlog_size=64M:

 

 

An interesting outcome from this last test is that total throughput raised by around 4% (that was originally lost using binlogs of 64M) and that the total number of IOPS dropped to ~16k, leaving room for more IO in case of a different workload.

We then ran a new test using only 2 mysqld instances. It shows what was already easy to guess when running a similar test with only one mysqld instance: a lower number of mysqld instances can't fully utilize IO capacity and therefore has lower throughput.

 

Conclusions (most of them are as expected) for this specific workload and on this specific hardware:

O_DIRECT performs slightly better than the default fsync for innodb_flush_method .

A high number of clients provides more throughput than a smaller number of clients: not enough tests were performed to find the optimal number of clients.

Throughput reduces when data doesn't fit in the buffer pool.

A high number of mysqld instances running on the same server are able to better utilize the number of IOPS that FusionIO is able to provide (perhaps, it should be a very bad idea to run multiple mysqld instances on the same spinning disk or array)

The sync of binlog during binlog rotation are able to stall the system. Lowering dirty_background_ration or max_binlog_size is able to stabilize the throughput.

Syndicate content
Website by Digital Loom