At PalominoDB, we are not just another remote DBA; we are an integral part of your webops, DBA and BI teams.

Blog

Testing and Analyzing Performance with Benchmarks

Generic benchmark tools can be very useful for testing performance on your system. These benchmark tools normally have a set of predefined workloads, but often they don't match your specific workload in useful ways.

One of the best ways to reproduce your workload is to have a good sense of the application that uses the database and how it manages requests to the database. If this is not an option, it is also possible to analyze traffic and to find the most common queries, and use those to define the most common workload.

You can analyze traffic in many ways, from tcpdump to general log, from binlog (only for DML statements) to slow query log.

Afterwards it is possible to analyze them with pt-query-digest (or the obsolete mk-query-digest) to find the most common and/or heavy queries.

In the system we analyze here, the workload was mainly write intensive and involved just 4 tables:

  • tableA was receiving single-row INSERT statements;
  • for each insert on tableA , on average 200 INSERTs were performed in the other 3 tables, distributed as follows: 100 on tableB, 95 on tableC, 5 on tableD (to be more specific , for each INSERT on tableB there is an INSERT either on tableC or tableD).

 

The system also receives SELECT statements, but in a very small number and very simple primary key lookup.

To simulate the workload, we generated a simple perl script that spawns a certain number of threads that perform the DML statements, and other threads that perform the SELECT statements.

At regular intervals, the script prints statistics and progress.

The benchmark test was executed in a setup with 2 hosts: one host where the client was running, and another host where the servers were running.

The RDBMS tested were: MariaDB 5.2.3 with TokuDB 5.2.7 and InnoDB, and Percona 5.5.20.

Additionally, Percona 5.5.20 was tested as multiple instances running on the same hosts.

 

The goal of the first benchmark test was to compare TokuDB against InnoDB for this specific workload.

We executed MariaDB with TokuDB with the following (simple) config file:

[mysqld] 
user=mysql 
table_open_cache=1024 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
datadir=/localfio/datadir
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M 
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/tokudb 
 

 

We found the performance of InnoDB significantly better compared than TokuDB in this instance, though this test - where the dataset fits almost entirely in memory - does not show the real power of TokuDB, which excels at insertion rate at scale. Because these tables have very few indexes, TokuDB and Fractal tree indexes weren't very efficient. Furthermore, the benchmarks were running on FusionIO, which meant that performance on InnoDB didn't degrade much as on spinning disks. We excluded TokuDB out from the next benchmark tests because they are all cases which are not well-suited for TokuDB’s strengths.

We temporarily abandoned MariaDB, and tested Percona 5.5.20 with the following config file:

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=2G
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 

 

We tried various innodb_flush_method attempts, and the graphs show that O_DIRECT performs slightly better than the default fsync(), even if the benchmark shows a weird bootstrap. We also tried ALL_O_DIRECT, which performed badly.

 

Additionally, we tried innodb_log_block_size=4096 instead of the default 512, but nothing changed: insert rate wasn't affected.

 

One of the goals of this benchmark was to test if running multiple mysqld instances on the same host performs better than a single mysqld instance.

On this specific hardware, the answer seems to be yes. Configuring 8 mysqld instances with the same config file listed below (but different paths and ports), throughput is significantly higher. Note that innodb_buffer_pool_size was set to 256M to try to stress the IO subsystem.

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 
 

 

All the above tests were executed using 36 client connections for writes and 36 client connections for reads.

 

We then ran a new cycle of tests, but instead of using 36 x 2 connections, we used 80 x 2 (80 for writes and 80 for reads).

 

 

With 80 connections, throughput was higher than with 36 connections, but at nearly regular intervals we found performance dropping. This seems independent from the size of the buffer pool.

It is interesting to note that with only one mysqld instance, FusionIO was performing at 4.7k – 4.8k IOPS, while with 8 mysqld instances FusionIO was performing at 27k – 29k IOPS. As expected, with a small buffer pool performance tends to slowly degrade when the data doesn't fit in memory.

We tried various values of innodb_write_io_threads, but this didn't make any difference, since the Redo Log was the most written and not the tablespaces.

To better analyze the throughput, we reduced the sample time to 10 seconds and reran the test:

 

 

It is clear that throughput drops from time to time, and for a nearly constant amount of time. While the test was running, we tried to monitor the mysqld instances, but there was no clear indication of why they were stalling. The Redo Log wasn't anywhere close to full and InnoDB wasn't performing aggressive flushing. The amount of data read from disk was pretty low but the amount of data written was spiking. Yet, the writes weren't coming from InnoDB.

The reason for the stalls became apparent when we analyzed the content of /proc/meminfo: the Linux Virtual Memory (VM) subsystem was performing dirty pages flushing!

We changed the dirty_background_ratio from 10 (the default) to 1 , and reran the test.

sysctl -w vm.dirty_background_ratio=1

 

Throughput is now way more stable, although performance has dropped by 2.8%. It is interesting to note that throughput drops at nearly the same time no matter the value of dirty_background_ratio.

A quick analysis of MySQL source code shows that binlog are synced to disk when closed, therefore the drops in throughput may be caused by the flush of binary logs.

We then raised vm.dirty_background_ratio up to 10 (the default value) and lowered max_binlog_size from 1G to 64M.

 

 

Throughput doesn't drop drastically as in the two previous tests, but goes up and down at more regular intervals.

At the end of this test, performance with max_binlog_size=64M is ~4% lower than the initial test with max_binlog_size=1G (in both cases, vm.dirty_background_ratio=10).

The last setup of 8 instances with a 256M buffer pool each and max_binlog_size=64M was then compared with a new setup:  4 instances with a 512M buffer pool each (2GB total in both cases) and max_binlog_size=64M:

 

 

An interesting outcome from this last test is that total throughput raised by around 4% (that was originally lost using binlogs of 64M) and that the total number of IOPS dropped to ~16k, leaving room for more IO in case of a different workload.

We then ran a new test using only 2 mysqld instances. It shows what was already easy to guess when running a similar test with only one mysqld instance: a lower number of mysqld instances can't fully utilize IO capacity and therefore has lower throughput.

 

Conclusions (most of them are as expected) for this specific workload and on this specific hardware:

O_DIRECT performs slightly better than the default fsync for innodb_flush_method .

A high number of clients provides more throughput than a smaller number of clients: not enough tests were performed to find the optimal number of clients.

Throughput reduces when data doesn't fit in the buffer pool.

A high number of mysqld instances running on the same server are able to better utilize the number of IOPS that FusionIO is able to provide (perhaps, it should be a very bad idea to run multiple mysqld instances on the same spinning disk or array)

The sync of binlog during binlog rotation are able to stall the system. Lowering dirty_background_ration or max_binlog_size is able to stabilize the throughput.

Securing MySQL on Linux Systems

 

MySQL is one of the most, if not the most, popular relational databases chosen by internet based startups in the last decade. Although it is a very robust platform and offers many of the necessary features to support the database needs of today's internet giants, it also suffers from some security issues that must be addressed prior to production use.

This article will discuss the necessary steps to secure a basic MySQL installation and additionally cover more advanced topics for general database security as would be applied to general production environments. 

Functionality

The following general statements apply to this security discussion:

This is a discussion of MySQL as installed on Linux; Windows, OSX, Solaris, and BSD installations may differ and the differences are not covered here.

MySQL will be handling generic web traffic as is commonly found on PHP, Python, Perl, and Java web applications. 

The standard, and included, MySQL administration tools will be used as necessary for daily contact with the database server. Non-standard tools are not discussed. 

Any remote connections not done over SSH, either explicitly or via tunnel, are assumed to be a security risk and are not advised, as they lack encryption and allow passwords to be read in transit by potential attackers. 

 

Pre-Installation Security Recommendations

Before installing any software it is a good idea to harden the operating system. This includes filesystem security features:

Encrypted filesystems to prevent inspection of data by unauthorized parties

Intrusion detection systems like Tripwire, which watch for changes to critical system files 

Port lockdown via firewall software to prevent access to system services

Strong root and OS-user passwords

Disallowing OS login by application users: set to “nologin” or “/bin/false” 

Setting up sudo for privileged accounts and requiring a password for sudo access

Running scheduled rootkit detection scripts

Running an operating system level Access Control List process: SELinux extensions or AppArmor. These programs will restrict system or server processes from accessing data and resources that is not explicitly defined. Due to general misunderstanding or lack of desire to maintain the access controls, these programs are very often disabled by system administrators. 

Post Install Security 

After MySQL is installed, either via RPM, Deb package, or other means, there are different approaches to securing the initial database. The first option is to execute the script provided with MySQL, named “mysql_secure_installation”. This will go through the following steps, which can also be taken manually if you prefer:

Checking for existence of the Root password, and if not found it will set one

mysql> SELECT * FROM mysql.users WHERE User=’root’;

mysql> UPDATE mysql.users SET Password=password(‘your password here’) WHERE User=’root’; 

Delete anonymous users

mysql> DELETE FROM mysql.users WHERE User=’’;

Removing remote access for the root account

mysql> DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1', '::1');

Removing the default “test” database

mysql> DROP database test;

Reloading the user privilege tables

mysql> FLUSH PRIVILEGES;

 

By default, MySQL will run via the “mysqld_safe” which contrary to the name does not make MySQL safer for security reasons aside from ensuring that the mysqld process is not running under the root user. The mysqld_safe script provides the functionality as follows: “Script to start the MySQL daemon and restart it if it dies unexpectedly”. As such, if one attempts to run mysqld as the root user it will complain and refuse to start. If you are troubleshooting the mysqld process and want to run without mysqld_safe you can run it as follows, via sudo. Your binary location may differ from /usr/local/bin/mysqld so replace as necessary.  

$>  sudo -u mysql /usr/local/bin/mysqld

 

MySQL Configuration Considerations

There are several configuration settings that can further secure the database during operation. These settings will be found in the /etc/my.cnf or /etc/mysql/my.cnf file depending on the version of Linux being used. 

 

old-passwords: this allows MySQL to create and authenticate users via the outdated and insecure password hashing from version 4 and older. It is strongly recommended against using this in production.

http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_old-passwords

sql-mode: this allows the administrator to run MySQL in various operating modes. There are many options but for security the recommended minimum setting is “NO_AUTO_CREATE_USER” or “TRADITIONAL”

http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_sql-mode

skip-grant-tables: this starts the mysqld process without any authentication tables, which is useful if one needs to recover or reset a lost root password. However, if you see this option enabled on a production server that is not undergoing recovery it should be seen as a critical security issue and dealt with immediately. 

http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_skip-grant-tables

bind-address: this can be used to restrict network and socket access to specific interfaces, thus ensuring that traffic can only originate through the desired interface. An example is a server with multiple network interfaces (eth0, eth1, eth2) that resides on multiple subnets; in some architectures the database will answer only to application servers or users on a specific subnet and thus the mysqld process needs to be restricted to listen for connections on only the required subnet. 

http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_bind-address

 

Data Encryption

MySQL offers in-row data encryption functions. These can be used to ensure that even in the event of a security breach, that the attacker cannot access the data in the database tables without an encryption key. Although not a MySQL specific functionality, the SQL functions for AES_ENCRYPT and AES_DECRYPT, along with a multitude of HASH mechanisms exist for employing in-row data encryption. Read more about this topic here: http://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html

 

Database User and Connection Security Considerations

MySQL users, by definition, can connect to the database server and access data. How much or how little access your users are granted will determine the level of security compromise that they pose to potential attackers. The most insecure user that can be created is a user with wildcard remote host connection ability (the ability to connect from any location, any host, any IP address) who has SUPER and GRANT options; this user can create other users and has the ability to shut down the database server process. As such, it is wise to limit the amount of access that a user is granted when creating accounts. Here are some things to keep in mind:

Limit user access to specific schemas: granting global access to a user allows the user to access all current and future schemas as opposed to a single defined schema. 

Limit originating users connections from specific IP addresses or subnets instead of allowing connections from any host or any network. 

Where possible limit users to connect only from localhost (127.0.0.1) and advise ssh tunnels to be utilized to gain access to localhost. 

Do not grant SUPER privileges to non-administrative users

Do not grant replication privileges to non replication process users.

When granting replication processes limit them to only replication privileges.

When possible, do not use hostnames for host connection privileges, instead specify IP addresses. This removes the risk of DNS spoofing and removes the requirement for a DNS lookup during connection initiation which saves time and resources.

Require strong passwords for all users and rotate passwords on a defined schedule.

If running connections over the internet without a VPN, SSH tunnel, or other encrypted means, consider using SSL for all connections. Otherwise passwords can be seen in transit by attackers, similar to FTP. 

 

References for further reading:

Redis Persistence

Clients often ask us about the benefits of using key-value stores, such as Redis, for high-volume environments. One of the key benefits that is often cited is the durability offered by Redis persistence (as described in detail here).

Developer Salvatore ‘antirez’ Sanfilippo delves into the topic on his blog. Here are some key questions you should consider as you evaluate Redis in your own environment.

Redis replication (and one of two methods of persistence) is achieved using the AOF (append-only file), which logs all statements that modify data. In the example in the article, there is a DELETE issued on a non-existent key, and that statement doesn't get logged nor replicated to a slave Redis. But in real life, masters and slaves get out-of-sync, and it can be handy to have a statement that does nothing on the master, but when replicated to the slave, has the tangible effect of moving the master and slave closer toward convergence. It seems at least it should be an option to have "no-effect" statements replicate from the master to the slave.

When the AOF gets too large, an AOF rewrite occurs. This is the minimal set of statements needed to log to reproduce the data set as it is in memory:

You may wonder what happens to data that is written to the server while the rewrite is in progress. This new data is simply also written to the old (current) AOF file, and at the same time queued into an in-memory buffer, so that when the new AOF is ready we can write this missing part inside it, and finally replace the old AOF file with the new one.

So what happens when we run out of RAM? That would be a very interesting behaviour to have defined unambiguously. He also doesn't talk about how long it takes to write the in-memory delta to the AOF before the swap-over. I suspect during that time, the server will be largely unresponsive (or should be, to preserve data integrity). On a busy server, there might be millions of entries in the in-memory delta buffer once the new AOF is finished writing.

If you think Redis having persistence means you can serve data from disk, you'd be wrong. The point of persistence is just to get the in-memory-only dataset back after a crash or restart.

AOF rewrites are generated only using sequential I/O operations, so the whole dump process is efficient even with rotational disks (no random I/O is performed). This is also true for RDB snapshots generation. The complete lack of Random I/O accesses is a rare feature among databases, and is possible mostly because Redis serves read operations from memory, so data on disk does not need to be organized for a random access pattern, but just for a sequential loading on restart.

So Redis really is just (a fast) memcached but with some persistence methods.

There is an option for fsync'ing data to the AOF in various ways. Be careful, the "appendfsync everysec" setting is actually worst-case every TWO seconds. It's only every second on average.

When you set appendfsync always, Redis still doesn't do an fsync after every write. If there are multiple threads writing, it'll batch the writes together doing what he calls "group commit." That is, every thread that performs a write in the current event loop will get written at the same time at the end of the event loop. I can't think of any downside to this, as I don't think clients get their response that data was written until the end of the event loop.

Restarting Redis requires either re-loading an RDB (Redis snapshot) or replaying AOF transactions to get to the state before the server was stopped. Redis is an in-memory database, so as you might expect, the start-up times are fairly onerous.

Redis server will load an RDB file at the rate of 10 ~ 20 seconds per gigabyte of memory used, so loading a dataset composed of tens of gigabytes can take even a few minutes.

That's the best case, as we note from the following:

Loading an AOF file [takes] twice per gigabyte in Redis 2.6, but of course if a lot of writes reached the AOF file after the latest compaction it can take longer (however Redis in the default configuration triggers a rewrite automatically if the AOF size reaches 200% of the initial size).

It isn't pretty, but I'm really glad the author is giving so much transparency here. An optimisation mentioned is to run a Redis slave and have it continue serving the application while the Redis master is restarted.

The author notes that in high-volume environments, a traditional RDBMS can in theory serve reads from the moment it's started, in practice that can cause the database to become fairly unresponsive as the disks seek like wild to pull in the required data. He further notes that Redis, once it starts serving reads, serves them at full speed.

At Palomino, we are dedicated to bringing rigor in benchmarking and analysis to the DBMSs that are our core compentencies. In a future post, watch for us to test Redis and put some solid numbers behind some of this. We are interested in seeing how Redis performs during AOF rewrites and how long it takes to start up on typical modern hardware at typical modern loads.

The Postgres-XC 1.0 beta release overview

 

When I heard about this project a year ago, I was really excited about it. Many cluster-wide projects based on Postgres were developed very slowly, based on older (i.e. Postgres-R http://www.postgres-r.org/) or proprietary (i.e. Greenplum) versions. The features that this project hoped to achieve were ambitious, as we’ll detail in this post. And best of all - this project is based on the 9.1 Postgresql version, which is really up-to-date (at the moment of writing this post, this is the last stable version).

If you are interested in a serious project for scaling horizontally your PostgreSQL architecture, you may visit the official website at http://postgres-xc.sourceforge.net/ and take a look. 

For those who are interested, there will be a tutorial at PgCon this year.  As a brief overview, I will try to give you a broad idea for those who want to get involved in the project.

 

What Postgres-XC can do:

  • Support multi-master architecture. Data nodes can contain part or all of the data of a relationship. 
  • Transparent view to application from any master. The application only needs to interact with the cluster through coordinators.
  • Distribute/Replicate per relation (replication, round robin (by default if any unique column is specified), by hash (by default if a unique is specified), by modulo or a set to a group or node)
  • Parallel transaction execution among cluster nodes.

 

What Postgres-XC cannot do:

  • Support triggers (may be supported in future releases).
  • Distribute a table with more than one parent. 

 

Before you start:

  You need to install the most recent versions of Flex and Bison. That’s important because in the last tarball, ‘./configure’ won’t raise error if they are missing, and the error will be prompted once you execute ‘make’. You will need readline-dev and zlib-dev (not mandatory but strongly recommended).

According to the documentation, Postgres-XC should be compatible with Linux platforms based upon Intel x86_64 CPU. 

The documentation needs to be improved, so we advise you to try the steps directly and read the help prompted by the commands. For example, the initdb command in the documentation is incomplete, “--nodename” is mandatory in this version. This project is new and has only a few contributors to date, but we hope its community keeps growing. Most importantly, it is great that a beta release was launched earlier than we expected.

 

Elements of architecture

 

  • GTM (Global Transaction Manager)

+ Realize that I say only GTM, not GTMs. Only one GTM can be the manager. For redundancy, you have GTM-Standby and to improve performance and failover GTM-Proxies.

+ The GTM serializes all the transaction processing and can limit the whole scalability if you implement it primitively. This should not be used in slow/wide networks and is recommended to involve the fewest number of switches between GTM and coordinators. The proxies reduce the iteration with the GTM and improve the performance. 

+ Uses MVCC technology. That means that it will still use the same control for the concurrency as Postgres.

+ This is the first thing you will need to configure. If you set up everything in the same machine, you will need to create a separate folder for the configuration and files.

 

  • Coordinator/s

+ Interface for applications (like a Postgres backend). 

+ It has its own pooler (yes, and I think this is great, avoiding more complexity in big environments).

+ Doesn’t store any data. The queries are executed in the datanodes, but...

+ … it has its own data folder (for global catalogs).

  • Datanode/s

+ Stores the data.

+ It receives the petition with a GXID (Global Transaction ID) and Global Snapshot to allow requests from several coordinators.

Both Datanodes and Coordinator use their own data directory, so keep this in mind this if you are setting both up on the same machine. 

Configuring several GTM-Proxies will improve the scalability, shrinking the I/O in the GTM. Plus, you can configure the GTM-Standby to avoid a SPOF of the general manager. It not only provides the GXID, it also receives the node registration (you can trace your log or check the file inside the gtm folder called register.node, it’s binary but is readable) and most importantly, it holds the snapshot of the current status of all the transactions.

Coordinators can point to all the datanodes and can point to the same datanode (as Oracle RAC, but we’re not sure if all the features included in that solution will be available for Postgres-XC). Coordinators connect to the proxies (if you have already configured them) or the main GTM.

Hope you enjoyed the read. We are preparing more cool stuff about this amazing solution, keep in touch!

New versions of PgPool released - 3.1.3 & 3.0.7


This essential tool for Postgres architectures is continually improving, and is now available in its new releases. Both are bugfix versions.
 
For those unfamiliar with the tool, it is a middleware with functionality as a load balancer, pooler*  and/or replication system for PostgreSQL databases. The 3.1.x versions are compatible with Postgres 9.x, whose streaming replication feature was pushed to Pgpool developers to take advantage of it.   This allows the tool to balance queries without using the pgpool-replication technique.
 
In the 3.1.3/3.0.7 fixes we have:

  • Allow multi statement transactions in master/slave mode: Transactions with BEGIN, since 3.1 were sent to the slaves/standby servers as well. This brings non desirable effects when the transaction contains DELETE/INSERT/UPDATE, due to the fact that standbys cannot execute writable SQL. (3.1.3 fix)
  • Important fixes for failover detection and execution. (3.1.3 fix)
  • Added m4 files to avoid problems when compiling in older operating systems.
  • Fixed hangup on PREPARE errors.
  • Fixed memory leak in reset queries.


If you are running 3.1.x against Postgres 9 databases, we strongly recommend you upgrade PgPool due to the fixing in the multi statement feature.

For more information http://www.pgpool.net/mediawiki/index.php/Main_Page

* If you need only a connection pooler for Postgres, I prefer PgBouncer http://wiki.postgresql.org/wiki/PgBouncer. It is lightweight and more specific and simply works, without as much configuration.

MySQL Conference and Expo - 2012 and our company offsite

Last week was a huge week for PalominoDB.  I will admit to being cautiously optimistic about Percona taking over this conference - one of the biggest parts of the year for the MySQL community.  That being said, the conference was done quite well.  I really applaud Percona for making it happen.  In particular, the dot org pavillion was an excellent addition.  While I was stuck at the booth most of the event, I had a great view for the sheer variety of attendees coming through, and had the privilege of participating in a number of excellent conversations.

I noticed a number of patterns that seemed to prevail among the conversations.  Folks seem eager to move on to MySQL 5.5, finally comfortable with its stability.  Administrators are eager to learn more about MariaDB and Drizzle, and how they can differentiate themselves from the Oracle variants.  Partitioning is more prevalent as datasets grow, and sharding is becoming almost commonplace.  Now the focus is more on the challenging questions around HA - multi-master, synchronous replication and multi-datacenter installations.  People seem more interested in commercial additions around the MySQL ecosystem such as Tungsten, TokuDB, ScaleArc, Scalebase and Clustrix.  

René Cannao, one of our senior administrators did a great tutorial on understanding performance through measurement, benchmarking and profiling.  Feedback has been great, and we look forward to continuing to evolve our benchmarking and profiling methods.  Please keep an eye out.

Additionally, we were able to announce a very exciting partnership with SkySQL.  PalominoDB focuses on operational excellence by providing top DBAs to our clients.  We dig in to our clients' architectures and improve them, maintain them and help redesign them as they grow.  Our oncall services are top notch, regularly answering pages in under 5 minutes - and always providing a talented and experienced DBA on the other end of the phone.  What we don't do is software support.  It's just not our experience.  We can tune it, run it and grow it, but for clients who need to dig into code, fix bugs and really provide deep internals knowledge in MySQL - SkySQL are who we turn to.   We are also quite excited to help augment SkySQL's excellent services with our own - to create some of the happiest customers out there.

We are thrilled to see our partnership ecosystem grow, our service offerings expand and knowledge of our brand and the quality of services continue to improve.  I can't help but glow with pride at the reputation PalominoDB has built - through our DBAs, our clients and our partners.  Being a part of the MySQL  conference and expo only cemented this pride in community, and pride in our work.  Thank you to everyone who has helped us get there.

Thank you all of you!

Exploring a new feature of 9.2: Index-only scans

We, like other Postgres DBAs worldwide, have been waiting for the 9.2 release for some time, specifically for the index-only scan feature, which will help reduce I/O by preventing unnecessary access to heap data if you only need data from the index.

Besides 9.2 is still in development, it is possible to download a version for testing at http://www.postgresql.org/download/snapshots/ . It's important to note that it doesn’t add new behaviours, but improves the way that indexes are used.

How can we test this feature? We created a ‘big’ table starting with 10 million+ records with random values.

  • Extract a few elements, using a where clause.
  • Use aggregations.
  • Use partitioning plus index only scans.



When is this feature most useful?:

  • When you select only the columns that are specified in the index definition,  including those which are at the condition part of the query.
  • If a vacuum was executed previously. Thus happens because the scan can skip the “heap fetch” if the TID references a heap [table] page on which all tuples are known visible to everybody (src/backend/executor/nodeIndexonlyscan.c).


So, the main table is:


CREATE TABLE lot_of_values AS SELECT i, clock_timestamp() t1, random() r1, random() r2, random() r3, clock_timestamp() + (round(random()*1000)::text || ' days')::interval d1 from generate_series(1,10000000) i(i);
ALTER TABLE lot_of_values ADD PRIMARY KEY(i);

CREATE INDEX CONCURRENTLY ON lot_of_values (d1);



Something interesting: due to some improvements in write performance, we realized that 9.2 demonstrated better timing compared with 9.1.3 (~200k in 9.1, ~170k ms on 9.2). The index creation was slightly better on 9.2.

The table will contain data like this:

stuff=# \x
Expanded display is on.
stuff=# select * from lot_of_values limit 1;
-[ RECORD 1 ]---------------------
i  | 1
t1 | 2012-04-18 08:37:14.426624+00
r1 | 0.571268450468779
r2 | 0.222371176816523
r3 | 0.72282966086641
d1 | 2012-08-17 08:37:14.426713+00


Ok, let’s start with some examples. As we previously explained, we need to specify columns that are only in the index. You can’t use columns from 2 different indexes. The next example is a clear fail:


stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
             QUERY PLAN                                
---------------------------------------------
Seq Scan on lot_of_values  (cost=0.00..263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)

stuff=# set enable_seqscan=off;
SET
stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
                                      QUERY PLAN                                       
---------------------------------------------
Seq Scan on lot_of_values  (cost=10000000000.00..10000263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)



We don’t have indexes at r1 and it isn’t part of the index!

The next example is another fail, using a column that is defined in another index or directly not defined in any index:


stuff=# explain select i,  d1 from lot_of_values where i between 12345 and 23456;
                                         QUERY PLAN                                           
---------------------------------------------
Index Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=12)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



The next example is the correct case:


stuff=# explain select i from lot_of_values where i between 12345 and 23456;
                                           QUERY PLAN                                             
---------------------------------------------
Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=4)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



Also, we can try with a non-pk index:


stuff=# explain select min(d1), max(d1) from lot_of_values ;
                                          QUERY PLAN                                                             
---------------------------------------------
Result  (cost=6.93..6.94 rows=1 width=0)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
  InitPlan 2 (returns $1)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan Backward using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
(9 rows) stuff=# explain select min(i), max(i), avg(i) from lot_of_values where i between 1234 and 2345;
                                             QUERY PLAN                                               
---------------------------------------------
Aggregate  (cost=66.90..66.91 rows=1 width=4)
  ->  Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..58.21 rows=1159 width=4)
        Index Cond: ((i >= 1234) AND (i <= 2345))
(3 rows)


The aggregation cases are special.  Index-only scans are not useful for count(*) without condition, because the index scan needs to check the visibility of the tuple, which makes it expensive. So, if you need to count the entire table, a sequential scan must be perfomed.

Just for testing purposes, we’ll try to “turn off” the seqscan node, to force an Index-only scan:


stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
               QUERY PLAN                                                       
---------------------------------------------
Aggregate  (cost=213496.00..213496.01 rows=1 width=4) (actual time=57865.943..57865.946 rows=1 loops=1)
  Output: count(i)
  Buffers: shared hit=2380 read=86116
  ->  Seq Scan on public.lot_of_values  (cost=0.00..188496.00 rows=10000000 width=4) (actual time=0.667..30219.806 rows=10000000 loops=1)
        Output: i, t1, r1, r2, r3, d1
        Buffers: shared hit=2380 read=86116
Total runtime: 57866.166 ms
(7 rows) stuff=# set enable_seqscan=off;
SET
stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
                                                          QUERY PLAN                                    
---------------------------------------------
Aggregate  (cost=351292.03..351292.04 rows=1 width=4) (actual time=64094.544..64094.547 rows=1 loops=1)
  Output: count(i)
  Buffers: shared read=110380
  ->  Index Only Scan using lot_of_values_pkey on public.lot_of_values  (cost=0.00..326292.03 rows=10000000 width=4) (actual time=38.773..35825.761 rows=10000000 loops=1)
        Output: i
        Heap Fetches: 10000000
        Buffers: shared read=110380
Total runtime: 64094.777 ms
(8 rows)



After a Vacuum, the plan changed and the cost drops to 262793.04.

For partitioning, as we expected, this works as well (in this example, we’ll use another table called ‘persons’ with ‘dni’ as PK column):


coches=# explain (analyze true, costs true, buffers true, timing true, verbose true)  select dni from persons where dni between 2100111 and 2110222;
Result  (cost=0.00..168.62 rows=22 width=8) (actual time=61.468..61.468 rows=0 loops=1)
  Output: persons.dni
  Buffers: shared hit=43 read=1
  ->  Append  (cost=0.00..168.62 rows=22 width=8) (actual time=61.442..61.442 rows=0 loops=1)
        Buffers: shared hit=43 read=1
        ->  Seq Scan onpersons  (cost=0.00..0.00 rows=1 width=8) (actual time=0.156..0.156 rows=0 loops=1)
              Output:persona.dni
              Filter: (((persona.dni)::bigint >= 2100111) AND ((persona.dni)::bigint <= 2110222))
        ->  Index Only Scan using persons_200_pkey on persons_200 persons  (cost=0.00..8.38 rows=1 width=8) (actual time=0.405..0.405 rows=0 loops=1)
              Output:persona.dni
              Index Cond: ((persons.dni >= 2100111) AND (persons.dni <= 2110222))
              Heap Fetches: 0
              Buffers: shared hit=5

…. LOT OF PARTITIONS ….
Total runtime: 11.045 ms
(114 rows)


Conclusion: this feature adds one of the most exciting performance improvements in Postgres. We see improvements at as much as 30% so far, and look forward to seeing how this scales as 9.2 becomes production-ready.


Automatically Bring Up A New Mongod Node in AWS

One of the most desirable features of MongoDB is its ability to automatically recover from a failed node.  Setting this up with AWS instances can raise challenges, depending on how you're addressing your machines.  We see many examples online of people using IP addresses in their configurations, like this:
cfg = {
    _id : "rs_a",
    members : [
        {_id : 0, host : "10.2.3.4", priority : 1},
        {_id : 1, host : "10.2.3.5", priority : 1},
        {_id : 2, host : "10.2.3.6", priority : 0}
    ]
}
Which is fine for giving examples, or maybe creating a single cluster that you control the hardware.  But in the ephemeral world we live in with virtual instances where IPs change constantly, this is just not feasible.  You need to use hostnames, and have DNS set up.  Well, that's easier said than done, and so I've tried to come up with a proof of concept of automatically assigning hostnames for new instances in AWS. I hope that some of my experiences will be valuable in setting up your MongoDB clusters.
Here is the scenario:  a replication set with three nodes set up in AWS using ec2 instances.  One of the nodes goes down.  You want to be able to bring that same node back - or a replacement node - with a minimum of configuration changes.  To make this happen, you need to prepare for this event beforehand.  All of this is done on Amazon's default OS, which is based on CentOS.
The MongoDB pages on setting up instances in AWS here and here
are very valuable in setting up most of it, however the details of how to set up hostnames could use some elaboration.  For one thing, it doesn't really help to use the public domain addresses into your zone files - if you bring up a new instance, you'll get a new public domain name, and you'll need to go back into your zone file and update it.  I wanted to minimize the reconfiguration necessary when bringing a new instance up to replace one that went down.
My next reference is Marius Ducea's very helpful article How To update DNS hostnames automatically for your Amazon EC2 instances. One of the great things I discovered here was the really useful tool Amazon provides to get info about the instance that you're on.  Basically, curl http://169.254.169.254/latest/meta-data/ to get a list of available parameters, and then add that parameter on to the end of the url to get its value.  I made a little bash function that I used often while going through this:
------
meta () {
  curl http://169.254.169.254/latest/meta-data/$1
  echo
}
------
Stick that in your .bashrc or similar file, and then "meta" will list all parameters, and e.g. "meta local-ipv4" will give you your local ip.
The first thing you want to set up is your DNS server.  Of course you must have redundancy, with at least a primary and a backup, spread across different regions.  For the purposes of this demonstration, however, I just had a single DNS server.  Obviously, do not do this in production!

Here are the significant changes when setting up your bind server:
1.  In the {{options}} block, ensure your "listen-on" block will work for all servers who need DNS.  In my case, I set it to "any".  You may need to do something more restrictive.
2. {{allow-recursion}} also needs to be set for the same servers.
3. As in Ducea's article, add your keys and your key blocks.  You'll need to copy these keys to your instance, but be careful with leaving copies around, and ensure the keys' permissions are as strict as possible.
4. Add blocks for your internal and external zones.  You can probably get by with just an internal zone, but you may find it more convenient to have both zones.
5. Each of the zones will need to reference the key.
6. The purpose of the "control" block in there is so that I can use rndc to control my named server.  Although not technically necessary, it becomes useful if you need to edit your zone files after any dynamic updates have been made.  You'll need to run "rndc freeze" before, and "rndc thaw", otherwise the hanging .jnl files will make named complain.  You'll also need to add an /etc/rndc.conf file, which has at a minimum a "key" and an "options" block.  See here for details on setting it up on CentOS systems.
/etc/named.conf:
-------
//
// named.conf
//
// Provided by Red Hat bind package to configure the ISC BIND named(8) DNS
// server as a caching only nameserver (as a localhost DNS resolver only).
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
options {
// listen-on port 53 { 127.0.0.1; };
listen-on port 53 { any; };
listen-on-v6 port 53 { ::1; };
directory "/var/named";
dump-file "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
allow-query     { localhost; };
allow-recursion     { any; };
recursion yes;
dnssec-enable yes;
dnssec-validation yes;
dnssec-lookaside auto;
/* Path to ISC DLV key */
bindkeys-file "/etc/named.iscdlv.key";
};
controls {   
inet 127.0.0.1 allow { localhost; } 
keys { palomino.mongo.; }; 
};
key palomino.mongo. {
algorithm HMAC-MD5;
secret "SEcreT FRom Your Generated Key file";
};
logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};
zone "." IN {
type hint;
file "named.ca";
};
include "/etc/named.rfc1912.zones";
zone "int.palomino.mongo" IN {
  type master;
  file "int.palomino.mongo.zone";
  allow-update { key palomino.mongo.; };
  allow-query     { any; }; // this should be local network only
  //allow-transfer { 10.160.34.184; };
};
zone "palomino.mongo" IN {
  type master;
  file "palomino.mongo.zone";
  allow-update { key palomino.mongo.; };
  allow-query     { any; };
  //allow-transfer { 10.160.34.184; };
};
-------
In this example, I chose an obviously not public domain name to avoid confusion.  A feature not in this example, but you will need, are allow-transfer lines for dealing with communication with your other nameserver(s).  Also note, that in this case I used the same keys for both dynamic updates from the other hosts and for rdnc updates.  You may decide to do otherwise.  Finally, you may want to add reverse-lookup zones as well.
External Zone File: palomino.mongo.zone 
------
$ORIGIN .
$TTL 86400 ; 1 day
palomino.mongo IN SOA dns1.palomino.mongo. mgross.palomino.mongo. (
2012032919 ; serial
21600      ; refresh (6 hours)
3600       ; retry (1 hour)
604800     ; expire (1 week)
86400      ; minimum (1 day)
)
NS dns1.palomino.mongo.
A local.ip.of.this.instance
$ORIGIN palomino.mongo.
$TTL 60 ; 1 minute
dns1 A public.ip.of.this.instance
-------
Internal Zone File: int.palomino.mongo.zone 
-------
$ORIGIN .
$TTL 86400 ; 1 day
int.palomino.mongo IN SOA dns1.palomino.mongo. mgross.palomino.mongo. (
2012032917 ; serial
21600      ; refresh (6 hours)
3600       ; retry (1 hour)
604800     ; expire (1 week)
86400      ; minimum (1 day)
)
NS dns1.palomino.mongo.
A local.ip.of.this.instance
$ORIGIN int.palomino.mongo.
$TTL 60 ; 1 minute
dns1 A local.ip.of.this.instance
-------
You don't need to put any other hosts in here but your DNS server(s).  All the other hosts will get in there dynamically.  The local ip of this DNS server is what you'll need to bootstrap your DNS when you start your other hosts.  Start your DNS server.
Next, for each mongod host, I used a script similar to Ducea's, with the exception that it will take the local ip of the DNS server in addition to the hostname.  Also, I insert the DNS server's IP into the /etc/resolv.conf.  This is probably a bit of a hack.  I don't doubt there's a more efficient way to do this.  In this example, I am setting the user-data of the ec2 instance to "host:DNS-local-ip", so, for example --user-data "ar1:10.1.2.3".  Hopefully this will make more sense later.
------
update_dns.sh:
------
#!/bin/bash
USER_DATA=`/usr/bin/curl -s http://169.254.169.254/latest/user-data`
HOSTNAME=`echo $USER_DATA | cut -d : -f 1`
DNS_IP=`echo $USER_DATA | cut -d : -f 2`
line="nameserver $DNS_IP"
# this should skip if there is no DNS_IP sent
if  ! grep "$line" /etc/resolv.conf 1>/dev/null
then
    sed -i"" -e "/nameserver/i$line" /etc/resolv.conf
fi
DNS_KEY=/etc/named/Kpalomino.mongo.+157+29687.private
DOMAIN=palomino.mongo
#set also the hostname to the running instance
hostname $HOSTNAME.$DOMAIN
PUBIP=`/usr/bin/curl -s http://169.254.169.254/latest/meta-data/public-ipv4`
cat<<EOF | /usr/bin/nsupdate -k $DNS_KEY -v
server dns1.int.$DOMAIN
zone $DOMAIN
update delete $HOSTNAME.$DOMAIN A
update add $HOSTNAME.$DOMAIN 60 A $PUBIP
send
EOF
LOCIP=`/usr/bin/curl -s http://169.254.169.254/latest/meta-data/local-ipv4`
cat<<EOF | /usr/bin/nsupdate -k $DNS_KEY -v
server dns1.int.$DOMAIN
zone int.$DOMAIN
update delete $HOSTNAME.int.$DOMAIN A
update add $HOSTNAME.int.$DOMAIN 60 A $LOCIP
send
EOF
--------
This file can be anywhere, I happened to put it in /etc/named, where I also put the keys.  This will be run by root, and contains secrets, so you should chmod 700 those three files.
The Mongo instances:
I used the excellent guide on the MongoDB site to set up a single instance. Go through many of the steps the entire process, including Launching an Instance, Configure Storage, and up to Install and Configure MongoDB.  Now, for this example, we are only setting up a ReplicaSet. Sharding will be done at another time.  The main thing we want to do here is to be able to create a generic image that you can kick off and automatically set its DNS and connect to the rest of the Replica Set, with as little human interaction as possible.
As in the example given, in /etc/mongod.conf, the only changes I made were:
fork = true
dbpath=/data
replSet = rs_a
So that when this mongod starts up, it will already be configured to be a member of rs_a.  Some other changes that can be made here would be to generate these changes dynamically from user-data sent to the instance, similar to the dns update script above, but for this case, I left it hard-coded.
You can go ahead and start up mongod, connect to it and ensure it functions.  The final piece of the puzzle is to ensure our dns update script runs when the host starts up.  At first, I had it running as part of /etc/rc.local, as in Ducea's page.  However, I soon found out that means that DNS would then get updated _after_ mongod has already started, and then mongod couldn't find any of the new hosts.
So, to ensure that dns is updated before mongod starts, I added these lines to /etc/init.d/mongod, right at the beginning of the start() function:
-------
start()
{
  echo "Updating dns: "
  /etc/named/update_dns.sh
  echo -n $"Starting mongod: "
--------
At this point, we are almost ready to save our image.  If you did a test and ran mongod, your /data directory will already have journal and/or other files in there, which might have data in them we don't necessarily want on startup.  So, stop mongod and remove all your data:
-----
service mongod stop
cd /data
rm -rf *
-------
Now your instance is ready to make an image.  You can stop it now (otherwise it would reboot when you create the image).  All you need now is the id of your instance.  Create your image:
ec2-create-image <instance-id> --name mongod-replset-server --description "Autoboot image of mongod server"
Once you get an image id from that, we can very easily set up our replica set.  All we need to bootstrap this is the local IP of your DNS server.  For this example, my hostnames will be ar1.palomino.mongo, ar2.palomino.mongo, and ar3.palomino.mongo (public), and ar1.int.palomino.mongo, ar2.int.palomino.mongo, and ar3.int.palomino.mongo .  In this example, my DNS server's local IP address is 10.160.121.203
[mgross@dev ~]$ for i in 1 2 3
> do
> ec2-run-instances <ami-image-id> -n 1 -k <your-keypair-name> -d ar$i:10.160.121.203 -t m1.large -z <your-availability-zone> -g <your-security-group-id>
> done
This will launch 3 instances of the mongod server, which are expecting to be in a replset named "rs_a" (from the setting we added to /etc/mongod.conf above).  When they start up, right before mongod starts, they'll set the hostnames provided, connect to the DNS server and set their public and private IPs.
Now if you use the nameserver that you set up, you should be able to connect with one of the hostnames, for example ar1.palomino.mongo
mongo ar1.palomino.mongo
set up your replica set:
> rs.config()
null
> cfg = {
...     _id : "rs_a",
...     members : [
...         {_id : 0, host : "ar1.int.palomino.mongo", priority : 1},
...         {_id : 1, host : "ar2.int.palomino.mongo", priority : 1},
...         {_id : 2, host : "ar3.int.palomino.mongo", priority : 0}
...     ]
... }
{
"_id" : "rs_a",
"members" : [
{
"_id" : 0,
"host" : "ar1.int.palomino.mongo",
"priority" : 1
},
{
"_id" : 1,
"host" : "ar2.int.palomino.mongo",
"priority" : 1
},
{
"_id" : 2,
"host" : "ar3.int.palomino.mongo",
"priority" : 0
}
]
}
> rs.initiate(cfg);
{
"info" : "Config now saved locally.  Should come online in about a minute.",
"ok" : 1
}
> rs.status();
{
"set" : "rs_a",
"date" : ISODate("2012-03-30T04:20:49Z"),
"myState" : 2,
"members" : [
{
"_id" : 0,
"name" : "ar1.int.palomino.mongo:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"self" : true
},
{
"_id" : 1,
"name" : "ar2.int.palomino.mongo:27017",
"health" : 1,
"state" : 6,
"stateStr" : "UNKNOWN",
"uptime" : 2,
"optime" : {
"t" : 0,
"i" : 0
},
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2012-03-30T04:20:49Z"),
"pingMs" : 0,
"errmsg" : "still initializing"
},
{
"_id" : 2,
"name" : "ar3.int.palomino.mongo:27017",
"health" : 1,
"state" : 5,
"stateStr" : "STARTUP2",
"uptime" : 4,
"optime" : {
"t" : 0,
"i" : 0
},
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2012-03-30T04:20:49Z"),
"pingMs" : 0,
"errmsg" : "initial sync need a member to be primary or secondary to do our initial sync"
}
],
"ok" : 1
}
PRIMARY> rs.status();
{
"set" : "rs_a",
"date" : ISODate("2012-03-30T04:21:46Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "ar1.int.palomino.mongo:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"self" : true
},
{
"_id" : 1,
"name" : "ar2.int.palomino.mongo:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 59,
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"lastHeartbeat" : ISODate("2012-03-30T04:21:45Z"),
"pingMs" : 0
},
{
"_id" : 2,
"name" : "ar3.int.palomino.mongo:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 61,
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"lastHeartbeat" : ISODate("2012-03-30T04:21:45Z"),
"pingMs" : 0
}
],
"ok" : 1
}
PRIMARY> 
--------
Now, to demonstrate our automatic new host configuration, let's completely take out one of our hosts and replace it with a brand new instance:
[root@ar1 ~]# shutdown -h now
We can see that ar2 has been elected primary:
[ec2-user@ar2 ~]$ mongo
MongoDB shell version: 2.0.4
connecting to: test
PRIMARY> rs.status()
{
"set" : "rs_a",
"date" : ISODate("2012-03-30T04:25:24Z"),
"myState" : 1,
"syncingTo" : "ar1.int.palomino.mongo:27017",
"members" : [
{
"_id" : 0,
"name" : "ar1.int.palomino.mongo:27017",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"lastHeartbeat" : ISODate("2012-03-30T04:23:38Z"),
"pingMs" : 0,
"errmsg" : "socket exception"
},
{
"_id" : 1,
"name" : "ar2.int.palomino.mongo:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"self" : true
},
{
"_id" : 2,
"name" : "ar3.int.palomino.mongo:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 267,
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"lastHeartbeat" : ISODate("2012-03-30T04:25:24Z"),
"pingMs" : 14
}
],
"ok" : 1
}
--------
Now, just spin up another instance that will replace ar1:
> ec2-run-instances ami-cda2fa88 -n 1 -k engineering-west -d ar1:10.160.121.203 -t m1.large -z us-west-1b -g mm-database
With no further action, your instance will acquire the new hostname, get polled by the new primary ar2, and become part of the Replica Set again:
------
PRIMARY> rs.status()
{
"set" : "rs_a",
"date" : ISODate("2012-03-30T04:34:09Z"),
"myState" : 1,
"syncingTo" : "ar1.int.palomino.mongo:27017",
"members" : [
{
"_id" : 0,
"name" : "ar1.int.palomino.mongo:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 77,
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"lastHeartbeat" : ISODate("2012-03-30T04:34:08Z"),
"pingMs" : 0
},
{
"_id" : 1,
"name" : "ar2.int.palomino.mongo:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"self" : true
},
{
"_id" : 2,
"name" : "ar3.int.palomino.mongo:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 792,
"optime" : {
"t" : 1333081239000,
"i" : 1
},
"optimeDate" : ISODate("2012-03-30T04:20:39Z"),
"lastHeartbeat" : ISODate("2012-03-30T04:34:08Z"),
"pingMs" : 19
}
],
"ok" : 1
}
PRIMARY> 
----------
Now, this exercise had a lot missing from it:  most notably multiple DNS servers for redundancy, and sharding.  But I hope that this was helpful in seeing how you can use dns servers to make failover of a mongodb host a little closer to automatic.  I've run this scenario several times, and I did have some issues the first time I was setting up the replica set on the cluster.  Once that got going, though, the failover and promotion of the brand new host seemed to work fine.  I welcome your comments.

Avoid Gotchas While Starting up with MongoDB

 

When starting to work with a new technology in a development or sandbox environment, we tend to run things  -  as much as possible - using their default settings.  This is completely understandable, as we're not familiar with what all of the options are.  With MongoDB, there are some important issues you might face if you leave your setup in the default configuration.  Here are some of our experiences getting started trying out MongoDB on virtual machines both locally and on some Amazon EC2 instances.

 

Versions

Ensure that the version of MongoDB that you're using has the features you're going to use.  For example, if you installed the "mongodb" package from the default debian repositories as of this writing (v 1.4.4) , you'd find that some of the commands used for sharding are available, but sharding isn't actually supported in that version, so this incompatibility is masked.  Ensure you get the latest (as of this writing, it's 2.0.2) by adding  downloads-distro.mongodb.org/repo/ubuntu-upstart to your repos and installing the package named mongodb-10gen.

Check that your OS version is compatible with what you're going to use it for.  MongoDB recommends a 64-bit OS due to file-size limitations in 32-bit OS's.

 

Why is it taking so long to start up?

If you're on a 64-bit install, version 1.9.2+, journaling is enabled by default.  MongoDB may decide it needs to preallocate the journal files, and it waits until the files are allocated before it starts listening on the default port.  This could be on the order of tens of minutes. You definitely don't need this option if you're just kicking the tires and trying to see what mongodb can do.  Restart with --nojournal.

  • What type of filesystem is your database directory mounted on?

A lot of the popular Linux AMIs available in Amazon's list have the ext3 filesystem mounted by default.  It's recommended to use ext4 or xfs filesystems for your database directory, due to the file allocation speed.  This is especially noticeable if MongoDB starts allocating journal files, as in the above. If you're using an AWS instance, you'll avoid this problem if you set up a RAID10 filesystem for your data directory, as shown here.

 

Disk Space

Another issue is disk space.  If you leave settings to their default and you're on a VM or a machine with limited disk space, you could very well start hitting your limits soon.  Even if you are starting up a configserver, it will end up taking up another 3GB if you're not careful.  Our recommendation is to use the --smallfiles flag as you're starting, stopping, and configuring, until you figure out what you're doing. As an example, we followed this page to create a sharded database on a debian VM with about 16GB of disk space, and it quickly ballooned to this:

moss@moss-debian:~/mongodb$ du -h .
3.1G ./a/journal
204K ./a/moveChunk/test.data
208K ./a/moveChunk
4.0K ./a/_tmp
3.3G ./a
3.1G ./b/journal
4.0K ./b/_tmp
3.3G ./b
3.1G ./config/journal
4.0K ./config/_tmp
3.3G ./config
9.7G .

 

Bottom line: use "--smallfiles" in your command line flags or in your /etc/mongodb.conf files until you are actually running in an environment that has the required disk space.

 

Splitting and Balancing

As Jeremy Zawodny outlines in his excellent blog post “MongoDB Pre-Splitting for Faster Data Loading and Importing”, it is important to understand how MongoDB manages shards. By default, documents are grouped into 200MB chunks which are mapped to shards, and then moves those chunks between shards as the balancer attempts to manage the load. If you’re doing a large data migration, however, this can be tricky. Check Jeremy’s post for some great advice on pre-splitting while maintaining acceptable performance levels.

 

Spaces in Configuration Files

It's not in the documentation anywhere, but another gotcha was in a configuration file - for example a line like this, with multiple values for a single parameter:

configdb = ip-10-172-31-109.us-west-1.compute.internal,ip-10-170-209-104.us-west-1.compute.internal,ip-10-172-169-38.us-west-1.compute.internal

If you have spaces before or after the ",", the setting will not parse.  Just ensure it's a single string with no spaces as it is above.

When is MongoDB the right choice for your business? We explore detailed use cases.

As part of PalominoDB’s “Mongo Month”, we’re reviewing use cases for where we’d recommend choosing MongoDB and where we’d consider other technologies.

Because every environment and architecture is different, and every business has unique needs, we work carefully with clients to find the best solution for the particular challenges they face. Sometimes MongoDB or one of the other open-source technologies we build and support is most appropriate; sometimes an RDBMS is most appropriate; and often a hybrid environment makes the most sense for a client’s specific needs.

Our partners at 10gen lay out the more typical use cases, and we find there are often additional factors to consider, such as:

  • Risk tolerance for bugs and unmapped behaviors
  • Availability requirements 
  • Access- and location-based requirements
  • Security requirements
  • Existing skill sets and tooling
  • Existing architecture and infrastructure
  • Growth expectations and the timeline therein
  • Support? Community? Start-up? Enterprise Class?

Below, we’ll review some of the specific use cases our clients face, and we’ll explore how MongoDB and/or other technologies might address these most appropriately.

 

Archiving

Craigslist is one of the most famous implementations of MongoDB for archiving - in this case, old posts.  The schemaless component makes it easy for the datastore to grow as the data model evolves.  Because MongoDB does not offer some features, such as compression, that other tools like Cassandra offer natively, Craigslist has created patterns for workarounds to address issues such as presplitting data to avoid chunks migrating to various shards or using ZFS with LZO compression.

MongoDB and Cassandra both suit this use case well.  The choice in a situation like this is often determined by in-house skillsets and preferences, as well as the actual size and amount of data that needs to be managed (stay tuned for a future post about the complexities of data management at various stages of data volume and scale). Additional considerations for Cassandra (and HBase or any other Java-based DBMS) include JVM management and tuning, which can be quite challenging for operations teams unused to working with this issue.

other solutions:

  • Cassandra

pros: Native compression reduces complexity and growth is easier to manage with new nodes. Cassandra is arguably the better choice for massive scale (both in storage growth and in throughput of inserts) and does multi-datacenter installations well.  

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Content Management

Schema evolution is a huge win in the content management field.  10gen’s production deployments show numerous successful use cases.  MongoDB’s rich querying capabilities and the document store’s ability to support hierarchical models are all perfect for a CMS. For a great real world example, see the extensive documentation on the Wordnik deployment, and how they manage 1.2TB of data across five billion records.

other solutions:

  • MySQL or PostgreSQL w/caching and read distribution

pros: Skillsets are more readily available, and can support existing tools for managing the RDBMS infrastructure.  Reads are easily scaled in known patterns in the relational database world.

cons: Schema modifications can prove expensive and hard to integrate into publishing workflows.  Failover is not automatic. 

 

Online Gaming

MongoDB works very well with small reads and writes, assuming you manage to keep your working set in RAM.  Compounding that with ease of schema evolution and the replica set functionality for easy read scale and failover creates a solid case to investigate MongoDB.  In fact, this rule can be pushed out to any situation where writes can grow out of hand for a single database instance.  When you find yourself needing to support growth of writes, those writes being small and numerous, you need to ask if you want to a) design the writes away (easier said than done) b) functionally partition the workload or c) shard.  MongoDB’s autosharding is nice, though not perfect - and there are gotchas PalominoDB can assist you with.  Depending on other variables mentioned earlier, MongoDB might be a solid fit for this part of your workload. Disney’s Interactive Media Group offered a great presentation at the annual MongoSV conference on how they use MongoDB in their environment. 

other solutions:

  • MySQL or PostgreSQL with sharding

pros: Skillsets are more readibly available, and can support existing tools for managing the RDBMS infrastructure.  

cons: Games require significant schema modifications in early iterations, and these can prove expensive and impactful.  Extensive development and QA hours and increased complexity come hand in hand with writing your own sharding.

  • Cassandra

pros: Skillsets are more readibly available, and can support existing tools for managing the RDBMS infrastructure.  Reads are easily scaled in known patterns in the relational database world.

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Log Centralization

Asynchronous (and speedy!) writes, capped collections for space management and the flexibility of a schemaless database (since attributes in logs tend to pop up like mushrooms after a rainstorm) are often cited as key benefits for using MongoDB to store logs.  Admittedly, one could build a queue to push data asynchronously into an RDBMS, and partitioning plus a few scripts could duplicate the space management features of a capped collection.

MongoDB and Cassandra both do this well.  Cassandra is arguably the better choice for massive scale and works well in a multi-datacenter environment.  However, MongoDB is much easier to use, manage and query. The size of the client, the skill-set on hand and the availability needs will help us here.

other solutions

  • Percona’s XtraDB w/socket handler and MySQL partitioning, XML datatypes

pros: MySQL familiarity, reuse of MySQL infrastructure (backups, monitoring etc...)

cons: Socket Handler is still somewhat buggy, partitioning requires scripts and XML is not easily queried.

  • Cassandra w/TTL data

pros: multi-datacenter support makes this more viable than MongoDB.  

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Queue Implementation

MongoDB implements its internal replication using tailable cursors and capped collections, and the same features are useful to build simple persistent network-based queuing (distributed message-passing) implementations rather than using a “proper” queueing package. One such implementation is described in detail on Captain Codeman's Blog. Building your own queueing mechanism on top of a DBMS can be suboptimal, but one organization did so because they already had MongoDB expertise on-staff and had difficult performance problems with ActiveMQ.

other solutions

  • ActiveMQ, RabbitMQ. 

pros: proper queueing solutions. Known and documented problems or solutions.

cons: more brittle, more difficult to set up, and less performant if your queueing needs are extremely simple.

 

In summary, MongoDB, either on its own or in a hybrid environment with other technologies, is a wonderful choice for many of the most common use cases and business challenges our clients face. Helping clients make these complex decisions, and working together on the installation, management and optimization of these tools is our core business, and we encourage you to get in touch if you’d like to explore using MongoDB in your own environment.

Website by Digital Loom