Blog

The Postgres-XC 1.0 beta release overview

 

When I heard about this project a year ago, I was really excited about it. Many cluster-wide projects based on Postgres were developed very slowly, based on older (i.e. Postgres-R http://www.postgres-r.org/) or proprietary (i.e. Greenplum) versions. The features that this project hoped to achieve were ambitious, as we’ll detail in this post. And best of all - this project is based on the 9.1 Postgresql version, which is really up-to-date (at the moment of writing this post, this is the last stable version).

If you are interested in a serious project for scaling horizontally your PostgreSQL architecture, you may visit the official website at http://postgres-xc.sourceforge.net/ and take a look. 

For those who are interested, there will be a tutorial at PgCon this year.  As a brief overview, I will try to give you a broad idea for those who want to get involved in the project.

 

What Postgres-XC can do:

  • Support multi-master architecture. Data nodes can contain part or all of the data of a relationship. 
  • Transparent view to application from any master. The application only needs to interact with the cluster through coordinators.
  • Distribute/Replicate per relation (replication, round robin (by default if any unique column is specified), by hash (by default if a unique is specified), by modulo or a set to a group or node)
  • Parallel transaction execution among cluster nodes.

 

What Postgres-XC cannot do:

  • Support triggers (may be supported in future releases).
  • Distribute a table with more than one parent. 

 

Before you start:

  You need to install the most recent versions of Flex and Bison. That’s important because in the last tarball, ‘./configure’ won’t raise error if they are missing, and the error will be prompted once you execute ‘make’. You will need readline-dev and zlib-dev (not mandatory but strongly recommended).

According to the documentation, Postgres-XC should be compatible with Linux platforms based upon Intel x86_64 CPU. 

The documentation needs to be improved, so we advise you to try the steps directly and read the help prompted by the commands. For example, the initdb command in the documentation is incomplete, “--nodename” is mandatory in this version. This project is new and has only a few contributors to date, but we hope its community keeps growing. Most importantly, it is great that a beta release was launched earlier than we expected.

 

Elements of architecture

 

  • GTM (Global Transaction Manager)

+ Realize that I say only GTM, not GTMs. Only one GTM can be the manager. For redundancy, you have GTM-Standby and to improve performance and failover GTM-Proxies.

+ The GTM serializes all the transaction processing and can limit the whole scalability if you implement it primitively. This should not be used in slow/wide networks and is recommended to involve the fewest number of switches between GTM and coordinators. The proxies reduce the iteration with the GTM and improve the performance. 

+ Uses MVCC technology. That means that it will still use the same control for the concurrency as Postgres.

+ This is the first thing you will need to configure. If you set up everything in the same machine, you will need to create a separate folder for the configuration and files.

 

  • Coordinator/s

+ Interface for applications (like a Postgres backend). 

+ It has its own pooler (yes, and I think this is great, avoiding more complexity in big environments).

+ Doesn’t store any data. The queries are executed in the datanodes, but...

+ … it has its own data folder (for global catalogs).

  • Datanode/s

+ Stores the data.

+ It receives the petition with a GXID (Global Transaction ID) and Global Snapshot to allow requests from several coordinators.

Both Datanodes and Coordinator use their own data directory, so keep this in mind this if you are setting both up on the same machine. 

Configuring several GTM-Proxies will improve the scalability, shrinking the I/O in the GTM. Plus, you can configure the GTM-Standby to avoid a SPOF of the general manager. It not only provides the GXID, it also receives the node registration (you can trace your log or check the file inside the gtm folder called register.node, it’s binary but is readable) and most importantly, it holds the snapshot of the current status of all the transactions.

Coordinators can point to all the datanodes and can point to the same datanode (as Oracle RAC, but we’re not sure if all the features included in that solution will be available for Postgres-XC). Coordinators connect to the proxies (if you have already configured them) or the main GTM.

Hope you enjoyed the read. We are preparing more cool stuff about this amazing solution, keep in touch!

Exploring a new feature of 9.2: Index-only scans

We, like other Postgres DBAs worldwide, have been waiting for the 9.2 release for some time, specifically for the index-only scan feature, which will help reduce I/O by preventing unnecessary access to heap data if you only need data from the index.

Besides 9.2 is still in development, it is possible to download a version for testing at http://www.postgresql.org/download/snapshots/ . It's important to note that it doesn’t add new behaviours, but improves the way that indexes are used.

How can we test this feature? We created a ‘big’ table starting with 10 million+ records with random values.

  • Extract a few elements, using a where clause.
  • Use aggregations.
  • Use partitioning plus index only scans.



When is this feature most useful?:

  • When you select only the columns that are specified in the index definition,  including those which are at the condition part of the query.
  • If a vacuum was executed previously. Thus happens because the scan can skip the “heap fetch” if the TID references a heap [table] page on which all tuples are known visible to everybody (src/backend/executor/nodeIndexonlyscan.c).


So, the main table is:


CREATE TABLE lot_of_values AS SELECT i, clock_timestamp() t1, random() r1, random() r2, random() r3, clock_timestamp() + (round(random()*1000)::text || ' days')::interval d1 from generate_series(1,10000000) i(i);
ALTER TABLE lot_of_values ADD PRIMARY KEY(i);

CREATE INDEX CONCURRENTLY ON lot_of_values (d1);



Something interesting: due to some improvements in write performance, we realized that 9.2 demonstrated better timing compared with 9.1.3 (~200k in 9.1, ~170k ms on 9.2). The index creation was slightly better on 9.2.

The table will contain data like this:

stuff=# \x
Expanded display is on.
stuff=# select * from lot_of_values limit 1;
-[ RECORD 1 ]---------------------
i  | 1
t1 | 2012-04-18 08:37:14.426624+00
r1 | 0.571268450468779
r2 | 0.222371176816523
r3 | 0.72282966086641
d1 | 2012-08-17 08:37:14.426713+00


Ok, let’s start with some examples. As we previously explained, we need to specify columns that are only in the index. You can’t use columns from 2 different indexes. The next example is a clear fail:


stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
             QUERY PLAN                                
---------------------------------------------
Seq Scan on lot_of_values  (cost=0.00..263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)

stuff=# set enable_seqscan=off;
SET
stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
                                      QUERY PLAN                                       
---------------------------------------------
Seq Scan on lot_of_values  (cost=10000000000.00..10000263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)



We don’t have indexes at r1 and it isn’t part of the index!

The next example is another fail, using a column that is defined in another index or directly not defined in any index:


stuff=# explain select i,  d1 from lot_of_values where i between 12345 and 23456;
                                         QUERY PLAN                                           
---------------------------------------------
Index Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=12)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



The next example is the correct case:


stuff=# explain select i from lot_of_values where i between 12345 and 23456;
                                           QUERY PLAN                                             
---------------------------------------------
Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=4)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



Also, we can try with a non-pk index:


stuff=# explain select min(d1), max(d1) from lot_of_values ;
                                          QUERY PLAN                                                             
---------------------------------------------
Result  (cost=6.93..6.94 rows=1 width=0)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
  InitPlan 2 (returns $1)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan Backward using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
(9 rows) stuff=# explain select min(i), max(i), avg(i) from lot_of_values where i between 1234 and 2345;
                                             QUERY PLAN                                               
---------------------------------------------
Aggregate  (cost=66.90..66.91 rows=1 width=4)
  ->  Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..58.21 rows=1159 width=4)
        Index Cond: ((i >= 1234) AND (i <= 2345))
(3 rows)


The aggregation cases are special.  Index-only scans are not useful for count(*) without condition, because the index scan needs to check the visibility of the tuple, which makes it expensive. So, if you need to count the entire table, a sequential scan must be perfomed.

Just for testing purposes, we’ll try to “turn off” the seqscan node, to force an Index-only scan:


stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
               QUERY PLAN                                                       
---------------------------------------------
Aggregate  (cost=213496.00..213496.01 rows=1 width=4) (actual time=57865.943..57865.946 rows=1 loops=1)
  Output: count(i)
  Buffers: shared hit=2380 read=86116
  ->  Seq Scan on public.lot_of_values  (cost=0.00..188496.00 rows=10000000 width=4) (actual time=0.667..30219.806 rows=10000000 loops=1)
        Output: i, t1, r1, r2, r3, d1
        Buffers: shared hit=2380 read=86116
Total runtime: 57866.166 ms
(7 rows) stuff=# set enable_seqscan=off;
SET
stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
                                                          QUERY PLAN                                    
---------------------------------------------
Aggregate  (cost=351292.03..351292.04 rows=1 width=4) (actual time=64094.544..64094.547 rows=1 loops=1)
  Output: count(i)
  Buffers: shared read=110380
  ->  Index Only Scan using lot_of_values_pkey on public.lot_of_values  (cost=0.00..326292.03 rows=10000000 width=4) (actual time=38.773..35825.761 rows=10000000 loops=1)
        Output: i
        Heap Fetches: 10000000
        Buffers: shared read=110380
Total runtime: 64094.777 ms
(8 rows)



After a Vacuum, the plan changed and the cost drops to 262793.04.

For partitioning, as we expected, this works as well (in this example, we’ll use another table called ‘persons’ with ‘dni’ as PK column):


coches=# explain (analyze true, costs true, buffers true, timing true, verbose true)  select dni from persons where dni between 2100111 and 2110222;
Result  (cost=0.00..168.62 rows=22 width=8) (actual time=61.468..61.468 rows=0 loops=1)
  Output: persons.dni
  Buffers: shared hit=43 read=1
  ->  Append  (cost=0.00..168.62 rows=22 width=8) (actual time=61.442..61.442 rows=0 loops=1)
        Buffers: shared hit=43 read=1
        ->  Seq Scan onpersons  (cost=0.00..0.00 rows=1 width=8) (actual time=0.156..0.156 rows=0 loops=1)
              Output:persona.dni
              Filter: (((persona.dni)::bigint >= 2100111) AND ((persona.dni)::bigint <= 2110222))
        ->  Index Only Scan using persons_200_pkey on persons_200 persons  (cost=0.00..8.38 rows=1 width=8) (actual time=0.405..0.405 rows=0 loops=1)
              Output:persona.dni
              Index Cond: ((persons.dni >= 2100111) AND (persons.dni <= 2110222))
              Heap Fetches: 0
              Buffers: shared hit=5

…. LOT OF PARTITIONS ….
Total runtime: 11.045 ms
(114 rows)


Conclusion: this feature adds one of the most exciting performance improvements in Postgres. We see improvements at as much as 30% so far, and look forward to seeing how this scales as 9.2 becomes production-ready.


Automatically Bring Up A New Mongod Node in AWS

One of the most desirable features of MongoDB is its ability to automatically recover from a failed node. Setting this up with AWS instances can raise challenges, depending on how you're addressing your machines. We see many examples online of people using IP addresses in their configurations, like this:
cfg = {
    _id : "rs_a",
    members : [
        {_id : 0, host : "10.2.3.4", priority : 1},
        {_id : 1, host : "10.2.3.5", priority : 1},
        {_id : 2, host : "10.2.3.6", priority : 0}
    ]
}
Which is fine for giving examples, or maybe creating a single cluster that you control the hardware. But in the ephemeral world we live in with virtual instances where IPs change constantly, this is just not feasible. You need to use hostnames, and have DNS set up. Well, that's easier said than done, and so I've tried to come up with a proof of concept of automatically assigning hostnames for new instances in AWS. I hope that some of my experiences will be valuable in setting up your MongoDB clusters.
Here is the scenario: a replication set with three nodes set up in AWS using ec2 instances. One of the nodes goes down. You want to be able to bring that same node back - or a replacement node - with a minimum of configuration changes. To make this happen, you need to prepare for this event beforehand. All of this is done on Amazon's default OS, which is based on CentOS.
The MongoDB pages on setting up instances in AWS here and here
are very valuable in setting up most of it, however the details of how to set up hostnames could use some elaboration. For one thing, it doesn't really help to use the public domain addresses into your zone files - if you bring up a new instance, you'll get a new public domain name, and you'll need to go back into your zone file and update it. I wanted to minimize the reconfiguration necessary when bringing a new instance up to replace one that went down.
My next reference is Marius Ducea's very helpful article How To update DNS hostnames automatically for your Amazon EC2 instances. One of the great things I discovered here was the really useful tool Amazon provides to get info about the instance that you're on. Basically, curl http://169.254.169.254/latest/meta-data/ to get a list of available parameters, and then add that parameter on to the end of the url to get its value. I made a little bash function that I used often while going through this:
meta () {
  curl http://169.254.169.254/latest/meta-data/$1
  echo
}
Stick that in your .bashrc or similar file, and then "meta" will list all parameters, and e.g. "meta local-ipv4" will give you your local ip. The first thing you want to set up is your DNS server. Of course you must have redundancy, with at least a primary and a backup, spread across different regions. For the purposes of this demonstration, however, I just had a single DNS server. Obviously, do not do this in production! Here are the significant changes when setting up your bind server: 1. In the {{options}} block, ensure your "listen-on" block will work for all servers who need DNS. In my case, I set it to "any". You may need to do something more restrictive. 2. {{allow-recursion}} also needs to be set for the same servers. 3. As in Ducea's article, add your keys and your key blocks. You'll need to copy these keys to your instance, but be careful with leaving copies around, and ensure the keys' permissions are as strict as possible. 4. Add blocks for your internal and external zones. You can probably get by with just an internal zone, but you may find it more convenient to have both zones. 5. Each of the zones will need to reference the key. 6. The purpose of the "control" block in there is so that I can use rndc to control my named server. Although not technically necessary, it becomes useful if you need to edit your zone files after any dynamic updates have been made. You'll need to run "rndc freeze" before, and "rndc thaw", otherwise the hanging .jnl files will make named complain. You'll also need to add an /etc/rndc.conf file, which has at a minimum a "key" and an "options" block. See here for details on setting it up on CentOS systems. /etc/named.conf:
//
// named.conf
//
// Provided by Red Hat bind package to configure the ISC BIND named(8) DNS
// server as a caching only nameserver (as a localhost DNS resolver only).
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
options {
// listen-on port 53 { 127.0.0.1; };
 listen-on port 53 { any; };
 listen-on-v6 port 53 { ::1; };
 directory  "/var/named";
 dump-file  "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";
 allow-query     { localhost; };
 allow-recursion     { any; };
 recursion yes;
 dnssec-enable yes;
 dnssec-validation yes;
 dnssec-lookaside auto;
 /* Path to ISC DLV key */
 bindkeys-file "/etc/named.iscdlv.key";
};
controls {   
 inet 127.0.0.1 allow { localhost; } 
 keys { palomino.mongo.; }; 
};
key palomino.mongo. {
algorithm HMAC-MD5;
secret "SEcreT FRom Your Generated Key file";
};
logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};
zone "." IN {
 type hint;
 file "named.ca";
};
include "/etc/named.rfc1912.zones";
zone "int.palomino.mongo" IN {
  type master;
  file "int.palomino.mongo.zone";
  allow-update { key palomino.mongo.; };
  allow-query     { any; }; // this should be local network only
  //allow-transfer { 10.160.34.184; };
};
zone "palomino.mongo" IN {
  type master;
  file "palomino.mongo.zone";
  allow-update { key palomino.mongo.; };
  allow-query     { any; };
  //allow-transfer { 10.160.34.184; };
};
------- In this example, I chose an obviously not public domain name to avoid confusion. A feature not in this example, but you will need, are allow-transfer lines for dealing with communication with your other nameserver(s). Also note, that in this case I used the same keys for both dynamic updates from the other hosts and for rdnc updates. You may decide to do otherwise. Finally, you may want to add reverse-lookup zones as well. External Zone File: palomino.mongo.zone
$ORIGIN .
$TTL 86400 ; 1 day
palomino.mongo IN SOA dns1.palomino.mongo. mgross.palomino.mongo. (
 2012032919 ; serial
 21600      ; refresh (6 hours)
 3600       ; retry (1 hour)
 604800     ; expire (1 week)
 86400      ; minimum (1 day)
 )
 NS dns1.palomino.mongo.
 A local.ip.of.this.instance
$ORIGIN palomino.mongo.
$TTL 60 ; 1 minute
dns1 A public.ip.of.this.instance
Internal Zone File: int.palomino.mongo.zone
$ORIGIN .
$TTL 86400 ; 1 day
int.palomino.mongo IN SOA dns1.palomino.mongo. mgross.palomino.mongo. (
 2012032917 ; serial
 21600      ; refresh (6 hours)
 3600       ; retry (1 hour)
 604800     ; expire (1 week)
 86400      ; minimum (1 day)
 )
 NS dns1.palomino.mongo.
 A local.ip.of.this.instance
$ORIGIN int.palomino.mongo.
$TTL 60 ; 1 minute
dns1 A local.ip.of.this.instance
You don't need to put any other hosts in here but your DNS server(s). All the other hosts will get in there dynamically. The local ip of this DNS server is what you'll need to bootstrap your DNS when you start your other hosts. Start your DNS server. Next, for each mongod host, I used a script similar to Ducea's, with the exception that it will take the local ip of the DNS server in addition to the hostname. Also, I insert the DNS server's IP into the /etc/resolv.conf. This is probably a bit of a hack. I don't doubt there's a more efficient way to do this. In this example, I am setting the user-data of the ec2 instance to "host:DNS-local-ip", so, for example --user-data "ar1:10.1.2.3". Hopefully this will make more sense later. update_dns.sh:
#!/bin/bash
USER_DATA=`/usr/bin/curl -s http://169.254.169.254/latest/user-data`
HOSTNAME=`echo $USER_DATA | cut -d : -f 1`
DNS_IP=`echo $USER_DATA | cut -d : -f 2`
line="nameserver $DNS_IP"
# this should skip if there is no DNS_IP sent
if  ! grep "$line" /etc/resolv.conf 1>/dev/null
then
    sed -i"" -e "/nameserver/i$line" /etc/resolv.conf
fi
DNS_KEY=/etc/named/Kpalomino.mongo.+157+29687.private
DOMAIN=palomino.mongo
#set also the hostname to the running instance
hostname $HOSTNAME.$DOMAIN
PUBIP=`/usr/bin/curl -s http://169.254.169.254/latest/meta-data/public-ipv4`
cat<<EOF | /usr/bin/nsupdate -k $DNS_KEY -v
server dns1.int.$DOMAIN
zone $DOMAIN
update delete $HOSTNAME.$DOMAIN A
update add $HOSTNAME.$DOMAIN 60 A $PUBIP
send
EOF
LOCIP=`/usr/bin/curl -s http://169.254.169.254/latest/meta-data/local-ipv4`
cat<<EOF | /usr/bin/nsupdate -k $DNS_KEY -v
server dns1.int.$DOMAIN
zone int.$DOMAIN
update delete $HOSTNAME.int.$DOMAIN A
update add $HOSTNAME.int.$DOMAIN 60 A $LOCIP
send
EOF
This file can be anywhere, I happened to put it in /etc/named, where I also put the keys. This will be run by root, and contains secrets, so you should chmod 700 those three files. The Mongo instances: I used the excellent guide on the MongoDB site to set up a single instance. Go through many of the steps the entire process, including Launching an Instance, Configure Storage, and up to Install and Configure MongoDB. Now, for this example, we are only setting up a ReplicaSet. Sharding will be done at another time. The main thing we want to do here is to be able to create a generic image that you can kick off and automatically set its DNS and connect to the rest of the Replica Set, with as little human interaction as possible. As in the example given, in /etc/mongod.conf, the only changes I made were:
fork = true
dbpath=/data
replSet = rs_a
So that when this mongod starts up, it will already be configured to be a member of rs_a. Some other changes that can be made here would be to generate these changes dynamically from user-data sent to the instance, similar to the dns update script above, but for this case, I left it hard-coded. You can go ahead and start up mongod, connect to it and ensure it functions. The final piece of the puzzle is to ensure our dns update script runs when the host starts up. At first, I had it running as part of /etc/rc.local, as in Ducea's page. However, I soon found out that means that DNS would then get updated _after_ mongod has already started, and then mongod couldn't find any of the new hosts. So, to ensure that dns is updated before mongod starts, I added these lines to /etc/init.d/mongod, right at the beginning of the start() function:
start()
{
  echo "Updating dns: "
  /etc/named/update_dns.sh
  echo -n $"Starting mongod: "
At this point, we are almost ready to save our image. If you did a test and ran mongod, your /data directory will already have journal and/or other files in there, which might have data in them we don't necessarily want on startup. So, stop mongod and remove all your data:
service mongod stop
cd /data
rm -rf *
Now your instance is ready to make an image. You can stop it now (otherwise it would reboot when you create the image). All you need now is the id of your instance. Create your image:
ec2-create-image <instance-id> --name mongod-replset-server --description "Autoboot image of mongod server"
Once you get an image id from that, we can very easily set up our replica set. All we need to bootstrap this is the local IP of your DNS server. For this example, my hostnames will be ar1.palomino.mongo, ar2.palomino.mongo, and ar3.palomino.mongo (public), and ar1.int.palomino.mongo, ar2.int.palomino.mongo, and ar3.int.palomino.mongo . In this example, my DNS server's local IP address is 10.160.121.203
[mgross@dev ~]$ for i in 1 2 3
> do
> ec2-run-instances <ami-image-id> -n 1 -k <your-keypair-name> -d ar$i:10.160.121.203 -t m1.large -z <your-availability-zone> -g <your-security-group-id>
> done
This will launch 3 instances of the mongod server, which are expecting to be in a replset named "rs_a" (from the setting we added to /etc/mongod.conf above). When they start up, right before mongod starts, they'll set the hostnames provided, connect to the DNS server and set their public and private IPs. Now if you use the nameserver that you set up, you should be able to connect with one of the hostnames, for example ar1.palomino.mongo mongo ar1.palomino.mongo set up your replica set:
> rs.config()
null
> cfg = {
...     _id : "rs_a",
...     members : [
...         {_id : 0, host : "ar1.int.palomino.mongo", priority : 1},
...         {_id : 1, host : "ar2.int.palomino.mongo", priority : 1},
...         {_id : 2, host : "ar3.int.palomino.mongo", priority : 0}
...     ]
... }
{
 "_id" : "rs_a",
 "members" : [
 {
 "_id" : 0,
 "host" : "ar1.int.palomino.mongo",
 "priority" : 1
 },
 {
 "_id" : 1,
 "host" : "ar2.int.palomino.mongo",
 "priority" : 1
 },
 {
 "_id" : 2,
 "host" : "ar3.int.palomino.mongo",
 "priority" : 0
 }
 ]
}
> rs.initiate(cfg);
{
 "info" : "Config now saved locally.  Should come online in about a minute.",
 "ok" : 1
}
> rs.status();
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:20:49Z"),
 "myState" : 2,
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 6,
 "stateStr" : "UNKNOWN",
 "uptime" : 2,
 "optime" : {
 "t" : 0,
 "i" : 0
 },
 "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:20:49Z"),
 "pingMs" : 0,
 "errmsg" : "still initializing"
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 5,
 "stateStr" : "STARTUP2",
 "uptime" : 4,
 "optime" : {
 "t" : 0,
 "i" : 0
 },
 "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:20:49Z"),
 "pingMs" : 0,
 "errmsg" : "initial sync need a member to be primary or secondary to do our initial sync"
 }
 ],
 "ok" : 1
}
PRIMARY> rs.status();
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:21:46Z"),
 "myState" : 1,
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 59,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:21:45Z"),
 "pingMs" : 0
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 61,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:21:45Z"),
 "pingMs" : 0
 }
 ],
 "ok" : 1
}
PRIMARY> 
-------- Now, to demonstrate our automatic new host configuration, let's completely take out one of our hosts and replace it with a brand new instance:
[root@ar1 ~]# shutdown -h now
We can see that ar2 has been elected primary:
[ec2-user@ar2 ~]$ mongo
MongoDB shell version: 2.0.4
connecting to: test
PRIMARY> rs.status()
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:25:24Z"),
 "myState" : 1,
 "syncingTo" : "ar1.int.palomino.mongo:27017",
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 0,
 "state" : 8,
 "stateStr" : "(not reachable/healthy)",
 "uptime" : 0,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:23:38Z"),
 "pingMs" : 0,
 "errmsg" : "socket exception"
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 267,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:25:24Z"),
 "pingMs" : 14
 }
 ],
 "ok" : 1
}
Now, just spin up another instance that will replace ar1: > ec2-run-instances ami-cda2fa88 -n 1 -k engineering-west -d ar1:10.160.121.203 -t m1.large -z us-west-1b -g mm-database With no further action, your instance will acquire the new hostname, get polled by the new primary ar2, and become part of the Replica Set again:
PRIMARY> rs.status()
{
 "set" : "rs_a",
 "date" : ISODate("2012-03-30T04:34:09Z"),
 "myState" : 1,
 "syncingTo" : "ar1.int.palomino.mongo:27017",
 "members" : [
 {
 "_id" : 0,
 "name" : "ar1.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 77,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:34:08Z"),
 "pingMs" : 0
 },
 {
 "_id" : 1,
 "name" : "ar2.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 1,
 "stateStr" : "PRIMARY",
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "self" : true
 },
 {
 "_id" : 2,
 "name" : "ar3.int.palomino.mongo:27017",
 "health" : 1,
 "state" : 2,
 "stateStr" : "SECONDARY",
 "uptime" : 792,
 "optime" : {
 "t" : 1333081239000,
 "i" : 1
 },
 "optimeDate" : ISODate("2012-03-30T04:20:39Z"),
 "lastHeartbeat" : ISODate("2012-03-30T04:34:08Z"),
 "pingMs" : 19
 }
 ],
 "ok" : 1
}
PRIMARY> 
Now, this exercise had a lot missing from it: most notably multiple DNS servers for redundancy, and sharding. But I hope that this was helpful in seeing how you can use dns servers to make failover of a mongodb host a little closer to automatic. I've run this scenario several times, and I did have some issues the first time I was setting up the replica set on the cluster. Once that got going, though, the failover and promotion of the brand new host seemed to work fine. I welcome your comments.

When is MongoDB the right choice for your business? We explore detailed use cases.

As part of PalominoDB’s “Mongo Month”, we’re reviewing use cases for where we’d recommend choosing MongoDB and where we’d consider other technologies.

Because every environment and architecture is different, and every business has unique needs, we work carefully with clients to find the best solution for the particular challenges they face. Sometimes MongoDB or one of the other open-source technologies we build and support is most appropriate; sometimes an RDBMS is most appropriate; and often a hybrid environment makes the most sense for a client’s specific needs.

Our partners at 10gen lay out the more typical use cases, and we find there are often additional factors to consider, such as:

  • Risk tolerance for bugs and unmapped behaviors
  • Availability requirements 
  • Access- and location-based requirements
  • Security requirements
  • Existing skill sets and tooling
  • Existing architecture and infrastructure
  • Growth expectations and the timeline therein
  • Support? Community? Start-up? Enterprise Class?

Below, we’ll review some of the specific use cases our clients face, and we’ll explore how MongoDB and/or other technologies might address these most appropriately.

 

Archiving

Craigslist is one of the most famous implementations of MongoDB for archiving - in this case, old posts.  The schemaless component makes it easy for the datastore to grow as the data model evolves.  Because MongoDB does not offer some features, such as compression, that other tools like Cassandra offer natively, Craigslist has created patterns for workarounds to address issues such as presplitting data to avoid chunks migrating to various shards or using ZFS with LZO compression.

MongoDB and Cassandra both suit this use case well.  The choice in a situation like this is often determined by in-house skillsets and preferences, as well as the actual size and amount of data that needs to be managed (stay tuned for a future post about the complexities of data management at various stages of data volume and scale). Additional considerations for Cassandra (and HBase or any other Java-based DBMS) include JVM management and tuning, which can be quite challenging for operations teams unused to working with this issue.

other solutions:

  • Cassandra

pros: Native compression reduces complexity and growth is easier to manage with new nodes. Cassandra is arguably the better choice for massive scale (both in storage growth and in throughput of inserts) and does multi-datacenter installations well.  

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Content Management

Schema evolution is a huge win in the content management field.  10gen’s production deployments show numerous successful use cases.  MongoDB’s rich querying capabilities and the document store’s ability to support hierarchical models are all perfect for a CMS. For a great real world example, see the extensive documentation on the Wordnik deployment, and how they manage 1.2TB of data across five billion records.

other solutions:

  • MySQL or PostgreSQL w/caching and read distribution

pros: Skillsets are more readily available, and can support existing tools for managing the RDBMS infrastructure.  Reads are easily scaled in known patterns in the relational database world.

cons: Schema modifications can prove expensive and hard to integrate into publishing workflows.  Failover is not automatic. 

 

Online Gaming

MongoDB works very well with small reads and writes, assuming you manage to keep your working set in RAM.  Compounding that with ease of schema evolution and the replica set functionality for easy read scale and failover creates a solid case to investigate MongoDB.  In fact, this rule can be pushed out to any situation where writes can grow out of hand for a single database instance.  When you find yourself needing to support growth of writes, those writes being small and numerous, you need to ask if you want to a) design the writes away (easier said than done) b) functionally partition the workload or c) shard.  MongoDB’s autosharding is nice, though not perfect - and there are gotchas PalominoDB can assist you with.  Depending on other variables mentioned earlier, MongoDB might be a solid fit for this part of your workload. Disney’s Interactive Media Group offered a great presentation at the annual MongoSV conference on how they use MongoDB in their environment. 

other solutions:

  • MySQL or PostgreSQL with sharding

pros: Skillsets are more readibly available, and can support existing tools for managing the RDBMS infrastructure.  

cons: Games require significant schema modifications in early iterations, and these can prove expensive and impactful.  Extensive development and QA hours and increased complexity come hand in hand with writing your own sharding.

  • Cassandra

pros: Skillsets are more readibly available, and can support existing tools for managing the RDBMS infrastructure.  Reads are easily scaled in known patterns in the relational database world.

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Log Centralization

Asynchronous (and speedy!) writes, capped collections for space management and the flexibility of a schemaless database (since attributes in logs tend to pop up like mushrooms after a rainstorm) are often cited as key benefits for using MongoDB to store logs.  Admittedly, one could build a queue to push data asynchronously into an RDBMS, and partitioning plus a few scripts could duplicate the space management features of a capped collection.

MongoDB and Cassandra both do this well.  Cassandra is arguably the better choice for massive scale and works well in a multi-datacenter environment.  However, MongoDB is much easier to use, manage and query. The size of the client, the skill-set on hand and the availability needs will help us here.

other solutions

  • Percona’s XtraDB w/socket handler and MySQL partitioning, XML datatypes

pros: MySQL familiarity, reuse of MySQL infrastructure (backups, monitoring etc...)

cons: Socket Handler is still somewhat buggy, partitioning requires scripts and XML is not easily queried.

  • Cassandra w/TTL data

pros: multi-datacenter support makes this more viable than MongoDB.  

cons: These will vary by use case, and will be addressed in a further post.  A generalization is that MongoDB is easier to setup and manage in smaller environments or for companies constrained by resources.

 

Queue Implementation

MongoDB implements its internal replication using tailable cursors and capped collections, and the same features are useful to build simple persistent network-based queuing (distributed message-passing) implementations rather than using a “proper” queueing package. One such implementation is described in detail on Captain Codeman's Blog. Building your own queueing mechanism on top of a DBMS can be suboptimal, but one organization did so because they already had MongoDB expertise on-staff and had difficult performance problems with ActiveMQ.

other solutions

  • ActiveMQ, RabbitMQ. 

pros: proper queueing solutions. Known and documented problems or solutions.

cons: more brittle, more difficult to set up, and less performant if your queueing needs are extremely simple.

 

In summary, MongoDB, either on its own or in a hybrid environment with other technologies, is a wonderful choice for many of the most common use cases and business challenges our clients face. Helping clients make these complex decisions, and working together on the installation, management and optimization of these tools is our core business, and we encourage you to get in touch if you’d like to explore using MongoDB in your own environment.

Considerations about text searchs in big fields and planner costs

Imagine the following scenario: you have a table with a small set of records with columns containing  tsvector and text data type. But, text fields has almost 20 megabytes of text or more.

So, what happens? Postgres planner checks the amount of reltuples (tuples) in the relation to calculate the cost of data extraction. If the amount of tuples reached by the query is higher than the 20% of the total result set or the amount of blocks in the relation is too small, then the planner will choose sequential scan, otherwise random access will be the chosen one.

But, in that case we face a problem. The amount of rows and blocks in the relation is not the “total” content. That happens because Postgres use a technique called TOAST (The Oversized  Attribute Storage Technique) which store all the data that is over 200 bytes, apart from the block. So, in the relation you will have only the firsts bytes of your column data. You can modify some options about how the columns get toasted (the level of compression), but the default behaviour is compress and store apart. This methodology is pretty cool, because if you have i.e. 500MB of data in only one field, your relation will contain all that data in the same file and in the most usual cases that is not convenient.

So, meanwhile you have small number of rows, the amount of data to read from disk is quite big in relation of the data stored in the relfilenode (the data file). But for FTS, we need to dig into all the content and for make the searchs faster, we need GIN or GIST indexes. This type of indexes, stores the data inside the index (differently from B-Tree indexes), making possible the search using the index.

So, you want to access by index to your data, but the planner estimates that the amount of rows is so small that is better to access by sequential scan instead of use index. So, the only way to force the read of index is *SET enable_seqscan TO off;* before execute your query. This variable is per session, so you might want to add it inside the transaction and set again to “on” for other queries.

But, by the way, once your table is populated with enough amount of data, you wouldn’t need anymore this execution, but if your table is quite static, could be a little fix. Besides this, it doesn’t represents a problem of performance, but this kind of things could show you how the planner works.

Let’s see an example:

CREATE TABLE documents_en (
   author_id char(2),
   text_id serial,
   author varchar(45) NOT NULL,
   title varchar(127) NOT NULL,
   content text NOT NULL,
   cont_ftsed tsvector  ,
   PRIMARY KEY (author_id, text_id)
);

CREATE FUNCTION ftsed_upd_en() RETURNS trigger AS $$
DECLARE
BEGIN
NEW.cont_ftsed:=(setweight(to_tsvector('english', coalesce(NEW.title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(NEW.author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(NEW.content, '')), 'C'));
RETURN NEW;
END
$$
LANGUAGE plpgsql;

CREATE TRIGGER tgr_ins_ftsed_en BEFORE INSERT ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE TRIGGER tgr_upd_ftsed_en BEFORE UPDATE OF author,title,content
ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE INDEX documents_en ON nspc_es USING gin(
   (setweight(to_tsvector('english', coalesce(title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(content, '')), 'C'))
);

CREATE INDEX ix_ftsed ON documents_en USING GIST(cont_ftsed);


Before continuing, we need to clarify 3 things: text_id is related with author_id, but just for testing we use it as a serial, to avoid collisions in the primary key; we use triggers to automatically set up the tsvector column; the last thing is the indexes, I created 2 indexes to show the examples. IMHO I don’t recommend functional indexes, for performance purposes, but is still an option.

To fill up the table, I’ve made a script to catch up all the READMEs in the server and put them into the DB.


#!/bin/bash
FILENAME="$@"
TMPBFF=temp_file
PSQL=/opt/pg92dev/bin/psql
WORDS=words

for _FILE_ in $FILENAME
do
 egrep  -oi '[a-z]+' $_FILE_ | egrep -i '[a-z]+{4}+' > $WORDS
 AUTHOR=$(cat $WORDS | sort -u | head -n2 | awk '{printf("%s ", $0) }')
 AUTHOR_ID=${AUTHOR:0:2}
 TITLE=$(cat ${WORDS} | sort | uniq -c | sort -nr | head -n3 | awk '{printf("%s ", $2)}')
 BODY=$(cat $WORDS | sort -u | awk '{printf("%s ", $0) }')

 $PSQL -Upostgres fts -c "INSERT INTO documents_en (author_id, author, title, content) VALUES ('${AUTHOR_ID}', '${AUTHOR}', '${TITLE}','${BODY}' );"
done


I know, is a bit tricky, but useful for this test. to execute it, you’ll need something like:

locate README.txt | xargs ./script

First let’s get some stats from the table:

fts=# select count(*) from documents_en ;
count : 22
fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty :  24 kB (size of the table)

fts=# select relpages from pg_class where relname = 'documents_en';
relpages :3 (8kb each)

Now, a simple query with a EXPLAIN ANALYZE:

fts=# EXPLAIN ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Seq Scan on documents_en  (cost=0.00..
3.27 rows=1 width=34) (actual time=0.228..0.464 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.501 ms
(4 rows)


Oh! It’s using sequential scan (as we expect ), but now let’s add a trick:

fts=# SET enable_seqscan TO off;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
8.27 rows=1 width=34) (actual time=0.127..0.191 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.280 ms
(4 rows)

What happens? The cost is higher but is faster? The fast answer is YES, is possible. Cost is based on estimations in the amount of blocks and type of access. Random access use to has 4:1 more cost in relation with the sequential accesses. In this particular case, the table is small so the planner estimates that is cheaper to read it without indexes. BUT that’s not the real problem, the real one is that the data is already TOASTED outside the “data file” and this data is not considered by the planner in this case.

Another question is: can we improve the access to this data? The answer is YES, if you are not complaining about the storage. What you can do is:

fts=# alter table documents_en alter column cont_ftsed set storage external;
ALTER TABLE

That will uncompress the data of the column cont_ftsed and force to store it outside the table (this is not really at this way, but for be didactic we will explain it like this).

fts=# vacuum  full analyze documents_en;
VACUUM
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..8.27 rows=1 width=34) (actual time=0.071..0.127 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.178 ms
(4 rows)

fts=# SET enable_seqscan TO on;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Seq Scan on documents_en  (cost=0.00..3.27 rows=1 width=34) (actual time=0.067..0.292 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.329 ms
(4 rows)

In this case we use VACUUM because the SET STORAGE option doesn’t make any changes to the previous stored data, and its effects are after. So with vacuum, we are forcing to apply the changes in all the table.

Ok, what happen if the table gets filled with more data? Automatically the planner will start to use the index:

fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty
----------------
1744 kB
(1 row)
fts=# SELECT count(*)  FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
count
-------
   45
(1 row)


fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                     QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents_en  (cost=4.38..
47.02 rows=13 width=31) (actual time=1.091..5.404 rows=45 loops=1)
  Recheck Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
  ->  Bitmap Index Scan on ix_ftsed_en  (cost=0.00..4.37 rows=13 width=0) (actual time=1.003..1.003 rows=125 loops=1)
        Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
Total runtime:
5.489 ms
(6 rows)

fts=#
SET enable_bitmapscan TO off;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                         QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
56.50 rows=13 width=31) (actual time=0.307..5.286 rows=45 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
Total runtime:
5.390 ms
(4 rows)

We don’t see a huge difference in terms of time between each other, but we still facing the question about the relation between cost/real-performance.

But, don’t forget the functional index! I think you will like the following example:

fts=# analyze;
ANALYZE
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                             QUERY PLAN                                                                                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on documents_en  
(cost=0.00..292.20 rows=17 width=7) (actual time=2.960..3878.627 rows=45 loops=1)
  Filter: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(author
, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 2238
Total runtime:
3878.714 ms
(4 rows)

fts=#
SET enable_seqscan TO off;
SET
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                                  QUERY PLAN                                                                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on documents_en  
(cost=272.15..326.46 rows=17 width=7) (actual time=0.937..1.026 rows=45 loops=1)
  Recheck Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(
author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  ->  Bitmap Index Scan on textsearch_en  (cost=0.00..272.15 rows=17 width=0) (actual time=0.916..0.916 rows=45 loops=1)
        Index Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALE
SCE(author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquer
y)
Total runtime:
1.109 ms
(5 rows)

Again! But hits time it is more stronger the difference in terms of time. Is not so pretty, ugh? So, It seems that you will need to be careful if you use searchs in big columns, that are toasted. Maybe you should check in you production database if you need to add some tweaks to get the maximum performance to your Postgres instance.

The version used on this post is PostgreSQL 9.2devel on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.2 20111027 (Red Hat 4.6.2-1), 64-bit. Besides this, I made tests in 9.1 and still see the same thing’.

Happy hacking!





Master/Slave Replication In Limited Bandwidth Scenarios

Database replication is quite an important component of the database architectures of many of our clients. Managed properly, it offers real-time redundancy and fault-tolerance, and allows for increased scalability. When replication goes wrong, however, it can turn into a morass of data conflicts, lag and strange little mysteries to distract your DBA team.

PalominoDB recently helped a client out of one of these strange little replication mysteries, and we document it here to help you avoid getting stuck in the same sort of situation.

The symptoms: Our slave database (which we’ll call DB_Slave) was lagging. The logs revealed that the value of Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0.

What we knew: DB_Slave connects to the master (which we’ll call DB_Master), and DB_Master sends binlog events to DB_Slave.

Because this data is pushed from the master rather than pulled from the slave, there was a small window where, in theory, a replication channel could be broken, and DB_Slave wouldn’t notice until "slave-net-timeout" seconds (with a default of 3600) passed.

We also knew that Seconds_Behind_Master is calculated not as the difference between the current  datetime and the datetime of the binlog event being processed, but as the difference between the time in master binlog and current executed relay log. That means that if DB_Master is not sending binlog events to DB_Slave (for example, due to a network issue), the slave could be lagging behind the master but not be aware of it at all.

Further, if DB_Slave then received some binlog events from DB_Master, and realized at that point that it was lagging, it could still fail to notice if DB_Master once again stopped pushing binlog events. 

Investigation: Our theory that there was in fact a network issue was supported by some evidence from the logs. We saw that SHOW SLAVE STATUS didn't show any progress in Master_Log_File and Read_Master_Log_Pos : that is, DB_Slave was not getting any data from DB_Master. 

We also found this in the processlist of DB_Master:

104062295       slave_user      [IP Address]:10124     NULL    Binlog Dump     1140    Writing to net  NULL

The status "Writing to net" means that DB_Master is sending (or trying to send) data to DB_Slave.

In a healthy replication channel, the status would be "Has sent all binlog to slave; waiting for binlog to be updated".

 

We suspected this might be an issue related to network bandwidth, but when we ran a series of STOP SLAVE / START SLAVE processes, we noticed that the number of threads in status “Writing to net” was increasing, with up to 20 concurrent open threads. Connections simply weren’t being closed. Further investigation revealed that those connections were staying in CLOSE_WAIT status.

Based on our observations we concluded that the limited network bandwidth was at fault, and we needed a solution to compress the transmission between the master and slave databases. Luckily MySQL provides a convenient parameter, which is dynamic and is set on the slave: slave_compressed_protocol=1

Here is how to implement it:

 

mysql> show global variables like 'slave_compressed_protocol';

+---------------------------+-------+

| Variable_name             | Value |

+---------------------------+-------+

| slave_compressed_protocol | OFF   |

+---------------------------+-------+

1 row in set (0.00 sec)

mysql> stop slave;

Query OK, 0 rows affected (0.25 sec)

 

mysql> set global slave_compressed_protocol=1;

Query OK, 0 rows affected (0.00 sec)

mysql> start slave;

Query OK, 0 rows affected (0.00 sec)
mysql> show global variables like 'slave_compressed_protocol';

 

+---------------------------+-------+

| Variable_name             | Value |

+---------------------------+-------+

| slave_compressed_protocol | ON    |

+-----------------------------------+

Resolution: We set slave_compressed_protocol=1 and restarted the slave. Replication started catching up at a very surprising speed. Even though it was catching up, we noticed the same behavior we’d noticed in the past: Seconds_Behind_Master was jumping from 0 to a random high value, and then back to 0, the io_thread was behind, and we didn’t see a high load on the server. 

As you can see in these graphs, network traffic did not increase much, but CPU usage and commands/sec had surprisingly high jumps.

In short, DB_Slave was processing more queries because the IO thread was receiving more data from DB_Master, thanks to compression. 

Hashing Algorithm in MySQL PASSWORD()

Recently we had a question from a customer: what is the hashing algorithm implemented in PASSWORD() ?

The manual doesn't give a straight answer in any of these two pages:

http://dev.mysql.com/doc/refman/5.1/en/encryption-functions.html#function_password

http://dev.mysql.com/doc/refman/5.1/en/password-hashing.html

 

It is enough to dig a bit more to find the solution in http://forge.mysql.com/wiki/MySQL_Internals_ClientServer_Protocol#4.1_and_later that specifies "mysql.user.Password stores SHA1(SHA1(password))" .

 

Instead of blindly trusting the documentation (even if I believe it is correct), I did some tests and was confused by the first result:

mysql> SELECT PASSWORD("this_is_a_random_string") `pass`\G 

pass: *12E76A751EFA43A177049262A2EE36DA327D8E50

mysql> SELECT SHA1(SHA1("this_is_a_random_string")) `pass`\G 

pass: 9b653fd9fb63e1655786bfa3b3e00b0913dfc177

So it looked like SHA1(SHA1(password)) wasn't PASSWORD(password)), at least in this test.

The best documentation ever is the source code, so I read the source code and understood why my previous test was incorrect: the second SHA1() is applied to the binary data returned by the first SHA1() and not to its hex representation. Therefore in SQL I have to UNHEX() it before applying the second SHA1. In fact: 

mysql> SELECT SHA1(UNHEX(SHA1("this_is_a_random_string"))) `pass`\G 

pass: 12e76a751efa43a177049262a2ee36da327d8e50

 

So yes, I confirmed that mysql.user.password stores SHA1(SHA1(password)) . I also hope this post is useful to understand how MySQL implements PASSWORD().

Using HAProxy for MySQL failovers

There are a number of solutions available for MySQL architecture that provide automatic failover, but the vast majority involve potentially significant and complex changes to existing configurations. Fortunately, HAProxy can be leveraged for this purpose with a minimum of impact to the current system.  Alex Williams posted this clever solution a couple years ago in his blog.  Here we take a closer look at the details of implementing it into an already existing system.

HAProxy is a freeware load balancer, proxy, and high availability application for the TCP and HTTP protocols.  Since it was built mostly to handle web traffic, it has robust rule writing using HTTP components to check on the health of systems.  The key to Alex's solution was creating xinetd daemons on the database servers that send out HTTP messages.  The HTTP check to determine database statuses works like this:

  1. The HAProxy server sends HTTP requests to the database servers at configured intervals to specified ports
  2. The /etc/services file on the database servers maps those ports to services in their /etc/xinetd.d directory
  3. The services can call any specified script, so we build scripts that connect to the databases and checks for whatever conditions we choose.

The services then return an OK or a Service Unavailable response per the conditions.  Code for these scripts is in included in the Alex's article.

Our database configuration for this implementation is two pairs of Master-Slave databases in an Active-Passive relationship with Master-Master replication between the sets.  Siloing the passive Master-Slave provides a hot spare as well as continuous up-time during deployments provided we have a means of swapping the active/passive roles of each pair.  To accomplish the latter, we built two HAProxy configuration files, haproxy_pair1.cfg and haproxy_pair2.cfg, the only difference between the two being which Master-Slave pair is active.  Having the two files indicate which pair is active also allows immediate visibility of the current configuration.

Just as in Alex's sample configuration, our web application uses two DNS entries, appmaster and appslave to call writes and reads respectively.  The IPs for these addresses are then attached to our HAProxy server, allowing HAProxy to bind to them when it starts and then route them to the appropriate database server.

On our HAProxy server we then customized the /etc/init.d/haproxy script to handle an additional parameter of "flipactive" which provides us the capability to swap the database pairs:

flipactive() {
        # first detect which cfg file haproxy is using
        ps_out=`ps -ef |grep haproxy|grep cfg`
        # if pair2 then use pair1
        if [[ "$ps_out" =~ pair2 ]]  then
             # the -sf does a friendly reread of the config file
             /usr/local/sbin/haproxy -f /etc/haproxy/haproxy_pair1.cfg -p /var/run/haproxy.pid -st $(cat /var/run/haproxy.pid)
        # if pair1 then use pair2
        else
             /usr/local/sbin/haproxy -f /etc/haproxy/haproxy_pair2.cfg -p /var/run/haproxy.pid -st $(cat /var/run/haproxy.pid)
        fi
}

The rest of the configuration details are covered pretty well in Alex's article as well as in the HAProxy documentation.

We successfully developed and implemented this technique with the devops team at SlideShare last month to build rolling DDL scripting to multiple databases using Capistrano.  It allowed them to have explicit control over which database was being updated, thereby giving them the means necessary to update one database while other served the current application code.

Last Day at PalominoDB

I have been the Community Liaison and a Senior DBA at PalominoDB for 15 months, and doing remote DBA work for 4 years.  In that time I have learned that "consultant" need not be a dirty word, and that in the DBA world it is actually extremely valuable to have a remote DBA with lots of outside experience, and a team of remote DBAs for when your primary contact is sick or goes on holiday.

As with everything, there are downsides to remote database management.  Even though there is a lot of architecture experience among the remote DBAs I know, we are not often invited to architecture meetings.  This is because time is the unit of currency, and while sitting in an hour-long meeting to give 5 minutes of feedback can save hours down the road, it's hard to see that.  Many clients have gotten around this by having all DDL checked and performed by remote DBAs, and that helps a lot.

There is also no ownership - we can recommend solutions and technologies, but the client makes the actual decision about whether something needs to be done or not.  I look forward to actually owning architecture after 4 years of making "strong recommendations".

Since folks will ask, I have taken a job as a Senior DBA/Architect with Mozilla, starting Monday.  A former co-worker told me about the job; I was not particularly looking for anything, but I was intrigued.

I have said before that it is hard to find a good in-house DBA if you are not a huge company like Facebook or Google or Yahoo, and that is still true.  At Mozilla, they are 100% open and public about their ideas, and they do a lot of behind-the-scenes work.  Sound familiar?

They also allow their developers to develop on whatever platforms work best.  Their biggest database is their crash reporting database (and they do read it, so do submit your crashes).  They have MySQL, PostgreSQL, and MongoDB, and are starting to move some applications around, as developers are not always aware of what platforms will work best.  There is another DBA, so I will not be alone, but I expect to be just as challenged at Mozilla as I have been at PalominoDB and Pythian.

So, to keep up with PalominoDB, you can:

- like the PalominoDB page on Facebook

- follow @palominodb on Twitter

- connect with Laine on LinkedIn

- follow PalominoDB on LinkedIn

- continue to read PalominoDB's blog and Planet MySQL

 

To keep up with me, you can:

- follow @sheeri on Twitter

- read my blog and Planet MySQL

- connect with me on LinkedIn

- subscribe to the OurSQL podcast (details at http://www.oursql.com/?page_id=2)

Non-deterministic Functions and the Binary Log

I wrote this post because I ran across this issue when debugging why tables with triggers/functions that were not getting replicated to slaves. The problem appears when binlog_format is set up  as STATEMENT. You can check the ‘change log’ checking the following link for more information about default values [1].

If you have non-deterministic functions  [2]  that insert or modify your data with dynamic SQL like this :

<CODE>
DELIMITER $$
CREATE FUNCTION `insert_dynamic_data` (in_param VARCHAR(30)) RETURNS bigint(20)
BEGIN
 ....
 INSERT INTO `table_to_ins` (param) VALUES(in_param);
 ...
 /* You can also add here @@warning_count or row_count() check
     and return -1 to catch errors*/
 return LAST_INSERT_ID();
END
$$
</CODE>

The problem starts when you plan to user “replication” through binary logs events. In that cases, you must consider:

"If a stored program that modifies data is nondeterministic, it is not repeatable. This can result in different data on a master and slave, or cause restored data to differ from the original data."  [3]

When you create a function, by default is NOT DETERMINISTIC, which means that for each execution could the code will be executed again. In the other hand, DETERMINISTIC will check if the parameters are the same and if is that condition comes true, will return the same result as before without executing again the code.

If you want to create the function with the binlog_format in STATEMENT and log_bin activated without the log_bin_trust_function_creators, you will get the following error:

Version 5.1.41:
mysql> DELIMITER $$
mysql> CREATE FUNCTION fx_pru () RETURNS int
    -> BEGIN
    ->  insert into prueba select 'po',round(rand()*100), rand();
    -> RETURN 1;
    -> END
    -> $$
ERROR 1418 (HY000): This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its declaration and binary logging is enabled (you *might* want to use the less safe log_bin_trust_function_creators variable)

What happened so? The problem was that the funciton were created previous to activate the log_bin.

You have to ways to fix it: activating binlog_format as ROW or declare in your /etc/my.cnf the log-bin-trust-function-creators which allows insertions in the binary log without enforcing deterministic property. In the specified case, we cannot declare this function as DETERMINISTIC because it uses dynamic values in the parameters  (always we expect different values and different results).

What happens with the execution of the function? Basically, it creates a lock contention ( if you usually use Nagios, you will see a mysql-lock-contention alarm), but MySQL will not raise any error or alarm, which causes confusion when trying to find the error. Other statements that run outside the function will be executed normally. In the previous example, this function was only for inserts, but every statement inside the function will not work. Derived  from this problem, statements inside the function will not be executed directly.

You wouldn't  see any errors. The only thing you could realize is the lock contention. The lock is generated because the statement hangs a lock into the table trying to reach a commit that never comes.

It is a bug? No, is a feature. It prevents unknown and potentially slow executions from being replicated, causing lagging into the replication process.

I recommend a link about this and related problems with triggers: [4]

1: http://www.filehippo.com/es/download_mysql/changelog/6962/
2: http://en.wikipedia.org/wiki/Nondeterministic_algorithm
3: http://dev.mysql.com/doc/refman/5.1/en/stored-programs-logging.html
4: http://dba.stackexchange.com/questions/321/dynamic-sql-in-mysql-stored-routines

Syndicate content
Website by Digital Loom