Emanuel Calvo

Insights from the PgCon 2013

PgCon 2013 was attended by 256 people across the globe.  Attendees had the opportunity to enjoy tutorials, talks and an excellent unconference (this last deserves a special mention).

I lectured a talk related with Full text search using Sphinx and Postgres (you can find the slides at http://t.co/lgFoLq37EC, and all of the talks have been recorded).  The quality of the talks in general was quite good, but I don't want to repeat what you will find in other posts.

The unconference was attended quite late into the evening. You can find a schedule of it, as well as the minutes of some of the talks that happened (and others that didn't also) here.

There was a special emphasis on the pluggable storage feature, albeit most agree that it will be a very difficult feature to implement in the near versions. A topic related to this, was the Foreign Data Wrapper enhancements.

Pluggable Storage engine was extended after. The main reason of why everybody agrees with this feature, is because an API for the storage will allow companies to collaborate with code and avoid forks to other projects.

There was a long discussion also about migrations on the hall, using pg_upgrade.

The features about replication were bi-directional and logical replication.

Full text search unconference discussion was pretty interesting. Oleg Bartunov and Alexander showed a really interesting work coming up for optimizing GIN indexes. According to their benchmarks, Postgres could improve the performance significantly.

There were a lot of discussion I missed, due the wide number of tracks and "hall spots". But th emajority of attendees I heard agreed that the unconference was quite exciting and granted the possibility to bring many new ideas.

 

 

 

 

Benchmarking Postgres on AWS 4,000 PIOPs EBS instances

Introduction

Disk I/O is frequently the performance bottleneck with relational databases. With AWS recently releasing 4,000 PIOPs EBS volumes, I wanted to do some benchmarking with pgbench and PostgreSQL 9.2. Prior to this release the maximum available I/O capacity was 2,000 IOPs per volume. EBS IOPs are read and written in 16Kb chunks with their performance limited by both the I/O capacity of the EBS volumes and the network bandwidth between an EC2 instance and the EBS network. My goal isn't to provide a PostgreSQL tuning guide, an EC2 tuning guide, or a database deathmatch complete with graphs; I'll just be displaying what kind of performance is available out-of-the-box without substantive tuning. In other words, this is an exploratory benchmark not a comparative benchmark. I would have liked to compare the performance of 4,000 PIOPs EBS volumes with 2,000 PIOPs EBS volumes, but I ran out of time so that will have to wait for a following post.

Setup

Region

I conducted my testing in AWS' São Paulo region. One benefit of testing in sa-east-1 is that spot prices for larger instances are (anecdotally) more stable than in us-east. Unfortunately, sa-east-1 doesn't have any cluster compute (CC) instances available. CC instances have twice the bandwidth to the EBS network than non-CC EC2 instances. That additional bandwidth allows you to construct larger software RAID volumes. My cocktail napkin calculations show that it should be possible to reach 50,000 PIOPs on an EBS-backed CC instance without much of a problem.

EC2 instances

I tested with three EC2 instances: an m1.large from which to run pgbench, an m2.2xlarge with four EBS volumes, and an m1.xlarge with one EBS volume. All EBS volumes are 400GB with 4,000 provisioned IOPs. The m1.large instance was an on-demand instance; the other instances  — the pgbench target database servers — were all spot instances with a maximum bid of $0.05. (In one case our first spot instance was terminated, and we had to rebuild it). Some brief testing showed that having an external machine driving the benchmark was critical for the best results.

Operating System

All EC2 instances are running Ubuntu 12.10. A custom sysctl.conf tuned the Sys V shared memory as well as set swappiness to zero and memory overcommit to two.

kernel.shmmax = 13355443200
kernel.shmall = 13355443200
vm.swappiness = 0
vm.overcommit_memory = 2

Packages

The following packages were installed via apt-get:

  • htop
  • xfsprogs
  • debian-keyring
  • mdadm
  • postgresql-9.2
  • postgresql-contrib-9.2

In order to install the postgresql packages a pgdb.list file containing

deb http://apt.postgresql.org/pub/repos/apt/ squeeze-pgdg main

was placed in /etc/apt/sources.list.d and the following commands were run:

gpg --keyserver pgp.mit.edu --recv-keys ACCC4CF8
gpg --armor --export ACCC4CF8 | apt-key add -
apt-get update

RAID and Filesystems

For the one volume instance, I simply created an XFS file system and mounted it on /mnt/benchmark.

mkdir /mnt/benchmark
mkfs.xfs /dev/svdf 
mount -t xfs /dev/svdf /mnt/benchmark
echo "/dev/svdf    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab

For the four volume instance it was only slightly more involved. mkfs.xfs analyzes the underlying disk objects and determines the appropriate values for stride and width. Below are the commands for assembling a four volume mdadm software RAID array that is mounted on boot (assuming you've attached the EBS volumes to your EC2 instance). Running dpkg-reconfigure rebuilds the initrd image.

mkdir /mnt/benchmark
mdadm --create /dev/md0 --level=0 --raid-volumes=4 /dev/svdf /dev/svdg /dev/svdh /dev/svdi
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
mkfs.xfs /dev/md0
echo "/dev/md0    /mnt/benchmark    xfs    defaults    1 2" >> /etc/fstab
dpkg-reconfigure mdadm

Benchmarking

pgbench is a utlity included in the postgresql-contrib-9.2 package. It approximates the TPC-B benchmark and can be looked at as a database stress test whose output is measured in transactions per second. It involves a significant amount of disk I/O with transactions that run for relatively short amounts of time. vacuumdb was run before each pgbench iteration. For each database server pgbench was run mimicking 16 clients, 32 clients, 48 clients, 64 clients, 80 clients, and 96 clients. At each of those client values, pgbench iterated ten times in steps of 100 from 100 to 1,000 transactions per client. It's important to realize that pgbench's stress test is not typical of a web application workload; most consumer facing web applications could achieve much higher rates than those mentioned here. The only pgbench results against AWS/EBS volumes that I'm-aware-of/is-quickly-googleable is from early 2012 and, at its best, achieves rates 50% slower than the lowest rates found here. I drove the benchmark using a very small, very unfancy bash script. An example of the pgbench commandline would be:

pgbench -h $DBHOST -j4 -r -Mextended -n -c48 -t600 -U$DBUSER

m1.xlarge with single 4,000 PIOPs volume

The maximum transaction volume for this isntance was when running below 48 concurrent clients and under 500 transactions per client. While the transaction throuput never dropped precipitously at any point, loads outside of that range exhibited varying performance. Even at its worst, though, this instance handled between 600-700 transactions/second.

m2.2xlarge with four 4,000 PIOPs volumes

I was impressed; at no point did the benchmark stress this instance — the tps rate was between 1700-1900 in most situations with peaks up to 2200 transactions per second. If I was asked to blindly size a "big" PostgreSQL database server running on AWS this is probably where I would start. It's not so large that you have operational issues like worrying about MTBFs for ten volume RAID arrays or trying to snapshot 4TB of disk space, but it is large enough to absorb a substantial amount of traffic.

Graphs and Tabular Data

single-4K-volume tps

The spread of transactions/second irrespective of number of clients.

Box plot of transactions per second. Single 4K volume

Data grouped by number of concurrent clients with each bar representing an increase in 100 transactions per second ranging from 100 to 1,000.

Bar graph of transactions per second grouped by concurrent clients. Single 4K volume

Progression of tps by individual level of concurrency. The x-axis tick marks measure single pgbench runs from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Single 4K volume

Raw tabular data

txns/client1002003004005006007008009001000
clients
1614551283118365311975336311009923648
321500124212327577476301067665688709
482818648997051029749736593766641
6494412817041010739596778662820612
808158931055809597801684708736663
96939889774772798682725662776708

four-4,000-PIOPs-volumes tps

Again, a box plot of the data with a y-axis of transactions/second.

Box plot of transactions per second. Four 4,000 PIOPs volumes

Grouped by number of concurrent clients between 100 and 1,000 transactions per client.

Bar graph of transactions per second grouped by concurrent clients. Four 4,000 PIOPs volumes

TPS by number of concurrent clients. The x-axis ticks mark pgbench runs progressing from 100 transactions per client to 1,000 transactions per client.

Six subgraphs of transactions per second by each level of concurrency. Four 4,000 PIOPs volumes

Tabular data m2.2xlarge with four 4,000 PIOPs EBS volumes

txns/client1002003004005006007008009001000
clients
161487161718771415138818821897177112671785
321804208321601791125919972230150117171918
481810215212961951211717751709180318171847
641810158015682056181117841849190919421658
801802204414672142164518961933174018211851
961595140320471731178318591708189617511801

Cascade Replication and Delayed servers on PostgreSQL 9.2

Postgresql 9.2 will be released with a new replication feature: Cascade Replication. It will improve streaming between wide networks, shortening distances between the servers. For example: a master server in San Francisco, a slave in Chicago and a cascade server in Barcelona.
But it has other benefit, e.g. when you need to add nodes to the cluster, you can to take backups from your slave without directly affecting the master in terms of I/O of traffic while also minimizing the connections to it.
The setup is easy and intuitive. Like from a slave from a master, but with the difference on the primary_conninfo, which will point to the slave server.
Things to consider when cascading replication:
  • One of the 9.0 limitations in replication -cleaned on 9.1- is the problem that when a master is vacuuming a big table that affects a query that is being executed on the slave. That feature can be enabled with the hot_standby_feedback on the slave server and will add more communication between the servers (if you have very slow connection, maybe you want to disable it)
  • If you have synchronous replication it will only affect 1 slave server against the master, the cascade slave won’t be affected by this setup (cascade server cannot be synchronous).
  • The control function pg_xlog_replay_pause (which stops applying the new incoming wal records) affects only the server in which you are executing it. If the server in which you are executing the control function has cascade servers, those servers will continue streaming and applying the records (occasionally you can have a=c but b!=a, besides replication has a->b->c chain), cause that function avoids the replay, not the streaming shipping. If you want to stop the cascading to all the hosts (not only replication, although the streaming), just stop the slave or change the primary_conninfo in the cascade servers. That is useful if you run BI queries on the cascade servers and don’t want to be affected by writes, so you can delay completely the replication until finish your reports.
  • If you want to use archiving, it must be set up from the master server. Archiving is necessary if you want to enlarge the availability in case of a long downtime of a slave. If you want a delayed server or if you want an incremental backup for PITR in case of disaster.
In our test we’ll have 5 servers, that I’ll label with the port number and data directory to avoid confusions. In this case you can run postgres in same port in different hosts, but just for make clear the difference I will use different ports between them:
Master (8888, data_master) Slave(8889 data_slave) Cascade Slave(8890 data_cascade_slave) delayed standby server using omnipitr-restore (8891 data_delayed) Delayed slave hot standby server using a handmade script (8892 data_streaming_delayed).
To take a backup for replication, first create a user with replication grants and add it to pg_hba.conf in the master. From psql, run:
    CREATE USER repl WITH REPLICATION PASSWORD ‘password’;

pg_hba.conf:
local   replication     repl                            md5
host    replication     repl        127.0.0.1/32            md5
host    replication      repl        0.0.0.0/0               md5

*NOTE: if you want to setup several slaves in one move, make a backup locally in the master, compress it and then ship it among the slaves. Compress and backup command:
bin/pg_basebackup -Ft -p8888 -Urepl -Ddata_slave -z -Z9


Servers configuration
Master (8888) configuration:
postgresql.conf:
wal_level = hot_standby
max_wal_senders=3
wal_keep_segments = 30
archive_mode=on
archive_command = 'cp %p /opt/pg92beta/archives_master/%f'
pg_hba.conf:
local   replication     repl                                md5
host    replication     repl        127.0.0.1/32            md5
host    replication      repl        0.0.0.0/0               md5
max_wal_senders should be a value -at least- the number of slaves connected to this server (cascade servers doesn’t count, cause they connect to the other slaves). If you don’t know how many slaves you need, you can setup a higher value (like 10 for example). That won’t affect performance and the process of wal sender only will work if another slave is brought up.
wal_keep_segments is optional, but is a very comfortable setup when we need to have a large available segments on the master to update the slaves. If one slave gets into a not-so-long downtime it can get updated using directly the streaming replication, instead the archiving files. Is not a time value, and this is important to keep in mind. Keeping 30 segments maybe for a high load will represent just a few hours of activity and in a low activity server maybe a few days. archive_mode and archive_command are in charge of copy those segments to other destination to be used for incremental backups or update slaves that were in long period of downtime.
Slave (8889) Configuration:
postgresql.conf:
wal_level = hot_standby
max_wal_senders=3
wal_keep_segments = 30
hot_standby = on
hot_standby_feedback = on
recovery.conf:
standby_mode = on
primary_conninfo = 'user=postgres host=localhost port=8888'
trigger_file=/tmp/promote_me

This is a classic setup of a asynchronous slave server through streaming replication. No delay specified, just replicate as fast as it can. The hot_standby_feedback option, allows to communicate with the master, in case of long running queries on the slave and to avoid vacuums during its execution on the master. Cascade slave (8890) configuration:
recovery.conf:
standby_mode = on
primary_conninfo = 'user=postgres host=localhost port=8889'
Configuration of this host is like every slave, the difference is that the primary_conninfo will point to the slave server where you have taken the backup.
Delayed Slave (8891) configuration:
recovery.conf:
restore_command = '/opt/pg92beta/omniti/bin/omnipitr-restore -D /opt/pg92beta/data_delayed -l /opt/pg92beta/omniti-restore.log -s /opt/pg92beta/archives_master -w 300 %f %p'
archive_cleanup_command = '/opt/pg92beta/bin/pg_archivecleanup /opt/pg92beta/archives_master %r'
standby_mode = on
You can download omnipitr-restore at https://github.com/omniti-labs/omnipitr . Some documentation here at OmniPitr with delay.
restore_command is in charge to apply the corresponding segments and the option “-w 300” is the “wait” to apply the delay. The server will be on standby mode, which means that is not accessible to query it.

Streaming delayed hot-standby slave (8892) configuration:
Configuration is almost the same as other slaves, just keep an eye on: postgresql.conf:
#...same configuration as the other slaves...
hot_standby_feedback = off   #not necessary cause the server is delayed

Script for delayed using recovery_target_time:
#!/bin/bash
 
HOME_PG=/opt/pg92beta
DATA_PG=$HOME_PG/data_streaming_delayed
DELAYED_DATE=$(date "+%Y-%m-%d %H:%M:%S %Z" --date="5 minutes ago")
RECOVERY_FILE_LOCATION=$DATA_PG/recovery.conf
RECOVERY_FILE_LOCATION_DONE=$DATA_PG/recovery.done
CONTROL_COMMAND=$HOME_PG/control_streaming_delayed.sh
PORT=8892
 
 
res=$($CONTROL_COMMAND stop)
 
if [ -f $RECOVERY_FILE_LOCATION_DONE ] ; then
 mv $RECOVERY_FILE_LOCATION_DONE $RECOVERY_FILE_LOCATION
fi
 
sed -i "s/.*recovery_target_time[ |\=].*/recovery_target_time =\'$DELAYED_DATE\'/g" $RECOVERY_FILE_LOCATION
 
#Taken from the doc:
# If this recovery target is not the desired stopping point,
# then shutdown the server, change the recovery target settings
# to a later target and restart to continue recovery.
res=$($CONTROL_COMMAND start)
control_streaming_delayed.sh is just a one line script (--mode=fast is required to stop queries when restarted):
su postgres -c "bin/pg_ctl -D data_streaming_delayed --mode=fast -l logfile_streaming_delayed $1"
This configuration has its cons:
The “delay” frame isn’t exact. Why? You need to cron the script - which has the recovery_target_time change- each x period of time. During that period of time, the delay will increase until you re-execute the script. So, if you have configured a “30 minutes ago” and your script is executed each 5 minutes, your delay will be between 30 and 35 minutes. The scripts executes a stop/start, which is necessary to load the new configuration. DO NOT RUN A RELOAD, because the slave will stop the recovery mode. Due this, you will need to have in mind if you want to make queries in that server.
But its advantages are:
You can query the server. You can just stop the refresh of the recovery_target_time for make your reports against that server. Due that the server replay is paused, you will experience more performance in your queries.

After every start, you will find something like:
LOG:  database system was shut down in recovery at 2012-07-04 11:14:39 UTC
LOG:  entering standby mode
LOG:  redo starts at 0/1E000020
LOG:  consistent recovery state reached at 0/1E010B48
LOG:  database system is ready to accept read only connections
LOG:  recovery stopping before commit of transaction 14024, time 2012-07-04 11:14:18.319328+00
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.

Do not execute pg_xlog_replay_resume() if you want to keep recovering. If you execute it, will replay the last records and start as a stand alone server, which isn’t our objective. That function will update until the last record received and will continue doing that until something stops the replay.
Let’s see an example:
root@lapp /opt/pg92beta# bin/psql -p8888 -Upostgres
psql (9.2beta2)
Type "help" for help.
 
postgres=# insert into check_time values (now());
INSERT 0 1
postgres=# select * from check_time ;
            i
----------------------------
2012-07-04 11:19:47.667041
2012-07-04 11:43:23.794384
(2 rows)
 
postgres=# \q
root@lapp /opt/pg92beta# ./change_recovery_target_time.sh
root@lapp /opt/pg92beta# tail data_streaming_delayed/pg_log/postgresql-2012-07-04_11
postgresql-2012-07-04_110653.log  postgresql-2012-07-04_110853.log  postgresql-2012-07-04_111440.log  postgresql-2012-07-04_114335.log
postgresql-2012-07-04_110808.log  postgresql-2012-07-04_111315.log  postgresql-2012-07-04_111954.log
root@lapp /opt/pg92beta# tail data_streaming_delayed/pg_log/postgresql-2012-07-04_114335.log
LOG:  database system was shut down in recovery at 2012-07-04 11:43:33 UTC
LOG:  entering standby mode
LOG:  redo starts at 0/1E000020
LOG:  consistent recovery state reached at 0/1E012734
LOG:  database system is ready to accept read only connections
LOG:  recovery stopping before commit of transaction 14027, time 2012-07-04 11:43:23.797137+00
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.
root@lapp /opt/pg92beta# bin/psql -p8892 -Upostgres
psql (9.2beta2)
Type "help" for help.
 
postgres=# select * from check_time ;
            i
----------------------------
2012-07-04 11:19:47.667041
(1 row)
 
postgres=# \q

For large amounts of data, maybe you want to have more large delays (1 day, 6 hours) due to the spended time to start an instance. Based on some experience, 1 day shuold be find to detect problems or to work with reports.
http://pgxn.org/search?q=delay&in=docs
http://pgxnclient.projects.postgresql.org/install.html
http://vibhorkumar.wordpress.com/2011/11/15/new-replication-and-recovery-features-in-postgresql-9-1/

The Postgres-XC 1.0 beta release overview

 

When I heard about this project a year ago, I was really excited about it. Many cluster-wide projects based on Postgres were developed very slowly, based on older (i.e. Postgres-R http://www.postgres-r.org/) or proprietary (i.e. Greenplum) versions. The features that this project hoped to achieve were ambitious, as we’ll detail in this post. And best of all - this project is based on the 9.1 Postgresql version, which is really up-to-date (at the moment of writing this post, this is the last stable version).

If you are interested in a serious project for scaling horizontally your PostgreSQL architecture, you may visit the official website at http://postgres-xc.sourceforge.net/ and take a look. 

For those who are interested, there will be a tutorial at PgCon this year.  As a brief overview, I will try to give you a broad idea for those who want to get involved in the project.

 

What Postgres-XC can do:

  • Support multi-master architecture. Data nodes can contain part or all of the data of a relationship. 
  • Transparent view to application from any master. The application only needs to interact with the cluster through coordinators.
  • Distribute/Replicate per relation (replication, round robin (by default if any unique column is specified), by hash (by default if a unique is specified), by modulo or a set to a group or node)
  • Parallel transaction execution among cluster nodes.

 

What Postgres-XC cannot do:

  • Support triggers (may be supported in future releases).
  • Distribute a table with more than one parent. 

 

Before you start:

  You need to install the most recent versions of Flex and Bison. That’s important because in the last tarball, ‘./configure’ won’t raise error if they are missing, and the error will be prompted once you execute ‘make’. You will need readline-dev and zlib-dev (not mandatory but strongly recommended).

According to the documentation, Postgres-XC should be compatible with Linux platforms based upon Intel x86_64 CPU. 

The documentation needs to be improved, so we advise you to try the steps directly and read the help prompted by the commands. For example, the initdb command in the documentation is incomplete, “--nodename” is mandatory in this version. This project is new and has only a few contributors to date, but we hope its community keeps growing. Most importantly, it is great that a beta release was launched earlier than we expected.

 

Elements of architecture

 

  • GTM (Global Transaction Manager)

+ Realize that I say only GTM, not GTMs. Only one GTM can be the manager. For redundancy, you have GTM-Standby and to improve performance and failover GTM-Proxies.

+ The GTM serializes all the transaction processing and can limit the whole scalability if you implement it primitively. This should not be used in slow/wide networks and is recommended to involve the fewest number of switches between GTM and coordinators. The proxies reduce the iteration with the GTM and improve the performance. 

+ Uses MVCC technology. That means that it will still use the same control for the concurrency as Postgres.

+ This is the first thing you will need to configure. If you set up everything in the same machine, you will need to create a separate folder for the configuration and files.

 

  • Coordinator/s

+ Interface for applications (like a Postgres backend). 

+ It has its own pooler (yes, and I think this is great, avoiding more complexity in big environments).

+ Doesn’t store any data. The queries are executed in the datanodes, but...

+ … it has its own data folder (for global catalogs).

  • Datanode/s

+ Stores the data.

+ It receives the petition with a GXID (Global Transaction ID) and Global Snapshot to allow requests from several coordinators.

Both Datanodes and Coordinator use their own data directory, so keep this in mind this if you are setting both up on the same machine. 

Configuring several GTM-Proxies will improve the scalability, shrinking the I/O in the GTM. Plus, you can configure the GTM-Standby to avoid a SPOF of the general manager. It not only provides the GXID, it also receives the node registration (you can trace your log or check the file inside the gtm folder called register.node, it’s binary but is readable) and most importantly, it holds the snapshot of the current status of all the transactions.

Coordinators can point to all the datanodes and can point to the same datanode (as Oracle RAC, but we’re not sure if all the features included in that solution will be available for Postgres-XC). Coordinators connect to the proxies (if you have already configured them) or the main GTM.

Hope you enjoyed the read. We are preparing more cool stuff about this amazing solution, keep in touch!

New versions of PgPool released - 3.1.3 & 3.0.7


This essential tool for Postgres architectures is continually improving, and is now available in its new releases. Both are bugfix versions.
 
For those unfamiliar with the tool, it is a middleware with functionality as a load balancer, pooler*  and/or replication system for PostgreSQL databases. The 3.1.x versions are compatible with Postgres 9.x, whose streaming replication feature was pushed to Pgpool developers to take advantage of it.   This allows the tool to balance queries without using the pgpool-replication technique.
 
In the 3.1.3/3.0.7 fixes we have:

  • Allow multi statement transactions in master/slave mode: Transactions with BEGIN, since 3.1 were sent to the slaves/standby servers as well. This brings non desirable effects when the transaction contains DELETE/INSERT/UPDATE, due to the fact that standbys cannot execute writable SQL. (3.1.3 fix)
  • Important fixes for failover detection and execution. (3.1.3 fix)
  • Added m4 files to avoid problems when compiling in older operating systems.
  • Fixed hangup on PREPARE errors.
  • Fixed memory leak in reset queries.


If you are running 3.1.x against Postgres 9 databases, we strongly recommend you upgrade PgPool due to the fixing in the multi statement feature.

For more information http://www.pgpool.net/mediawiki/index.php/Main_Page

* If you need only a connection pooler for Postgres, I prefer PgBouncer http://wiki.postgresql.org/wiki/PgBouncer. It is lightweight and more specific and simply works, without as much configuration.

Exploring a new feature of 9.2: Index-only scans

We, like other Postgres DBAs worldwide, have been waiting for the 9.2 release for some time, specifically for the index-only scan feature, which will help reduce I/O by preventing unnecessary access to heap data if you only need data from the index.

Besides 9.2 is still in development, it is possible to download a version for testing at http://www.postgresql.org/download/snapshots/ . It's important to note that it doesn’t add new behaviours, but improves the way that indexes are used.

How can we test this feature? We created a ‘big’ table starting with 10 million+ records with random values.

  • Extract a few elements, using a where clause.
  • Use aggregations.
  • Use partitioning plus index only scans.



When is this feature most useful?:

  • When you select only the columns that are specified in the index definition,  including those which are at the condition part of the query.
  • If a vacuum was executed previously. Thus happens because the scan can skip the “heap fetch” if the TID references a heap [table] page on which all tuples are known visible to everybody (src/backend/executor/nodeIndexonlyscan.c).


So, the main table is:


CREATE TABLE lot_of_values AS SELECT i, clock_timestamp() t1, random() r1, random() r2, random() r3, clock_timestamp() + (round(random()*1000)::text || ' days')::interval d1 from generate_series(1,10000000) i(i);
ALTER TABLE lot_of_values ADD PRIMARY KEY(i);

CREATE INDEX CONCURRENTLY ON lot_of_values (d1);



Something interesting: due to some improvements in write performance, we realized that 9.2 demonstrated better timing compared with 9.1.3 (~200k in 9.1, ~170k ms on 9.2). The index creation was slightly better on 9.2.

The table will contain data like this:

stuff=# \x
Expanded display is on.
stuff=# select * from lot_of_values limit 1;
-[ RECORD 1 ]---------------------
i  | 1
t1 | 2012-04-18 08:37:14.426624+00
r1 | 0.571268450468779
r2 | 0.222371176816523
r3 | 0.72282966086641
d1 | 2012-08-17 08:37:14.426713+00


Ok, let’s start with some examples. As we previously explained, we need to specify columns that are only in the index. You can’t use columns from 2 different indexes. The next example is a clear fail:


stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
             QUERY PLAN                                
---------------------------------------------
Seq Scan on lot_of_values  (cost=0.00..263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)

stuff=# set enable_seqscan=off;
SET
stuff=# explain select i,  d1 from lot_of_values where round(r1*100) < 10;
                                      QUERY PLAN                                       
---------------------------------------------
Seq Scan on lot_of_values  (cost=10000000000.00..10000263496.00 rows=3333333 width=12)
  Filter: (round((r1 * 100::double precision)) < 10::double precision)
(2 rows)



We don’t have indexes at r1 and it isn’t part of the index!

The next example is another fail, using a column that is defined in another index or directly not defined in any index:


stuff=# explain select i,  d1 from lot_of_values where i between 12345 and 23456;
                                         QUERY PLAN                                           
---------------------------------------------
Index Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=12)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



The next example is the correct case:


stuff=# explain select i from lot_of_values where i between 12345 and 23456;
                                           QUERY PLAN                                             
---------------------------------------------
Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..450.83 rows=11590 width=4)
  Index Cond: ((i >= 12345) AND (i <= 23456))
(2 rows)



Also, we can try with a non-pk index:


stuff=# explain select min(d1), max(d1) from lot_of_values ;
                                          QUERY PLAN                                                             
---------------------------------------------
Result  (cost=6.93..6.94 rows=1 width=0)
  InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
  InitPlan 2 (returns $1)
    ->  Limit  (cost=0.00..3.46 rows=1 width=8)
          ->  Index Only Scan Backward using lot_of_values_d1_idx on lot_of_values  (cost=0.00..34634365.96 rows=10000000 width=8)
                Index Cond: (d1 IS NOT NULL)
(9 rows) stuff=# explain select min(i), max(i), avg(i) from lot_of_values where i between 1234 and 2345;
                                             QUERY PLAN                                               
---------------------------------------------
Aggregate  (cost=66.90..66.91 rows=1 width=4)
  ->  Index Only Scan using lot_of_values_pkey on lot_of_values  (cost=0.00..58.21 rows=1159 width=4)
        Index Cond: ((i >= 1234) AND (i <= 2345))
(3 rows)


The aggregation cases are special.  Index-only scans are not useful for count(*) without condition, because the index scan needs to check the visibility of the tuple, which makes it expensive. So, if you need to count the entire table, a sequential scan must be perfomed.

Just for testing purposes, we’ll try to “turn off” the seqscan node, to force an Index-only scan:


stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
               QUERY PLAN                                                       
---------------------------------------------
Aggregate  (cost=213496.00..213496.01 rows=1 width=4) (actual time=57865.943..57865.946 rows=1 loops=1)
  Output: count(i)
  Buffers: shared hit=2380 read=86116
  ->  Seq Scan on public.lot_of_values  (cost=0.00..188496.00 rows=10000000 width=4) (actual time=0.667..30219.806 rows=10000000 loops=1)
        Output: i, t1, r1, r2, r3, d1
        Buffers: shared hit=2380 read=86116
Total runtime: 57866.166 ms
(7 rows) stuff=# set enable_seqscan=off;
SET
stuff=# explain (analyze true, costs true, buffers true, timing true, verbose true) select count(i) from lot_of_values;
                                                          QUERY PLAN                                    
---------------------------------------------
Aggregate  (cost=351292.03..351292.04 rows=1 width=4) (actual time=64094.544..64094.547 rows=1 loops=1)
  Output: count(i)
  Buffers: shared read=110380
  ->  Index Only Scan using lot_of_values_pkey on public.lot_of_values  (cost=0.00..326292.03 rows=10000000 width=4) (actual time=38.773..35825.761 rows=10000000 loops=1)
        Output: i
        Heap Fetches: 10000000
        Buffers: shared read=110380
Total runtime: 64094.777 ms
(8 rows)



After a Vacuum, the plan changed and the cost drops to 262793.04.

For partitioning, as we expected, this works as well (in this example, we’ll use another table called ‘persons’ with ‘dni’ as PK column):


coches=# explain (analyze true, costs true, buffers true, timing true, verbose true)  select dni from persons where dni between 2100111 and 2110222;
Result  (cost=0.00..168.62 rows=22 width=8) (actual time=61.468..61.468 rows=0 loops=1)
  Output: persons.dni
  Buffers: shared hit=43 read=1
  ->  Append  (cost=0.00..168.62 rows=22 width=8) (actual time=61.442..61.442 rows=0 loops=1)
        Buffers: shared hit=43 read=1
        ->  Seq Scan onpersons  (cost=0.00..0.00 rows=1 width=8) (actual time=0.156..0.156 rows=0 loops=1)
              Output:persona.dni
              Filter: (((persona.dni)::bigint >= 2100111) AND ((persona.dni)::bigint <= 2110222))
        ->  Index Only Scan using persons_200_pkey on persons_200 persons  (cost=0.00..8.38 rows=1 width=8) (actual time=0.405..0.405 rows=0 loops=1)
              Output:persona.dni
              Index Cond: ((persons.dni >= 2100111) AND (persons.dni <= 2110222))
              Heap Fetches: 0
              Buffers: shared hit=5

…. LOT OF PARTITIONS ….
Total runtime: 11.045 ms
(114 rows)


Conclusion: this feature adds one of the most exciting performance improvements in Postgres. We see improvements at as much as 30% so far, and look forward to seeing how this scales as 9.2 becomes production-ready.


Considerations about text searchs in big fields and planner costs

Imagine the following scenario: you have a table with a small set of records with columns containing  tsvector and text data type. But, text fields has almost 20 megabytes of text or more.

So, what happens? Postgres planner checks the amount of reltuples (tuples) in the relation to calculate the cost of data extraction. If the amount of tuples reached by the query is higher than the 20% of the total result set or the amount of blocks in the relation is too small, then the planner will choose sequential scan, otherwise random access will be the chosen one.

But, in that case we face a problem. The amount of rows and blocks in the relation is not the “total” content. That happens because Postgres use a technique called TOAST (The Oversized  Attribute Storage Technique) which store all the data that is over 200 bytes, apart from the block. So, in the relation you will have only the firsts bytes of your column data. You can modify some options about how the columns get toasted (the level of compression), but the default behaviour is compress and store apart. This methodology is pretty cool, because if you have i.e. 500MB of data in only one field, your relation will contain all that data in the same file and in the most usual cases that is not convenient.

So, meanwhile you have small number of rows, the amount of data to read from disk is quite big in relation of the data stored in the relfilenode (the data file). But for FTS, we need to dig into all the content and for make the searchs faster, we need GIN or GIST indexes. This type of indexes, stores the data inside the index (differently from B-Tree indexes), making possible the search using the index.

So, you want to access by index to your data, but the planner estimates that the amount of rows is so small that is better to access by sequential scan instead of use index. So, the only way to force the read of index is *SET enable_seqscan TO off;* before execute your query. This variable is per session, so you might want to add it inside the transaction and set again to “on” for other queries.

But, by the way, once your table is populated with enough amount of data, you wouldn’t need anymore this execution, but if your table is quite static, could be a little fix. Besides this, it doesn’t represents a problem of performance, but this kind of things could show you how the planner works.

Let’s see an example:

CREATE TABLE documents_en (
   author_id char(2),
   text_id serial,
   author varchar(45) NOT NULL,
   title varchar(127) NOT NULL,
   content text NOT NULL,
   cont_ftsed tsvector  ,
   PRIMARY KEY (author_id, text_id)
);

CREATE FUNCTION ftsed_upd_en() RETURNS trigger AS $$
DECLARE
BEGIN
NEW.cont_ftsed:=(setweight(to_tsvector('english', coalesce(NEW.title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(NEW.author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(NEW.content, '')), 'C'));
RETURN NEW;
END
$$
LANGUAGE plpgsql;

CREATE TRIGGER tgr_ins_ftsed_en BEFORE INSERT ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE TRIGGER tgr_upd_ftsed_en BEFORE UPDATE OF author,title,content
ON documents_en
FOR EACH ROW EXECUTE PROCEDURE ftsed_upd_en();

CREATE INDEX documents_en ON nspc_es USING gin(
   (setweight(to_tsvector('english', coalesce(title,   '')), 'A') ||
    setweight(to_tsvector('english', coalesce(author,  '')), 'B') ||
    setweight(to_tsvector('english', coalesce(content, '')), 'C'))
);

CREATE INDEX ix_ftsed ON documents_en USING GIST(cont_ftsed);


Before continuing, we need to clarify 3 things: text_id is related with author_id, but just for testing we use it as a serial, to avoid collisions in the primary key; we use triggers to automatically set up the tsvector column; the last thing is the indexes, I created 2 indexes to show the examples. IMHO I don’t recommend functional indexes, for performance purposes, but is still an option.

To fill up the table, I’ve made a script to catch up all the READMEs in the server and put them into the DB.


#!/bin/bash
FILENAME="$@"
TMPBFF=temp_file
PSQL=/opt/pg92dev/bin/psql
WORDS=words

for _FILE_ in $FILENAME
do
 egrep  -oi '[a-z]+' $_FILE_ | egrep -i '[a-z]+{4}+' > $WORDS
 AUTHOR=$(cat $WORDS | sort -u | head -n2 | awk '{printf("%s ", $0) }')
 AUTHOR_ID=${AUTHOR:0:2}
 TITLE=$(cat ${WORDS} | sort | uniq -c | sort -nr | head -n3 | awk '{printf("%s ", $2)}')
 BODY=$(cat $WORDS | sort -u | awk '{printf("%s ", $0) }')

 $PSQL -Upostgres fts -c "INSERT INTO documents_en (author_id, author, title, content) VALUES ('${AUTHOR_ID}', '${AUTHOR}', '${TITLE}','${BODY}' );"
done


I know, is a bit tricky, but useful for this test. to execute it, you’ll need something like:

locate README.txt | xargs ./script

First let’s get some stats from the table:

fts=# select count(*) from documents_en ;
count : 22
fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty :  24 kB (size of the table)

fts=# select relpages from pg_class where relname = 'documents_en';
relpages :3 (8kb each)

Now, a simple query with a EXPLAIN ANALYZE:

fts=# EXPLAIN ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Seq Scan on documents_en  (cost=0.00..
3.27 rows=1 width=34) (actual time=0.228..0.464 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.501 ms
(4 rows)


Oh! It’s using sequential scan (as we expect ), but now let’s add a trick:

fts=# SET enable_seqscan TO off;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
8.27 rows=1 width=34) (actual time=0.127..0.191 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.280 ms
(4 rows)

What happens? The cost is higher but is faster? The fast answer is YES, is possible. Cost is based on estimations in the amount of blocks and type of access. Random access use to has 4:1 more cost in relation with the sequential accesses. In this particular case, the table is small so the planner estimates that is cheaper to read it without indexes. BUT that’s not the real problem, the real one is that the data is already TOASTED outside the “data file” and this data is not considered by the planner in this case.

Another question is: can we improve the access to this data? The answer is YES, if you are not complaining about the storage. What you can do is:

fts=# alter table documents_en alter column cont_ftsed set storage external;
ALTER TABLE

That will uncompress the data of the column cont_ftsed and force to store it outside the table (this is not really at this way, but for be didactic we will explain it like this).

fts=# vacuum  full analyze documents_en;
VACUUM
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                       QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..8.27 rows=1 width=34) (actual time=0.071..0.127 rows=2 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 1
Total runtime:
0.178 ms
(4 rows)

fts=# SET enable_seqscan TO on;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                             QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
Seq Scan on documents_en  (cost=0.00..3.27 rows=1 width=34) (actual time=0.067..0.292 rows=2 loops=1)
  Filter: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 20
Total runtime:
0.329 ms
(4 rows)

In this case we use VACUUM because the SET STORAGE option doesn’t make any changes to the previous stored data, and its effects are after. So with vacuum, we are forcing to apply the changes in all the table.

Ok, what happen if the table gets filled with more data? Automatically the planner will start to use the index:

fts=# select pg_size_pretty(pg_relation_size('documents_en'));
pg_size_pretty
----------------
1744 kB
(1 row)
fts=# SELECT count(*)  FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
count
-------
   45
(1 row)


fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                     QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on documents_en  (cost=4.38..
47.02 rows=13 width=31) (actual time=1.091..5.404 rows=45 loops=1)
  Recheck Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
  ->  Bitmap Index Scan on ix_ftsed_en  (cost=0.00..4.37 rows=13 width=0) (actual time=1.003..1.003 rows=125 loops=1)
        Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
Total runtime:
5.489 ms
(6 rows)

fts=#
SET enable_bitmapscan TO off;
SET
fts=# explain ANALYZE  SELECT author, title
   FROM documents_en WHERE cont_ftsed @@ to_tsquery('english','editor&find');
                                                         QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_ftsed_en on documents_en  (cost=0.00..
56.50 rows=13 width=31) (actual time=0.307..5.286 rows=45 loops=1)
  Index Cond: (cont_ftsed @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Index Recheck: 80
Total runtime:
5.390 ms
(4 rows)

We don’t see a huge difference in terms of time between each other, but we still facing the question about the relation between cost/real-performance.

But, don’t forget the functional index! I think you will like the following example:

fts=# analyze;
ANALYZE
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                             QUERY PLAN                                                                                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on documents_en  
(cost=0.00..292.20 rows=17 width=7) (actual time=2.960..3878.627 rows=45 loops=1)
  Filter: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(author
, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  Rows Removed by Filter: 2238
Total runtime:
3878.714 ms
(4 rows)

fts=#
SET enable_seqscan TO off;
SET
fts=# explain analyze select author_id, text_id from documents_en

where (((setweight(to_tsvector('english'::regconfig, COALESCE(title, ''::character varying)::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, COALESCE(author, ''::character varying)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char"))) @@ to_tsquery('english','editor&find');
                                                                                                                                                                                  QUERY PLAN                                                                                                                                                                                 
------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on documents_en  
(cost=272.15..326.46 rows=17 width=7) (actual time=0.937..1.026 rows=45 loops=1)
  Recheck Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALESCE(
author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquery)
  ->  Bitmap Index Scan on textsearch_en  (cost=0.00..272.15 rows=17 width=0) (actual time=0.916..0.916 rows=45 loops=1)
        Index Cond: (((setweight(to_tsvector('english'::regconfig, (COALESCE(title, ''::character varying))::text), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (COALE
SCE(author, ''::character varying))::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, COALESCE(content, ''::text)), 'C'::"char")) @@ '''editor'' & ''find'''::tsquer
y)
Total runtime:
1.109 ms
(5 rows)

Again! But hits time it is more stronger the difference in terms of time. Is not so pretty, ugh? So, It seems that you will need to be careful if you use searchs in big columns, that are toasted. Maybe you should check in you production database if you need to add some tweaks to get the maximum performance to your Postgres instance.

The version used on this post is PostgreSQL 9.2devel on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.6.2 20111027 (Red Hat 4.6.2-1), 64-bit. Besides this, I made tests in 9.1 and still see the same thing’.

Happy hacking!





Non-deterministic Functions and the Binary Log

I wrote this post because I ran across this issue when debugging why tables with triggers/functions that were not getting replicated to slaves. The problem appears when binlog_format is set up  as STATEMENT. You can check the ‘change log’ checking the following link for more information about default values [1].

If you have non-deterministic functions  [2]  that insert or modify your data with dynamic SQL like this :

<CODE>
DELIMITER $$
CREATE FUNCTION `insert_dynamic_data` (in_param VARCHAR(30)) RETURNS bigint(20)
BEGIN
 ....
 INSERT INTO `table_to_ins` (param) VALUES(in_param);
 ...
 /* You can also add here @@warning_count or row_count() check
     and return -1 to catch errors*/
 return LAST_INSERT_ID();
END
$$
</CODE>

The problem starts when you plan to user “replication” through binary logs events. In that cases, you must consider:

"If a stored program that modifies data is nondeterministic, it is not repeatable. This can result in different data on a master and slave, or cause restored data to differ from the original data."  [3]

When you create a function, by default is NOT DETERMINISTIC, which means that for each execution could the code will be executed again. In the other hand, DETERMINISTIC will check if the parameters are the same and if is that condition comes true, will return the same result as before without executing again the code.

If you want to create the function with the binlog_format in STATEMENT and log_bin activated without the log_bin_trust_function_creators, you will get the following error:

Version 5.1.41:
mysql> DELIMITER $$
mysql> CREATE FUNCTION fx_pru () RETURNS int
    -> BEGIN
    ->  insert into prueba select 'po',round(rand()*100), rand();
    -> RETURN 1;
    -> END
    -> $$
ERROR 1418 (HY000): This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its declaration and binary logging is enabled (you *might* want to use the less safe log_bin_trust_function_creators variable)

What happened so? The problem was that the funciton were created previous to activate the log_bin.

You have to ways to fix it: activating binlog_format as ROW or declare in your /etc/my.cnf the log-bin-trust-function-creators which allows insertions in the binary log without enforcing deterministic property. In the specified case, we cannot declare this function as DETERMINISTIC because it uses dynamic values in the parameters  (always we expect different values and different results).

What happens with the execution of the function? Basically, it creates a lock contention ( if you usually use Nagios, you will see a mysql-lock-contention alarm), but MySQL will not raise any error or alarm, which causes confusion when trying to find the error. Other statements that run outside the function will be executed normally. In the previous example, this function was only for inserts, but every statement inside the function will not work. Derived  from this problem, statements inside the function will not be executed directly.

You wouldn't  see any errors. The only thing you could realize is the lock contention. The lock is generated because the statement hangs a lock into the table trying to reach a commit that never comes.

It is a bug? No, is a feature. It prevents unknown and potentially slow executions from being replicated, causing lagging into the replication process.

I recommend a link about this and related problems with triggers: [4]

1: http://www.filehippo.com/es/download_mysql/changelog/6962/
2: http://en.wikipedia.org/wiki/Nondeterministic_algorithm
3: http://dev.mysql.com/doc/refman/5.1/en/stored-programs-logging.html
4: http://dba.stackexchange.com/questions/321/dynamic-sql-in-mysql-stored-routines

Indexing text columns with GIST or GIN to optimize LIKE, ILIKE using pg_trgm in Postgres 9.1 (Part 1)

 
“The pg_trgm module provides functions and operators for determining the similarity of ASCII alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.”
 
This is the introduction to the official documentation of the extension at [1]. Note, that I used the EXTENSION terminology instead of CONTRIB, as in PostgreSQL 9.1.  Now we’ll use the CREATE EXTENSION command to include this module in the database. This new methodology allows us to manage modules installations/uninstallations with only a few commands.

The idea of this post is to show you how KNN GIST and pg_trgm could be used together to obtain interesting results. First, let’s start with some basic elements of pg_trgm.

Installation

Installing the module is easy. If you are installing through the use of source code, you must compile the module and once you get access to the database execute:

CREATE EXTENSION pg_trgm;

That’s it! Installation complete!

What is a Trigram and how use them?


A trigram is a group of three consecutive characters in a string that can be used to detect the similarity of two words (for example) or the ‘distance’ between them.

When we talk about ‘distance’, 0 means in the same place and 1 is very far. When we talk about similarity, 1 is equal and 0 is totally different. In other words, distance is 1 minus the similarity value. These concepts about distance and similarity, are necessary to start without confusion.

What’s is a trigram? A trigram is a group of three consecutive characters from a string, used to know the similarity between two strings by counting the trigrams they share.

A trigram looks like this:

palominodb=# select show_trgm('PalominoDB CO');
                              show_trgm
-----------------------------------------------------------------------
{"  c","  p"," co"," pa",alo,"co ","db ",ino,lom,min,nod,odb,omi,pal}
(1 row)

In the pg_trgm extension we have functions and operators. show_limit and set_limit are functions used to set up and show the similarity threshold for the % operator. This operator takes the form “string1 % string2” and returns a boolean type (“t” if the similarity is greater than the similarity threshold, otherwise it returns “f”).

palominodb=# select show_limit();
show_limit
       0.4

palominodb=# select set_limit(0.3), show_limit();
set_limit | show_limit
         0.3 |    0.3

In the following example we’ll see the use of each one. Operator % will return true if the similarity of the strings is greater than similarity threshold returned by show_limit function. In this example, both string are equal, in consequence, the operation will return true:

palominodb=# select similarity('Palomino','Palomino')  AS Similarity,
                                  'Palomino'<->'Palomino'              AS distance,
                                  'Palomino' % 'Palomino'              AS SimilarOrNot;
-[ RECORD 1 ]+--
similarity          | 1
distance     | 0
similarornot     | t

Index Support and usage


Now let’s discuss combining GIST or GIN and pg_trgm. pg_trgm includes an operator class to support searches using similarity, like, or ilike operators. GIN and GIST have several differences. If you don’t know which to choose, just remember a few rules: GIN searches quicker than GIST but is slower to update; if you have a write-intensive table use GIST. GIN is better for static data. Please be aware, however, that they don’t support exact matching with the equals operator!  You can do an exact match using like/ilike with no wildcards.  If you want to use the equals operator(=), you must create a standard BTREE index on the pertinent column.

In the following examples, we’ll show a table only with a GIST index. As you can see, if you want to match the exact value with equal operator, it will scan the whole table:

palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto = 'Palomino';
                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on texto_busqueda  (cost=0.00..90.15 rows=1 width=136) (actual time=16.835..16.846 rows=1 loops=1)
  Filter: (texto = 'Palomino'::text)
Total runtime: 17.094 ms
(3 rows)

But, if we use LIKE operator, index scan will be activated:

palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto like 'Palomino';
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using texto_busqueda_texto_idx on texto_busqueda  (cost=0.00..8.27 rows=1 width=136) (actual time=0.374..1.780 rows=1 loops=1)
  Index Cond: (texto ~~ 'Palomino'::text)
Total runtime: 1.979 ms
(3 rows)


palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto like '%Palomino%';
                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Scan using texto_busqueda_texto_idx on texto_busqueda  (cost=0.00..8.27 rows=1 width=136) (actual time=0.171..1.732 rows=1 loops=1)
  Index Cond: (texto ~~ '%Palomino%'::text)
Total runtime: 1.882 ms
(3 rows)



To use an index for  match equal strings, we need to create a BTREE index. But in case of BTREE there is a limitation of 8191 bytes per index row. So, if you have very large text columns you will not allowed to create a BTREE index without using functional indexes.

We get this result because, unlike BTREE indexes, the search string is not left-anchored.

The creation of indexes with the pg_trgm operator class is simple:

CREATE INDEX ON texto_busqueda USING GIST(texto gist_trgm_ops);
or
CREATE INDEX ON texto_busqueda USING GIN(texto gin_trgm_ops);

If you want a more comprehensive understanding of GIST or GIN implementation on Postgres, you can download the the source code and read  src/backend/access/gist/README and src/backend/access/gin/README.


Another useful technique

Combining % operator to get the strings that have a similarity greater than the established threshold and similarity function -that returns the similarity-, we can get ordered from the most similar to the less one discarding all the strings that aren’t similar enough:

palominodb=# SELECT ctid, similarity(texto, 'Palominodb') AS simil
palominodb-#  FROM texto_busqueda
palominodb-#  WHERE texto % 'Palominodb'
palominodb-#  ORDER BY simil DESC;
 ctid  |  simil
--------+----------
(55,3) | 0.666667
(1 row)

The same query, but with its EXPLAIN plan shows that the generated index is used in the condition:

palominodb=# EXPLAIN ANALYZE  SELECT ctid, similarity(texto, 'Palominodb') AS sml
palominodb-#  FROM texto_busqueda
palominodb-#  WHERE texto % 'Palominodb'
palominodb-#  ORDER BY sml DESC;
                                                             QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Sort  (cost=14.26..14.27 rows=3 width=138) (actual time=3.428..3.437 rows=1 loops=1)
  Sort Key: (similarity(texto, 'Palominodb'::text))
  Sort Method: quicksort  Memory: 17kB
  ->  Bitmap Heap Scan on texto_busqueda  (cost=4.28..14.24 rows=3 width=138) (actual time=3.336..3.383 rows=1 loops=1)
        Recheck Cond: (texto % 'Palominodb'::text)
        ->  Bitmap Index Scan on texto_busqueda_texto_idx  (cost=0.00..4.28 rows=3 width=0) (actual time=3.278..3.278 rows=1 loops=1)
              Index Cond: (texto % 'Palominodb'::text)
Total runtime: 3.578 ms
(8 rows)


Well, this is the first part. Hope you enjoyed the reading and I’ll wait your comments and feedback!

 


[1] http://www.postgresql.org/docs/9.1/static/pgtrgm.html

Postgresql 9.1 - Part 1: General Features

General scope

 
Postgresql 9.1 runs over the theme “features, innovation and extensibility” and it really does. This version was born to overcome Postgresql 9.0 ‘s limitations and known bugs in replication. If you are developing over 9.0, it’s time to think seriously about preparing your code for Postgresql 9.1.
 
The intent of this series of posts are not to be another release features posts. I offer a vision based on my personal experience and focus on the features that I saw exciting for the most of the projects where I’m involved. If you want to read an excellent general article about the new features of this version, web to [2].
 
At the moment of this post, the last PostgreSQL version is 9.1.1 . It includes 11 commits to fix GIST memory leaks, VACUUM improvements, catalog fixes and others. A description of the minor release can be check at [3].
 
The main features included are:

  • Synchronous Replication
  • Foreign Data Support
  • Per Column collation support
  • True SSI (Serializable Snapshot Isolation)
  • Unlogged tables for ephemeral data
  • Writable Common Table Expressions
  • K-nearest-neighbor added to GIST indexes
  • Se-linux integration with the SECURITY LEVEL command
  • Update the PL/Python server-side language
  • To come: PGXN Client for install extensions easily from the command line. More information: http://pgxn.org/faq/  The source will be on https://github.com/pgxn/pgxn-client



Some of these features could be considered minor, but many think they are very cool while using 9.1  in their environments.

Considerations before migrating


If you are an old Pg user, you may already know the migration risks listed on the next page. Still, I advise that you note and carefully learn about these risks. Many users freeze their developments to older versions simply because they didn’t know how to solve new issues. The most notable case is when 8.3 stopped using implicit casts for some datatypes and many queries didn’t work as a result.  



There are some important changes that could affect your queries, so take a pen and note:

  • The default value of standard_conforming_strings is now turned on by default. That means that backslashes are normal characters (which is the SQL standard behavior). So, if you have backslashes in your SQL code, you must add E’’ strings. For example: E’Don’t’
  • Function-style and attribute-style data type casts were disallowed for composite types. If you have code like value_composite.text or text(value_composite), you will need to use CAST or :: operator.
  • Whereas before the checks were skipped, domains are now based on arrays when they are updated, which results in a rechecking of the constraints.
  • String_to_array function returns now an empty array for a zero-length string (before it returned NULL). The same function splits into characters if you use the NULL separator.
  • The inclusion of the INSTEAD OF action for triggers will require you to recheck the logic of your triggers.
  • If you are an actual 9.0 replication user, you may know that in 9.1 you can control the side effects of VACUUM operations during big queries execution and replication. This is a really important improvement. Basically, if you run a big query in the slave server and the master starts a VACUUM process, the slave server can request the master postpone the cleanup of death rows that are being used by the query.


Brief description of main features


Don’t worry about the details, we’ll cover each feature in future posts.

  • Synchronous Replication
    • This feature enhances the durability of the data. Only one server can be synchronous with the master, the rest of the replicated servers will be asynchronous. If the actual synchronous server goes down, another server will become synchronous (using a list of servers in synchronous_standby_names).  Failover is not automatic, so you must use external tools to activate the standby sync server, one of the most popular is pgpool [4].
  • Foreign Data Support
    • The feature of Foreign Data Wrappers has been included since 8.4, but now it is possible to reach data from any database where a plugin exists. Included in the contribs, is a file called file_fwd, which connects CSV files to a linked table. Basically it provides an interface to connect to external data. In my opinion, this is perhaps one of the most useful features of this versions, especially if you have different data sources in your environment.
  • Serializable Snapshot Isolation
    • This new level of serialization is the strictest. Postgres now supports READ COMMITED, REPEATABLE READ (old serializable) and SERIALIZABLE. It uses predicate locking to keep the lock if the write would have an impact on the result. You will not need explicit locks to use this level, due to the automatic protection provided.
  • Unlogged tables
    • Postgres uses the WAL log to have a log of all the data changes to prevent data loss and guarantee consistency in the event of a crash, but it consumes resources and sometimes we have data that we can recover from other sources or that is ephemeral. In these cases, creation of unlogged tables allows the database to have tables without logging into the WAL, reducing the writes to disk. Otherwise, this data will not be replicated, due to the mechanism of replication used by Postgres (through WAL records shipping).
  • Writable Common Table Expressions
    • CTE was included in 8.4 version, but in this version, it was improved to allow you to use writes inside the CTE (WITH clause). This could save a lot of code in your functions.
  • K-nearest-neighbor added to GIST indexes
    • Postgres supports multiple types of indexes; one of them is GiST (Generalized Search Tree). With 9.1, we can define a ‘distance’ for datatypes and use it for with a GiST index. Right now, this feature is implemented for point, pg_trgm contrib and others btree_gist datatypes. The operator for distance is <-> . Another feature you will enjoy is that LIKE and ILIKE operators can use the tgrm index without scanning the whole table.
  • SE-Linux integration
    • Postgres is now the first database to be fully integrated with military security-grade. SECURITY LABEL applies a security label to a database object. This facility is intended to allow integration with label-based mandatory access control (MAC) systems such as SE-Linux instead of the more traditional access control - discretionary with users and groups. (DAC).



References:

[1] http://www.postgresql.org/docs/9.1/static/release-9-1.html
[2] http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.1
[3] http://www.postgresql.org/docs/9.1/static/release-9-1-1.html
[4] http://pgpool.projects.postgresql.org/


Syndicate content
Website by Digital Loom