Process

When Disaster Strikes: Hurricane Sandy

 

We devoted the Palomino Newsletter this month to the important topic of disaster recovery, in light of the challenges posed by Hurricane Sandy. If you're not already receiving our newsletter, you can subscribe here.


Hurricane Sandy has been on many people's minds of late; mine not least.  Having lived the last 4 years of my life in Manhattan and on the Jersey Shore, the loss of lives, the destruction of homes, business and memories, and the disruption of so much has me in shock.  I grew up in Louisiana, and hurricanes were a way of life.  You didn't do something hoping that a hurricane would not come by.  You assumed a hurricane would come.  At least, that's how I was taught.  That's the mentality I try to bring into my architectures, my process and my planning as well.  So, when hurricane Sandy bore down on the East Coast, my alarm bells started ringing, just as my email started exploding.  Every one of our US-East Amazon customers was in danger.  Who knew when power would go out?  And when it would come back?

Palomino is proud to be an Amazon Web Services consulting partner. That said, we recognize that Amazon has had its share of instability.  A few weeks ago, US-East experienced some significant EBS latency and unavailability.  We've lost availability zones.  We've lost regions.  We've found availability zones inexplicably unpredictable in terms of latency and availability.  Amazon forces us to think resiliently.  Not in preventing disasters, but weathering them, bouncing back, and being ready.  Some say this is an issue with Amazon.  That the unreliability is a drawback.  Perhaps I'm the eternal optimist, but I simply see it as a way to force rigor in anticipating, documenting and practicing our availability and business continuity plans.

None of this is new or incredibly enlightening.  Any operations person worth their salt thinks of failure and what can go wrong, and they think of it often.  So what's the point here?  I thought I'd share the war stories of the weekend to help cast a light on varying degrees of preparation.

 

Client One

Client One contacted us.  They had anticipated the problem and already been preparing to create multi-region EC2 environments; Sandy just accelerated things.  This client is in RDS, Amazon's Relational Database as a Service - in this case MySQL as a service.  RDS is such a convenient tool, until it isn't.  One of the big drawbacks? No cross-region support.  Yes, you can use Multi-AZ replication for Master availability across availability zones.  Yes, you can also create replicas in multiple availability zones.  If you do both of these things, you've got a certain level of fault tolerance in place.  You can still get hurt if your master does a multi-AZ failover.  All of your replicas will break, as RDS doesn't take into account the ability to move manually to the next binlog when a master crashes before closing their binlogs.  Thus, you are without replicas.  Not great.  But you have a working master.  Similarly, you have multiple replicas across AZs, to tolerate those failures.  But cross-region?  Nothing.

So, we had to dump all of our RDS instances and load them into RDS in another region.  Parallel dumps and loads were kicked off, accelerating the very painful process of a logical rebuild of a system.  We used SSD ephemeral storage on EC2 to speed this up as well.  The process still took 2 days.  OpenVPNs had to be set up with mappings for port 3306 to allow replication.  If this hadn't already been in process before Sandy was a threat, we never would have been ready in time.  We still had and have issues.  You can't replicate from RDS in one region to another.  Custom ETL must be created in order to keep each table as in sync as possible.  We'd done this work in a previous plan to move off of RDS, mapping tables to one of three categories - static (read-only), insert only, or upd/del.  Static just needs to be monitored for changes.  Insert only can be kept close to fresh with high water marks and batch inserts.  Transactional requires keys on updated at and created at fields, and confidence in the values in those fields.  Deletes present even bigger problems.  Digging in further is out of scope here, but consider it a future topic.

Summary: Client One was in-process for multi-region disaster recovery (DR).  A fire-drill occurred, and people had to work long, long hours doing tedious work.  But, had Sandy hit their region with the force it hit further north, we'd have been ready.

 

Client Two

Client Two contacted us also.  They had known that they were at risk, but they were small, they were pushing new features and refactoring applications, and DR was far out on their roadmap.  They too, were on RDS.  They could not afford the amount of custom work our larger clients requested, so we had to create a best effort approach.  RDS instances were created in Portland, along with cache servers, transaction engines, web services and the rest of the stack. Amazon Machine Images (AMIs) were kicked out, and we built a dump and copy process across regions.  There would be data loss, up to many hours, if the region went down and never came up.  But they would not be dead in the water.  Data loss can be mitigated by more frequent dumps and copies, but not eliminated completely.

Summary: Client Two had no plans for multi-region DR.  They had taken a conscious risk.  Luckily they had the talent and agility of a small company and could move fast with our help.  Failing over would have hurt, but they'd still be alive.

 

Client Three

We reached out proactively to Client Three. They had put together a multi-region plan for critical systems last year before we started working with them, which included scripts to rapidly build out new clusters of Hadoop based systems.  It was supposed to just work.  When we started working with Client Three, we’d scheduled our DR review, testing and modernizing for our Q4 checklist.  Too little, too late, right?  Sure enough, things didn't "just work".  It wasn't horrible, but a weekend of cleaning up, rescripting and fixing problems as they rose occurred.  But had we had to fail over? They would've been ready.

Summary: Client Three had anticipated and architected DR, but they hadn’t tested it.  Luckily we had the days before the storm to test and to fix this.  If they hadn't planned at all, I'm not sure we would've made it.

 

It’s also worth remembering that you are not alone in these shared environments.  All weekend shops were staking claims on instances and storage, and building out.  Rolling out resources got slower, and if you didn't claim, you'd lose out.  This has to be considered in your plans.  

 

To recap: Palomino loves AWS.  We're a consulting partner and have helped many clients in many different business models deploy, scale and perform in AWS.  But DR is not a luxury anymore.  It's a necessity.  Architectures have to take multi-AZ and multi-region plans into consideration in the beginning.  Many people use AWS so they save money on hardware.  They get upset when you point out the labor and extra instances needed to guarantee they can weather these storms.  But it's a hard reality.  It's one of the reasons we only recommended RDS in early phases, when downtime is tolerable.  Good configuration management also means you can deploy a skeleton infrastructure in another region; you can explode that to a full-blown install with ease.  But you have to practice, and you have to move fast.  If you think your region can go down, go to DEFCON and push the buttons.  If you're wrong, you can always tear back down.

Anticipate.
Plan.
Build it early.
Automate it.
Test it.
Test it.
Test it.
Test it.

If you haven't been able to donate to the Red Cross or other institutions helping our fellow brothers and sisters in the Northeast and in the Caribbean, please take some time to do so.  Having lost property and cared for loved ones displaced by Katrina, and now hearing so many horror stories from New Jersey and New York, I urge everyone to donate money, donate shelter, donate time and skills if you have them.  

 

Mystery Solved: Replication lag in InnoDB

 

While running a backup with XtraBackup against a slave server we noticed that replication was lagging significantly. The root cause wasn't clear, but we noticed that DML statements from replication were just hanging for a long time. Replication wasn't always hanging, but it happened so frequently that a 24 hour backup caused replication to lag 11 hours.

The first hypothesis was that all the writes generated from replication (relay log, bin log, redo log, etc) were generating too high contention on IO while XtraBackup was reading the files from disk. The redo log wasn't hitting 75%, which meant that InnoDB wasn't doing aggressive flushing - some other contention was causing replication to stall.

After various tests, we found that disabling innodb_auto_lru_dump solved the issue. It wasn’t entirely clear what the relation was between the lru dump and replication lag during backup, but it was very easy to reproduce. Enabling lru dump at runtime was immediately causing replication to lag, and disabling it restored replication back to normal.

Also, when innodb_auto_lru_dump was enabled we noticed that from time to time the simple command "SHOW ENGINE INNODB STATUS" was hanging for 2-3 minutes.

To attempt to reproduce the issue outside this production environment, we tried to run various benchmarks using sysbench, with and without auto lru dump. The sbtest table (~20GB on disk) was created using the following command:

sysbench --test=oltp --mysql-table-engine=innodb --mysql-user=root --oltp-table-size=100000000 prepare

The InnoDB settings were:

innodb_buffer_pool_size = 10G

innodb_flush_log_at_trx_commit = 2

innodb_thread_concurrency = 0

innodb_flush_method=O_DIRECT

innodb_log_file_size=128M

innodb_file_per_table

 

The various benchmarks were ran using:

- read-only workload vs read-write workload;

- small buffer pool vs large buffer pool (from 2G to 30G)

- small number of threads vs large number of threads

 

None of the above benchmarks showed any significant difference with auto lru dump enabled or disabled. Perhaps these workloads were not really reproducing our environment where we were getting issues with auto lru dump. We therefore started a new series of benchmarks with only one thread doing mainly writes - this is the workload we expect in a slave used only for replication and backups.

The workload with sysbench was modified to perform more writes than read, yet the result of the benchmark didn't change a lot - enabling or disabling lru wasn't producing any significant change in performance. The problem with this benchmark was that it was generating too many writes and filling the redo log. InnoDB was then doing aggressive flushing and this was a bottleneck that was hiding any effect caused from the lru dump.

To prevent the redo from filling too quickly, we had to change the workload to read a lot of pages, change the buffer pool from 30G to 4G, and test with always restarting mysqld and with the buffer pool prewarmed with:

select sql_no_cache count(*), sum(length(c)) FROM sbtest where id between 1 and 20000000;

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=1000 run

 

innodb_auto_lru_dump=0:    transactions: (7.26 per sec.)

innodb_auto_lru_dump=1:    transactions: (6.93 per sec.)

 

This was not a huge difference, but we finally saw some effect of the auto_lru_dump.

It became apparent that the number of transactions per second in the above benchmark was really low because the number of random reads from disk was the bottleneck. To remove this bottleneck, we removed innodb_flush_method=O_DIRECT (therefore using the default flush method), and then ran the following to load the whole table into the OS cache (not into the buffer pool).

 

dd if=sbtest/sbtest.ibd of=/dev/null bs=1M

 

To prevent the redo log from filling up, we also changed the innodb_log_file_size from 128M to 1G.

With these changes - always using a buffer pool of 4G, restarting mysqld before each test ,and prewarming the buffer pool with "select sql_no_cache count(*), sum(length(c)) FROM sbtest where id between 1 and 20000000" - we reran the same test changing the number of requests:

10K transactions:

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=10000 run

 

innodb_auto_lru_dump=0:    transactions: (243.22 per sec.)

innodb_auto_lru_dump=1:    transactions: (230.62 per sec.)

 

50K transactions:

sysbench --num-threads=1 --test=oltp --mysql-user=root --oltp-table-size=100000000 --oltp-index-updates=10 --oltp-non-index-updates=10 --oltp-point-selects=1 --max-requests=50000 run

 

innodb_auto_lru_dump=0:    transactions: (194.31 per sec.)

innodb_auto_lru_dump=1:    transactions: (175.69 per sec.)

 

 

With innodb_auto_lru_dump=1 , performance drops by a factor of 5-10% !

 

After this, we wanted to run a completely different test with no writes, only reads.

innodb_auto_lru_dump didn't show any difference when sysbench was executed with read only workload, and we believe the reason is simply the fact that sysbench wasn't changing too many pages in the buffer pool. The easiest way to change pages in the buffer pool is to perform a full scan of a large table with a small buffer pool. We set innodb_flush_method=O_DIRECT, since otherwise the read from the OS cache was too fast and we couldn't detect any effect of innodb_auto_lru_dump. With innodb_buffer_pool_size=4G, and restarting mysqld after each test, this was the the result of a full table scan:

 

With innodb_auto_lru_dump=0 :

mysql> select sql_no_cache count(*), sum(length(c)) FROM sbtest;

+-----------+----------------+

| count(*)  | sum(length(c)) |

+-----------+----------------+

| 100000000 |      145342938 |

+-----------+----------------+

1 row in set (3 min 27.22 sec)

 

 

With innodb_auto_lru_dump=1 :

mysql> select sql_no_cache count(*), sum(length(c)) FROM sbtest;

+-----------+----------------+

| count(*)  | sum(length(c)) |

+-----------+----------------+

| 100000000 |      145342938 |

+-----------+----------------+

1 row in set (3 min 38.43 sec)

 

Again, innodb_auto_lru_dump=1 affects performance increasing the execution time by ~5% .

It is also important to note that innodb_auto_lru_dump seems to affect performance only for some specific workload scenarios. In fact, the majority of the benchmarks we ran weren't showing any performance effect caused by innodb_auto_lru_dump.

 

Testing and Analyzing Performance with Benchmarks

Generic benchmark tools can be very useful for testing performance on your system. These benchmark tools normally have a set of predefined workloads, but often they don't match your specific workload in useful ways.

One of the best ways to reproduce your workload is to have a good sense of the application that uses the database and how it manages requests to the database. If this is not an option, it is also possible to analyze traffic and to find the most common queries, and use those to define the most common workload.

You can analyze traffic in many ways, from tcpdump to general log, from binlog (only for DML statements) to slow query log.

Afterwards it is possible to analyze them with pt-query-digest (or the obsolete mk-query-digest) to find the most common and/or heavy queries.

In the system we analyze here, the workload was mainly write intensive and involved just 4 tables:

  • tableA was receiving single-row INSERT statements;
  • for each insert on tableA , on average 200 INSERTs were performed in the other 3 tables, distributed as follows: 100 on tableB, 95 on tableC, 5 on tableD (to be more specific , for each INSERT on tableB there is an INSERT either on tableC or tableD).

 

The system also receives SELECT statements, but in a very small number and very simple primary key lookup.

To simulate the workload, we generated a simple perl script that spawns a certain number of threads that perform the DML statements, and other threads that perform the SELECT statements.

At regular intervals, the script prints statistics and progress.

The benchmark test was executed in a setup with 2 hosts: one host where the client was running, and another host where the servers were running.

The RDBMS tested were: MariaDB 5.2.3 with TokuDB 5.2.7 and InnoDB, and Percona 5.5.20.

Additionally, Percona 5.5.20 was tested as multiple instances running on the same hosts.

 

The goal of the first benchmark test was to compare TokuDB against InnoDB for this specific workload.

We executed MariaDB with TokuDB with the following (simple) config file:

[mysqld] 
user=mysql 
table_open_cache=1024 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
datadir=/localfio/datadir
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M 
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/tokudb 
 

 

We found the performance of InnoDB significantly better compared than TokuDB in this instance, though this test - where the dataset fits almost entirely in memory - does not show the real power of TokuDB, which excels at insertion rate at scale. Because these tables have very few indexes, TokuDB and Fractal tree indexes weren't very efficient. Furthermore, the benchmarks were running on FusionIO, which meant that performance on InnoDB didn't degrade much as on spinning disks. We excluded TokuDB out from the next benchmark tests because they are all cases which are not well-suited for TokuDB’s strengths.

We temporarily abandoned MariaDB, and tested Percona 5.5.20 with the following config file:

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=2G
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 

 

We tried various innodb_flush_method attempts, and the graphs show that O_DIRECT performs slightly better than the default fsync(), even if the benchmark shows a weird bootstrap. We also tried ALL_O_DIRECT, which performed badly.

 

Additionally, we tried innodb_log_block_size=4096 instead of the default 512, but nothing changed: insert rate wasn't affected.

 

One of the goals of this benchmark was to test if running multiple mysqld instances on the same host performs better than a single mysqld instance.

On this specific hardware, the answer seems to be yes. Configuring 8 mysqld instances with the same config file listed below (but different paths and ports), throughput is significantly higher. Note that innodb_buffer_pool_size was set to 256M to try to stress the IO subsystem.

[mysqld] 
user=mysql 
table_open_cache=256 
max_connections=128 
query_cache_size=0 
innodb_file_per_table 
log_bin 
innodb_flush_log_at_trx_commit=1 
innodb_buffer_pool_size=256M
innodb_log_buffer_size=8M 
innodb_log_file_size=1024M 
basedir=/usr/local/mysql 
port=3306
datadir=/localfio/MULTI/db00 
socket=/localfio/MULTI/db00/mysql.sock 
 

 

All the above tests were executed using 36 client connections for writes and 36 client connections for reads.

 

We then ran a new cycle of tests, but instead of using 36 x 2 connections, we used 80 x 2 (80 for writes and 80 for reads).

 

 

With 80 connections, throughput was higher than with 36 connections, but at nearly regular intervals we found performance dropping. This seems independent from the size of the buffer pool.

It is interesting to note that with only one mysqld instance, FusionIO was performing at 4.7k – 4.8k IOPS, while with 8 mysqld instances FusionIO was performing at 27k – 29k IOPS. As expected, with a small buffer pool performance tends to slowly degrade when the data doesn't fit in memory.

We tried various values of innodb_write_io_threads, but this didn't make any difference, since the Redo Log was the most written and not the tablespaces.

To better analyze the throughput, we reduced the sample time to 10 seconds and reran the test:

 

 

It is clear that throughput drops from time to time, and for a nearly constant amount of time. While the test was running, we tried to monitor the mysqld instances, but there was no clear indication of why they were stalling. The Redo Log wasn't anywhere close to full and InnoDB wasn't performing aggressive flushing. The amount of data read from disk was pretty low but the amount of data written was spiking. Yet, the writes weren't coming from InnoDB.

The reason for the stalls became apparent when we analyzed the content of /proc/meminfo: the Linux Virtual Memory (VM) subsystem was performing dirty pages flushing!

We changed the dirty_background_ratio from 10 (the default) to 1 , and reran the test.

sysctl -w vm.dirty_background_ratio=1

 

Throughput is now way more stable, although performance has dropped by 2.8%. It is interesting to note that throughput drops at nearly the same time no matter the value of dirty_background_ratio.

A quick analysis of MySQL source code shows that binlog are synced to disk when closed, therefore the drops in throughput may be caused by the flush of binary logs.

We then raised vm.dirty_background_ratio up to 10 (the default value) and lowered max_binlog_size from 1G to 64M.

 

 

Throughput doesn't drop drastically as in the two previous tests, but goes up and down at more regular intervals.

At the end of this test, performance with max_binlog_size=64M is ~4% lower than the initial test with max_binlog_size=1G (in both cases, vm.dirty_background_ratio=10).

The last setup of 8 instances with a 256M buffer pool each and max_binlog_size=64M was then compared with a new setup:  4 instances with a 512M buffer pool each (2GB total in both cases) and max_binlog_size=64M:

 

 

An interesting outcome from this last test is that total throughput raised by around 4% (that was originally lost using binlogs of 64M) and that the total number of IOPS dropped to ~16k, leaving room for more IO in case of a different workload.

We then ran a new test using only 2 mysqld instances. It shows what was already easy to guess when running a similar test with only one mysqld instance: a lower number of mysqld instances can't fully utilize IO capacity and therefore has lower throughput.

 

Conclusions (most of them are as expected) for this specific workload and on this specific hardware:

O_DIRECT performs slightly better than the default fsync for innodb_flush_method .

A high number of clients provides more throughput than a smaller number of clients: not enough tests were performed to find the optimal number of clients.

Throughput reduces when data doesn't fit in the buffer pool.

A high number of mysqld instances running on the same server are able to better utilize the number of IOPS that FusionIO is able to provide (perhaps, it should be a very bad idea to run multiple mysqld instances on the same spinning disk or array)

The sync of binlog during binlog rotation are able to stall the system. Lowering dirty_background_ration or max_binlog_size is able to stabilize the throughput.

How to recreate an InnoDB table after the tablespace has been removed

Does your error log ever get flooded with errors like this one?

 

[ERROR] MySQL is trying to open a table handle but the .ibd file for
table my_schema/my_logging_table doesnot exist.
Have you deleted the .ibd file from thedatabase directory under
the MySQL datadir, or have you used DISCARD TABLESPACE?
See http://dev.mysql.com/doc/refman/5.0/en/innodb-troubleshooting.html how you can resolve the problem.

 

No? That is great!

 

We had a case where, in order to quickly solve a disk space issue, a SysAdmin decided to remove the biggest file in the filesystem, and of course this was an InnoDB table used for logging.

That is, he ran:

shell> rm /var/lib/mysql/my_schema/my_logging_table.ibd

He could have run TRUNCATE TABLE, but that's another story.

 

The results were not ideal:

  1. The table did not exist anymore.

  2. Errors in the application while trying to write to the table.

  3. MySQL flooding the error log.

 

The solution for this problem is to:

  • run DISCARD TABLESPACE ( InnoDB will remove insert buffer entries for that tablespace);

  • run DROP TABLE ( InnoDB will complaint that the .ibd file doesn't exist, but it will remove it from the internal data dictionary );

  • recover the CREATE TABLE statement from the latest backup ( you have backups, right? );

  • issue the CREATE TABLE statement to recreate the table.

 

Example:

mysql> ALTER TABLE my_logging_table DISCARD TABLESPACE;

Query OK, 0 rows affected (0.05 sec)

 

In the error log you will see something like:

InnoDB: Error: cannot delete tablespace 251
InnoDB: because it is not found in the tablespace memory cache.
InnoDB: Warning: cannot delete tablespace 251 in DISCARD TABLESPACE.
InnoDB: But let us remove the insert buffer entries for this tablespace.

 

mysql> DROP TABLE my_logging_table;

Query OK, 0 rows affected (0.16 sec)

 

In the error log you will see something like:

InnoDB: Error: table 'my_schema/my_logging_table'
InnoDB: in InnoDB data dictionary has tablespace id 251,
InnoDB: but tablespace with that id or name does not exist. Have
InnoDB: you deleted or moved .ibd files?
InnoDB: This may also be a table created with CREATE TEMPORARY TABLE
InnoDB: whose .ibd and .frm files MySQL automatically removed, but the
InnoDB: table still exists in the InnoDB internal data dictionary.
InnoDB: Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/innodb-troubleshooting.html
InnoDB: for how to resolve the issue.
InnoDB: We removed now the InnoDB internal data dictionary entry
InnoDB: of table `my_schema/my_logging_table`.

 

And finally:

mysql> CREATE TABLE `my_logging_table` (
(... omitted ...)
-> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Query OK, 0 rows affected (0.05 sec)

 

Of course, the final step - is a stern talking to with the SysAdmin.

Advocating For Our Clients - Part 1 Cultural

 

What do companies need from their database professional, MySQL or otherwise? How can we exceed those expectations as a remote team?  This is my first in a series of blog posts discussing exactly how we do so at PalominoDB - regardless if the technology is MySQL, MongoDB, Cassandra, Oracle or ottherwise.

Cultural:  

The majority of our clients are start-ups.  Some are small teams experiencing their first three year growth pains, while others are in the three to seven year period, have proven the effectiveness of their business model, yet retain a strong sense of start-up culture and personality.  When looking for staff, their focus is rightly on people who have passion, drive and personalities that fit with their unique corporate cultures.  How can a remote resource, much less one that is not dedicated to one company full-time, understand not just clients’ product and technology, but the people, the schedules and the drive that support clients' success?

Quite often clients want PalominoDB to have a single point of contact who gives us individual tasks and who functions as a filter between their organization and ours.  While we will work with whatever model is requested, this method builds a certain level of isolation that can limit our effectiveness in the bigger picture.  Being in operations requires a certain push and pull with engineering and product groups to meet business demands and ensure availability and performance. We also require knowledge of a company's business goals and project portfolio.  Otherwise, how can we react with urgency at the appropriate times?  How can we know which issues require escalation and which require push back? 

Once we understand company strategy and priorities, we can start to tailor the decisions we make to our clients’ needs. For example, if we know a particular system is crucial to the success of a client's key project, we are much more inclined to work until 2 am to complete the project or to meet a release deadline. If we know that two months of late nights and weekend work are crucial to helping a client with a customer launch, beat the competition and grow successful, my staff and I willingly devise a plan to support that customer. However, if we perceive that a customer’s demands for last minute changes or large amounts of off- hours work come from poor planning, poor communication or a poorly prioritized product plan, we are much more inclined to put our efforts into improving the underlying processes around change management and project planning. 

As operational professionals, we understand the importance of urgency and the product delivery speed that the modern start-up must work with.  Because of the breadth of our experience, we also know that if production teams had their way, all tasks would be P1s, all reports would be real time and there would never be any downtime (and all work would happen for free) and we act in accordance with this desire to the best of our ability.  When we know that the task we are working on is crucial to our clients’ ability to maintain their competitive edge, we are motivated to work the 12 hour days needed to get it done.  Alternatively, when we know a date is flexible, we can choose not to tax our operations team and evoke the risks associated with working too many hours and making crucial decisions under fatigue. As CEO and principal at PalominoDB, it is my job to work within my clients availability and take care of my staff  Work-life balance is not simply a concept to which I pay lip-service; I believe that a happy, rested and alert operations staff is essential to customer up-time and to keeping human mistakes to a minimum. 

Another question we often get asked is how do we as remote team members correctly align with business so that we can support them at their pace and intensity?  We’ve had the most success with regular knowledge- shares and participation in operational team meetings.  While taking part in our clients’ company-wide sessions has been unnecessary, we have found that attending operations and architecture team meetings where information can be shared down and around is an excellent start.  Getting onto operational team distribution lists is another method we use to learn about what is going on.  Obviously, every DBA on our team cannot do these things for every one of PalominoDB's clients, but the primary DBA assigned to a particular client can, and, as they filter out relevant details, they can share pertinent information with the rest of the team. 

Having that primary DBA serve as a client’s advocate is crucial, and a point I will continually discuss in my writings.  It is the primary DBA who asks questions when information is not forthcoming, who reviews the org charts and introduce themselves to engineers, project managers and QA/release folks.  The primary DBA gets contact info from all of these folk, documents it in CRM, and plugs it into GTalk, Skype or whatever medium is appropriate.  The primary DBA will hang out in a clients’ IRC and campfire rooms and soak up everything they see.  The primary DBA even reads powerpoints (yes really)! Finally, and most importantly, the primary DBA makes on-site visits.  PalominoDB’s operations team makes it a point to try and come on-site at least once every two months.  Some of that time is spent in meetings and some of that time is spent simply dining or hanging out.  Regardless, these on-site visits allow us all to connect, to put names to faces, and to get to know each other. It helps to ensure that our clients understand that PalominoDB is not a faceless company full of replaceable DBAs.  We are a company made up of individuals with skills, quirks, personalities (usually BIG ONES), and we know our clients are the same. 

Does this take time? Yes. However, I ask our clients to think about the savings in cost they accrue by using us instead of maintaining a full-time staff.  The extra time spent on meetings, emails and IRC conversations does not add much in overall cost, yet it is invaluable when building relationships.  Constant contact replaces the water-cooler meetings and impromptu conversations at lunch. That small investment of time in camaraderie and in team-building pays-off in more ways than you can imagine.

The Prototype

The first start-up stage I’ve worked within is the prototype phase. Within this phase traffic is not an issue for performance or scale, it’s about functionality. Low traffic and small datasets can hide atrocious code quite easily. The nice thing about this stage is that you should not have to invest a lot of time or money into your database and instead can focus on functionality and business development. Over-engineering at this point can be a devastating waste of very precious resources.

Syndicate content
Website by Digital Loom