Blog

Getting a List of Users From the MySQL General Log

From time to time, organizations want to know if there are any users that are not used.  For clients using MySQL 5.1 and up, that can handle a 1% overhead for general logging, we will enable the general log for a period of time and analyze the users that connect.

Note: we have some extremely busy clients, and we very rarely have a problem turning the general log on, other than making sure we're rotating and compressing logs so we do not run out of disk space.

Once we have the logs, I run this little perl tool I made -- I call it genlog_users.pl:

#!/usr/bin/perl

 

my $infile=$ARGV[0];

my %seen=();

my @uniq=();

 

open (INPUT, "<$infile");

while (<INPUT>) {

  my $line=$_;

  if ($line=~/Connect/) {

    if ($line=~/(\S*@\S*)/) { push(@uniq, $1) unless $seen{$1}++; }

 } # end if line matches Connect

}

close INPUT;

open (OUTPUT, ">>..users.txt");

$,="\n";

print OUTPUT (keys %seen);

print OUTPUT ("\n");

close OUTPUT;

 

----------

I hope it is useful for whoever stumbles on this; I know it has been useful for me in the past -- it's just doing some string matching, and I bet if I used Python it would be done in half the lines, but it's already fewer than 20 lines, so it's pretty small to begin with.

PalominoDB Percona Live: London Slides are up!

 

Percona Live: London was a rousing success for PalominoDB.  I was sad that I could not attend, but I got a few people who sent "hellos" to me via my coworkers.  But on to the most important stuff -- slides from our presentations are online!

René Cannao spoke about MySQL Backup and Recovery Tools and Techniques (description) slides (PDF)

 

Jonathan delivered a 3-hour tutorial about Advanced MySQL Scaling Strategies for Developers (description) slides (PDF)

Enjoy!

Automatic Downtimes in Nagios Without Using Time Periods

Monitoring systems are a great thing, and we rely heavily on Nagios here at PalominoDB.  We also rely heavily on xtrabackup for (mostly) non-locking, "warm" backups of MySQL.  In order to get a consistent backup with a proper binlog position on a slave, xtrabackup stops replication for a short period of time.  If the monitoring system catches this, the pager will go off, and usually in the middle of the night.

Nagios can have specific windows when it is on or off for a particular host or service, but you have to remember that when you change the time the backup runs.  I prefer to call a script from cron, just before I call the backup script, so that I can easily see that they are related.

Note that this script works even if you have Nagios password protected with .htaccess, because wget allows for a username and password.  This script is not perfect - there is not a lot of error checking, and log files are hard-coded.  But it does what it needs to.

The script is called like this:

./scheduleDowntimeByService service length host

Sample cron entries (note that the service name depends on what you are checking and the service names themselves):

00 5 * * 0 /home/palominodb/bin/scheduleDowntimeByService.pl MySQL+Replication+Lag 3600 hostname

 

00 5 * * 0 /home/palominodb/bin/scheduleDowntimeByService.pl MySQL+Slave+IO+Thread 3600 hostname

 

00 5 * * 0 /home/palominodb/bin/scheduleDowntimeByService.pl MySQL+Slave+SQL+Thread 3600 hostname

Here is the script itself:

 

cat scheduleDowntimeByService.pl 

#!/usr/bin/perl

use strict;

use POSIX;

 

my $service = shift @ARGV;

my $length = shift @ARGV;

my $host = shift @ARGV;

 

unless ($length ) {

        $length = 3600;

}

 

 

my $startTime = time();

my $endTime = $startTime + $length;

 

my $nagios_start = POSIX::strftime("%m-%d-%Y %H:%M:00", localtime($startTime));

my $nagios_end = POSIX::strftime("%m-%d-%Y %H:%M:00", localtime($endTime));

 

$nagios_start =~ s@:@%3A@g;

$nagios_start =~ s@-@%2D@g;

$nagios_start =~ s@ @%20@g;

$nagios_end =~ s@:@%3A@g;

$nagios_end =~ s@-@%2D@g;

$nagios_end =~ s@ @%20@g;

 

my $URL = 'https://monitoring.company.com/nagios//cgi-bin/cmd.cgi?fixed=1&start_time=' .  $nagios_start . '&end_time=' . $nagios_end . '&cmd_typ=56&cmd_mod=2&host=' . $host . '&service=' . $service . '&com_data=Backups&btnSubmit=Commit';

 

my $cmd = "/usr/bin/wget --quiet --user=user --password=PASS -O /tmp/nagios_downtime.html '$URL'"

;

open ( L, ">>/tmp/nagios_downtime.log" );

print L print $cmd . "\n";

print L `$cmd`;

close L;

Exporting the mysql.slow_log table into slow query log format

Using pt-query-digest is an excellent way to perform a SQL review. However, sometimes you don't have access to the slow_query_log_file. For example, when MySQL runs on Amazon RDS, the slow_query_log_file is unavailable (see the RDS FAQ). To get around this, export the mysql.slow_log table.

To export the data, run the following SQL command from an instance that has database access. The data is exported into a MySQL slow-log format, which pt-query-digest can analyze:

mysql -u user -p -h host.rds.amazonaws.com -D mysql -s -r -e "SELECT CONCAT( '# Time: ', DATE_FORMAT(start_time, '%y%m%d %H%i%s'), '\n', '# User@Host: ', user_host, '\n', '# Query_time: ', TIME_TO_SEC(query_time),  '  Lock_time: ', TIME_TO_SEC(lock_time), '  Rows_sent: ', rows_sent, '  Rows_examined: ', rows_examined, '\n', sql_text, ';' ) FROM mysql.slow_log" > /tmp/mysql.slow_log.log

The -s (--silent) option is used to suppress echoing of the query.

The -r (--raw) option is used to disable character escaping making \n into an actual newline, otherwise it's echoed as '\n'.

Once the export is complete, run pt-query-digest to do the analysis. A simple review command is:

pt-query-digest --limit 100% /tmp/mysql.slow_log.log > /tmp/query-digest.txt

query-digest.txt is now ready for review.

I won USD$100 and you can, too

Last week, I won USD$100*.  Gazzang is giving away $100 every Friday in October, in honor of National Cyber Security Awareness Month.  There are only 2 more weeks left -- all you have to do is tweet @Gazzang_Inc one way to improve your own cyber security.

My winning tweet:

 one way is to secure MySQL! With Gazzang or the tips in  or .

The tip can be any kind of cyber security.  I believe you have to tweet on Friday, though.....so wait until tomorrow!

* I donated my winnings to http://www.technocation.org

Important Caveats When Using pt-online-schema-change

 

Using pt-online-schema-change is a great way to change a table without downtime or switching over to a secondary database first.  While using it on production systems, we found some interesting issues to consider.

 

1) There were foreign_keys referencing the table we were altering.  There are two options to handle this.  They are specified with "--update-foreign-keys-method rebuild_constraints|drop_old_table".  One is to rebuild the tables with foreign keys.  This does lock the tables, and will block reads and writes while it is happening.  I chose to use drop_old_table.  This method disables foreign key checks, then drops the original table and renames the new table before reenabling foreign key checks.  Normally, pt-online-schema-change just renames the tables, then drops the old table, so locking is minimal.  We did find that there was locking while dropping the original table and renaming the new table.  This is something to be cautious with, especially with large tables.  In theory, you could run an online change to drop the foreign key constraints on the child tables prior to working on the parent table.

 

2) When testing, this ran in 15 minutes against an unladen database and took over 24 minutes against production.  You could add --sleep n, where n is a number of seconds, if you need it to add less load to production.  

 

3) One limit we also saw is that pt-online-schema-change will only work on tables with at least one unique single column index, such as a single column primary key.  Often partitioned tables don't have a unique single column index.

 

 

Here is some of the output of a sample pt-online-schema-change...

 
 
## Install pt-online-schema-change and pre-requisites...
wget percona.com/get/pt-online-schema-change
chmod 755 pt-online-schema-change
yum install perl-DBI
yum install perl-DBD-MySQL
 
time ./pt-online-schema-change h=127.0.0.1,D=my_schema,t=my_table -P3306 --alter "add column signature varbinary(255) NULL DEFAULT NULL; add column signature_version varchar(4) NULL DEFAULT NULL" --child-tables fk_table1,fk_table2 --update-foreign-keys-method drop_old_table -pXXXX --bin-log
# 2011-10-14T03:20:28 ./pt-online-schema-change started
# 2011-10-14T03:20:28 USE `my_schema`
# 2011-10-14T03:20:28 Alter table my_table using temporary table __tmp_my_table
# 2011-10-14T03:20:28 Checking if table my_table can be altered
# 2011-10-14T03:20:28 SHOW TRIGGERS FROM `my_schema` LIKE 'my_table'
# 2011-10-14T03:20:28 Table my_table can be altered
# 2011-10-14T03:20:28 Chunk column id, index PRIMARY
# 2011-10-14T03:20:28 Chunked table my_table into 25685 chunks
# 2011-10-14T03:20:28 User-specified child tables: fk_table1, fk_table2
# 2011-10-14T03:20:28 Starting online schema change
# 2011-10-14T03:20:28 CREATE TABLE `my_schema`.`__tmp_my_table` LIKE `my_schema`.`my_table`
# 2011-10-14T03:20:28 ALTER TABLE `my_schema`.`__tmp_my_table` add column signature varbinary(255) NULL DEFAULT NULL
# 2011-10-14T03:20:29 ALTER TABLE `my_schema`.`__tmp_my_table`  add column signature_version varchar(4) NULL DEFAULT NULL
# 2011-10-14T03:20:30 Shared columns: id, col1, col2, col3, col4, col5
# 2011-10-14T03:20:30 Calling OSCCaptureSync::capture()
# 2011-10-14T03:20:30 CREATE TRIGGER mk_osc_del AFTER DELETE ON `my_schema`.`my_table` FOR EACH ROW DELETE IGNORE FROM `my_schema`.`__tmp_my_table` WHERE `my_schema`.`__tmp_my_table`.id = OLD.id
# 2011-10-14T03:20:30 CREATE TRIGGER mk_osc_upd AFTER UPDATE ON `my_schema`.`my_table` FOR EACH ROW REPLACE INTO `my_schema`.`__tmp_my_table` (id, col1, col2, col3, col4, col5) VALUES (NEW.id, NEW.col1, NEW.col2, NEW.col3, NEW.col4, NEW.col5)
# 2011-10-14T03:20:30 CREATE TRIGGER mk_osc_ins AFTER INSERT ON `my_schema`.`my_table` FOR EACH ROW REPLACE INTO `my_schema`.`__tmp_my_table` (id, col1, col2, col3, col4, col5) VALUES(NEW.id, NEW.col1, NEW.col2, NEW.col3, NEW.col4, NEW.col5)
# 2011-10-14T03:20:30 Calling CopyRowsInsertSelect::copy()
Copying rows:   1% 24:52 remain
Copying rows:   3% 25:01 remain
...
Copying rows:  96% 00:43 remain
Copying rows:  98% 00:15 remain
# 2011-10-14T03:44:14 Calling OSCCaptureSync::sync()
# 2011-10-14T03:44:14 Renaming foreign key constraints in child table
# 2011-10-14T03:44:14 SET foreign_key_checks=0
# 2011-10-14T03:44:14 DROP TABLE `my_schema`.`my_table`
# 2011-10-14T03:44:49 RENAME TABLE `my_schema`.`__tmp_my_table` TO `my_schema`.`my_table`
# 2011-10-14T03:44:49 Calling CopyRowsInsertSelect::cleanup()
# 2011-10-14T03:44:49 Calling OSCCaptureSync::cleanup()
# 2011-10-14T03:44:49 DROP TRIGGER IF EXISTS `my_schema`.`mk_osc_del`
# 2011-10-14T03:44:49 DROP TRIGGER IF EXISTS `my_schema`.`mk_osc_ins`
# 2011-10-14T03:44:49 DROP TRIGGER IF EXISTS `my_schema`.`mk_osc_upd`
# 2011-10-14T03:44:49 ./pt-online-schema-change ended, exit status 0
 
real    24m20.777s
user    0m3.936s
sys     0m1.216s

Indexing text columns with GIST or GIN to optimize LIKE, ILIKE using pg_trgm in Postgres 9.1 (Part 1)

 
“The pg_trgm module provides functions and operators for determining the similarity of ASCII alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.”
 
This is the introduction to the official documentation of the extension at [1]. Note, that I used the EXTENSION terminology instead of CONTRIB, as in PostgreSQL 9.1.  Now we’ll use the CREATE EXTENSION command to include this module in the database. This new methodology allows us to manage modules installations/uninstallations with only a few commands.

The idea of this post is to show you how KNN GIST and pg_trgm could be used together to obtain interesting results. First, let’s start with some basic elements of pg_trgm.

Installation

Installing the module is easy. If you are installing through the use of source code, you must compile the module and once you get access to the database execute:

CREATE EXTENSION pg_trgm;

That’s it! Installation complete!

What is a Trigram and how use them?


A trigram is a group of three consecutive characters in a string that can be used to detect the similarity of two words (for example) or the ‘distance’ between them.

When we talk about ‘distance’, 0 means in the same place and 1 is very far. When we talk about similarity, 1 is equal and 0 is totally different. In other words, distance is 1 minus the similarity value. These concepts about distance and similarity, are necessary to start without confusion.

What’s is a trigram? A trigram is a group of three consecutive characters from a string, used to know the similarity between two strings by counting the trigrams they share.

A trigram looks like this:

palominodb=# select show_trgm('PalominoDB CO');
                              show_trgm
-----------------------------------------------------------------------
{"  c","  p"," co"," pa",alo,"co ","db ",ino,lom,min,nod,odb,omi,pal}
(1 row)

In the pg_trgm extension we have functions and operators. show_limit and set_limit are functions used to set up and show the similarity threshold for the % operator. This operator takes the form “string1 % string2” and returns a boolean type (“t” if the similarity is greater than the similarity threshold, otherwise it returns “f”).

palominodb=# select show_limit();
show_limit
       0.4

palominodb=# select set_limit(0.3), show_limit();
set_limit | show_limit
         0.3 |    0.3

In the following example we’ll see the use of each one. Operator % will return true if the similarity of the strings is greater than similarity threshold returned by show_limit function. In this example, both string are equal, in consequence, the operation will return true:

palominodb=# select similarity('Palomino','Palomino')  AS Similarity,
                                  'Palomino'<->'Palomino'              AS distance,
                                  'Palomino' % 'Palomino'              AS SimilarOrNot;
-[ RECORD 1 ]+--
similarity          | 1
distance     | 0
similarornot     | t

Index Support and usage


Now let’s discuss combining GIST or GIN and pg_trgm. pg_trgm includes an operator class to support searches using similarity, like, or ilike operators. GIN and GIST have several differences. If you don’t know which to choose, just remember a few rules: GIN searches quicker than GIST but is slower to update; if you have a write-intensive table use GIST. GIN is better for static data. Please be aware, however, that they don’t support exact matching with the equals operator!  You can do an exact match using like/ilike with no wildcards.  If you want to use the equals operator(=), you must create a standard BTREE index on the pertinent column.

In the following examples, we’ll show a table only with a GIST index. As you can see, if you want to match the exact value with equal operator, it will scan the whole table:

palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto = 'Palomino';
                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on texto_busqueda  (cost=0.00..90.15 rows=1 width=136) (actual time=16.835..16.846 rows=1 loops=1)
  Filter: (texto = 'Palomino'::text)
Total runtime: 17.094 ms
(3 rows)

But, if we use LIKE operator, index scan will be activated:

palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto like 'Palomino';
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using texto_busqueda_texto_idx on texto_busqueda  (cost=0.00..8.27 rows=1 width=136) (actual time=0.374..1.780 rows=1 loops=1)
  Index Cond: (texto ~~ 'Palomino'::text)
Total runtime: 1.979 ms
(3 rows)


palominodb=# EXPLAIN ANALYZE  SELECT id, texto FROM texto_busqueda WHERE texto like '%Palomino%';
                                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Index Scan using texto_busqueda_texto_idx on texto_busqueda  (cost=0.00..8.27 rows=1 width=136) (actual time=0.171..1.732 rows=1 loops=1)
  Index Cond: (texto ~~ '%Palomino%'::text)
Total runtime: 1.882 ms
(3 rows)



To use an index for  match equal strings, we need to create a BTREE index. But in case of BTREE there is a limitation of 8191 bytes per index row. So, if you have very large text columns you will not allowed to create a BTREE index without using functional indexes.

We get this result because, unlike BTREE indexes, the search string is not left-anchored.

The creation of indexes with the pg_trgm operator class is simple:

CREATE INDEX ON texto_busqueda USING GIST(texto gist_trgm_ops);
or
CREATE INDEX ON texto_busqueda USING GIN(texto gin_trgm_ops);

If you want a more comprehensive understanding of GIST or GIN implementation on Postgres, you can download the the source code and read  src/backend/access/gist/README and src/backend/access/gin/README.


Another useful technique

Combining % operator to get the strings that have a similarity greater than the established threshold and similarity function -that returns the similarity-, we can get ordered from the most similar to the less one discarding all the strings that aren’t similar enough:

palominodb=# SELECT ctid, similarity(texto, 'Palominodb') AS simil
palominodb-#  FROM texto_busqueda
palominodb-#  WHERE texto % 'Palominodb'
palominodb-#  ORDER BY simil DESC;
 ctid  |  simil
--------+----------
(55,3) | 0.666667
(1 row)

The same query, but with its EXPLAIN plan shows that the generated index is used in the condition:

palominodb=# EXPLAIN ANALYZE  SELECT ctid, similarity(texto, 'Palominodb') AS sml
palominodb-#  FROM texto_busqueda
palominodb-#  WHERE texto % 'Palominodb'
palominodb-#  ORDER BY sml DESC;
                                                             QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Sort  (cost=14.26..14.27 rows=3 width=138) (actual time=3.428..3.437 rows=1 loops=1)
  Sort Key: (similarity(texto, 'Palominodb'::text))
  Sort Method: quicksort  Memory: 17kB
  ->  Bitmap Heap Scan on texto_busqueda  (cost=4.28..14.24 rows=3 width=138) (actual time=3.336..3.383 rows=1 loops=1)
        Recheck Cond: (texto % 'Palominodb'::text)
        ->  Bitmap Index Scan on texto_busqueda_texto_idx  (cost=0.00..4.28 rows=3 width=0) (actual time=3.278..3.278 rows=1 loops=1)
              Index Cond: (texto % 'Palominodb'::text)
Total runtime: 3.578 ms
(8 rows)


Well, this is the first part. Hope you enjoyed the reading and I’ll wait your comments and feedback!

 


[1] http://www.postgresql.org/docs/9.1/static/pgtrgm.html

Why are your indexes larger than your actual data?

I sometimes encounter this phenomenon when a database has tables in which the indexes are larger—sometimes much larger—than the actual data stored in the table. Often, the table becomes unwieldy, which generates a number of questions: Should I partition? Should I shard? Should I use a different database entirely?

 

The problem is especially common with InnoDB tables because of:

* Clustered indexes - in Innodb, the primary key is included in every secondary key. If the primary key is large, it could have a detrimental effect on the entire table.

* Slight index overhead - "each index record contains a five-byte header that may be preceded by a variable-length header. The header is used to link together consecutive records, and also in row-level locking." 

* The indexes are fragmented (here’s what you can do about this problem).

When comparing to MyISAM, which compresses its string indexes, InnoDB looks quite big.

 

Here is an actual example:

 mysql> SELECT engine, count(*) tables, concat(round(sum(table_rows)/1000000,2),'M') rows, concat(round(sum(data_length)/(1024*1024*1024),2),'G') data, concat(round(sum(index_length)/(1024*1024*1024),2),'G') idx, concat(round(sum(data_length+index_length)/(1024*1024*1024),2),'G') total_size, round(sum(index_length)/sum(data_length),2) idxfrac

FROM information_schema.TABLES GROUP BY engine

ORDER BY sum(data_length+index_length) DESC LIMIT 10;

+------------+--------+---------+--------+---------+------------+---------+

| engine     | tables | rows    | data   | idx     | total_size | idxfrac |

+------------+--------+---------+--------+---------+------------+---------+

| InnoDB     |    960 | 278.31M | 75.76G | 119.27G | 195.03G    |    1.57 |

| MyISAM     |   1054 | 140.75M | 16.43G | 8.21G   | 24.64G     |    0.50 |

| MRG_MYISAM |     18 | 4.46M   | 0.48G  | 0.00G   | 0.48G      |    0.00 |

| MEMORY     |     44 | NULL    | 0.00G  | 0.00G   | 0.00G      |    NULL |

| NULL       |      7 | NULL    | NULL   | NULL    | NULL       |    NULL |

+------------+--------+---------+--------+---------+------------+---------+

 

mysql> SELECT CONCAT(table_schema, '.', table_name),  CONCAT(ROUND(table_rows / 1000000, 2), 'M')  rows,  CONCAT(ROUND(data_length / ( 1024 * 1024 * 1024 ), 2), 'G')                    DATA, CONCAT(ROUND(index_length / ( 1024 * 1024 * 1024 ), 2), 'G')  idx, CONCAT(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2), 'G') total_size, ROUND(index_length / data_length, 2) idxfrac FROM  information_schema.TABLES ORDER  BY data_length + index_length DESC LIMIT  3;

+---------------------------------------+--------+-------+--------+------------+---------+

| CONCAT(table_schema, '.', table_name) | rows   | DATA  | idx    | total_size | idxfrac |

+---------------------------------------+--------+-------+--------+------------+---------+

| db1.big_old_table              | 48.48M | 8.38G | 27.25G | 35.63G     |    3.25 |

| db1.big_new_table            | 38.63M | 6.32G | 18.13G | 24.44G     |    2.87 |

| db1.another_big_table       | 17.83M | 2.14G | 9.92G  | 12.06G     |    4.64 |

+---------------------------------------+--------+-------+--------+------------+---------+

 

Warning: be careful when you run the above queries. Try to run them on an available and non-production server whenever possible as they may seriously slow down your server.

 

As you can see, the differences between the index sizes vary between 3-4.5 times the size of the data.

 

My experience suggests the root causes of why the indexes are bigger than the data, could be any of the following:

1) Too many indexes

2) Indexes on big columns - like varchar(255)

3) Redundant or duplicate indexes

4) Combination of all of the above

 

And what are the root causes of these problems? Here are some potential answers, garnered, once again, from personal experience:

1) A happy trigger on the phpmyadmin "add index" button

2) When the table was small, some queries were slow and indexes were added indiscriminately until these queries were fast (which explains the previous point)

3) Someone tried numerous different combinations of multi-column indexes to speed up different queries using the same table(s)

4) Additional indexes could have been used to solve bad (after-the-fact) schema design issues

 

Please note that there could be more causes.  This is only a small sample of possibilities.

 

How to solve the problem?

This is something I am still learning, but so far I have discovered that:

1) You need to find out which indexes are not being used any more. To do that, you can use pt-index-usage. What you would need is the general log (or your slow log at long_query_time=0) and you would need to run it through pt-index-usage. This will generate a list of ALTER TABLE statements. This tool can also generate the findings or the decisions for the results into database of your choice (meaning you need to give it a database server if you want that extra data).

This report logs the indexes that were used by the queries in the general/slow log and suggests that you DROP the ones that are not. What's important to remember is that queries which did not run during the time you took your log - such as weekly or monthly reports - will not show up in the results.

I find this tool useful, because I can use it with the same log I used for pt-digest-query. However, I do not blindly run all the drop table statements, I just use them to help me understand the indexes better.

I sometimes use pt-duplicate-key-checker, but usually, pt-index-usage finds all or almost all of what pt-duplicate-key-checker finds.

 

2) I pick a table from the top table query I ran and investigate the indexes on it. I look for duplicate indexes or an index with similar columns (such as index1(a,b), index2(a), index3(b,a,c)) and try to determine why they are there.

Next, I try to go over the queries that access the table from the general/slow log and see how they use the indexes. There are no real short cuts to this process. Usually, I focus on low hanging fruit and I am very careful not to negatively impact other queries that I may not know about.

What I try to do is gather the uniqueness/cardinality of the indexes and determine how they help queries.

 

Here is an example to explain better:

show create table users(

user_id INT(4) UNSIGNED AUTOINC,

fullname VARCHAR(255) DEFAULT '',

active ENUM('Y','N') DEFAULT 'Y',

PRIMARY KEY(id),

KEY index1(fullname)

);

 

Let’s pretend that the users table has 100k rows.  Your query searches on the `fullname` column, which also has an index. 

In the case of most queries, the names may be unique enough to give back a few rows. But names like "John Smith" may give back an unwieldy number of results. If you are only interested in active users, then you can make a change like this:

 

ALTER TABLE users DROP INDEX index1, ADD INDEX index1(fullname,active);

 

By doing so, you increase the cardinality of the index because the additional column `active` may reduce the number of results yielded. In the case of the example, there may be over 100 John Smiths, but only 30 that are active.

 

3) Try to reduce the size of the queries. In the case of the earlier example, `fullname` is a varchar(255). If we use UTF8 for the table, then the index for it will use 767byes. We must determine what the fewest amount of characters that we can get away with is. This could be difficult if we don’t have any data to use for comparison.

If I were running this query, I would use the following process to  determine what that number might be:

 

mysql> set @unique_total=0;

mysql> select count(distinct fullname) into @unique_total from users;

0 row in set (6 min 28.29 sec)

 

mysql> select @unique_total;

+---------------+

| @unique_total |

+---------------+

|        100448 |

+---------------+

1 row in set (0.00 sec)

 

mysql> select round(count(distinct left(fullname,200))/@unique_total*100,2) as c200, round(count(distinct left(fullname,100))/@unique_total*100,2) as c100,

    -> round(count(distinct left(fullname,75))/@unique_total*100,2) as c75, round(count(distinct left(fullname,50))/@unique_total*100,2) as c50,

    -> round(count(distinct left(fullname,30))/@unique_total*100,2) as c30, round(count(distinct left(fullname,20))/@unique_total*100,2) as c20 from users;

+--------+--------+--------+-------+-------+-------+

| c200   | c100   | c75    | c50   | c30   | c20   |

+--------+--------+--------+-------+-------+-------+

| 100.00 | 100.00 | 100.00 | 99.97 | 99.81 | 92.86 |

+--------+--------+--------+-------+-------+-------+

1 row in set (4 min 13.49 sec)

 

You can see that if I limit the index to 30 characters, it will be very precise. If I try a lower number, it will become less and less precise. Technically, I would aim for an index that’s limited to over 95% of the existing data.

 

If you set the number for the index’s characters too low, this may have a negative effect.

To explain what I mean by that, let’s take the full name of the actor Daniel Michael Blake Day-Lewis, and let’s say that our index is limited to 20 characters.  If we search “Daniel Michael Blake Day-Lewis,” we may get back results from the database for:

Daniel Michael Blake Day-Lewis

Daniel Michael Blake

Daniel Michael Blake Smith

Daniel Michael Blake Willis

 

The database will have to further search these rows after it has been brought up from the storage engine. So using a limited index in this case will require the database to do additional work.

 

4) If at all possible, try to change a VARCHAR column/index that has a defined number of rows to an ENUM (1 or 2 bytes) or normalize it to another table and keep the id (usually an INT) in the original table. Doing so can save a ton of space (remember the VARCHAR and UTF8 comment I made above?). Check the cardinality of your VARCHAR column using SHOW INDEX and see if that column’s cardinality is between 0-255 or 0-65535 to determine if it’s a good candidate for this conversion.

 

There are many more techniques for and things you should know about going over indexes. I hope that this post will encourage you to learn more of them. If you are interested in doing so, I will be speaking at Percona Live London, were I will cover this subject (and others) in more depth. Hope to see you there!

TechRepublic Article: "Are tech certs valuable or not?"

I normally do not repost articles and make teeny tiny blog posts (I have noticed a few bloggers who do that, and it bothers me), but I found a short article on TechRepublic called Are tech certs valuable or not? to be quite interesting.

The author did an informal poll of a few CBS interactive higher-ups, and it was interesting to note that in addition to the expected "it depends" and "experience is the most important thing" and "beware the person with many certs and no experience" answers, the MySQL certification was specifically noted as being good:

Doug Lane, Director of Site Reliability and Performance Engineering, says, “The MySQL DBA certifications are valuable to me, since we support so many instances of that. Frankly, the Java and MS certs became less valuable due to the number of those who were just paper certified but not actually practitioners.”

Interestingly, he says the reason the MySQL certifications are valuable is because CBS interactive supports so many instances of MySQL, but he implies that folks who have MySQL DBA certifications also have experience (as opposed to Java/MS certs).

Change Views DEFINER without ALTER VIEW : how to fix thousands of views!

Recently I faced an interesting challenge: modify over 26k views on a single mysqld instance.

Altering the view one by one was far from an easy task, so a I had to look for an efficient way to do it. Read to find out more.

 

Preface:

Views have security context and their own access control.

When a view is created, a DEFINER is assigned to it (by default, the user that creates the view), and a SQL SECURITY that specifies the security context (definer by default).

Assume you use 'user123'@'192.168.1.%' , and you issue the follows (a very simple view):

CREATE VIEW view1 SELECT * FROM tbl1;

 

Behind the scene, this becomes:

CREATE ALGORITHM=UNDEFINED DEFINER=`user123`@`192.168.1.%` SQL SECURITY DEFINER VIEW `view1` AS select <omitted> from `tbl1` ;

 

 

Problem:

 

So far, all good.

What if your application change user to `user456`@`192.168.1.%` ?

The result can be very disappointing (even if expected):

mysql> SELECT * FROM view1;

ERROR 1045 (28000): Access denied for user 'user456'@'192.168.1.%' (using password: YES)

 

What if you use a highly privileged user like root?

mysql> SELECT * FROM view1;

ERROR 1449 (HY000): The user specified as a definer ('user123'@'192.168.1.%') does not exist

Not much more luck, we are sort of stuck until we change permission.

 

ALTER VIEW has the same syntax of CREATE VIEW , ( just 's/CREATE/VIEW/' ) , therefore there is no way to change only the definer: all the statement that define the view need to be re-issued.

 

Even if recreating the view is an easy task, isn't that easy if you thousands and thousands of views to fix.

 

Solution:

There is a dirty way to do this, but it does its job!

Each view is a .frm file in the database directory.

Changing the definer is easy as editing the file changing definer_user and/or definer_host .

 

This is the procedure to update all the views, no matter their number (26k views updates in only few minutes):

shell> cd /var/lib/mysql

shell> for i in `find . -name "*frm"` ; do if [ `cat $i | egrep '^definer_user=user123$' | wc -l` -ne 0 ]; then echo $i ; fi ; done > /tmp/list_views.txt

# don't forget to make a backup of your views!

shell> tar cf /tmp/views_backup.tar /tmp/list_views.txt

shell> for i in `cat /tmp/list_views.txt` ; do tar rf /tmp/views_backup.tar $i ; done

shell> for i in `cat /tmp/list_views.txt` ; do cp -a $i $i.old && sed -e 's/^definer_user=user123$/definer_user=user456/' $i.old > $i && rm $i.old ; done

mysql> FLUSH TABLES;

 

Describing the above procedure:

- /tmp/list_views.txt lists all the views that need to be modified ;

- /tmp/views_backup.tar is a backup with all the frm listed in above list ;

- for each view (file) : copy the .frm file as .frm.old , applies the change using sed , remove the .frm.old file

- close all opened tables (included view) so the changes take effects.

 

References:

http://dev.mysql.com/doc/refman/5.5/en/create-view.html

http://dev.mysql.com/doc/refman/5.5/en/stored-programs-security.html

http://dev.mysql.com/doc/refman/5.5/en/alter-view.html

http://forge.mysql.com/wiki/MySQL_Internals_File_Formats

Syndicate content
Website by Digital Loom