转(使用 MySQL 5.7 虚拟列提高查询效率)
27 February 2018
原文: 使用 MySQL 5.7 虚拟列提高查询效率

说明

大约两年前,我发表了一个在 MySQL5.7 版本上关于虚拟列的文章。从那时开始,它成为 MySQL5.7 发行版当中,我最喜欢的一个功能点。原因很简单:在虚拟列的帮助下,我们可以创建间接索引(fine-grained indexes),可以显著提高查询性能。我要告诉你一些技巧,可以潜在地解决那些使用了 GROUP BY 和 ORDER BY 而慢的报表查询。

问题

最近我正在协助一位客户,他正挣扎于这个查询上:

SELECT
    CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call',
    COUNT(*) as 'No. of API Calls',
    AVG(ExecutionTime) as 'Avg. Execution Time',
    COUNT(distinct AccountId) as 'No. Of Accounts',
    COUNT(distinct ParentAccountId) as 'No. Of Parents'
FROM ApiLog
WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'
GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))
HAVING COUNT(*) >= 1 ;

这个查询运行了一个多小时,并且使用和撑满了整个 tmp 目录(需要用到临时文件完成排序)。
表结构如下:

CREATE TABLE `ApiLog` (
        `Id` int(11) NOT NULL AUTO_INCREMENT,
        `ts` timestamp DEFAULT CURRENT_TIMESTAMP,
        `ServerName` varchar(50)  NOT NULL default '',
        `ServerIP` varchar(50)  NOT NULL default '',
        `ClientIP` varchar(50)  NOT NULL default '',
        `ExecutionTime` int(11) NOT NULL default 0,
        `URL` varchar(3000)  NOT NULL COLLATE utf8mb4_unicode_ci NOT NULL,
        `Verb` varchar(16)  NOT NULL,
        `AccountId` int(11) NOT NULL,
        `ParentAccountId` int(11) NOT NULL,
        `QueryString` varchar(3000) NOT NULL,
        `Request` text NOT NULL,
        `RequestHeaders` varchar(2000) NOT NULL,
        `Response` text NOT NULL,
        `ResponseHeaders` varchar(2000) NOT NULL,
        `ResponseCode` varchar(4000) NOT NULL,
        ... // other fields removed for simplicity
        PRIMARY KEY (`Id`),
        KEY `index_timestamp` (`ts`),
        ... // other indexes removed for simplicity
        ) ENGINE=InnoDB;

我们发现查询没有使用时间戳字段(“TS”)的索引:

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*)  as 'No. of API Calls',  avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts',  count(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))  HAVING COUNT(*)  >= 1G
*************************** 1. row ***************************
         id: 1
select_type: SIMPLE
      table: ApiLog
 partitions: NULL
       type: ALL
possible_keys: ts
        key: NULL
    key_len: NULL
        ref: NULL
       rows: 22255292
   filtered: 50.00
      Extra: Using where; Using filesort1 row in set, 1 warning (0.00 sec)

原因很简单:符合过滤条件的行数太大了,以至于影响一次索引扫描扫描的效率(或者至少优化器是这样认为的):

mysql> select count(*) from ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' ;
+----------+
| count(*) |
+----------+
|  7948800 |
+----------+
1 row in set (2.68 sec)

总行数:21998514。查询需要扫描的总行数的 36%(7948800/21998514)(译者按:当预估扫描行数超过 20% ~ 30%时,即便有索引,优化器通常也会强制转成全表扫描)。

在这种情况下,我们有许多处理方法:

  • 创建时间戳列和 GROUP BY 列的联合索引;
  • 创建一个覆盖索引(包含所有查询字段);
  • 仅对 GROUP BY 列创建索引;
  • 创建索引松散索引扫描。

然而,如果我们仔细观察查询中 GROUP BY 部分,我们很快就意识到,这些方案都不能解决问题。以下是我们的 GROUP BY 部分:

GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))

这里有两个问题:

  1. 它是计算列,所以 MySQL 不能扫描 verb + url 的索引。它首先需要连接两个字段,然后组成连接字符串。这就意味着用不到索引;
  2. URL 被定义为“varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL”,不能被完全索引(即使在全 innodb_large_prefix= 1 参数设置下,这是 UTF8 启用下的默认参数)。我们能做部分索引,这对 GROUP BY 的 sql 优化并没有什么帮助。

在这里,我尝试去对 URL 列添加一个完整的索引,在 innodb_large_prefix=1 参数下:

mysql> alter table ApiLog add key verb_url(verb, url);
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

嗯,通过修改 GROUP BY CONCAT(verb, '–', replace(url, '.xml', ''))GROUP BY verb, url 会帮助(假设我们把字段定义从 varchar(3000)调小一些,不管业务上允许或不允许)。然而,这将改变结果,因 URL 字段不会删除 .xml 扩展名了。

解决方案

好消息是,在 MySQL 5.7 中我们有虚拟列。所以我们可以在 CONCAT(verb, '–', replace(url, '.xml', '')) 之上创建一个虚拟列。最好的部分:我们不需要执行一组完整的字符串(可能大于 3000 字节)。我们可以使用 MD5 哈希(或更长的哈希,例如 SHA1 / SHA2)作为 GROUP BY 的对象。

下面是解决方案:

alter table ApiLog add verb_url_hash varbinary(16) GENERATED ALWAYS AS (unhex(md5(CONCAT(verb, ' - ', replace(url,'.xml',''))))) VIRTUAL;
alter table ApiLog add key (verb_url_hash);

所以我们在这里做的是:

  1. 声明虚拟列,类型为 varbinary(16);
  2. CONCAT(verb, '–', replace(url, '.xml', '')) 上创建虚拟列,并且使用 MD5 哈希转化后再使用 unhex 转化 32 位十六进制为 16 位二进制;
  3. 对上面的虚拟列创建索引。

现在我们可以修改查询语句,GROUP BY verb_url_hash 列:

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml',''))
    AS 'API Call', COUNT(*)  as 'No. of API Calls', avg(ExecutionTime) as 'Avg. Execution Time',
    count(distinct AccountId) as 'No. Of Accounts',
    count(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog
    WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'
    GROUP BY verb_url_hash HAVING COUNT(*)  >= 1;
ERROR 1055 (42000): Expression #1 of SELECT list is not in
GROUP BY clause and contains nonaggregated column 'ApiLog.ApiLog.Verb'
which is not functionally dependent on columns in GROUP BY clause;
this is incompatible with sql_mode=only_full_group_by

MySQL 5.7 的严格模式是默认启用的,我们可以只针对这次查询修改一下。
现在解释计划看上去好多了:

mysql> select @@sql_mode;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| @@sql_mode
|+-------------------------------------------------------------------------------------------------------------------------------------------+
| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |
+-------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> set sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
Query OK, 0 rows affected (0.00 sec)
mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*)  as 'No. of API Calls',  avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts',  count(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY verb_url_hash HAVING COUNT(*)  >= 1G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ApiLog
   partitions: NULL
         type: index
possible_keys: ts,verb_url_hash
          key: verb_url_hash
      key_len: 19
          ref: NULL
         rows: 22008891
     filtered: 50.00
        Extra: Using where1 row in set, 1 warning (0.00 sec)

MySQL 可以避免排序,速度更快。它将最终还是要扫描所有表的索引的顺序。响应时间明显更好:只需大概 38 秒而不再是大于一小时。

覆盖索引

现在我们可以尝试做一个覆盖索引,这将相当大:

mysql> alter table ApiLog add key covered_index (`verb_url_hash`,`ts`,`ExecutionTime`,`AccountId`,`ParentAccountId`, verb, url);
Query OK, 0 rows affected (1 min 29.71 sec)
Records: 0  Duplicates: 0  Warnings: 0

我们添加了一个“verb”和“URL”,所以之前我不得不删除表定义的 COLLATE utf8mb4_unicode_ci。现在执行计划表明,我们使用了覆盖索引:

mysql> explain SELECT  CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call',  COUNT(*) as 'No. of API Calls',  AVG(ExecutionTime) as 'Avg. Execution Time',  COUNT(distinct AccountId) as 'No. Of Accounts',  COUNT(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY verb_url_hash  HAVING COUNT(*) >= 1G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ApiLog
   partitions: NULL
         type: index
possible_keys: ts,verb_url_hash,covered_index
          key: covered_index
      key_len: 3057
          ref: NULL
         rows: 22382136
     filtered: 50.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

响应时间下降到约 12 秒!但是,索引的大小明显地比仅 verb_url_hash 的索引(每个记录 16 字节)要大得多。

结论

MySQL 5.7 的生成列提供一个有价值的方法来提高查询性能。如果你有一个有趣的案例,请在评论中分享。