基于 Databend 和腾讯云 COS 打造新型云数仓

2022/4/20 8:12:31

本文主要是介绍基于 Databend 和腾讯云 COS 打造新型云数仓,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!

图片

本篇文章向大家演示如何使用 Databend 基于腾讯云 COS 构建新式数仓及其计算能力。如果你也在找一个低成本、高性能、支持弹性的数仓,Databend 可以为大家提供一个基于对象存储的云原生数仓解决方案。目前 Databend 支持数据的 stream load , copy into from stage , insert 等方式的数据写入,部署上支持单机和集群模式。需要更多支持添加微信: 82565387 。 文章较长,建议收藏 PC 端阅读。

Databend 介绍

Databend 是一款使用 Rust 研发、开源、完全面向对象存储架构的新式数仓,提供极速的弹性扩展能力,致力于打造按需、按量的 Data Cloud 产品体验。具备以下特点:

•Vectorized Execution 和 Pull&Push-Based Processor Model

•真正的存储、计算分离架构,高性能、低成本,按需按量使用

•完整的数据库支持,兼容 MySQL ,Clickhouse 协议, SQL Over http

•完善的事务性,支持 Data Time Travel, Database Zero Clone 等功能

•支持基于同一份数据的多租户读写、共享操作

github repo: https://github.com/datafuselabs/databend

Docs : https://databend.rs

关于 Databend 架构图,参考:https://databend.rs/doc/

腾讯云 COS

对象存储(Cloud Object Storage,COS)是由腾讯云推出的无目录层次结构、无数据格式限制,可容纳海量数据且支持 HTTP/HTTPS 协议访问的分布式存储服务。腾讯云 COS 的存储桶空间无容量上限,无需分区管理,适用于 CDN 数据分发、数据万象处理或大数据计算与分析的数据湖等多种场景。

官网:https://cloud.tencent.com/product/cos

测试环境介绍

北京区: CVM SA2.8XLARGE64 & COS(ap-beijing)

操作系统: ubuntu-20

Databend : 使用进二制发布版本 v0.6.99-nightly

下载地址:https://repo.databend.rs/databend/v0.6.99-nightly/databend-v0.6.99-nightly-x86_64-unknown-linux-gnu.tar.gz

本次测试安装部署方式参考:https://databend.rs/doc/deploy/cos

集群部署模式参考:https://databend.rs/doc/deploy/cluster_minio

测试数据

wget --no-check-certificate --continue https://transtats.bts.gov/PREZIP/\
On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{1987..2021}_{1..12}.zip

表结构参考:cat create_ontime.sql

CREATE TABLE ontime
(
    Year                            UInt16 NOT NULL,
    Quarter                         UInt8 NOT NULL,
    Month                           UInt8 NOT NULL,
    DayofMonth                      UInt8 NOT NULL,
    DayOfWeek                       UInt8 NOT NULL,
    FlightDate                      Date NOT NULL,
    Reporting_Airline               String NOT NULL,
    DOT_ID_Reporting_Airline        Int32 NOT NULL,
    IATA_CODE_Reporting_Airline     String NOT NULL,
    Tail_Number                     String NOT NULL,
    Flight_Number_Reporting_Airline String NOT NULL,
    OriginAirportID                 Int32 NOT NULL,
    OriginAirportSeqID              Int32 NOT NULL,
    OriginCityMarketID              Int32 NOT NULL,
    Origin                          String NOT NULL,
    OriginCityName                  String NOT NULL,
    OriginState                     String NOT NULL,
    OriginStateFips                 String NOT NULL,
    OriginStateName                 String NOT NULL,
    OriginWac                       Int32 NOT NULL,
    DestAirportID                   Int32 NOT NULL,
    DestAirportSeqID                Int32 NOT NULL,
    DestCityMarketID                Int32 NOT NULL,
    Dest                            String NOT NULL,
    DestCityName                    String NOT NULL,
    DestState                       String NOT NULL,
    DestStateFips                   String NOT NULL,
    DestStateName                   String NOT NULL,
    DestWac                         Int32 NOT NULL,
    CRSDepTime                      Int32 NOT NULL,
    DepTime                         Int32 NOT NULL,
    DepDelay                        Int32 NOT NULL,
    DepDelayMinutes                 Int32 NOT NULL,
    DepDel15                        Int32 NOT NULL,
    DepartureDelayGroups            String NOT NULL,
    DepTimeBlk                      String NOT NULL,
    TaxiOut                         Int32 NOT NULL,
    WheelsOff                       Int32 NOT NULL,
    WheelsOn                        Int32 NOT NULL,
    TaxiIn                          Int32 NOT NULL,
    CRSArrTime                      Int32 NOT NULL,
    ArrTime                         Int32 NOT NULL,
    ArrDelay                        Int32 NOT NULL,
    ArrDelayMinutes                 Int32 NOT NULL,
    ArrDel15                        Int32 NOT NULL,
    ArrivalDelayGroups              Int32 NOT NULL,
    ArrTimeBlk                      String NOT NULL,
    Cancelled                       UInt8 NOT NULL,
    CancellationCode                String NOT NULL,
    Diverted                        UInt8 NOT NULL,
    CRSElapsedTime                  Int32 NOT NULL,
    ActualElapsedTime               Int32 NOT NULL,
    AirTime                         Int32 NOT NULL,
    Flights                         Int32 NOT NULL,
    Distance                        Int32 NOT NULL,
    DistanceGroup                   UInt8 NOT NULL,
    CarrierDelay                    Int32 NOT NULL,
    WeatherDelay                    Int32 NOT NULL,
    NASDelay                        Int32 NOT NULL,
    SecurityDelay                   Int32 NOT NULL,
    LateAircraftDelay               Int32 NOT NULL,
    FirstDepTime                    String NOT NULL,
    TotalAddGTime                   String NOT NULL,
    LongestAddGTime                 String NOT NULL,
    DivAirportLandings              String NOT NULL,
    DivReachedDest                  String NOT NULL,
    DivActualElapsedTime            String NOT NULL,
    DivArrDelay                     String NOT NULL,
    DivDistance                     String NOT NULL,
    Div1Airport                     String NOT NULL,
    Div1AirportID                   Int32 NOT NULL,
    Div1AirportSeqID                Int32 NOT NULL,
    Div1WheelsOn                    String NOT NULL,
    Div1TotalGTime                  String NOT NULL,
    Div1LongestGTime                String NOT NULL,
    Div1WheelsOff                   String NOT NULL,
    Div1TailNum                     String NOT NULL,
    Div2Airport                     String NOT NULL,
    Div2AirportID                   Int32 NOT NULL,
    Div2AirportSeqID                Int32 NOT NULL,
    Div2WheelsOn                    String NOT NULL,
    Div2TotalGTime                  String NOT NULL,
    Div2LongestGTime                String NOT NULL,
    Div2WheelsOff                   String NOT NULL,
    Div2TailNum                     String NOT NULL,
    Div3Airport                     String NOT NULL,
    Div3AirportID                   Int32 NOT NULL,
    Div3AirportSeqID                Int32 NOT NULL,
    Div3WheelsOn                    String NOT NULL,
    Div3TotalGTime                  String NOT NULL,
    Div3LongestGTime                String NOT NULL,
    Div3WheelsOff                   String NOT NULL,
    Div3TailNum                     String NOT NULL,
    Div4Airport                     String NOT NULL,
    Div4AirportID                   Int32 NOT NULL,
    Div4AirportSeqID                Int32 NOT NULL,
    Div4WheelsOn                    String NOT NULL,
    Div4TotalGTime                  String NOT NULL,
    Div4LongestGTime                String NOT NULL,
    Div4WheelsOff                   String NOT NULL,
    Div4TailNum                     String NOT NULL,
    Div5Airport                     String NOT NULL,
    Div5AirportID                   Int32 NOT NULL,
    Div5AirportSeqID                Int32 NOT NULL,
    Div5WheelsOn                    String NOT NULL,
    Div5TotalGTime                  String NOT NULL,
    Div5LongestGTime                String NOT NULL,
    Div5WheelsOff                   String NOT NULL,
    Div5TailNum                     String NOT NULL
);

加载表结构:

cat create_ontime.sql | mysql -h127.0.0.1 -P3307 -uroot

数据加载

cat load_ontime.sh

echo "unzip ontime ,input your ontime zip dir: ./load_ontime.sh zip_dir"

ls $1/*.zip |xargs -I{} -P 4 bash -c "echo {}; unzip -q {} '*.csv' -d ./dataset"

if [ $? -eq  0 ];
then
    echo "unzip success"
else
    echo "unzip was wrong!!!"
    exit 1
fi

cat create_ontime.sql |mysql -h127.0.0.1 -P3307 -uroot
if [ $? -eq  0 ];
then
    echo "Ontime table create success"
else
    echo "Ontime table create was wrong!!!"
    exit 1
fi


time ls ./dataset/*.csv|xargs -P 8 -I{} curl -H "insert_sql:insert into ontime format CSV" -H "skip_header:1" -F "upload=@{}" -XPUT http://localhost:8081/v1/streaming_load

使用方法

./load_ontime.sh ZIP文件目录

基于 Ontime 测试 SQL 展示

Q1 查询2000年到2008年每天的总的航班总

(0.494 sec., 143.75 million rows/sec., 431.25 MB/sec)

mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
+-----------+---------+
| DayOfWeek | c       |
+-----------+---------+
|         5 | 8732422 |
|         1 | 8730614 |
|         4 | 8710843 |
|         3 | 8685626 |
|         2 | 8639632 |
|         7 | 8274367 |
|         6 | 7514194 |
+-----------+---------+
7 rows in set (0.50 sec)
Read 71000000 rows, 213 MB in 0.494 sec., 143.75 million rows/sec., 431.25 MB/sec.

mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: DayOfWeek:UInt8, count() as c:UInt64                                                                                                                                                                                                                  |
|   Sort: count():UInt64                                                                                                                                                                                                                                            |
|     AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]]                                                                                                                                                                                                      |
|       AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]]                                                                                                                                                                                                  |
|         Filter: ((Year >= 2000) and (Year <= 2008))                                                                                                                                                                                                               |
|           ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8], statistics: [read_rows: 71000000, read_bytes: 213000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4], filters: [((Year >= 2000) AND (Year <= 2008))]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)

Q2 查询 2000 年到 2008 年延迟超过 10 分钟,每天总的延迟发生情况

( 0.543 sec., 130.71 million rows/sec., 914.95 GB/sec.)

mysql> SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
+-----------+---------+
| DayOfWeek | c       |
+-----------+---------+
|         5 | 2175733 |
|         4 | 2012848 |
|         1 | 1898879 |
|         7 | 1880896 |
|         3 | 1757508 |
|         2 | 1665303 |
|         6 | 1510894 |
+-----------+---------+
7 rows in set (0.54 sec)
Read 71000000 rows, 497 MB in 0.543 sec., 130.71 million rows/sec., 914.95 MB/sec.

mysql> explain SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY DayOfWeek ORDER BY c DESC;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: DayOfWeek:UInt8, count() as c:UInt64                                                                                                                                                                                                                                                            |
|   Sort: count():UInt64                                                                                                                                                                                                                                                                                      |
|     AggregatorFinal: groupBy=[[DayOfWeek]], aggr=[[count()]]                                                                                                                                                                                                                                                |
|       AggregatorPartial: groupBy=[[DayOfWeek]], aggr=[[count()]]                                                                                                                                                                                                                                            |
|         Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008))                                                                                                                                                                                                                                   |
|           ReadDataSource: scan schema: [Year:UInt16, DayOfWeek:UInt8, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 497000000, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 4, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q3 2000-2008年机场的延误次数,显示最高的10条

(0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.)

mysql> SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
+--------+--------+
| Origin | c      |
+--------+--------+
| ORD    | 860911 |
| ATL    | 831822 |
| DFW    | 614403 |
| LAX    | 402671 |
| PHX    | 400475 |
| LAS    | 362026 |
| DEN    | 352893 |
| EWR    | 302267 |
| DTW    | 296832 |
| IAH    | 290729 |
+--------+--------+
10 rows in set (0.69 sec)
Read 71000000 rows, 1.21 GB in 0.679 sec., 104.59 million rows/sec., 1.78 GB/sec.

mysql> explain SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year >= 2000 AND Year <= 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit: 10                                                                                                                                                                                                                                                                                                     |
|   Projection: Origin:String, count() as c:UInt64                                                                                                                                                                                                                                                              |
|     Sort: count():UInt64                                                                                                                                                                                                                                                                                      |
|       AggregatorFinal: groupBy=[[Origin]], aggr=[[count()]]                                                                                                                                                                                                                                                   |
|         AggregatorPartial: groupBy=[[Origin]], aggr=[[count()]]                                                                                                                                                                                                                                               |
|           Filter: (((DepDelay > 10) and (Year >= 2000)) and (Year <= 2008))                                                                                                                                                                                                                                   |
|             ReadDataSource: scan schema: [Year:UInt16, Origin:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1271665856, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 14, 31], filters: [(((DepDelay > 10) AND (Year >= 2000)) AND (Year <= 2008))]] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
7 rows in set (0.00 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q4 2007年各航空公司延误的次数

(0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;
+---------+---------+
| Carrier | count() |
+---------+---------+
| WN      |  296451 |
| AA      |  179769 |
| MQ      |  152293 |
| OO      |  147019 |
| US      |  140199 |
| UA      |  135061 |
| XE      |  108571 |
| EV      |  104055 |
| NW      |  102206 |
| DL      |   98427 |
| CO      |   81039 |
| YV      |   79553 |
| FL      |   64583 |
| OH      |   60532 |
| AS      |   54326 |
| B6      |   53716 |
| 9E      |   48578 |
| F9      |   24100 |
| AQ      |    6764 |
| HA      |    4059 |
+---------+---------+
20 rows in set (0.19 sec)
Read 15000000 rows, 240 MB in 0.188 sec., 79.77 million rows/sec., 1.28 GB/sec.

mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, count() FROM ontime WHERE DepDelay>10 AND Year = 2007 GROUP BY Carrier ORDER BY count() DESC;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, count():UInt64                                                                                                                                                                                                                                |
|   Sort: count():UInt64                                                                                                                                                                                                                                                                                   |
|     AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]]                                                                                                                                                                                                                           |
|       AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[count()]]                                                                                                                                                                                                                       |
|         Filter: ((DepDelay > 10) and (Year = 2007))                                                                                                                                                                                                                                                      |
|           ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((DepDelay > 10) AND (Year = 2007))]] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.00 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q5 2007年各航空公司延误的千分比

(0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;
+---------+--------------------+
| Carrier | c3                 |
+---------+--------------------+
| EV      | 363.53123668047823 |
| AS      |  339.1453631738303 |
| US      |  288.8039271022377 |
| AA      |  283.6112877194699 |
| MQ      |  281.7663100792978 |
| B6      |  280.5745625489684 |
| UA      | 275.63356884257615 |
| YV      | 270.25567158804466 |
| OH      |  256.4567516268981 |
| WN      | 253.62165713752844 |
| CO      | 250.77750030171651 |
| XE      | 249.71881878589517 |
| NW      | 246.56113247419944 |
| F9      | 246.52209492635023 |
| OO      | 245.90051515354253 |
| FL      |  245.4143692596491 |
| DL      | 206.82764258051773 |
| 9E      | 187.66780889391967 |
| AQ      |  145.9016393442623 |
| HA      |  72.25634178905207 |
+---------+--------------------+
20 rows in set (0.27 sec)
Read 15000000 rows, 240 MB in 0.265 sec., 56.58 million rows/sec., 905.28 MB/sec.

mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year=2007 GROUP BY Carrier ORDER BY c3 DESC;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64                                                                                                                                                                   |
|   Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64                                                                                                                                                                                                                            |
|     Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy)                                                                                                                                                               |
|       AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]]                                                                                                                                                                            |
|         AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]]                                                                                                                                                                        |
|           Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy)                                                                                                                                                                          |
|             Filter: (Year = 2007)                                                                                                                                                                                                                                                      |
|               ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 15000000, read_bytes: 250239306, partitions_scanned: 15, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [(Year = 2007)]] |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q6 2000-2008年各航空公司延误的千分比

(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;
+---------+--------------------+
| Carrier | c3                 |
+---------+--------------------+
| AS      | 293.05649076611434 |
| EV      |  282.0709981074399 |
| YV      |  270.3897636688929 |
| B6      | 257.40594891667007 |
| FL      | 249.28742951361826 |
| XE      | 246.59005902424192 |
| MQ      |  245.3695989400477 |
| WN      | 233.38127235928863 |
| DH      | 227.11013827345042 |
| F9      | 226.08455653226812 |
| UA      | 224.42824657703645 |
| OH      | 215.52882835147614 |
| AA      | 211.97122176454556 |
| US      | 206.60330294168244 |
| HP      | 205.31690167066455 |
| OO      |  202.4243177198239 |
| NW      |  191.7393936377831 |
| TW      |  188.6912623180138 |
| DL      | 187.84162871590732 |
| CO      | 187.71301306878976 |
| 9E      |  181.6396991511518 |
| RU      | 181.46244295416398 |
| TZ      |  176.8928125899626 |
| AQ      | 145.65911608293766 |
| HA      |  79.38672451825789 |
+---------+--------------------+
25 rows in set (0.94 sec)
Read 71000000 rows, 1.14 GB in 0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.

mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(cast(DepDelay>10 as Int8))*1000 AS c3 FROM ontime WHERE Year>=2000 AND Year <=2008 GROUP BY Carrier ORDER BY c3 DESC;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(cast((DepDelay > 10) as Int8)) * 1000) as c3:Float64                                                                                                                                                                                          |
|   Sort: (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64                                                                                                                                                                                                                                                   |
|     Expression: IATA_CODE_Reporting_Airline:String, (avg(cast((DepDelay > 10) as Int8)) * 1000):Float64 (Before OrderBy)                                                                                                                                                                                      |
|       AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]]                                                                                                                                                                                                   |
|         AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(cast((DepDelay > 10) as Int8))]]                                                                                                                                                                                               |
|           Expression: IATA_CODE_Reporting_Airline:String, cast((DepDelay > 10) as Int8):Int8 (Before GroupBy)                                                                                                                                                                                                 |
|             Filter: ((Year >= 2000) and (Year <= 2008))                                                                                                                                                                                                                                                       |
|               ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.003 sec., 0 rows/sec., 0 B/sec.

Q7 2000-2008年各航空公司平均延误时间

(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

mysql> SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
+---------+--------------------+
| Carrier | c3                 |
+---------+--------------------+
| B6      | 16789.739456036365 |
| NW      | 11717.623092632819 |
| F9      | 11232.889558936127 |
| XE      | 17092.548853057146 |
| YV      |  17971.53933699898 |
| US      |   11868.7097884053 |
| RU      | 12556.249210602802 |
| AS      | 14735.545887755581 |
| HA      |  6851.555976883671 |
| OH      | 12655.103820799075 |
| UA      | 14594.243159716054 |
| TZ      | 12618.760195758565 |
| EV      | 16374.703330010156 |
| HP      | 11625.682112859839 |
| DH      | 15311.949983190174 |
| DL      | 10943.456441165357 |
| 9E      | 13091.087573576122 |
| FL      | 15192.451732538268 |
| MQ      | 14125.201554023559 |
| AQ      |  7323.278123603293 |
| OO      | 11600.594852741107 |
| AA      |  13508.78515494305 |
| TW      | 10842.722114986364 |
| WN      | 10484.932610056378 |
| CO      | 12671.595978518368 |
+---------+--------------------+
25 rows in set (0.74 sec)
Read 71000000 rows, 1.14 GB in 0.727 sec., 97.6 million rows/sec., 1.56 GB/sec.

mysql> explain SELECT IATA_CODE_Reporting_Airline AS Carrier, avg(DepDelay) * 1000 AS c3 FROM ontime WHERE Year >= 2000 AND Year <= 2008 GROUP BY Carrier;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: IATA_CODE_Reporting_Airline as Carrier:String, (avg(DepDelay) * 1000) as c3:Float64                                                                                                                                                                                                           |
|   Expression: IATA_CODE_Reporting_Airline:String, (avg(DepDelay) * 1000):Float64 (Before Projection)                                                                                                                                                                                                      |
|     AggregatorFinal: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]]                                                                                                                                                                                                                      |
|       AggregatorPartial: groupBy=[[IATA_CODE_Reporting_Airline]], aggr=[[avg(DepDelay)]]                                                                                                                                                                                                                  |
|         Filter: ((Year >= 2000) and (Year <= 2008))                                                                                                                                                                                                                                                       |
|           ReadDataSource: scan schema: [Year:UInt16, IATA_CODE_Reporting_Airline:String, DepDelay:Int32], statistics: [read_rows: 71000000, read_bytes: 1179110760, partitions_scanned: 71, partitions_total: 207], push_downs: [projections: [0, 8, 31], filters: [((Year >= 2000) AND (Year <= 2008))]] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q8 每年航班延误平均时间

(0.935 sec., 75.95 million rows/sec., 1.22 GB/sec.)

mysql> SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
+------+--------------------+
| Year | avg(DepDelay)      |
+------+--------------------+
| 1987 | 12.380385692195556 |
| 1988 |  7.345867511864449 |
| 1989 |   8.81845473300008 |
| 1990 |  7.966702606180775 |
| 1991 |  6.940411174086677 |
| 1992 |  6.687364706154975 |
| 1993 |  7.207721091071671 |
| 1994 |  7.758752042452116 |
| 1995 |  9.328649903752932 |
| 1996 |  11.14468468976826 |
| 1997 |  9.919225483813925 |
| 1998 | 10.884314711941435 |
| 1999 | 11.567390524113748 |
| 2000 | 13.456897681824556 |
| 2001 | 10.895474364001354 |
| 2002 |   9.97856700710386 |
| 2003 |  9.778465263372038 |
| 2004 | 11.936799840656898 |
| 2005 |  12.60167890747495 |
| 2006 | 14.237297887039372 |
| 2007 | 15.431738868356579 |
| 2008 | 14.654588068064287 |
| 2009 | 13.168984006133062 |
| 2010 | 13.202976628175891 |
| 2011 | 13.496191548097778 |
| 2012 | 13.155971481255131 |
| 2013 | 14.901210490900201 |
| 2014 | 15.513697266113969 |
| 2015 | 14.638336410280733 |
| 2016 | 14.643883269504837 |
| 2017 |  15.70225324299191 |
| 2018 |  16.16188254545747 |
| 2019 | 16.983263489524507 |
| 2020 | 10.624498278073712 |
| 2021 | 15.289615417399649 |
+------+--------------------+
35 rows in set (1.04 sec)
Read 201816232 rows, 1.21 GB in 1.030 sec., 195.93 million rows/sec., 1.18 GB/sec.

mysql> explain SELECT Year, avg(DepDelay) FROM ontime GROUP BY Year;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: Year:UInt16, avg(DepDelay):Float64                                                                                                                                                                   |
|   AggregatorFinal: groupBy=[[Year]], aggr=[[avg(DepDelay)]]                                                                                                                                                      |
|     AggregatorPartial: groupBy=[[Year]], aggr=[[avg(DepDelay)]]                                                                                                                                                  |
|       ReadDataSource: scan schema: [Year:UInt16, DepDelay:Int32], statistics: [read_rows: 201816232, read_bytes: 1210897392, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 31]] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q9 每年有多少航班

(0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.)

mysql> SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;
+------+---------+
| Year | c1      |
+------+---------+
| 1987 |  440403 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1992 | 5092157 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 5967780 |
| 2002 | 5271359 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 2014 | 5819811 |
| 2015 | 5819079 |
| 2016 | 5617658 |
| 2017 | 5674621 |
| 2018 | 7213446 |
| 2019 | 7422037 |
| 2020 | 4688354 |
| 2021 | 5443512 |
+------+---------+
35 rows in set (0.52 sec)
Read 201816232 rows, 403.63 MB in 0.509 sec., 396.54 million rows/sec., 793.08 MB/sec.

mysql> explain SELECT Year, count(*) as c1 FROM ontime GROUP BY Year;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: Year:UInt16, count() as c1:UInt64                                                                                                                                               |
|   AggregatorFinal: groupBy=[[Year]], aggr=[[count()]]                                                                                                                                       |
|     AggregatorPartial: groupBy=[[Year]], aggr=[[count()]]                                                                                                                                   |
|       ReadDataSource: scan schema: [Year:UInt16], statistics: [read_rows: 201816232, read_bytes: 403632464, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0]] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q10 计算每月延迟15分钟的航班平均数

(0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.)

mysql> SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
+-------------------+
| avg(cnt)          |
+-------------------+
| 81474.99019607843 |
+-------------------+
1 row in set (0.90 sec)
Read 201816232 rows, 1.41 GB in 0.891 sec., 226.44 million rows/sec., 1.59 GB/sec.

mysql> explain SELECT avg(cnt) FROM (SELECT Year,Month,count(*) AS cnt FROM ontime WHERE DepDel15=1 GROUP BY Year,Month) a;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: avg(cnt):Float64                                                                                                                                                                                                                                        |
|   AggregatorFinal: groupBy=[[]], aggr=[[avg(cnt)]]                                                                                                                                                                                                                  |
|     AggregatorPartial: groupBy=[[]], aggr=[[avg(cnt)]]                                                                                                                                                                                                              |
|       Projection: Year:UInt16, Month:UInt8, count() as cnt:UInt64                                                                                                                                                                                                   |
|         AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]]                                                                                                                                                                                                  |
|           AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]]                                                                                                                                                                                              |
|             Filter: (DepDel15 = 1)                                                                                                                                                                                                                                  |
|               ReadDataSource: scan schema: [Year:UInt16, Month:UInt8, DepDel15:Int32], statistics: [read_rows: 201816232, read_bytes: 1412713624, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2, 33], filters: [(DepDel15 = 1)]] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q11 计算每月航班平均数

(0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.)

mysql> SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;
+-------------------+
| avg(c1)           |
+-------------------+
| 494647.6274509804 |
+-------------------+
1 row in set (0.57 sec)
Read 201816232 rows, 605.45 MB in 0.561 sec., 359.58 million rows/sec., 1.08 GB/sec.

mysql> explain SELECT avg(c1) FROM (SELECT Year,Month,count(*) AS c1 FROM ontime GROUP BY Year,Month) a;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection: avg(c1):Float64                                                                                                                                                                                       |
|   AggregatorFinal: groupBy=[[]], aggr=[[avg(c1)]]                                                                                                                                                                 |
|     AggregatorPartial: groupBy=[[]], aggr=[[avg(c1)]]                                                                                                                                                             |
|       Projection: Year:UInt16, Month:UInt8, count() as c1:UInt64                                                                                                                                                  |
|         AggregatorFinal: groupBy=[[Year, Month]], aggr=[[count()]]                                                                                                                                                |
|           AggregatorPartial: groupBy=[[Year, Month]], aggr=[[count()]]                                                                                                                                            |
|             ReadDataSource: scan schema: [Year:UInt16, Month:UInt8], statistics: [read_rows: 201816232, read_bytes: 605448696, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [0, 2]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
7 rows in set (0.02 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q12 显示10个两个城市直飞线航班最多的前10个

(2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.)

mysql> SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
+-------------------+-------------------+--------+
| OriginCityName    | DestCityName      | c      |
+-------------------+-------------------+--------+
| San Francisco, CA | Los Angeles, CA   | 514878 |
| Los Angeles, CA   | San Francisco, CA | 512147 |
| New York, NY      | Chicago, IL       | 456042 |
| Chicago, IL       | New York, NY      | 448756 |
| Chicago, IL       | Minneapolis, MN   | 437913 |
| Minneapolis, MN   | Chicago, IL       | 433688 |
| Los Angeles, CA   | Las Vegas, NV     | 428942 |
| Las Vegas, NV     | Los Angeles, CA   | 422825 |
| New York, NY      | Boston, MA        | 419405 |
| Boston, MA        | New York, NY      | 416324 |
+-------------------+-------------------+--------+
10 rows in set (2.94 sec)
Read 201816232 rows, 8.54 GB in 2.930 sec., 68.87 million rows/sec., 2.91 GB/sec.

mysql> explain SELECT OriginCityName, DestCityName, count(*) AS c FROM ontime GROUP BY OriginCityName, DestCityName ORDER BY c DESC LIMIT 10;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit: 10                                                                                                                                                                                                                            |
|   Projection: OriginCityName:String, DestCityName:String, count() as c:UInt64                                                                                                                                                        |
|     Sort: count():UInt64                                                                                                                                                                                                             |
|       AggregatorFinal: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]]                                                                                                                                                    |
|         AggregatorPartial: groupBy=[[OriginCityName, DestCityName]], aggr=[[count()]]                                                                                                                                                |
|           ReadDataSource: scan schema: [OriginCityName:String, DestCityName:String], statistics: [read_rows: 201816232, read_bytes: 9829664815, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15, 24]] |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.00 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q13 显示飞机最多航班的10个城市

(1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.)

mysql> SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
+-----------------------+----------+
| OriginCityName        | c        |
+-----------------------+----------+
| Chicago, IL           | 12545243 |
| Atlanta, GA           | 10900284 |
| Dallas/Fort Worth, TX |  9011081 |
| Houston, TX           |  6844476 |
| Los Angeles, CA       |  6695628 |
| New York, NY          |  6309911 |
| Denver, CO            |  6283055 |
| Phoenix, AZ           |  5658884 |
| Washington, DC        |  4998047 |
| San Francisco, CA     |  4673365 |
+-----------------------+----------+
10 rows in set (1.23 sec)
Read 201816232 rows, 4.27 GB in 1.223 sec., 165.05 million rows/sec., 3.49 GB/sec.

mysql> explain SELECT OriginCityName, count(*) AS c FROM ontime GROUP BY OriginCityName ORDER BY c DESC LIMIT 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit: 10                                                                                                                                                                                                   |
|   Projection: OriginCityName:String, count() as c:UInt64                                                                                                                                                    |
|     Sort: count():UInt64                                                                                                                                                                                    |
|       AggregatorFinal: groupBy=[[OriginCityName]], aggr=[[count()]]                                                                                                                                         |
|         AggregatorPartial: groupBy=[[OriginCityName]], aggr=[[count()]]                                                                                                                                     |
|           ReadDataSource: scan schema: [OriginCityName:String], statistics: [read_rows: 201816232, read_bytes: 4914707403, partitions_scanned: 207, partitions_total: 207], push_downs: [projections: [15]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.

Q14 查询 ontime 表总共有多少行

(0.002 sec., 443.51 rows/sec., 443.51 B/sec.)

mysql> SELECT count(*) FROM ontime;
+-----------+
| count()   |
+-----------+
| 201816232 |
+-----------+
1 row in set (0.01 sec)
Read 1 rows, 1 B in 0.002 sec., 443.51 rows/sec., 443.51 B/sec.

mysql> explain SELECT count(*) FROM ontime;
+-----------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------+
| Projection: count():UInt64                                                                                                              |
|   Projection: 201816232 as count():UInt64                                                                                               |
|     Expression: 201816232:UInt64 (Exact Statistics)                                                                                     |
|       ReadDataSource: scan schema: [dummy:UInt8], statistics: [read_rows: 1, read_bytes: 1, partitions_scanned: 1, partitions_total: 1] |
+-----------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
Read 0 rows, 0 B in 0.002 sec., 0 rows/sec., 0 B/sec.


这篇关于基于 Databend 和腾讯云 COS 打造新型云数仓的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!


扫一扫关注最新编程教程