单机 8 核 32G, 用什么技术才可以做单表 20 亿数据的过滤查询?

outdoorlife

2022-01-14 11:38:20 +08:00

本质上就是拆表

你把这些数据想象成放在文本数据里，用 Grep 去查需要多久。数据库比较重查询效率低，文本直接搜索效率最高，但是他的效率也在于磁盘 IO 和 CPU 自身上。

所以你的问题优化本质就在于拆表。

找到可以拆的维度比如时间，然后缩小范围，然后再去查询，速度会好很多，毕竟不用从 1-20 亿这样排下去，可能是 1-1000 万，但是有 200 个。

我们称为百库百表。

另外单机程序上限受 IO 影响

多机拆了才更好（ HADOOP 原理）

可能你去买 I5 CPU SSD ，16G 内存的台式机来 20 台，每台放 1 亿数据，这效率你就可以接受了。

none

2022-01-14 11:39:51 +08:00

一般建议单表数据记录不超过 500W 条，应该考虑分库分表。

outdoorlife

2022-01-14 11:40:34 +08:00

我们以前线上用户有 1000 万。

用户中心含积分，历史记录，上亿数据。

百库百表上了 15 台数据库 DELL 高配，

结果 MYCAT 成为了我们的瓶颈。中间件扛不住高并发。

C10K 经常出现。后来在 DNS 层面弄了好几个登陆中心，每个再连接一个 MYCAT

meeop

2022-01-14 12:02:21 +08:00

根据[ transaction 的 hash , from addr, to addr ]这三个字段简单加 3 个索引不行吗? 看起来就是 o1 级别查询,亿级别应该是百毫秒左右出结果吧

cxh116

2022-01-14 12:02:52 +08:00

@outdoorlife 目前是三机械硬盘 raid 0,速度太勉强了.

主要是 tx 表不像 block 表, 使用自增 number 做主键,而是使用 hex hash 做主键,单机分库分表 IO 瓶颈感觉意义好像不大,多机的话三件成本高,没赚钱先亏个几百块一个月还是撑不住.

ch2

2022-01-14 12:17:20 +08:00

只能上 clickhouse 了

dayeye2006199

2022-01-14 13:15:40 +08:00

试试这几个：

1. 你的查询基本上都是 = 条件，所以考虑用 hash index ，不用 btree index ，大概有 10-20%的性能提升；
2. 使用 CLUSTER- 基本就是重排数据记录，让相近的数据在同一个 cluster 里面；这样应对比如查询所有 from_addr = xxxx 的记录的时候有奇效，因为这些记录都是邻居；
3. 使用 PARTITION - 思路和 sharding 接近，但是不需要引入多机；方便 query plan 并行化执行

outdoorlife

2022-01-14 13:53:11 +08:00

@meeop

亿级数据查询百毫秒出结果。。。。

我有点被吓到了，请问你用的是 Oracle 的大型机柜服务器么？天玑那种？一套近 50 万人民币的那种。

dexterzzz

2022-01-14 13:53:25 +08:00

sql server 列存储索引表

outdoorlife

2022-01-14 13:58:41 +08:00

@cxh116

Mycat 数据库中间件就是干这个活的。

当然如果想的话，最新的 SQL SERVER 集群也支持的。

其实就是你的数据写进数据库前先进行一次取模去处，然后再落盘，到时候查的时候他是派任务下去查，查到有的就吐回来，没有的就不吐回来。

所以可以充分把硬件用起来。

至于那位说 “亿级别数据库查询只要几毫秒的”

我知道的话，可能要用到 Oracle 近 50 万的天玑数据库。

你还是别考虑了。

shyrock

2022-01-14 14:06:41 +08:00

不是很能理解 hash 这种几乎不重复的主键，加了索引后为什么不是秒查到 value 。
你把你的查询语句用 explain 跑一下呢，看一下耗时在哪里？

dreamramon

2022-01-14 14:08:30 +08:00

上 clickhouse 秒出

GM

2022-01-14 14:23:56 +08:00

TDEngine 试试，我看他们吹牛逼吹得很厉害，几亿数据毫秒级时间查到

tanhui2333

2022-01-14 14:28:04 +08:00

@none mysql 都这么弱了吗

superchijinpeng

2022-01-14 14:33:06 +08:00

ck

cxh116

2022-01-14 14:35:23 +08:00

@outdoorlife

Table "public.transactions"
Column | Type | Collation | Nullable | Default
-----------------------------+--------------------------+-----------+----------+---------
hash | character varying(66) | | |
nonce | bigint | | |
transaction_index | bigint | | |
from_address | character varying(42) | | |
to_address | character varying(42) | | |
value | numeric(38,0) | | |
gas | bigint | | |
gas_price | bigint | | |
input | text | | |
receipt_cumulative_gas_used | bigint | | |
receipt_gas_used | bigint | | |
receipt_contract_address | character varying(42) | | |
receipt_root | character varying(66) | | |
receipt_status | bigint | | |
block_timestamp | timestamp with time zone | | not null |
block_number | bigint | | |
block_hash | character varying(66) | | |
max_fee_per_gas | bigint | | |
max_priority_fee_per_gas | bigint | | |
transaction_type | bigint | | |
receipt_effective_gas_price | bigint | | |
Indexes:
"transactions_block_number" btree (block_number DESC)
"transactions_block_timestamp_idx" btree (block_timestamp DESC)
"transactions_from_address_block_timestamp_index" btree (from_address, block_timestamp DESC)
"transactions_hash" UNIQUE, btree (block_timestamp DESC, hash)
"transactions_to_address_block_timestamp_index" btree (to_address, block_timestamp DESC)
Triggers:
ts_insert_blocker BEFORE INSERT ON transactions FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Number of child tables: 70 (Use \d+ to list them.)

EXPLAIN select * from transactions where hash='0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8' limit 1;

-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Limit (cost=0.28..853977.96 rows=1 width=886)
-[ RECORD 2 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Append (cost=0.28..59778438.09 rows=70 width=886)
-[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_149_chunk_transactions_hash on _hyper_2_149_chunk (cost=0.28..50.48 rows=1 width=659)
-[ RECORD 4 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 5 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_151_chunk_transactions_hash on _hyper_2_151_chunk (cost=0.42..2496.59 rows=1 width=718)
-[ RECORD 6 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 7 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_153_chunk_transactions_hash on _hyper_2_153_chunk (cost=0.42..28396.85 rows=1 width=774)
-[ RECORD 8 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 9 ]----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_155_chunk_transactions_hash on _hyper_2_155_chunk (cost=0.42..20489.98 rows=1 width=752)
-[ RECORD 10 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 11 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_157_chunk_transactions_hash on _hyper_2_157_chunk (cost=0.42..30703.97 rows=1 width=792)
-[ RECORD 12 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 13 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_159_chunk_transactions_hash on _hyper_2_159_chunk (cost=0.43..34880.92 rows=1 width=800)
-[ RECORD 14 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 15 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_161_chunk_transactions_hash on _hyper_2_161_chunk (cost=0.43..42360.63 rows=1 width=792)
-[ RECORD 16 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 17 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_163_chunk_transactions_hash on _hyper_2_163_chunk (cost=0.43..49846.27 rows=1 width=775)
-[ RECORD 18 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 19 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_165_chunk_transactions_hash on _hyper_2_165_chunk (cost=0.43..41498.32 rows=1 width=815)
-[ RECORD 20 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)
-[ RECORD 21 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | -> Index Scan using _hyper_2_167_chunk_transactions_hash on _hyper_2_167_chunk (cost=0.43..33756.32 rows=1 width=813)
-[ RECORD 22 ]---------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: ((hash)::text = '0xd22b255ddcce6af9daa39fb0b2857096314f56b3a92908a9258695b1f4591ad8'::text)

buddyy

2022-01-14 15:24:28 +08:00

我是不是瞎了，没看到有 hash 字段开头的索引呢？
如果没有的话，这肯定很慢啊。

buddyy

2022-01-14 15:26:57 +08:00

还有 hash 这个字段长度太长了，可以考虑限制 hash 字段的索引长度，可以让一个 page 上容纳更多的记录。

buddyy

2022-01-14 15:32:19 +08:00

如果把 hash 作为主键，这也是不合理的，写入过程中会有大量的叶分裂。

cxh116

2022-01-14 16:08:21 +08:00

@dayeye2006199 @buddyy

感谢感谢,针对网友们的建议,做出以下改变尝试.

1. 改成 ClickHouse ,主键默认使用的是稀疏索引.
2. 使用 Table 自带的 Partition ,取 hash 前 8 位做分区.