[Closed][BE] Optimize hot rank updates #3617

github.com Optimize hot rank updates by sunaurus · Pull Request #3617 · LemmyNet/lemmy

Currently, hot ranks are only updated in our scheduled tasks for content up to a week old. This means that if content older than a week has not decayed to a hot rank of 0 yet (or if for some reason initial hot rank calculation does not work on an incoming old post, like in #3428), this content will not have its hot rank updated until Lemmy is restarted.

This PR aims to speed up the hot rank updates so we can stop limiting the scheduled task to only the past week.

Basically, this is a small tweak to the hot rank update query - it no longer updates rows where the hot rank has already decayed to 0. This significantly speeds up the whole process, as the update itself seems to be quite expensive. Here are some query plans for a batch of 1000 old rows:

Current main branch query

Update on comment_aggregates a  (cost=50.33..2667.91 rows=1000 width=38) (actual time=0.840..216.193 rows=1000 loops=1)
  CTE batch
    ->  Limit  (cost=0.43..49.91 rows=1000 width=18) (actual time=0.026..6.813 rows=1000 loops=1)
          ->  LockRows  (cost=0.43..48479.60 rows=979788 width=18) (actual time=0.026..6.693 rows=1000 loops=1)
                ->  Index Scan Backward using idx_comment_aggregates_published on comment_aggregates a_1  (cost=0.43..38681.72 rows=979788 width=18) (actual time=0.015..5.817 rows=1000 loops=1)
                      Index Cond: (published > (now() - '10 years'::interval))
  ->  Nested Loop  (cost=0.42..2618.00 rows=1000 width=38) (actual time=0.810..211.210 rows=1000 loops=1)
        ->  CTE Scan on batch  (cost=0.00..20.00 rows=1000 width=32) (actual time=0.032..7.533 rows=1000 loops=1)
        ->  Index Scan using comment_aggregates_pkey on comment_aggregates a  (cost=0.42..2.35 rows=1 width=26) (actual time=0.159..0.159 rows=1 loops=1000)
              Index Cond: (id = batch.id)
Planning Time: 2.284 ms
Execution Time: 216.383 ms

Optmizied query in this PR

CTE Scan on batch  (cost=2480.28..2500.28 rows=1000 width=12) (actual time=0.027..6.247 rows=1000 loops=1)
  CTE batch
    ->  Limit  (cost=0.43..49.91 rows=1000 width=18) (actual time=0.023..5.944 rows=1000 loops=1)
          ->  LockRows  (cost=0.43..48479.60 rows=979788 width=18) (actual time=0.022..5.860 rows=1000 loops=1)
                ->  Index Scan Backward using idx_comment_aggregates_published on comment_aggregates a  (cost=0.43..38681.72 rows=979788 width=18) (actual time=0.015..5.316 rows=1000 loops=1)
                      Index Cond: (published > (now() - '10 years'::interval))
  CTE updated
    ->  Update on comment_aggregates a_1  (cost=0.42..2430.37 rows=0 width=0) (actual time=3.219..3.219 rows=0 loops=1)
          ->  Nested Loop  (cost=0.42..2430.37 rows=247 width=38) (actual time=3.216..3.216 rows=0 loops=1)
                ->  CTE Scan on batch batch_1  (cost=0.00..20.00 rows=1000 width=32) (actual time=0.008..0.303 rows=1000 loops=1)
                ->  Index Scan using comment_aggregates_pkey on comment_aggregates a_1  (cost=0.42..2.35 rows=1 width=26) (actual time=0.003..0.003 rows=0 loops=1000)
                      Index Cond: (id = batch_1.id)
                      Filter: (hot_rank <> 0)
                      Rows Removed by Filter: 1
Planning Time: 0.270 ms
Execution Time: 9.584 ms

As you can see, the optimized query runs an order of magnitude faster. On lemm.ee, with this change, the scheduled task is fast enough to process hot ranks for all history in less than 30 seconds now. As a result, I think it should be safe to stop limiting the hot rank update task to just the previous week.

2 comments

comparison:

Originally posted by phiresky in #3617
Nice! With the indexes, it's even faster on lemm.ee (~20s instead of ~30s), and more importantly, should scale even better as post/comment counts increase.

Originally posted by sunaurus in #3617