But if you are doing indexing, why do you necessarily care that it's specificall...

enragedcacti · on May 15, 2024

> But if you are doing indexing, why do you necessarily care that it's specifically chronological?

edit: on second review if live insertions were occurring then this code would have an edge case, however the indexing job has an endtime presumably chosen where they can be sure no more inserts will occur. Given that the choice to use a timestamp probably has to do with the fact that there are 4 different tables being indexed and you would otherwise have to track their IDs individually.

original:

    id, timestamp
    1 , 1000
    2 , 1001
    4 , 1002
    --------- limit stops here
    5 , 1002
    3 , 1003

ID only: If we use ID > 4 as our start point for next time and ID 3 was not inserted yet then we will have missed it

createdAt only: If we use createdAt > 1002 then we will skip ID 5 next batch

OPs strategy: Even if we use createdAt > 1002 and it skips ID 5 it will be caught by createdAt = 1002 AND ID > 4. The Order by createdAt asc, id asc guarantees that if the limit splits 4 and 5 that we see 4 first and thus don't miss 5. I think this does still miss the case where ID 4 is inserted after ID 5 however.

mewpmewp2 · on May 15, 2024

> I think this does still miss the case where ID 4 is inserted after ID 5 however.

Yeah, and I would think that is very likely to happen given it would be the same timestamp.

So the whole thing still seems like a flawed, and unnecessarily complex solution to me, which should just use one simple unique sortable field to do all of it.

Like my intuitive guess is that maybe the solution could save maximally 20% - 40% of the same edge case, which doesn't seem like a good solution. It is not going to solve the problem. It's just adding complexity that can cause other problems.

So if postgres does the type of caching where it allocates 10 auto incr IDs to each process, which causes sometimes IDs being out of order, then normally it would be just enough to wait after these allocations have performed and index then, you are not going to miss any rows.

I would assume these processes have some form of timeout if there was a case where they couldn't assign one of those IDs, and then this ID would just maybe not exist or if there was a mechanism to reallocate that would work too, but none the less, I think some form of postgres sortable unique id would have to work by itself.