Implementing time-wrap-safe ID generator

spylik · October 15, 2023, 12:02pm

I recently had a chance to dive deeper to the topic how time wraps works with a time corrections and curious, what is the right way now to implement unique sortable ID generator?

So, my goal is to have ID’s which have the following properties:

Monotonically increasing values;
time-wrap safe (means even if time correction occurs in a backward direction the next value should be greater than previous one);
Restart-safe (means even after restart the value of ID generated should be greater than the value generated before restart);
(optionally) ability to extract time of ID creation.

I already had read Erlang -- Time and Time Correction in Erlang

So, the first solution which I believe is correct within single runtime is to generate ID as
{erlang:monotonic_time(), erlang:unique_integer([monotonic]), erlang:time_offset()} tuple.

This code is time-wrap safe and will continue to produce increasing references even if time correction occurs and it’s possible to extract time of the event smoothly.

Meanwhile this approach doesn’t work well for the case when IDs generated by this way supposed to be saved to some persistent storage (dets / mnesia / etc).

After restarting erlang the values produced by erlang:monotonic_time()

can be less than latest value of erlang:monotonic_time() before restart.

The other approach is to use {erlang:system_time(), erlang:unique_integer([monotonic])}, but as I understand correctly erlang:system_time() can leap backward if time correction occurs in a backward direction.

So, what is the right approach to generate sortable ID’s for the systems which supposed to be restarted ?

mikpe · October 15, 2023, 4:11pm

How unique? Do IDs need to be unique wrt IDs generated by other nodes? Or are they only unique wrt a specific node (allowing for restarts)?

The simplest solution, which doesn’t provide your goal (4), is to have two counters, C for current and N for next. C =< N is an invariant, and N is persisted in secondary storage. When C == N you increment N by some suitable amount and persist that. On restart you resume from N (so some values are skipped). We’ve been using this for some of our ID-generation for many years now.

You could maintain a virtual timestamp by having system time in UTC and a local time offset as persisted values. If you detect that system time went backwards you increase the local time offset to maintain the invariant that system time + local time offset is non-decreasing. You’d also need a counter to allow enumerating enough IDs per system time sampling + persist interval.

Either way you need persistent storage.

scherrey · October 16, 2023, 6:28am

Does this code not provide the guarantees that you need? I believe it does unless I’m misunderstanding some fundamentals of Erlang/BEAM.

% Generate timestamp using monotonic clock
-spec mono_time() → non_neg_integer() | float().
mono_time() →
Offset = get(cached_time_offset),
case Offset == undefined of
true →
TimeOffset = erlang:time_offset(),
put(cached_time_offset, TimeOffset),
mono_time();

false → Offset + erlang:monotonic_time()
end.

best regards,

– Ben Scherrey

spylik · October 17, 2023, 4:38am

@mikpe Thanks for your ideas!
It led me to another idea which probably removing theoretical bottle-necks.

So, yeah, we are also trying to avoid too much disc operations during ID generations, as we having quite huge amount of events (let’s say millions of ID every few seconds generated by different processes). Due this we also don’t want to maintain that updates of counters in some separate process as theoretically it may become a bottle-neck.

But your answer led me to another direction - use counter stored on disk and increase it only during startup and offset changes.
So, the solution which I can see is:

some process which starts when application starts.
When it starts it creates (if not present) dets/mnesia ordered_set table, and increases last counter stored on disk (if already exists).

The counter term will looks like {LastCounterValueAsIntegerer + 1, erlang:time_offset()}

Additionally it keeping that new current value in memory with persistent_term:put/1.

that process also subscribes to the offset changes via erlang:monitor(time_offset, clock_service). When offset occurs and it receives {'CHANGE', MonitorReference, time_offset, clock_service, NewTimeOffset} it also increases the counter and saving new offset (writing to dets/mnesia table and updating it in persistent_term).
the ID generation function then may looks as follows:

{ persistent_term:get(time_shift_counter) , erlang:monotonic_time(), erlang:unique_integer([monotonic]) }

so, when we will generate ID and will avoid any disk operations or counter updates.

so, after system restarts, even erlang:monotonic_time() is getting lower than monotonic_time from the previous run, the new events will be higher by order, as the value which we will get from persistent_term:get(time_shift_counter) will be higher and that value is at the first element of tuple in ID.

when we need to get timestamp for example for ID
{4, -576460393040854069,-576460752303416027} we may simply lookup for offset stored for 4 in dets / mnesia table than subtract that value from -576460393040854069.

Meanwhile I feel it a little-bit over-engineered, and I still trying to figure out more simple solution for that seems quite trivial task… Also it wont work for multi-node environment, so, additional routine will be needed to handle that…

jimdigriz · October 17, 2023, 8:56am

…I smell AdTech

But your answer led me to another direction - use counter stored on disk and increase it only during startup and offset changes.

I’m lazy, I would look to OS guarantees. Mostly because I want to ignore the harder problem of guaranteeing state making it to disk consistently and reliably; leave that to the DB people

Taking Linux as an example here but each OS provides its own options. Of course the suitability of this depends on whether you have a persistent server or if your entire runtime is transient.

If you include the process PID, which you can bump the wrapping value if need be, you get effectively an incrementing counter for free. This captures if your runtime restarted.

If you want to detect a system reboot, then boot_id is perfect to mix in. Though of course this is a random (non-sortable for your use case) value, but you may want to squirrel this away as an option of something you can store on disk safely; as if you cannot read your disk recorded version you just assume there was a reboot.

You can though infer a reboot counter via your filesystem mount count:

$ sudo tune2fs -l /dev/sda1 | grep 'Mount count'
Mount count:              9

Even if you remount without a reboot, it does not matter as long as the counter goes upwards.

so, when we will generate ID and will avoid any disk operations or counter updates.

If you are planning to generate millions per second, there is a warning in the manpage about using monotonic with erlang:unique_integer(), this is because it has to be strictly synchronised between all your CPU cores.

I personally would avoid using it.

Meanwhile I feel it a little-bit over-engineered

I think you need to step back a moment and meditate on…the speed of light is a limit…CPU ticks take time…memory bandwidth is not infinite.

These of course sound stupid to say out loud, but it is easy to forget their impact.

So the question is what is actually the time ordering of the events coming in?

When they hit the NIC?
When they hit the OS?
When they get routed into your applications network buffer?
When you actually pop the packet out of the buffer?

Now throw in that these events are reordered in the majority of environments as your network card (and the switch it is plugged into) is multi-queued, etc. Erlang also has inherit jitter too, it may pop off the packets in one order but your processes get scheduled in an order that is different to how they were popped.

You are talking of the order of 1m events per second, we are talking about microseconds as your base resolution. What is the impact to your service if some of the events have jitter by 10us in ordering of the timestamping? What about 100us? Some environments do have this constraint, finance is one space I can think of, yours may not and as time sortable is not a hard requirement for you I suspect it is not?

As I treat time as quantised (having discrete values) I probably would recommend something more like the following as applications should just embrace jitter (like you are embracing time correction):

{node(),RebootCount,Pid,erlang:monotonic_time(),erlang:unique_integer()}

Of course you could push node() through erlang:phash2 if you really needed an integer here.

I would then include a timestamp separately with erlang:time_offset() coupled to it; probably as an event in your event stream that applies to all events after it…which will be jittered in its-self…