Auto increment keys in ets set table

juhlig · July 24, 2023, 11:51am

Since “monotonic” and/or “contiguous” are not among your list of properties, why not use erlang:unique_integer/0,1? Stay away from the monotonic option if speed is high on your list of goals, though, it makes this otherwise very fast operation quite expensive.

The number of unique integers that this function can generate is 2⁶⁴ - 1 when using monotonic (which you shouldn’t if you can), and (NumberOfSchedulers + 1) * (2⁶⁴ - 1) without it (see here). So, while there is not an infinite amount of unique integers here, it should be large enough for almost anything you can bring to the table.

That said, the integers this function returns are unique only per runtime instance, so you might get duplicates when using it on multiple nodes. But you can use a tuple like {node(), erlang:unique_integer()} to work around that. Both are fast operations, one resulting in an atom and the other in an integer, so memory usage shouldn’t be an issue, either.

Inforista · July 24, 2023, 1:33pm

Thanks - it looks interesting

But if I do backup of the ets tables to dets tables, shut the system down (kill erlang) and later start system again and load tables back to into memory … then there is a risk of duplicates or can erlang:unique_integer handle this?

juhlig · July 24, 2023, 2:02pm

Hm, good point, and no, it can’t

Inforista · July 24, 2023, 2:16pm

Ok, thank you for clarifying this But I will defenitely remember erlang:unique_integer() for projects where I don’t have to use persistent counters like in this tuplespace project.

juhlig · July 24, 2023, 3:01pm

You could use something other than or in combination with the node name in the tuples, like an UUID-based integer that you create once at the startup of your application on a node, store it in persistent_term, and use that to “tag” inserts from a node. If the node/application goes down and later comes back up, another tag-integer should be created on it, so even if the calls to erlang:unique_integer/0,1 themselves would create duplicates with the previous runtime of the node, the different tag-integer would ensure that the ID tuple would be unique. (Just thinking out loud )

juhlig · July 24, 2023, 3:13pm

crypto:strong_rand_bytes is probably good enough to provide you with a sufficiently unique tag-integer, but don’t take my word for it =^^=

max-au · July 25, 2023, 5:02am

After re-reading several times, I realised that your goal is to implement some sort of “term sharing” mechanism. The intent is to de-duplicate tuples in various tables by storing a cheap “reference”, and having a dictionary containing reference-counted tuples.

I recall doing something similar a long time ago. However my requirements were relaxed, so I I took a different approach and used phash2 of the [Term] instead of a unique reference. It worked for me, because hash collisions were relatively rare, and I didn’t need a bidirectional map.

Your solution to have a bi-map or a table with a secondary index. To my knowledge, there is no canonical implementation. One can simulate a secondary index by adding a separate table (bidirectional mapping).

That’s what you’re doing with tuple_id_table. It is a secondary index for the id column of the tuple_data_main_table. When processes are accessing the main table concurrently, it’s possible to have inconsistencies. For example, when a tuple was deleted from the main table, but still present in the index. This one can be easily resolved by checking whether Id still exists in the main table after fetching it from the index table.

Inforista · July 25, 2023, 7:03am

Thank you for feedback

Yes i am working on a “sort of term sharing” system, but more advanced.

I call my system Inforistaspace and the end goal is a process coordination language so all kinds of processes, no matter what language they are programmed in, can communicate and coordinate easily trough a API over the internet.

Based on the idea of Linda Tuplespace it facilitates things like:

Coordination of master worker tasks (parallel computing)
Distributed data structures and data sharing
Timing / syncronisation between processes
Notification systems
Detection of co- occurences

… and a lot more

The goal is to make it easier for programmers to share and coordinate data between systems of any kind.

Maria-12648430 · July 27, 2023, 7:29pm

That sounds pretty interesting Can I see some code, like, on GitHub or something?

Inforista · July 28, 2023, 5:29am

Hi Maria,

Yes it is interesting

Right now the project will not be open source. It will be a core product of a comming business (inforista.ai site not public yet) with two main goals:

Helping business improving and optimizing their communication and coordination.
Development of a new and fundamentally different way of constructing AI systems. One there is self learning without the need of big data training and also one there is reversible and explainable - no black box limitations and biases.

The business have a stated goal of:

Help protecting our planet and improve life for everybody. We will not build systems to support negative things like: war, advertising, surveillance or gambling.

BUT when the Inforistaspace shared memory and process coordination language is ready, there will be free developer access for not commersial use of it.

And of course, most of the codebase will be Erlang

Maria-12648430 · July 28, 2023, 11:50am

Exciting

Ah, pity… I would have loved to see this grow and how it turns out.

Is that a one-man show?

Nothing to say against that

Hm, I wonder… while you’re not going to build such systems yourself, how will you ensure that it won’t be used by others for those ends?

Best choice you could make, for sure

Jokes aside though, let me give you some advice and heads-ups, based on what I gathered from the previous posts, ie without seeing actual code:

You mentioned backing up the ETS to DETS tables in case of shutdown. Just be aware that, other than ETS tables, DETS files are limited to 2GB in size each. So if you have some huge tables, maybe after running your service for some time, this may hit you unawares and $IMPORTANT_STUFF may get unexpectedly lost

ets:tab2file/1,2 is not subject to such limits (AFAIK), and faster too: on my system, ets:to_dets/2, used on an ETS table with ~2.5GB of data in it, fails after 37s, while ets:tab2file/2 succeeds after ~6s.

I don’t know if ets:tab2file has any disadvantages compared to ets:to_dets in this use case, though. Maybe somebody else may shed some light on this?
Have you thought about what could/should happen if the computer hosting your ETS tables gets hit by lightning (quoting Joe Armstrong)? I mean, by using ETS (or DETS for that matter), you’re basically limited to 1 node on that front, unless you implement some replication scheme.

How important is the data in those tables, can it be restored from somewhere, on another machine if need be, if it is important? How important is consistency between the index and main table? You may want to think about checking and repairing the index after a restart, and how much overhead (read “extra downtime”) that imposes on your service.

Inforista · July 28, 2023, 6:10pm

Hi Maria

Thank you for feedback and ideas - I will try to clarify

Yes right now it is a one-man show building the core system in Erlang. Looking ahead more involvement will naturally be needed

You are right, this is difficult - how to ensure others wont use the technology for negative things like: war, advertising, surveillance and gambling is not an easy task. Different steps must be persued:

legal protection … luckily this new way of doing computing/AI is already patented, in most parts of the world now.
strong focus on only allowing access via API
security and distribution of core system components

I already use ets:tab2file to save data from ram to disk and back again in the system It works nicely and I wont have problems with 2.5GB limits since data is shared over multiple ets tables in the system.
Right now the system handles 50 million operations on the inforistaspace in 72 seconds. And it takes 14 seconds to move 5 GB of data from ram to diskfiles.

I agree about the lightning issue - i love Joe Armstrongs book, so sad he is not here anymore.
I am thinking about two senario:

non critical business process cordination => just have the system on one node
critical applications => replicate the inforistaspace on two nodes with automatic failover if lightning hits.

Good point - my goal is to handle consistency between tables with an internal special build software transactional mechanism, without limiting concurrent process access to the ets tables to much