Use cases for duplicate_bag tables?

Maria-12648430 · January 3, 2024, 1:26pm

I have lately been discussing ets table types with a colleague, and we found that we couldn’t come up with any use cases for the duplicate_bag type, not even any far-fetched contrived ones.

The stumbling block is always the fact that, while one can insert any number of identical objects in such a table, it is not possible to delete or update just a single one: It’s always either all or nothing.
And while it might be possible to dream up something remotely useful involving traversal, there is no order to the objects, either.

So, much as we tried, we couldn’t find any case where duplicate_bag tables would be a fitting solution.

LostKobrakai · January 3, 2024, 1:48pm

phoenix_pubsub does use duplicate_bag as discussed in The Road to 2 Million Websocket Connections in Phoenix - Phoenix Blog

Maria-12648430 · January 3, 2024, 3:00pm

Hm, that is interesting, thanks

However, reading the surrounding material, it sounds more like a hack that was luckily possible, not a typical use case?

sverker · January 3, 2024, 7:54pm

Well, there is insert order at lookup :

1> T = ets:new(x,[duplicate_bag]).
2> ets:insert(T, {key, 1}).
3> ets:insert(T, {key, 2}).
4> ets:insert(T, {key, 1}).
5> ets:insert(T, {key, 3}).
6> ets:lookup(T, key).
[{key,1},{key,2},{key,1},{key,3}]

Don’t know if that makes the duplicate_bag more useful.

Maria-12648430 · January 4, 2024, 8:58am

Indeed it does, a little Does the same apply for take/2? By quickly trying it out, it looks like it does, but the documentation says nothing in that line, so I’m not sure.

A problem I see with insertion order only present in lookup (and maybe take) is that you never know how much you might get. Traversal with match/1,3 and select/1,3 would alleviate this by limiting the result at a time, but alas, for those insertion order for same keys seems not to be preserved

Maria-12648430 · January 4, 2024, 12:32pm

@sverker On a slightly different track… In the article that @LostKobrakai linked, they were able to boost performance by going for duplicate_bag instead of a bag table, which was possible just like that because the surrounding application was guaranteed to not insert any duplicates in the first place. I assume that performance boost came about because, other that with bag tables, there are no checks for duplicate objects needed in a duplicate_bag table?

TD5 · April 2, 2024, 6:50pm

Chiming in to say I make heavy use of duplicate_bag because it works out MUCH faster than bag as the number of entries grows (esp. as the number of entries for the same key grows), and often it’s safe to use duplicate_bag even if you don’t need duplicates, because the surrounding context prevents duplicates occurring anyway. Seems like you were spot on with your assumption. Now, if we could have a bag implementation with similar performance…

Maria-12648430 · April 3, 2024, 7:35am

Well, that would mean optimizing this part in erl_db_hash.c:

    else if (tb->common.status & DB_BAG) {
        struct tmp_uncomp_term* tmp = NULL;
	HashDbTerm** qp = bp;
	q = b;
	do {
	    if (db_terms_eq(&tb->common,
                            &value_to_insert->dbterm,
                            &q->dbterm,
                            &tmp)) {
		if (is_pseudo_deleted(q)) {
                    INC_NITEMS(tb, lck_ctr, hval);
                    q->pseudo_deleted = 0;
		    ASSERT(q->hvalue == hval);
		    if (q != b) { /* must move to preserve key insertion order */
			*qp = q->next;
			q->next = b;
			*bp = q;
		    }
		}
                free_term(tb, value_to_insert);
                free_tmp_uncomp_term(&tb->common, tmp);
		goto Ldone;
	    }
	    qp = &q->next;
	    q = *qp;
            (*consumed_reds_p)++;
	}while (q != NULL && has_key(tb,q,key,hval));
        free_tmp_uncomp_term(&tb->common, tmp);
    }

… to look more like this:

    /*else DB_DUPLICATE_BAG */