Total term order

raimo · March 20, 2023, 2:51pm

Questions

Should we have a total term order?
Should we complete the set of “exact” comparison operators?

Background

Today, Erlang has got these comparison operators: <, =<, >, >=, ==, and /=. These form a complete set that adheres to a term order where number() < atom() < reference() < fun() < port() < pid() < tuple() < map() < [] < [_|_] < bitstring().

We also have the “exact” operators =:= and =/=, a.k.a the match tests.

Terms of the same type has got a defined order, for some types documented, and for others just defined, so e.g all references has a total order between themselves.

Integer+float conflation

What is peculiar about the current term order is that it conflates integer() | float() into number(), so integers and floats of equal value are considered equal: e.g 3 == 3.0
and 2.0 < 3 are both true. This may be practical in many cases, but does not constitute a total term order since terms that are different (of different type, for starters) compare equal.

The “exact” operators differ between integers and floats, so 3 =:= 3.0 is false.

Ordered data structures

The lack of a total term order complicates order dependent features such as lists:sort/1 that has to treat 3 and 3.0 as the same value. It does the best of the situation and promises a stable sort, and lists:usort/1 promises to keep the first instance of “equal” elements. The behavior varies between different data structures such as ordsets, gb_trees, ETS, and so on.

In fact, it is hard to implement a data structure, such as a tree, that can handle any term types. You either have to ignore the integer+float conflation, which most have done, or you could work hard to implement a total term order of your own. The problem with the latter is that you would have to traverse down to the bottom of all compound types to implement a total term order function, so I have not seen that been done…

Maps key order

For maps we have a special situation, because to make them behave nicely in order dependent data structures we had to give them a defined order, so with respect to map keys they have a total order where integer() < float() < other types. Therefore #{1 => a} /= #{1.0 => a} and (surprise!) #{2 => a} < #{1.0 => a} are both true. This does not apply for map values so #{1 => 2} == #{1 => 2.0} is true while #{1 => 2} =:= #{1 => 2.0} is false. This oddity is not much of a problem since it is weird in itself to say that some map is less than some other map.

Glitch at zero

The IEEE Floating-point standard has got both a zero and a negative zero, and dictates that they are equal, except when calculating with them then the sign should be preserved. This is more useful when you also have got positive and negative infinity, which Erlang does not allow. Nevertheless, Erlang has got both 0.0 and -0.0. For historical reasons they not only compare equal, but also match equal, so 0.0 =:= -0.0 is true. One could argue that == should have them as equal but =:= should not, but when we fixed how -0.0 was handled it was already common with code that used a case match for 0.0 to avoid division by zero, so we sadly had to make 0.0 exactly equal to -0.0…

Proposals

We could:

Introduce a total term order.
Introduce operators :<, =:<, >: and >:= to, together with =:= and =/= get a complete set of operators for such a term order.
Eventually introduce options or functions to existing data structures such as ETS, gb_trees, and lists:sort/1 that adheres to the total term order to facilitate handling all types of terms nicely.

Total term order

A total term order could be the existing maps key order.

Another close-at-hand possibility would be to do a minimal extension of the current term order that conflates integers+floats so that if an integer and a float compares equal, then the integer is considered closer to zero. We can here name this the exact term order since we are aiming at completing the “exact comparison” operators.

There might be other possible total term orders, but I cannot think of an obviously better in any regard.

Pros and cons

The maps key order exists already, and has good performance since converting between floats and integers is avoided.

The exact term order is a minimal extension to the existing term order, and it is arguably intuitive.

The maps key order would have 3 :< 2.0 as true. This can be either utterly bewildering or just a small price to pay for performance depending on if you want just any term order or not just any term order.

The exact term order would have 1 :< 1.0, -1.0 < -1, 0 :< 0.0 and (surprise!) 0 :< -0.0 as true since -0.0 has to match 0.0. This can either be the only small glitch with this order, or just too ugly to be implemented, depending or your point of view.

The exact term order must be much slower than the maps key order for mixed floats and integers (on some position in a compound term), but only slightly slower than the existing term order.

Follow-up questions

Should the exact term order still have the maps key order for map keys to have the sort order the same both for < as for :<, or should the exact term order be used here too. E.g should #{1.0 => a} :< #{2 => a} be true when < has it as false? Should we have performance and consistency between operators, or consistency within the new operator?

Should we implement new operators? We could just implement either a function erlang:exact_compare/2 or erlang:maps_key_compare/2 and use that. But new operators optimize better, especially in the JIT, and are easier to use. Completing =:= and =/= would be fulfilling in itself.

Comments, anyone?

Which combination should an EEP aim for?

williamthome · March 20, 2023, 3:01pm

Maybe a poll is a good one for this topic?

raimo · March 20, 2023, 4:04pm

I would like motivations, and think there are too many disparate questions for a simple poll…

zambal · March 20, 2023, 6:33pm

From my point of view the main benefit of a total term order would be better consistency between data structures and the potential for increased performance, so I would choose for the maps key order option. I don’t see a problem with 3 :< 2.0, because when you want to do numerical comparison of that type, you would use the existing operators anyway.

okeuday · March 20, 2023, 8:42pm

Instead of |:<| and |>:| it is easier to read |<:<| and |>:>|||

Otherwise, it looks like a good idea.

uwiger · March 20, 2023, 8:44pm

I’ve had two areas where I’ve run into ordering issues:

Database backends, specifically ordered sets, where I’ve wanted to preserve Erlang term ordering. For this, I created sext (Sortable EXTernal term format), where the serialized form sorts the same way (by some reasonable interpretation) as the original terms would. Obviously, the float/integer duality cannot be solved in a library, but some practical assumptions can be made. The sext serialization format isn’t pretty, and quite bulky, but I challenge anyone to make it more compact. Anyway, for this use-case, it’s possible to expect the user to not be unforgivably stupid (especially not relying on the sort order of floats mixed with integers as primary keys).
Hashing, e.g. in blockchains. At Aeternity, we’ve had a few embarrassing bugs where the smart contract VM used maps as an underlying data structure, and forgot to ensure the sort order of the elements in one or two places. When this has been discussed before, one suggestion has been to add an option to term_to_binary, which serializes terms in an order-preserving fashion - not least, sorting map elements.

bjorng · March 21, 2023, 6:22am

A common mistake today when writing range checks for characters or integers is to forgot the is_integer/1 test. For example:

classify(C) when $A =< C, C =< $Z -> uppercase;
classify(C) when $a =< C, C =< $z -> lowercase;
classify(_) -> other.

This function will gladly accept 65.0 and 77.777 and say that they are uppercase letters. Letting floats slip by could mean that there could be a crash much later in code that is not prepared to handle floats. The JIT must also emit more verbose code that can handle both integer and floats.

To reject floats the function must be written like this:

classify(C) when is_integer(C), $A =< C, C =< $Z -> uppercase;
classify(C) when is_integer(C), $a =< C, C =< $z -> lowercase;
classify(C) when is_integer(C) -> other.

If we adopt new operators that use the map key order, the function can be written simply like this and still reject floats:

classify(C) when $A =:< C, C =:< $Z -> uppercase;
classify(C) when $a =:< C, C =:< $z -> lowercase;
classify(C) when is_integer(C) -> other.

If the new operators use the map key order, they can be nicely optimized by the JIT. Using the exact term order for the new operators would not give the JIT any more opportunities for optimizations.

Maria-12648430 · March 21, 2023, 9:50am

There is a typo, should be #{1.0 => a}.

That aside, a total term order, where either each integer compares less than any float or vice versa, is probably the cleanest solution. Unfortunately, it is also the most likely to break existing code, and somewhat counter-intuitive. So, IMO it should be accompanied with the new operators that you suggested (@okeuday, I find >: more readable than >:>, but that may only be my personal preference) and only employed when those operators are used, not with the existing comparison operators like >.

The exact term order is more intuitive, but it is still tricky. Consider >:=. If you say “… if an integer and a float compares equal, the integer is considered closer to zero…”, what should 1 >:= 1.0 give you? 1 and 1.0 could be seen as equal equal. But 1 is closer to zero than 1.0, so they are not equal?

raimo · March 21, 2023, 10:38am

Precisely! The new operators must harmonize with the existing =:= and =/=. Because 1 =:= 1.0 is false they are not equal in this “exact term order”, and therefore 1 >:= 1.0 must be false and 1 :< 1.0 must be true. And the same happens in this case for “maps key order” since any integer is strictly less than any float.

All operators for an order must fit together, and all others can be defined from any one of >:, >:=, :<, or =:<. E.g A =:= B can be defined as A >:= B and B >:= A, or as not (A >: B or B >: A).

Note: the documentation for lists:sort/2 defines an “ordering function” as for example '=<'/2, which is sufficient for sorting.

raimo · March 21, 2023, 10:39am

Fixed, plus one more… Thank you

raimo · March 21, 2023, 10:53am

We have == /= > < >= =<.

We also have =:= =/= and add >: :< >:= =:<, or
we also have =:= =/= and add >:> <:< >:= =:<.

I just want so see them next to each other to get a taste…

A >: B A >:= B B :< A B =:< A vs.
A >:> B A >:= B B <:< A B =:< A

Personally I think the extra angle bracket is redundant. Double angles gives me an association to a pipe or shift more than comparison.

raimo · March 21, 2023, 11:32am

bjorng:

A common mistake today when writing range checks for characters or integers is to forgot the is_integer/1 test. For example:
classify(C) when $A =< C, C =< $Z -> uppercase;
classify(C) when $a =< C, C =< $z -> lowercase;
classify(_) -> other.

Although that is an easy mistake to make it is largely a beginner’s mistake that most non-beginners know about.

Note that the maps key order opens for a maybe worse mistake:

foo(V) when 3 =:< V, V :< 5.5 -> in_range;
foo(V) when is_number(V) -> out_of_range.

That function would say all integers from 3 and above and all floats less than 5.5 are in_range, for example 17 and 1.0.

Also, with the maps key order it would indeed become possible to write integer (or float) range checks without testing the type since a range within a type also asserts the type.

The mistake you bring up might then become more common because then there is the mistake to use the =< and < operators and forget the type check because that is not needed for =:< and < in a range check…

Optimization is nice, but with a type guard you would get optimization also for the exact term order.

I think we should select a language feature based on how it fits in the language, and maybe on “beauty”, before on optimization possibilities.

mikpe · March 21, 2023, 12:18pm

I never saw any benefit in this “numerical comparison” which considers integer-valued floats to be equal to the corresponding integers, and the ensuing mess it made of the term order. So I would welcome a change which puts all integers before all floats, or vice versa. To avoid breaking existing code it will need new syntax, but as long as it’s not too verbose that’s fine.

bjorng · March 21, 2023, 12:19pm

I have seen it done in OTP code. It is easy to accidentally leave it out even if you know that they are needed. I am also not sure how many non-beginners are real aware of the issue. After all, floats are not widely used in Erlang, so it is quite easy to forget that they exist at all.

It would probably be a good idea for the compiler to warn about that kind of guard. The type would look weird in Dialyzer, too, so it would be easy to notice the mistake.

Yes, I see that as a feature. Do you see that as a bug?

Maybe. I think that most developers that would use the new operators would stop using the old ones altogether (except when dealing explicitly with numeric values).

That’s true, but you get the same optimizations with the standard term order. My point is that by using the map key order you could actually gain something (more concise code). By adopting the exact term order you don’t really gain much. You basically get a total term order useful for sorting. It doesn’t give you any new capabilities for range checks or anything else.

eproxus · March 21, 2023, 2:12pm

Would it be possible to combine multiple checks, like 3 < V < 5, in a guard? I always missed the ability to do that. If so, the type equality could be checked there.

raimo · March 21, 2023, 3:37pm

No, not a bug, more like a possibly dubious feature, just because that it for this particular case works to omit a type guard.

I do not like leaning on an external type checker, and it might be hard to emit warnings with good precision. Is this confusing?

foo(V) when 2 =:< V, V =:< 5 -> in;
foo(V) when V :< 0; 7 :< V -> out;
foo(V) when is_integer(V) -> neither.

That one would say 2, 3, 4, 5 are in, integer below 0 is out, integer above 7 is out, 0, 1, 6, 7 are neither, and all other terms are such as floats are out, not an exception. In other words; you cannot just omit the type test for an open range integer() :< V, and neither for any open range involving a float.

Omitting the type guard only works for closed ranges on integers or floats, and left-open integer ranges. Not right-open integer ranges, nor (right or left)-open float ranges.

So possibly dubious, or at least annoyingly tricky.

okeuday · March 21, 2023, 6:04pm

I mentioned using >:> and <:< instead of >: and :< because it is important to not confuse them with > and <. The syntax is a bit unusual because of other usage of >> and << in programming languages but it
would help to ensure the exact comparison operators clearly standout in the source code.

bjorng · March 22, 2023, 7:43am

So what is your verdict?

Are the possible ways to mis-use the map key order operators sufficient reason to reject them and their potential for doing good?

Yes, omitting type checks should only be used for closed ranges with the upper and lower bounds having the same type. I would strongly advice against omitting the type test with any open ranges, and would never do so in my own Erlang code. (Taking advantage of left-open integer ranges in the compiler and JIT is fair play, though.)

Speaking as the self-appointed attorney for the defense, I am not sure over-generalizing omitting of type tests from closed ranges to open ranges is a common mistake. I might not be more common than forgetting the is_integer/1 type test with the current operators. If those new mistakes are made, I am not sure we are any worse off than with the current mistakes.

I think that the first mistake you pointed out, wanting to use a closed float range but by mistake making one of the bounds an integer, is the most serious of the new possible mistakes. That can be caught with a compiler warning, and it seems that I have just volunteered to implement that warning.

michalmuskala · March 22, 2023, 10:53am

To be honest, I’m not a huge fan of this proposal. It makes the language much more complex without a very clear benefit - writing a data structure with a total order is a relatively niche task - useful for library authors, but not done much by end users. This just reminds me of the two existing sets of operators for booleans - and and or vs andalso and orelse - I believe everybody will agree this situation is less then desirable. Even the current duality of == and =:= is frequently a source of confusion and bugs.

From end-user perspective, what I would find infinitely more useful is operators that strictly compare, just raising on different data types. This would be similar to semantics of comparison operators in some of the prominent dynamic programming languages - e.g. Python3 (python2 had semantics close to current Erlang) or Ruby. This also very cleanly solves the issue of “integer ranges” indicated above.

If there’s a need for a total order operations, for building data structures, I’d consider introducing them as BIFs, rather than operators. Such use-cases are relatively infrequent and IMO it’s fine if the usage is somewhat more verbose.

Finally, I’d consider if perhaps the -feature mechanism can be used here to “fix” semantics of existing language constructs, rather than expanding the language and introducing new operators (IMO the worst kind of expansion - because operators are generally hard to Google for, research, and communicate about).

rickard · March 22, 2023, 3:22pm

The old “Concurrent Programming in ERLANG” book describes the comparison operators =:= and =/= as being “exact”, and all others as being “coerce” comparison operators. In order to be consistent with that, I’ll call the existing term order the coerce term order.

I don’t consider the exact term order (as described by Raimo in the initial post) as a new order but merely an extension of the coerce term order. The only thing it does is splitting ties where values compare as equal, but are of different numeric types, in the coerce term order.

The exact term order is very intuitive. The reason for exact term order being intuitive is that numbers still will compare according to numeric value, just as it always has been, until it comes to splitting ties. I find it very unlikely that anyone who has gotten the difference between the coerce and exact order explained once will find it hard to understand what code using the different operators actually will do.

We also wont need to see this explanation in Erlang introductory courses :

If the language never had allowed operations mixing integer and float operands, I would have found the map key order quite okay. However, this ship has sailed. Operating on a mixture of integer and floats has been allowed since at least the 90ies, so this is where we are and numeric comparison is part of it.

The map key order already exist and is documented, so why not just use it?

I’d say that where it is used today, it does not mess up the intuitiveness of comparison at all. Comparing two maps is very different from comparing two numbers. As it is now, the map key order is more or less hidden away in a dark corner of map internals.

But you get better performance with the map key order!

An argument for the map key order that (frequently) has been given is that you should not mix integers and floats anyway, so it doesn’t matter that it isn’t intuitive that 1.0 is larger than 2. In light of that, the argument that we would get a performance gain using the map key order loses weight quite considerably. This since the performance gain is just for comparison of integers and floats.

How about using exact term order also for comparing keys in maps when using the exact comparison operators?

When using the exact term order as total term order we have “A < B implies A :< B”, and “A > B implies A >: B” which I find very nice and probably is another way of saying that it is intuitive. Those implications will not be true if the total term order use map key order.

If we use exact term order as total term order and also use different key orders when ordering maps depending on if we compare using coerce operators or the exact operators, the implications above would not be true when A and B are maps which would be quite unfortunate. That is, I think that maps still should be compared using the map key order even when using exact total term order.

Is there a need for a total order?

I’d say yes. As it is today we’ve gotten key value stores using different comparisons mainly based on whether the internal data structure needs to be ordered or not. This is quite unfortunate since they become incompatible with each other. If we would have had a total term order from the beginning I don’t think we would have gotten this amount of incompatible key value stores. (This is also what triggered this discussion internally at all to begin with.)

I also find it useful to introduce the operators instead of a BIF. My guess is that usage of the BIF won’t attract as many users as the operators will when implementing key value stores.

An operator failing when comparing terms of different types might perhaps be useful for something, but not for this (ordered data structures of terms).