In Erlang/OTP 27, +0.0 will no longer be exactly equal to -0.0

Currently, the floating point numbers 0.0 and -0.0 have distinct internal representations. That can be seen if they are converted to binaries:

1> <<0.0/float>>.
<<0,0,0,0,0,0,0,0>>
2> <<-0.0/float>>.
<<128,0,0,0,0,0,0,0>>

However, when they are matched against each other or compared using the =:= operator, they are considered to be equal. Thus, 0.0 =:= -0.0 currently returns true.

@nzok considers this behaviour to be a bug. We also consider it a bug, but until recently we were not sure that fixing and introducing an incompatibility would be be worth it.

A recent bug found by erlfuzz made us reconsider. An optimization in the compiler to share identical code would consider the code for the two clauses in this function to be identical:

f(_V0, _V0) ->
    -0.0;
f(_, _) ->
    0.0.

and essentially rewrite it to:

f(_, _) ->
    0.0.

To fix this optimization when =:= considers 0.0 and -0.0 to be equal would be cumbersome and could make the compiler slower. It also likely that other optimizations in the compiler could be affected and would have to be fixed in similarly cumbersome ways.

Therefore, the OTP Technical Board decided that in Erlang/OTP 27, we will change +0.0 =:= -0.0 so that it will return false, and matching positive and negative 0.0 against each other will also fail. When used as map keys, 0.0 and -0.0 will be considered to be distinct.

The == operator will continue to return true for 0.0 == -0.0.

To help to find code that might need to be revised, in OTP 27 there will be a new compiler warning when matching against 0.0 or comparing to that value using the =:= operator. The warning can be suppressed by matching against +0.0 instead of 0.0.

We also plan to introduce the same warning in OTP 26.1, but by default it will be disabled. Anyone that suspect they have code that might be affected can turn on that warning in OTP 26.1.

25 Likes

Interesting … for people not in topic could you also please describe (or link to some article) why <<-0.0/float>> is represented as <<128,0,0,0,0,0,0,0>>? I think that many people and especially newbies would be confused and assume that -0.0 and +0.0 is always seen the same i.e. 0.0.

If I understand correctly regardless of value (0.0, 12.34, -98.76, …) first bit of every float is 0 (positive) or 128 (negative) and Erlang previously saw that like:

defmodule Example do
  def negative?(<<128, _data::8*7>>), do: true
  def negative?(<<_sign, _data::8*7>>), do: false

  def positive?(<<0, _data::8*7>>), do: true
  def positive?(<<_sign, _data::8*7>>), do: false

  def same?(<<_first_sign, 0::8*7>>, <<_second_sign, 0::8*7>>), do: true
  def same?(same, same), do: true
  def same?(_left, _right), do: false
end

and the change is like about removing first clause of Example.same?/2 function, right?

1 Like

Here is a link to Wikipedia’s description:

Yes, the most significant bit is the sign bit. Your example have the wrong sizes for the segments, but it does seem that you have correctly understood both the current behavior and the new behavior. Here is a corrected Erlang version of your same?/2 function:

is_same(<<_:1, 0:63>>, <<_:1, 0:63>>) -> true;
is_same(Same, Same) -> true;
is_same(_, _) -> false.
3 Likes

Since 8-bit 0 is 00000000 and 128 is 10000000 both examples are good. The <<0>> is <<0:8>> and <<128>> is <<1:1, 0:7>>

1> <<0, 0:56>> =:= <<0.0/float>>.             
true
2> <<128, 0:56>> =:= <<-0.0/float>>.
true
3> <<0:1, 0:63>> =:= <<0.0/float>>. 
true
4> <<1:1, 0:63>> =:= <<-0.0/float>>.  
true
5> <<0, 0:56>> == <<0:1, 0:63>>.
true
6> <<128, 0:56>> == <<1:1, 0:63>>.
true

For those who do not understand:

  1. My example have:
    a) 8 bits of 0 + 56 bits of 0 (<<0, 0:56>> or <<0, 0::56>> in Elixir)
    b) 8 bits of 128 + 56 bits of 0 (<<128, 0:56>> or <<128, 0::56>> in Elixir)
  2. Your example have:
    a) 1 bit of 0 and 63 bits of 0 (<<0:1, 0:63>> or <<0::1, 0::63>> in Elixir)
    b) 1 bit of 1 and 63 bits of 0 (<<1:1, 0:63>> or <<1::1, 0::63>> in Elixir)

So:

1> <<0>> =:= <<0:8>>.
true
2> <<128>> =:= <<1:1, 0:7>>.
true
3> <<0, 0:56>> =:= <<0:1, 0:63>>.
true
4> <<128, 0:56>> =:= <<1:1, 0:7, 0:56>>.
true
5> <<1:1, 0:7, 0:56>> =:= <<1:1, 0:63>>.
true

Also this article may be helpful especially for new developers:

3 Likes

No, they work for 0.0 but to be picky they drag in another 7 bits that will either be significant (positive? and negative?) or insignificant (same?) which will give funny results for other numbers. For example, -2.0 would be considered non-negative by negative?, and equal to 2.0 according to same?.

6 Likes

ok, now I understand why my code was wrong

I was thinking in good way, but I assumed that all first 8 bits are used to store a sign. It’s why I asked if it would be always 0 or 128. That’s obvious waste of memory, the only relevant is the first bit i.e. <<relevant::1, rest::63>>.

The image says more than thousands words! :smiling_imp:

3 Likes

I’m not a floating point expert, but, per the Comparison section on the Wikipedia page on signed zero, won’t this break IEE 754 compatibility?

3 Likes

I’m also not a floating point expert. I’d be interested to hear from the core team on this.

According to https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf, “Comparisons shall ignore the sign of zero (so +0 = −0).” So, the standard does indeed say this.

I suspect the idea is that == will still show them equal, and that satisfies the constraint that " negative zero and positive zero should compare as equal with the usual (numerical) comparison operators, like the == operators of C and Java," as the Wikipedia page says. Though Erlang’s == also equates integer 0 and floating point 0.

Erlang’s exact equality operator =:= will distinguish them, but maybe that is regarded as outside the standard’s requirement?

2 Likes

No, arithmetic comparison (==, >, =<, et al) will still consider them equal. They will only be considered unequal in exact comparisons (=:=, =/=) which are an entirely different beast.

IEEE754 distinguishes between 0.0 and -0.0 on purpose because it’s actually useful in many contexts, and people have complained about our behavior of randomly losing (or gaining!) the sign of zero many times. The bug referenced in the first post was just the straw that broke the camel’s back.

Why? Consider the following:

foo(X, Y) when X =:= Y ->
    Y;
foo(... snip ...) ->
    ... snip ...

When X and Y are the exact same it shouldn’t matter whether we return X or Y from the first clause: if the compiler deduces that it’s cheaper to return X it should be free to do so, but 0.0 =:= -0.0 makes that impossible because we could return 0.0 when the user wanted -0.0 or vice versa.

This affects all code in the compiler that has to reason about exact equality. We cannot blindly assume that A and B represent the same term unless we know for certain that they’re not a zero float, so the compiler skips these optimizations when this isn’t the case. This is not much fun since the highly dynamic nature of Erlang means that we rarely have this information, but we can live with it.

So far so good, right? We know about this problem and have taken measures to combat it. This is where the linked bug comes in. In our SSA form the code looks like this:

function `t`:`f`(_0, _1) {
0: %% Basic block 0
  @ssa_bool = bif:'=:=' _1, _0
  br @ssa_bool, ^4, ^3

4: %% Basic block 4
  ret `-0.0`

3:
  ret `0.0`
}

We have an optimization for sharing basic blocks that more or less just checks whether two blocks contain the same instructions and de-duplicates them if they do. Because 0.0 =:= -0.0 the return instruction in blocks 3 and 4 are considered equal and are thus shared, returning the wrong result.

To make this respect the sign bit of 0.0 we would have to implement our own comparisons manually to recursively dig into arbitrarily complex terms and treat 0.0 and -0.0 differently. This is not very difficult per se, but the rub is that we did not expect this to be a problem here. To solve it in this manner we would have to identify all the places in the compiler where this is necessary, which would quickly become a game of whack-a-mole that we’re sure would cost a lot of precious time and risks introducing bugs on its own.

As Björn mentioned we’ve considered -0.0 =:= 0.0 to be a bug for quite some time but just haven’t found it big enough of a deal to risk changing until now. People are using floats far more often now than they used to on our platform (the recently launched Elixir Nx being one big user) so we can’t just stick our heads in the sand and pretend it’s not an issue anymore. We’re faced with the options of:

  1. Giving up all attempts to distinguish 0.0 and -0.0, throwing everyone who needs that distinction under the bus.
  2. Having the compiler team bend over backwards trying to make the compiler 0.0-aware everywhere. We are just two people who split our time on many other things as well so we’d rather not have to do this.
  3. Changing exact comparisons (=:=) to consider them different, keeping arithmetic comparison the same.

As we consider the behavior itself a bug and it’s not only the compiler that’s affected, we’re attempting option 3 in the hopes that it won’t cause too many problems elsewhere. If it does we’ll have to backtrack before the OTP 27 release (or perhaps punt it to OTP 28 or even later).

Yes, == and friends will work just as they have before. It’s only exact comparisons that will change.

This is a rather annoying wrinkle: people who have used X =:= Y as both an arithmetic equality and implicit type test cannot blindly rewrite their code to X == Y, and will have to figure out whether the type test is unnecessary or rewrite it as X == Y andalso (is_float(X) == is_float(Y)) or similar.

Judging by a survey of the float-heaviest code bases we know of we don’t expect this to become a problem, but you never know, and we’re well aware that we may have to change our course after the release candidates for OTP 27. We’re making noise now in the hopes that we’ll catch these issues early.

11 Likes

Ahh, understood. I was getting tripped up on the exact-equals. I really appreciate the detailed explanation. :pray:

3 Likes

Thank you! This is very helpful and interesting.

I am not heavily impacted personally, but I think the long lead time is a good idea.

I’ve probably used =:= or =/= as an implicit type test as you mention. I also give an introductory homework assignment that asks students to solve the quadratic equation. Normal solutions, including mine, pattern match on 0.0 for the coefficient of the x^2 term.

I assume this kind of code will have to go from

f(0.0) -> %...

to

f(X) when X == 0.0 -> %...

(or separate positive and negative 0.0 or do explicit type checks if that matters).

This will also arise in unit tests, because ?assertEqual(0.0, StudentAnswer) currently succeeds for both positive and negative 0.0 and will fail in the future. My unit tests for the above mentioned homework will require some changes (to use, e. g., ?assert(StudentAnswer == 0.0))

For me, it’s probably just a handful of relatively short files to look through, but I can imagine it will take others a fair amount of time and testing.

Again, thank you so much for the clarifications!

3 Likes

Out of curiosity, would it make sense to introduce a feature in OTP 26.x maybe to preemptively enable this change?

2 Likes

This seems to have blown up elsewhere on the internet (hi Hacker News!) and a lot of people misunderstand what this change means. Since I can’t reply everywhere I’ll try to explain it here in a manner that hopefully makes more sense for people who are unfamiliar with Erlang:

Like Prolog before us we only have one data type, terms. This means that the language is practically untyped. There are only terms and while you can certainly categorize them if you wish, there are no types in the sense most people use that term.

Functions have a domain over which terms they operate, and going outside their domain results in an exception that’s often analogous to a type error (for example trying to add a list to an integer) which can be mistaken for having traditional types. To make things more confusing, we have functions that can tell you which pre-defined category a term belongs to, like is_atom returning true for all terms that are in the atom space and false for all terms outside of it.

This mixup is so prevalent that even our documentation refers to these pre-defined categories as “types” despite them being nothing more than value spaces, but it’s important to remember that at the end of the day we only have one data type, and that many functions are defined for all terms.

The arithmetic equality operator (==) returns whether two terms are considered arithmetically equal (for clarity it’s defined for all combinations of terms). Primitive non-numeric terms are simply checked for identity, compound terms are compared recursively, and any numeric terms (floats and integers) are compared according to the rules listed in the documentation. This operator will remain unchanged in the future and 0.0 == -0.0 will continue to hold.

This operator covers just about all common uses, but is not enough for code that needs to reason about terms in the general sense, for example sets, memoization, or other things you would use generics for in another language.

For that we have the exact equality operator (=:=) which returns whether two terms are indistinguishable. That is for any terms X and Y, X =:= Y returns whether f(X) =:= f(Y) for all pure functions f.

Since there exists several f that distinguish f(0.0) from f(-0.0), we either have to conclude that those functions are broken (where does that leave copysign?) or say that 0.0 =:= -0.0 should not return true.

We could make it consistent by removing all the things that let us observe the difference, but that removes functionality that people rely on so it’s not much of an option. So what we’ve done so far is to try to sweep these differences deep under the rug and hoping no one notices, one small patch at a time.

Since we’ve never exposed copysign or the other IEEE functions that allow you to observe the sign of zero, it has kind-of-sort-of worked as long as people stayed entirely within Erlang-land. Unfortunately with the rising popularity of GPGPU in general and the Nx library in particular, this bug has been rearing its ugly head with increasing regularity.

Not only because the compiler flubs the constants 0.0 and -0.0 like in the examples earlier in the thread, but because application code does the same. If you want to memoize the result of the aforementioned f(0.0) it will be confused with f(-0.0) unless you invoke some arcane nonsense to keep them apart.

The compiler is just where this bug is most visible. While we can certainly try to make the compiler distinguish between these values at all times there’s nothing we can do for application code, hence our attempt at breaking backwards compatibility. If it fails we’re most likely going to have to throw in the towel on this bug.

Yes.

The compiler will raise a warning whenever 0.0 is used in this manner, so our hope is that it will not be too difficult to find where to change the code.

Yes, we’ll look into it.

13 Likes