This library allows you to work with “too large to fit to RAM” or infinity-long data the same way as you would work with lists using lists
module.
Let’s say you have a 100Gb text file that has many lines which may or may not be numbers and we want to calculate a total sum of numbers (why not?).
foo
1234
321
barbar
baz
543
If this file was just 1Mb, you would just read it whole into memory, split it by \n
and process the resulting list using lists:filter
and lists:sum
or lists:foldl
:
% get lines
{ok, Bin} = file:read_file("1mb_file.txt"),
Lines = binary:split(B, <<"\n">>, [global]),
MatchingLines = lists:filter(
fun(Line) ->
% does the line contain number?
case re:run(Line, "^[0-9]+$") of
nomatch ->
false;
{match, _} ->
true
end
end, LinesIterator),
% convert to integers
Integers = lists:map(fun erlang:binary_to_integer/1, MatchingLines),
% sum integers
Sum = lists:sum(Integers).
However what are we going to do if the file does not fit into RAM? Then we would have to read the file line-by-line and write complex, nested, non-reusable, manual tail-recursive functions to get the same result. Smth like
process(F, Sum) ->
case file:read_line(F) of
{ok, Line} ->
case re:run(Line, "^[0-9]+$") of
nomatch ->
process(F, Sum);
{match, _} ->
process(F, Sum + binary_to_integer(Line))
end;
eof ->
Sum
end.
iterator.erl
lets you use the same high-order functions as in lists
module (or your own) without reading the whole input at once:
LinesIterator = file_line_iterator("100GB_file.txt"), % implementation see in repo README
MatchingIterator =
iterator:filter(
fun(Line) ->
case re:run(Line, "^[0-9]+$") of
nomatch ->
false;
{match, _} ->
true
end
end, LinesIterator).
IntegerIterator = iterator:map(fun erlang:binary_to_integer/1, MatchingIterator).
Sum = iterator:fold(fun erlang:'+'/2, 0, IntegerIterator).
The code looks identical to the first example (uses iterator:
module in place of lists:
), but it never reads more than one line at a time into memory!
Library also includes iterator_pmap
: parallel version of lists:map
that processes the input iterator elements on a pool of worker processes and maps the results to output iterator (there are ordered and unordered versions):
1> I = iterator:from_list(lists:seq(1, 100)).
...
2> I1 = iterator_pmap:pmap(fun(T) -> timer:sleep(T), T end, I, #{ordered => false}).
...
3> timer:tc(fun() -> iterator:to_list(I1) end).
{559483, [1,2,3,4,5|...]}
so it takes just 559ms to process the whole list on 10 (configurable) workers instead of 5s
4> timer:tc(fun() -> lists:map(fun(T) -> timer:sleep(T), T end, lists:seq(1, 100)) end).
{5151070, [1,2,3,4,5|...]}
We use it at Klarna to run a batch maintenance jobs or to iterate through database tables mostly.
It is available on hex
Hex docs
And github