How to write kmrs better?


% Run this in erlang shell:
% c("kmer.erl").
% kmer:calculate(Number).

get_timestamp() ->
  {Mega, Sec, Micro} = os:timestamp(),
  (Mega*1000000 + Sec)*1000 + round(Micro/1000).

calculate(N) ->
  Now = get_timestamp(),
  Kmers = calculate(N, string:copies("A", N), string:copies("T", N)),
  Delta = get_timestamp() - Now,
  io:fwrite("Nummer of generated k-mers: ~p - took ~pms~n", [Kmers, Delta]).

calculate(N, Start, Stop) -> calculate(N, Start, Stop, 1).

calculate(N, Start, Stop, C) -> if
  Start == Stop -> C;
  true -> calculate(N, next(N, Start), Stop, C + 1)

next(N, Start) -> next(N, 1, Start, "T").

next(N, I, Start, "T") ->
  C = string:sub_string(Start, I, I),
  New = convert(C),
  String = string:left(Start, I - 1) ++ New ++ string:right(Start, N - I),
  next(N, I + 1, String, C);
next(_, _, Start, _) -> 
  % io:fwrite("~p~n", [Start]),

convert(S) -> case S of
  "A" -> "C";
  "C" -> "G";
  "G" -> "T";
  "T" -> "A";
  _Else -> " "

can I somehow write this better/faster?


What is this supposed to do? :thinking:


My 2 cents:

  • You are measuring time by hand with get_timestamp/0. You can user timer:tc/2 to do this for you.
  • In next/4 you should utilize pattern matching on the first character of Start and than construct a list recursively. Extracting n-th charachter by hand, and than recreating the whole string with string:left(Start, I - 1) ++ New ++ string:right(Start, N - I), is O(n), so basically next is O(n^2).
  • Last parameter of next is always 1 character long. Do you even need it to be a list?
  • Same for convert, why not using just a character? I guess this is some genetic computing, so if you don’t expect input that is not A, C, G or T, just let it crash. Also, you can use pattern matching in the function argument.
  • Also, you are using some obsolete functions from string module. This surely works, but take a look into the documentation for substitute functions.
  • you can use pattern matching in calculate/4 in a following way:
calculate(_, Stop, Stop, C) -> C; %% This ensures 2nd and 3rd argument is equal
calculate(N, Start,Stop,C) -> calcualate(N, next(N,Start), Stop, C+1).

So next function can be something like this:

next(N, Start) -> next(N,1,Start,'T').

next(N, I, [C|Rest], 'T') ->
    [convert(C) | next(N, I+1,Rest,C)];
next(_, _, Start, _) ->

convert('A') -> 'C';
convert('C') -> 'G';
convert('G') -> 'T';
convert('T') -> 'A'.

edit: Also, you can get rid of the first argument in next/2, next/4, calculate/3 and calculate/4 if you write it like this. Suggested function is not tail recursive, so a really big lenght of the input would probably crush it, but can easily be rewrited to tail recursive function by additng the accumulator argument.


Define “faster”, and also OTP version you’re using (also, shameless plug, you probably want to use erlperf to answer a question “which version of my code is faster”).

Very obvious thing is not to use string though.