How to write kmrs better?

-module(kmer).
-export([calculate/1]).

% Run this in erlang shell:
% c("kmer.erl").
% kmer:calculate(Number).

get_timestamp() ->
  {Mega, Sec, Micro} = os:timestamp(),
  (Mega*1000000 + Sec)*1000 + round(Micro/1000).

calculate(N) ->
  Now = get_timestamp(),
  Kmers = calculate(N, string:copies("A", N), string:copies("T", N)),
  Delta = get_timestamp() - Now,
  io:fwrite("Nummer of generated k-mers: ~p - took ~pms~n", [Kmers, Delta]).

calculate(N, Start, Stop) -> calculate(N, Start, Stop, 1).

calculate(N, Start, Stop, C) -> if
  Start == Stop -> C;
  true -> calculate(N, next(N, Start), Stop, C + 1)
end.

next(N, Start) -> next(N, 1, Start, "T").

next(N, I, Start, "T") ->
  C = string:sub_string(Start, I, I),
  New = convert(C),
  String = string:left(Start, I - 1) ++ New ++ string:right(Start, N - I),
  next(N, I + 1, String, C);
next(_, _, Start, _) -> 
  % io:fwrite("~p~n", [Start]),
  Start.

convert(S) -> case S of
  "A" -> "C";
  "C" -> "G";
  "G" -> "T";
  "T" -> "A";
  _Else -> " "
end.

can I somehow write this better/faster?

2 Likes

What is this supposed to do? :thinking:

2 Likes

My 2 cents:

  • You are measuring time by hand with get_timestamp/0. You can user timer:tc/2 to do this for you.
  • In next/4 you should utilize pattern matching on the first character of Start and than construct a list recursively. Extracting n-th charachter by hand, and than recreating the whole string with string:left(Start, I - 1) ++ New ++ string:right(Start, N - I), is O(n), so basically next is O(n^2).
  • Last parameter of next is always 1 character long. Do you even need it to be a list?
  • Same for convert, why not using just a character? I guess this is some genetic computing, so if you don’t expect input that is not A, C, G or T, just let it crash. Also, you can use pattern matching in the function argument.
  • Also, you are using some obsolete functions from string module. This surely works, but take a look into the documentation for substitute functions.
  • you can use pattern matching in calculate/4 in a following way:
calculate(_, Stop, Stop, C) -> C; %% This ensures 2nd and 3rd argument is equal
calculate(N, Start,Stop,C) -> calcualate(N, next(N,Start), Stop, C+1).

So next function can be something like this:

next(N, Start) -> next(N,1,Start,'T').

next(N, I, [C|Rest], 'T') ->
    [convert(C) | next(N, I+1,Rest,C)];
next(_, _, Start, _) ->
    Start.

convert('A') -> 'C';
convert('C') -> 'G';
convert('G') -> 'T';
convert('T') -> 'A'.

edit: Also, you can get rid of the first argument in next/2, next/4, calculate/3 and calculate/4 if you write it like this. Suggested function is not tail recursive, so a really big lenght of the input would probably crush it, but can easily be rewrited to tail recursive function by additng the accumulator argument.

3 Likes

Define “faster”, and also OTP version you’re using (also, shameless plug, you probably want to use erlperf to answer a question “which version of my code is faster”).

Very obvious thing is not to use string though.

3 Likes