'Sequence representation method for a random sequences

I have a set of sequences of ABCD units, for example let A = 0, B = 1, C = 2, D = 4. I can represent a sequence AAABBBCCCDDD as a vector [000111222444] with no problem. However, what if I only know the % of ABCD units, but the sequence is inherently random (statistically random) the distribution of ABCD could be anything. What would be the best way to represent such a sequence in Python?

The end goal would be to feed such sequences to the machine learning model.

Thank you for your help!

Vito



Solution 1:[1]

You can generate random sequences with the right proportion of letters A, B, C, D like so:

import random

length = 10

ratio_a = 0.4
ratio_b = 0.3
ratio_c = 0.2
ratio_d = 0.1

population = (
    'A' * round(ratio_a * length) + 
    'B' * round(ratio_b * length) +
    'C' * round(ratio_c * length) +
    'D' * round(ratio_d * length)
)

seq = random.sample(population, k=length)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jacques Gaudin