'String compression using repeated chars count

I'm going through Cracking the Code and there is this question where they ask to write a method for string compression so:

aabbccccaa 

Would become:

a2b1c4a2

I came up with this:

''.join(y+str.count(y) for y in set(str))

But my output was:

a5c4b1

Could someone point me in the clean direction?

Sorry for bad edits, I'm on a cellphone



Solution 1:[1]

You could use groupby to do the work for you:

>>> from itertools import groupby
>>> s = 'aabbccccaa'
>>> ''.join(k + str(sum(1 for x in g)) for k, g in groupby(s))
'a2b2c4a2'

Solution 2:[2]

I thought I'd give a modification based on niemmi's answer that is forward and reversible up to 0x200 occurences of a character and is testable for correctness:

output = []
for k,g in groupby(txt):
    k = bytes(k,'ascii')
    g = sum(1 for x in g)
    if g > 0xff:
        output.append(b''.join(k,b'0xff'))
        by = bytes([0xff-g])
        output.append(b''.join(k,by))
    else:
        by = bytes([g])
        output.append(b''.join((k,by)))

compressed = b''.join(output)

decompressed = []
for char, count in zip(compressed[::2], compressed[1::2]):
    decompressed.append(chr(char)*count)

decompressed = ''.join(decompressed)

The compression can easily be extended up to any number of occurences beyond 0x200 using the above logic, but the decompression always stays the same for use with any number of occurences. The reason for using bytes is so that you can actually get compression if you were to use a file instead of a python datatype. However this is limited to ascii so you will need to preprocess like so:

import string
txt = open('to_compress.txt').read()
txt = "".join(filter(lambda x: x in string.printable, txt))

Furthermore, you will want to test that it works.

print("Same?", txt == decompressed)
>>> Same? True

You may want to install the pip package burrowswheeler, then you can just:

to_compress = burrowswheeler.transform(txt)
... compress decompress...
burrowswheeler.inverse(decompressed)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 niemmi
Solution 2