'Collecting data first in python to conduct operations

I was given the following problem where I had to match the logdata and the expected_result. The code is as follows, edited with my solution and comments containing feedback I received:

import collections

 
log_data = """1.1.2014 12:01,111-222-333,454-333-222,COMPLETED
1.1.2014 13:01,111-222-333,111-333,FAILED
1.1.2014 13:04,111-222-333,454-333-222,FAILED
1.1.2014 13:05,111-222-333,454-333-222,COMPLETED
2.1.2014 13:01,111-333,111-222-333,FAILED
"""
 
expected_result = {
    "111-222-333": "40.00%",
    "454-333-222": "66.67%",
    "111-333" : "0.00%"
}
 
def compute_success_ratio(logdata):
    #! better option to use .splitlines()
    #! or even better recognize the CSV structure and use csv.reader
    entries = logdata.split('\n')
    #! interesting choice to collect the data first
    #! which could result in explosive growth of memory hunger, are there
    #! alternatives to this structure?
    complst = []
    faillst = []
    #! probably no need for attaching `lst` to the variable name, no? 
 
    for entry in entries:
        #! variable naming could be clearer here
        #! a good way might involve destructuring the entry like:
        #! _, caller, callee, result
        #! which also avoids using magic indices further down (-1, 1, 2)
        ent = entry.split(',')
        if ent[-1] == 'COMPLETED':
            #! complst.extend(ent[1:3]) for even more brevity
            complst.append(ent[1])
            complst.append(ent[2])
        elif ent[-1] == 'FAILED':
            faillst.append(ent[1])
            faillst.append(ent[2])
 
    #! variable postfix `lst` could let us falsely assume that the result of set()
    #! is a list.
    numlst = set(complst + faillst)
 
    #! good use of collections.Counter,
    #! but: Counter() already is a dictionary, there is no need to convert it to one
    comps = dict(collections.Counter(complst))
    fails = dict(collections.Counter(faillst))
    #! variable naming overlaps with global, and doesn't make sense in this context
    expected_result = {}
 
    for e in numlst:
        #! good: dealt with possibility of a number not showing up in `comps` or `fails`
        #! bad: using a try/except block to deal with this when a simpler .get("e", 0)
        #! would've allowed dealing with this more elegantly
        try:
            #! variable naming not very expressive
            rat = float(comps[e]) / float(comps[e] + fails[e]) * 100
            perc = round(rat, 2)
            #! here we are rounding twice, and then don't use the formatting string
            #! to attach the % -- '{:.2f}%'.format(perc) would've been the right
            #! way if one doesn't know percentage formatting (see below)
            expected_result[e] = "{:.2f}".format(perc) + '%'
            #! a generally better way would be to either
            #! from __future__ import division
            #! or to compute the ratio as 
            #! ratio = float(comps[e]) / (comps[e] + fails[e])
            #! and then use percentage formatting for the ratio
            #! "{:.2%}".format(ratio) 
        except KeyError:
            expected_result[e] = '0.00%'
 
    return expected_result
 
if __name__ == "__main__":
    assert(compute_success_ratio(log_data) == expected_result)
 
#! overall
#! + correct 
#! ~ implementation not optimal, relatively wasteful in terms of memory 
#! - variable naming inconsistent, overly shortened, not expressive
#! - some redundant operations
#! + good use of standard library collections.Counter
#! ~ code could be a tad bit more idiomatic

I have understood some of the problems such as variable naming conventions and avoiding the try block section as much as possible.

However, I fail to understand how using csv.reader improves the code. Also, how am I supposed to understand the comment about collecting the data first? What could the alternatives be? Could anybody throw some light on these two issues?

python

Solution 1:^[1]

split('\n') and splitlines will create a copy of your data where each line is is separate item in the list. Since you only need to pass over the data once instead of randomly accessing lines this is wasteful compared to CSV reader which could return you one line at the time. The other benefit of using the reader is that you wouldn't have to split the data to lines and lines to columns manually.

The comment about data collection refers the fact that you add all the completed and failed items to two lists. Let's say that item 111-333 completes five times and fails twice. Your data would look something like this:

complst = ['111-333', '111-333', '111-333', '111-333', '111-333']
faillst = ['111-333', '111-333']

You don't need those repeating items so you could have used Counter directly without collecting items to the lists and save a lot of memory.

Here's an alternative implementation that uses csv.reader and collects success & failure counts to a dict where item name is key and value is list [success count, failure count]:

from collections import defaultdict
import csv
from io import StringIO

log_data = """1.1.2014 12:01,111-222-333,454-333-222,COMPLETED
1.1.2014 13:01,111-222-333,111-333,FAILED
1.1.2014 13:04,111-222-333,454-333-222,FAILED
1.1.2014 13:05,111-222-333,454-333-222,COMPLETED
2.1.2014 13:01,111-333,111-222-333,FAILED
"""

RESULT_STRINGS = ['COMPLETED', 'FAILED']
counts = defaultdict(lambda: [0, 0])
for _, *params, result in csv.reader(StringIO(log_data)):
    try:
        index = RESULT_STRINGS.index(result)
        for param in params:
            counts[param][index] += 1
    except ValueError:
        pass # Skip line in case last column is not in RESULT_STRINGS

result = {k: '{0:.2f}%'.format(v[0] / sum(v) * 100) for k, v in counts.items()}

Note that above will work only on Python 3.

Solution 2:^[2]

Alternatively, Pandas looks a good solution for this purpose if you are OK with using it.

import pandas as pd

log_data = pd.read_csv('data.csv',header=None)
log_data.columns = ['date', 'key1','key2','outcome']

meltedData = pd.melt(log_data, id_vars=['date','outcome'], value_vars=['key1','key2'],
              value_name = 'key') # we transpose the keys here
meltedData['result'] = [int(x.lower() == 'completed') for x in meltedData['outcome']] # add summary variable

groupedData = meltedData.groupby(['key'])['result'].mean()
groupedDict = groupedData.to_dict()

print groupedDict

Result:

{'111-333': 0.0, '111-222-333': 0.40000000000000002, '454-333-222': 0.66666666666666663}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	niemmi
Solution 2	Oxymoron88

'Collecting data first in python to conduct operations

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]