'Unrecognized type error (pyshark capfile ) numba?

import pyshark
import pandas as pd
import numpy as np
from multiprocessing import Pool
import re
import sys
from numba import jit

temp_array  = []

cap = np.array(pyshark.FileCapture(sys.argv[1]))
#print(cap._extract_packet_json_from_data(cap[0]))

def parse(capture):
   packet_raw = [i.strip('\r').strip('\t').split(':') for i in str(capture).split('\n')]
   packet_raw = map(lambda num:[num[0].replace('(',''),num[1].strip(')').replace('(','')] 
  if len(num)== 2 else [num[0],':'.join(num[1:])] ,[i for i in packet_raw])
    raw = list(packet_raw)[:-1]
    cols = [i[0] for i in raw]
    vals = [i[1] for i in raw]
    temp_array.append(dict(zip(cols,vals)))
  return dict(zip(cols,vals))

@jit(nopython=True)
def preprocess_dataset(x):
    count = 0
    temp = []
    #p = Pool(5)
    #r = p.map(parse,cap)
    #p.close()
    #p.join()
    #print(r)
    try:
       for i in cap:
          temp.append(parse(i))
          count += 1
    except Exception:
       print("somethin")
    #print(r)
    data = pd.DataFrame(temp)
    print(data)
    data = data[['Packet Length','.... 0101 = Header Length','Protocol','Time to Live','Source Port','Length','Time since previous frame in this TCP stream','Window']]
    data.rename(columns={".... 0101 = Header Length": 'Header Length'})
    filtr = ["".join(re.findall(r'\d.',str(i))) for i in data['Time since previous frame in this TCP stream']]
    data['Time since previous frame in this TCP stream'] = filtr
    print(data.to_csv('data.csv'))

preprocess_dataset(1000000)

11: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray cap = np.array(pyshark.FileCapture(sys.argv[1])) Traceback (most recent call last): File "/root/ddos-detect/shark.py", line 47, in preprocess_dataset(1000000) File "/usr/local/lib/python3.9/dist-packages/numba/core/dispatcher.py", line 468, in _compile_for_args error_rewrite(e, 'typing') File "/usr/local/lib/python3.9/dist-packages/numba/core/dispatcher.py", line 409, in error_rewrite raise e.with_traceback(None) numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend) Untyped global name 'parse': Cannot determine Numba type of <class 'function'>

File "shark.py", line 34: def preprocess_dataset(x): for i in cap: temp.append(parse(i))



Solution 1:[1]

Numba does not support Pandas. This is explicitly states in the first page of the documentation. Additionally, it cannot directly call pure-Python function in nopython mode. Thus, you should use @jit on parse too. Finally, Numba barely supports strings: only basic function are supported and they are not fast. Put is shortly: Numba is not the right tool here. Consider using Numpy or maybe Cython. PyPy might help too.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jérôme Richard