'What's the best approach for hashing or otherwise shortening a large number of arbitrarily long strings in Python or pandas?
Say I have 10GB of UTF8 text organized in two-column csv tables with arbitrarily long values. I want to load the data and do search and group operations such as pandas.Series.isin() and pandas.DataFrame.groupby(). The string values have arbitrary length, but I expect most of them to be between 10 and 10,000 characters long. Most values will be natural English text. I have three questions:
- Is hashing the string values a good approach to speed up computation or are there other recommended methods?
- Is the Python built-in
hash()function useful here or do I need a more robust algorithm? - If I were working with much more data, how would I estimate the risk of collisions for different methods? Put another way, would I need to change my approach if I were working with 100GB of text? What about 10,000?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
