'What's the best approach for hashing or otherwise shortening a large number of arbitrarily long strings in Python or pandas?

Say I have 10GB of UTF8 text organized in two-column csv tables with arbitrarily long values. I want to load the data and do search and group operations such as pandas.Series.isin() and pandas.DataFrame.groupby(). The string values have arbitrary length, but I expect most of them to be between 10 and 10,000 characters long. Most values will be natural English text. I have three questions:

  1. Is hashing the string values a good approach to speed up computation or are there other recommended methods?
  2. Is the Python built-in hash() function useful here or do I need a more robust algorithm?
  3. If I were working with much more data, how would I estimate the risk of collisions for different methods? Put another way, would I need to change my approach if I were working with 100GB of text? What about 10,000?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source