'Simplest way to sort of a list of headlines?
I'm pretty out of my depth here — hoping this is alright to post. I have a list of 1000 or so headlines. I'm trying to identify headlines that are about the same thing but worded differently.
Hoping to be pointed in the direction of the least difficult way to do this, find out if there are any existing tools out there for this, find relevant tutorials, etc. I've been Googling but haven't found anything on this specifically, possibly because I'm missing the vocab to describe it. (In an ideal world, there's be some online tool for this that I wouldn't have to code, but will try and code if necessary.) Thanks.
Solution 1:[1]
One way you could solve this, at least to a rough approximation:
- Count the total number of occurrences of each word in the entire list.
- Group together words that share the same root. E.g. walks, walking, walked. Add together those word counts.
- Sort this frequency list in most common word order.
- Sort the headlines by the most occurrences of word-group 1 in the frequency list. (For the set of headlines that contain it at least once.)
- Repeat (4) for word-group 2 in the frequency list, and so on through the end of the frequency list.
- You would now have a short list of related headlines from each word-group. Browse some of these yourself to see if there are some meaningfully similar ones.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
