'While using WordCloud for Python, why is the frequency of the letter "S" considered in the cloud?
I'm getting to know the WordCloud package for Python and I'm testing it with the Moby Dick Text from NLTK. A snippet of this is as follows:
As you can see from the highlights in the image, all of the possesive apostrophes have been escaped to "/'S" and WordCount seems to be including this in the frequency count as "S":
Frequency distribution of words
Of course this causes an issue because "S" is counted as a high frequency and all the other word's frequency are skewed in the cloud:
In a tutorial that I'm following for the same Moby Dick string, the WordCloud doesn't seem to be counting the "S". Am I missing an attribute somewhere or do I have to manually remove "/'s" from my string?
Below is a summary of my code:
example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Solution 1:[1]
It looks like your input is part of the problem, if you look do like so,
corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
words = [word for word in corpus]
print word[215:230]
You get
['RICHARDSON', "'", 'S', 'DICTIONARY', 'KETOS', ',', 'GREEK', '.', 'CETUS', ',', 'LATIN', '.', 'WHOEL', ',', 'ANGLO']
You can do a few things to try and overcome this, you could just filter for strings longer than 1,
words = [word for word in corpus if len(word) > 1]
You could try a different file provided by nltk, or you could try reading the input raw and properly decoding it.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | cssko |
