'How to get vocabulary after Sklearn ColumnTransformer

I want to get vocabulary after ColumnTransformer

This is my code:

features = df[["content", "numeric1", "numeric2"]]
results = df["label"]

features = features.to_numpy()
results = results.to_numpy()

# Creating vectorizer
transformerVectoriser = ColumnTransformer(transformers=[('vector_char', TfidfVectorizer(analyzer='char', ngram_range=(2, 6), max_features = 2500, lowercase = True), 0),
                                                        ('vector_word_1', TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 10000, lowercase = True), 0),
                                                        ('vector_word_2', TfidfVectorizer(analyzer='word', ngram_range=(2, 2), max_features = 4500, lowercase = True), 0),
                                                        ('vector_word_3', TfidfVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 750, lowercase = True), 0)],
                                          remainder='passthrough'
                                          )

print(transformerVectoriser.vocabulary_)

Im getting this error:

AttributeError: 'ColumnTransformer' object has no attribute 'vocabulary_'

I have also tried this:

features = transformerVectoriser.fit_transform(features)
print(features.vocabulary_)

But I'm getting this error:

raise AttributeError(attr + " not found")
AttributeError: vocabulary_ not found

I have also tried this:

transformerVectoriser.fit(features)  
print("Stem vocabulary:")
print(transformerVectoriser.transformers_[0].vocabulary_)

Error: AttributeError: 'tuple' object has no attribute 'vocabulary_'

And this:

transformed_features = transformerVectoriser.fit_transform(features) 

print("Stem vocabulary:")
print(transformed_features.transformers_[0].vocabulary_)

Error: AttributeError: transformers_ not found


Solution 1:[1]

Each of the four individual transformers in your ColumnTransformer has its own vocabulary. You can access the four transformers via transformerVectoriser.transformers_, ie

transformerVectoriser = ColumnTransformer(transformers=[('vector_char', TfidfVectorizer(analyzer='char', ngram_range=(2, 6), max_features = 2500, lowercase = True), 0),
                                                        ('vector_word_1', TfidfVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 10000, lowercase = True), 0),
                                                        ('vector_word_2', TfidfVectorizer(analyzer='word', ngram_range=(2, 2), max_features = 4500, lowercase = True), 0),
                                                        ('vector_word_3', TfidfVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 750, lowercase = True), 0)],
                                          remainder='passthrough'
                                          )

transformerVectoriser.fit(features)  
# or transformed_features = transformerVectoriser.fit_transform(features) 

print("Stem vocabulary:")
print(transformerVectoriser.transformers_[0][1].vocabulary_)
print("~~")
print("Word vocabulary:")
print(transformerVectoriser.transformers_[1][1].vocabulary_)
print("~~")
print("Bigram vocabulary:")
print(transformerVectoriser.transformers_[2][1].vocabulary_)
print("~~")
print("Trigram vocabulary:")
print(transformerVectoriser.transformers_[3][1].vocabulary_)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1