'Umap & Matplotlib. Umap new point seems badly placed everytime

I have an unsupervised sentiment analysis issue in Python. I used the sentence transformers library to get embeddings of tweets (since some of the text samples are directly pulled from twitter) and I read articles that dimensionality reduction is important and umap is great for that.

The overall problem is that when I want to get a new embedding for a new test tweet, umap seems to give weird coordinates. I'll walk through the code so anyone reading this can understand.

Created list of tweets. 10 positive, 10 neutral, and 10 negative. All into a simple dataframe.
Using the all-mpnet-base-v2 sentence transformer model. Code below:

model_st = SentenceTransformer('all-mpnet-base-v2')

The model now encodes the dataframes tweets and now I have embeddings of size 768. The code is:

umap_obj = umap.UMAP(n_neighbors=30, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit(embeddings)

umap_obj.embedding_

And this gives the result which is:

array([[ 7.043991 , 10.03341  ],
       [ 6.4562964,  9.504029 ],
       [ 6.7481065, 11.092019 ],
       [ 7.3372607, 11.114787 ],
       [ 7.890366 , 10.493936 ],
       [ 6.298611 , 10.29068  ],
       [ 6.4775186,  9.898772 ],
       [ 8.703255 ,  9.839503 ],
       [ 6.850452 , 10.553306 ],
       [ 7.1775093, 10.757572 ],
       [ 8.61553  ,  8.281198 ],
       [ 7.665401 ,  8.742563 ],
       [ 8.105979 ,  8.283659 ],
       [ 8.412901 ,  8.686226 ],
       [ 7.604193 ,  8.318158 ],
       [ 7.5261774,  9.969134 ],
       [ 7.7710595,  9.204553 ],
       [ 8.022583 ,  9.164099 ],
       [ 7.2784944,  8.836557 ],
       [ 9.169669 ,  9.772636 ],
       [ 9.370931 , 10.3363   ],
       [ 8.465871 , 10.676252 ],
       [ 8.5332   , 11.112685 ],
       [ 8.1095495, 11.277469 ],
       [ 8.147169 , 10.263562 ],
       [ 9.059501 , 11.015707 ],
       [ 8.97215  , 10.662908 ],
       [ 8.142927 ,  9.835047 ],
       [ 8.697013 , 10.231923 ],
       [ 8.514813 ,  9.202326 ]], dtype=float32)

Great! 2d coordinates.

I wanted to use a simple clustering algorithm for this so I used k means from the sklearn.cluster library. Code below:

amount_of_clusters = 3

k_means_model = KMeans(n_clusters=amount_of_clusters, random_state=1234)

k_means_model.fit(umap_obj.embedding_)

k_means_model.labels_

I did this because with the ".labels_" line of code, we get labels and that's what I'll be clustering by later down the line. The labels that were given were:

array([0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 1], dtype=int32)

I put the new 2d coordinates along with the labels k means found into a new dataframe. I just found it quick to do to be able to properly separate them all by label. Code below:

k_means_df = pd.DataFrame(umap_obj.embedding_, columns=['x', 'y'])

k_means_df['labels'] = k_means_model.labels_

k_means_df

And gives the result:

    x           y           labels
0   7.043991    10.033410   0
1   6.456296    9.504029    0
2   6.748106    11.092019   0
3   7.337261    11.114787   0
4   7.890366    10.493936   2
5   6.298611    10.290680   0
6   6.477519    9.898772    0
7   8.703255    9.839503    2
8   6.850452    10.553306   0
9   7.177509    10.757572   0
10  8.615530    8.281198    1
11  7.665401    8.742563    1
12  8.105979    8.283659    1
13  8.412901    8.686226    1
14  7.604193    8.318158    1
15  7.526177    9.969134    0
16  7.771060    9.204553    1
17  8.022583    9.164099    1
18  7.278494    8.836557    1
19  9.169669    9.772636    2
20  9.370931    10.336300   2
21  8.465871    10.676252   2
22  8.533200    11.112685   2
23  8.109550    11.277469   2
24  8.147169    10.263562   2
25  9.059501    11.015707   2
26  8.972150    10.662908   2
27  8.142927    9.835047    2
28  8.697013    10.231923   2
29  8.514813    9.202326    1

Now in THIS step. I get lists of all coordinates of label 0, 1, and 2, separately. I create a list of colors for the plt.scatter() function and then I just place some simple code to SHOW the plot. Code below:

zero_x_points = k_means_df[k_means_df['labels'] == 0]['x'].tolist()
zero_y_points = k_means_df[k_means_df['labels'] == 0]['y'].tolist()

one_x_points = k_means_df[k_means_df['labels'] == 1]['x'].tolist()
one_y_points = k_means_df[k_means_df['labels'] == 1]['y'].tolist()

two_x_points = k_means_df[k_means_df['labels'] == 2]['x'].tolist()
two_y_points = k_means_df[k_means_df['labels'] == 2]['y'].tolist()

colors = ['#fc0505', '#0514fc', '#00920d']

fig, ax = plt.subplots(figsize=(20, 10))

plt.scatter(zero_x_points, zero_y_points, color=colors[0])
plt.scatter(one_x_points, one_y_points, color=colors[1])
plt.scatter(two_x_points, two_y_points, color=colors[2])

plt.colorbar()

The result is in the following picture:

Doesn't look too bad. Now onto the real issue, which is prediction.

The KEY thing to note about this tweet test, is I done it AFTER I tested a simple tweet which was "Awesome game" (hence why the tweet you'll see is called "dummy_test_two"). This weird result has occurred for the past four days. So with this tweet, it's literally copied and pasted from the original list of tweets (the positive ones to be exact) so I an be absolutely sure that the result is nonsensical and I'm definitely doing something wrong. Code below:

# Make tweet.
dummy_tweet_two = "I'd like to slow down time so I can spend more hours on this. #CyberpunkGame"

# Encode it.
dummy_tweet_encoded_two = model_st.encode([dummy_tweet_two])

dummy_tweet_coords_two = umap_obj.transform(dummy_tweet_encoded_two)

print(f'Dummy tweet two coordindates: {dummy_tweet_coords_two}')

And this gives the result which is:

Dummy tweet coords plane: [[ 7.748943 12.07401 ]]

The above displays the result (the new coordinate is in black):

This is it. This doesn't make any sense. As I've said with the dummy_tweet_two variable, this tweet was in the original list of positive tweets. There's no way it should be placed away from literally every group.

Solution 1:^[1]

So Paul Brodersens previous comments put me on the right track regarding why my old implementation wasn't yielding proper point results. I've come to the conclusion was because I didn't call the fit function on the umap object with every test point.

First, the fit function fits x into an embedded space. See the comment in the function for documentation here: https://umap-learn.readthedocs.io/en/latest/_modules/umap/umap_.html#UMAP.fit which means my regular points and test points were technically in two different spaces.

Code example of what I mean

# The tweet to use. This is already present in the list_of_tweets variable.
new_tweet = "Wonderful game, story was absolutely brilliant."

# Get the embedded version.
new_tweet_embedding = model_st.encode(new_tweet)


# Add the new embedding to the old.
new_embeddings = []

for embedding in embeddings:
  new_embeddings.append(embedding)

new_embeddings.append(new_tweet_embedding)

print(f'Old tweet embedding (inside new list): {new_embeddings[0][:5]}\n\n')
print(f'New tweet embedding (inside new list): {new_embeddings[-1][:5]}')

# Create the new umap object that'll REALLY fit x into an embedded space.
new_umap_obj = umap.UMAP(n_neighbors=31, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit(new_embeddings)

new_umap_obj.embedding_

Keep in mind, the variable "new_tweet" with the string value "Wonderful game, story was absolutely brilliant." is in the main list of tweets that created the first 2d plot I shown in my original comment. So I expect some similar position.

The tweet that is in the main list has the following coordinates thanks to umap: x=10.252, y=19.692.

The "new_tweet" variable I tested has these coordinates thanks to umap: x=9.935, y=18.440

Obviously very similar, and the following picture shows they're close. I'll circle them just to be clear.

The test point in black and the original next to it in red. As Paul already said though, it won't be the exact coordinates because of underfitting and that's definitely understandable given the size of the data.

So that addresses my concerns regarding why my points weren't making sense and how to fix it. I've tried the exact same process on 3 separate tweets and none of them were oddly placed as well. The tweets:

test_tweets = ["Great game, glad I managed to give it a shot. Good job to the devs.",
               "A lot of room to upgrade I'd say, but this isn't really BAD.",
               "A complete buggy and glitchy waste of time. Game even crashed during the loading screen once."]

The result:

Hope this helps anyone who manages to have this same problem.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Omar Moodie

'Umap & Matplotlib. Umap new point seems badly placed everytime

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]