'How to plot R's vegan::procrustes using Python (SciPy, NumPy, and Matplotlib)?

I'm following the vegan tutorial on procrustes analysis:

https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/procrustes

# Load data
library(vegan)
data(varespec)

# Ordinations
vare.dist <- vegdist(wisconsin(varespec))
mds.null <- monoMDS(vare.dist, y = cmdscale(vare.dist))
mds.alt <- monoMDS(vare.dist)

# Run procrustes
vare.proc <- procrustes(mds.alt, mds.null)

# Summary of procrustes
summary(vare.proc)
Call:
procrustes(X = mds.alt, Y = mds.null)

Number of objects: 24    Number of dimensions: 2

Procrustes sum of squares:
 13.99951
Procrustes root mean squared error:
 0.7637492
Quantiles of Procrustes errors:
       Min         1Q     Median         3Q        Max
0.08408327 0.23038165 0.49316643 0.65854198 1.99105837

Rotation matrix:
          [,1]       [,2]
[1,] 0.9761893  0.2169202
[2,] 0.2169202 -0.9761893

 of averages:
              [,1]          [,2]
[1,] -2.677059e-17 -5.251452e-18

Scaling of target:
[1] 0.6455131

plot(vare.proc)

enter image description here

Now with Python I can choose either:

# Returns
mtx1: array_like
A standardized version of data1.

mtx2: array_like
The orientation of data2 that best fits data1. Centered, but not necessarily .

disparity: float
 as defined above.

R(N, N): ndarray
The matrix solution of the orthogonal Procrustes problem. Minimizes the Frobenius norm of (A @ R) - B, subject to R.T @ R = I.

scale: float
Sum of the singular values of A.T @ B.

My questions:

  • Which implementation is more similar to vegan::procrustes, scipy.spatial.procrustes or scipy.linalg.orthogonal_procrustes?
  • How is the output used to generate the figure above (in Matplotlib not via Rpy2)?

EDIT:

Using @Kat's answer for the plotting and my additional answer for the SciPy/NumPy implementation, one should be able to reproduce the VEGAN analysis in Python exactly using only standard packages from PyData.

Edit 2:

Just in case it helps anyone, I've implemented both Procrustes and Protest (along with the plotting method described here) from vegan in my soothsayer python package: https://github.com/jolespin/soothsayer/blob/60accb669062c559ba785be201deddeede26a049/soothsayer/ordination/ordination.py#L715



Solution 1:[1]

It's not exactly the same, but I think it's pretty close.

I brought the varespec data into Python for this. Then tried to mimic the actions that the vegan package (and your code) took. Along those lines, I stuck to the object names you used (for the most part).

There has to be an easier way to do this.

import scipy.spatial.distance as ssd
import sklearn.manifold as sm #  MDS, smacof
import procrustes
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

wVares_py = r.wVares # I brought in the varespec data from R (named wVares)
# find the distance
vare_dist = ssd.pdist(wVares_py, 'braycurtis')
# make it square (Required for MDS and smaof
vare_dist = ssd.squareform(vare_dist)

mds = sm.MDS(metric = False, dissimilarity = 'precomputed')
mds_alt = mds.fit_transform(vare_dist)
mds_null = sm.smacof(vare_dist, metric = False, init = mds_alt)

vare_proc = procrustes.rotational(mds_alt, mds_null[0], translate = False)

# create an array that represents the axes at the origin, then rotate them
xAx = np.array([[-.4, 0],[.4, 0]])
yAx = np.array([[0, .4], [0, -.4]])

# create the rotated origins
xHat = vare_proc["t"] * xAx
yHat = vare_proc["t"] * yAx

# add line segments
vareA = pd.DataFrame(vare_proc.new_a, columns = ['x1', 'y1'])
vareB = pd.DataFrame(vare_proc.new_b, columns = ['x2', 'y2'])
df = pd.concat([vareA, vareB], axis = 1)

# now find the differences
df['dx'] = df.x2 - df.x1
df['dy'] = df.y2 - df.y1

Now everything's set for graphing.

plt.figure(figsize = (10, 10), dpi = 100)
fig, ax = plt.subplots()
# plot the points
# plot the points
ax.scatter(vare_proc.new_a[:, 0], 
           vare_proc.new_a[:, 1], 
           marker = 'o', 
           label = 'a', 
           c = '#cc5500')
# plot the x, y axes
ax.axhline(y=0, color='k')
ax.axvline(x=0, color='k')
# plot the rotated origin
ax.axline(xHat[:,0], xHat[:,1], linestyle = '-.')
ax.axline(yHat[:,0], yHat[:,1], linestyle = '-.')
# add arrows to show the direction per point
ax.quiver(df.x1, df.y1, df.dx, df.dy, units='xy', color = 'gray', scale=.7)
plt.show()

enter image description here

Solution 2:[2]

Here is how to reproduce the vegan procrustes analysis in Python using only SciPy and NumPy. Please refer to @Kat answers above for context on how to plot:

# Get X and Y which are the first 2 embeddings
# -------------------------------------------
# X <- scores(X, display = scores, ...)
mds_null = pd.read_csv("https://pastebin.com/raw/c1Zwb4pu", index_col=0, sep="\t")
X_original = mds_null.values
# X_original[:5]
# array([[-0.28600328, -0.23135498],
#        [ 0.00510282, -0.39170001],
#        [ 0.80888622,  1.18656269],
#        [ 1.44897916, -0.18887702],
#        [ 0.54496396, -0.09403705]])

# Y <- scores(X, display = scores, ...)
mds_alt = pd.read_csv("https://pastebin.com/raw/UMGLgXmu", index_col=0, sep="\t")
Y_original = mds_alt.values
# Y_original[:5]
# array([[ 0.2651293 ,  0.09755548],
#        [ 0.17639036,  0.45328207],
#        [-0.16167398, -1.42304932],
#        [-1.44706047,  0.3966337 ],
#        [-0.49825949,  0.15239689]])


# Procrustes in VEGAN
# ----------------
procrustes_result = vegan.procrustes(X_original, Y_original, scale=True, symmetric=False, scores="sites")

# Rotation matrix from VEGAN
A = np.array(dict(procrustes_result.items())["rotation"])
# A
# # array([[-0.98171787, -0.19034186],
# #        [ 0.19034186, -0.98171787]])

# Center the X and Y data
# -----------------------
X_mean = X_original.mean(axis=0)
X = X_original - X_mean
Y_mean = Y_original.mean(axis=0)
Y = Y_original - Y_mean

# # Rotation matrix from SciPy
# ----------------------------
A, sca = orthogonal_procrustes(X,Y)
# A
# array([[-0.98171787,  0.19034186],
#        [-0.19034186, -0.98171787]])


# Manual calculation
# ------------------
# np.linalg.svd returns:
# u{ (…, M, M), (…, M, K) } array
# Unitary array(s). The first a.ndim - 2 dimensions have the same size as those of the input a. The size of the last two dimensions depends on the value of full_matrices. Only returned when compute_uv is True.

# s(…, K) array
# Vector(s) with the singular values, within each vector sorted in descending order. The first a.ndim - 2 dimensions have the same size as those of the input a.

# vh{ (…, N, N), (…, K, N) } array
# Unitary array(s). The first a.ndim - 2 dimensions have the same size as those of the input a. The size of the last two dimensions depends on the value of full_matrices. Only returned when compute_uv is True.

XY = np.dot(X.T, Y) # crossprod(X,Y)
U,s,Vh = np.linalg.svd(XY)
V = Vh.T
A = np.dot(V, U.T)
# A
# array([[-0.98171787, -0.19034186],
#        [ 0.19034186, -0.98171787]])

# Reproduce the Yrot object
# -------------------------
scale = True
symmetric = False

# Yrot from VEGAN
Yrot = np.array(dict(procrustes_result.items())["Yrot"])
# Yrot[:5]
# array([[-0.20919556, -0.12656386],
#        [-0.07519809, -0.41418755],
#        [-0.09706039,  1.23572329],
#        [ 1.29483042, -0.098617  ],
#        [ 0.44844991, -0.04740275]])

# NumPy implementation
# ctrace <- function(MAT) sum(MAT^2)
def ctrace(MAT):
    return np.sum(MAT**2)

# if (scale) {
#     c <- sum(sol$d)/ctrace(Y)
if scale:
    c = np.sum(s)/ctrace(Y)

# Yrot <- c * Y %*% A
Yrot = c * np.dot(Y,A)
Yrot[:5]
# array([[-0.20919556, -0.12656386],
#        [-0.07519809, -0.41418755],
#        [-0.09706039,  1.23572329],
#        [ 1.29483042, -0.098617  ],
#        [ 0.44844991, -0.04740275]])

Solution 3:[3]

Another Implementation of PROTEST for python

Thanks for sharing! In case anyone is interested, here is my implementation of PROTEST that I worked out upon @O.rka's work after cross-checking literature and the VEGAN implementation.

from tqdm import trange

def protest(X, Y, no_permutations=999, with_replacement=False):
    '''Performs procrustean test (PROTEST).

    H_0: "X and Y are no more related than random datasets would be."
    
    Args:
        X (np.ndarray):             Matrix of shape (n x p11), where n is the number of observations
        Y (np.ndarray):             Matrix of shape (n x p12)
        no_permutations (int):      Number of permutations to consider
        with_replacement (bool):    If set to True, permutations are sampled with replacement.

    Returns:
        test_statistic (float):          Procrustean correlation coefficient
        p_value (float):    P-Value of PROTEST
        RSS (float):        Residual Sum of Squares of prucrustes rotations

    References:
        Gower & Dijksterhuis (2004):    "Procrustes Problems"
        Gower (1971):                   "Statistical methods of comparing different multivariate analyses 
                                            of the same data", p. 138-149
        Peres-Neto & Jackson (2001):    "How well do multivariate data sets match? The advantages of 
                                            a Procrustean superimposition approach over the Mantel test"   
    '''
    n = X.shape[0]
    assert n == Y.shape[0], 'X has to be of shape (n x p1) and Y has to be of shape (n x p2).'

    # Get machine float error
    EPS = np.sqrt(np.finfo(float).eps)

    # # standardize the matrices using Gower-Transformation (Gower 1971)
    X, Y = gower_trafo(X), gower_trafo(Y)

    # calculate test statistic & residual sum of squares
    test_statistic = procrustes_corr(X, Y)
    RSS = np.sum(X**2) + np.sum(
        Y**2) - 2 * test_statistic  # c.f. Gower & Dijksterhuis (2004), equation 4.3

    # Get integer index for permutations
    index = np.arange(0, n)

    # Permute and calculate goodness of fit
    permutations = list()
    for rs in trange(no_permutations):
        permutated_idx = np.random.RandomState(rs).choice(index, replace=with_replacement, size=n)
        corr = procrustes_corr(X[permutated_idx], Y)
        permutations.append(corr)
    permutations = np.array(permutations)

    p_value = (np.sum(permutations >= test_statistic - EPS) + 1) / (no_permutations + 1)

    # Print Output
    print(f'Performed PROTEST with {no_permutations} permutations.')
    print(f'Procrustean correlation coefficient: {test_statistic}')
    print(f'p-Value: {p_value}')
    print(f'Residual Sum of Squares: {RSS}')

    return test_statistic, p_value, RSS


def procrustes_corr(X, Y):
    '''Calculates the procrustean correlation coefficient.
    
    Args:
        X (np.ndarray):   Gower-transformed Matrix of shape (n x p1), where n is the number of observations
        Y (np.ndarray):   Gower-transformed Matrix of shape (n x p2)

    Returns:
        corr (float):    The goodness of fit of the orthogonal procrustes rotation. Procrustean form of Corelation coefficient.

    References:
        Peres-Neto & Jackson (2001):    "How well do multivariate data sets match? The advantages of 
                                        a Procrustean superimposition approach over the Mantel test"       
    '''
    XY = np.dot(X.T, Y)
    _, s, _ = np.linalg.svd(XY)

    # for performing PROTEST (Peres-Neto & Jackson 2001)
    corr = np.trace(np.diag(s))

    return corr


def gower_trafo(A):
    '''Standardizes Matrix A using Gower-Transformation.

    Args:
        A (np.ndarray):     Matrix of shape (n x p), where n is the number of observations

    Returns:
        A (np.ndarray):     Standardized Matrix A

    References:
        Gower (1971):       "Statistical methods of comparing different multivariate analyses 
                             of the same data", p. 138-149
    '''
    A = A - A.mean(axis=0, keepdims=True)
    A = A / np.sqrt(np.sum(A**2))
    return A

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Kat
Solution 2 O.rka
Solution 3 SirIcarus