'Why my do-it-myself Lasso gives different solution paths from the scikit-learn's lasso_path?

I tried recreating Lasso from scratch in Python with some modifications to the Lasso class as below:

class custom_lasso():

    def __init__(self, lr, iter, lamb):
        self.eta = lr
        self.max_iter = iter
        self.alpha = lamb

    # Function for model training

    def fit(self, X, Y):

        # no_of_training_examples, no_of_features
        self.n, self.p = X.shape
        # weight initialization
        self.B, self.X, self.Y = zeros(self.p), X, Y

        # gradient descent learning
        for i in range(self.max_iter):
            self.update_weights(i)

        return self

    # Helper function to update weights in gradient descent

    def update_weights(self, j):

        # calculate gradients
        dB = self.X.transpose() @ (self.X @ self.B - self.Y)
        if j:
            dB += abs(self.B)/self.B * self.alpha
        # update weights
        self.B -= self.eta * dB
        return self

given that I aim to minimize 0.5 sum_{i=1}^n (y_i - sum_{j=1}^p x_{ij} b_j)^2 + alpha * sum_{j=1}^p abs(b_j) w.r.t. to all the b_j.

Problem: If I try to compare the solution paths with scikit-learn's lasso_path, I found the resulting coefficients are quite different. For example, if I set n = 100, p = 5 and alpha = 1, then lasso_path gives b_2 = 0 but the scratchy lasso I came up with gives b_2 = 0.5962:

enter image description here

How come the differences? Any more codes I should include? Thanks a lot.



Solution 1:[1]

You're doing gradient descent on the objective function, but the objective function is not smooth because of the L1 penalty.

I don't get why there's an if here (and it does not seem to be used in your code, meaning that you just solve ordinary least squares if eta is small enough):

        if j:
            dB += abs(self.B)/self.B * self.alpha

In order to obtain a converging solver, you need to apply proximal gradient descent, i.e. apply a soft-thresholding step after the gradient descent step:

gd_update = B - eta * X.T @ (X @ B - y)
B = np.maximum(0, np.abs(gd_update) - alpha * eta)

For it to converge, you need eta < 2 / np.linalg.norm(X, ord=2)**2, as for standard gradient descent.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 P. Camilleri