'nodejs can´t handle http successfully responses due to (probably) high volume of promises - Growth problems

I am finding my self blind on how to understand the root cause of this problem. Any direction is very welcomed I´ve made some tests as well which I am going to show here

A) --- General Scenery -----

I have two cluster on GCP Scheduler (2-15 servers) and Consumer (6-100) servers Scheduler gets the tasks that need to be done and send it to the Consumer.

There are about 2.2K messages per minute from Scheduler to Consumer and as more Schedulers are created, more the tasks are divided among them.

The problem:

  1. Consumer process 100% of the requests bellow 3 seconds
  2. About 0.5% of the cases, Scheduler logs shows a timeout even if the Consumer has processed the request in <2 seconds. I am sure because each request has an unique ID.
  3. Axios timeout are set to 20 seconds
  4. When the errors are logged in Scheduler, the elapsed time from the beginning of the request until the exception handling is about 45 seconds . So way more than the 20s Axios expiring time.
  5. Being at Google GCP local network means, as a premise, that I have no problem in network
  6. Scheduler CPU is rising but not necessarily overwhelmed, i.e, less than 80% Scheduler will get up to 100 tasks and process them 15 at a time (Promise.all()) and wait for the responses, than the next 15 and so on

What this scenery suggests to me is that Scheduler nodejs is not being able to process the responses that Consumers replies it and somehow it is missing some replies.

So, the hypothesis goes like this:

Scheduler is able to send a high volume of messages but unable to get the results because Axios timeout will expire before the eventloop is able to tackle the 200 response from the Consumer.

B) ------ The Tests ------

We could reproduce the problem in the tests. We've created a sender and a receiver. sender just send an incremental number to receiver, 100 at a time, 10000 times

There are several cases where receiver says it processed a specific task (number) but sender is presenting a timeout.

Here is sender and receiver

Sender

import Aigle from 'aigle';
import axios from 'axios';
import * as _ from 'lodash';


async function main(): Promise<void> {
    const TIMEOUT: number = 1000;
    const LIMIT: number = 10000;
    const items: unknown[] = _.range(LIMIT)
    let REMOTE_MACHINE: string = ''
    const hash: { [key in string]: true } = {};


    if (!REMOTE_MACHINE) throw new Error('No remote mahchine')

    try {

        const out = await Aigle
            .resolve(items)
            .mapLimit(100, async (address, index) => {
                try {
                    const response = await axios.get(`http://${REMOTE_MACHINE}:4000/${index}`, { timeout: TIMEOUT })
                    print({ type: 'response', index })
                    // 
                    return response;
                } catch (err) {
                    hash[index] = true;
                    print({ type: 'err', index })
                    // 
                    return err
                }
            })
        ;

        console.log('Amount errors', Object.keys(hash).length)

        await axios.get(`http://${REMOTE_MACHINE}:4000/${LIMIT}`, { timeout: TIMEOUT })
        // 
    } catch (err) {
        console.log({ err })
    }

    function print(item: any) {
        return console.log(item)
    }
    
}
main();

Receiver

const express = require('express')
const fs = require('fs')


function main() {
        const app = express()
        let count = 0;
        const LIMIT = 10000;
        const hash = {}

app.get('/:num?', async (req, res) => {
            // if (req.params.num !== count.toString()) console.log(req.params.num)
            count++;
            hash[req.params.num] = true;
            if (Number(req.params.num) >= LIMIT) {
                    const difference = Array(LIMIT).fill(0).map((_, index) => index).filter((_, index) => !hash[index])

                    console.log({ difference });
                    fs.writeFileSync('out', JSON.stringify({ difference, hash }))
            }
            console.log({ count, body: req.params.num })
            res.send({ message: 'Out' });
        });
        app.listen(4000, () => console.log('listening'))
}

main()

So the questions are:

  1. What could be the reasons scheduler falsely appoint a timeout
  2. Is there a safe parallelism threshold that we must respect? Is it bind to parallel process and/or IO?
  3. Any pattern I should use?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source