'NodeJS: scanning a directory tree is slow as hell

I have a NW.js app that simply (recursively) scans a directory tree and get the stats for each file/directory. It also performs a MD5 for files.

I have 29k files, 850 folders, all for 120GB data.

After almost 7 minutes, my code only scanned 4080 files over the 29k files.

How is it possible that it is so slow?? Is there something I can do to improve performance? Otherwise, Node would be useless to me...

What is surprising, is that it took "only" 7 seconds to scan 1k files. Why is it 60 times longer to scan only 4 times as much files?

When I check the processes, I can see that Node moves a lot in RAM usage: from 20MB to 400MB (it fluctuates both ways). But the CPU usage is stuck at 1%.

It is weird, because I don't think I am allocating so much RAM. Actually, I don't allocate anything! Please see my code below.

if (process.argv.length < 3)
    process.exit();


var fs = require('fs');
var md5 = require('md5');
var md5File = require('md5-file');

var iTotal = 0;
var iNbFiles = 0;
var iNbFolders = 0;

var iBegin = Date.now();

var App =
{
    scan: function(path)
    {
        var items = fs.readdirSync(path);
        var i, item, stats, fullPath, isFolder, fileMD5;
        var len = items.length;
        var md5Hash = md5(path);

        for (i = 0; i < len; i++)
        {
            item = items[i];
            fullPath = path + '/' + item;
            stats = fs.statSync(fullPath);
            if (stats.isSymbolicLink())
                continue;

            isFolder = stats.isDirectory();
            if (!isFolder)
            {
                fileMD5 = md5File(fullPath);
                iNbFiles++;
            }
            else
            {
                fileMD5 = null;
                iNbFolders++;
            }

            iTotal++;
            process.send({_type: 'item', name: item, path: path, path_md5: md5Hash, full_path: fullPath, file_md5: fileMD5, stats: stats, is_folder: isFolder});
            if (isFolder)
                App.scan(path + '/' + item);
        }

        process.send({_type: 'temp', total: iTotal, files: iNbFiles, folders: iNbFolders, elapsed: (Date.now() - iBegin)});
    }
};

App.scan(process.argv[2]);

// Send the final and definitive value of "total"
process.send({_type: 'total', total: iTotal, files: iNbFiles, folders: iNbFolders});

process.exit();


Solution 1:[1]

Use any node module like https://github.com/jprichardson/node-fs-extra#walk

Actually, I don't allocate anything!

Nop, you are allocating: any object creation or variable creation will allocate memory for this. Also, md5-file read each file via stream and calculate checksum. So, you need to send all content of all your files throw CPU and memory. You use sync version of MD5 - in one time it will calculate only one file. Also, you have recursion there and you have many files: it mean, when stack will end - you will have error. And I think you have this error - you just don't see it or you run it silently without any progress feedback. Use async directory read and async MD5 calculation. Best solution is use some collection of workers processes (for example 6 core CPU - 6 workers) and pull data to this this workers and they will calculate the MD5.

Update 1

Recursion memory leak example:

var i=0;
function inc() {
    i++;
    var s = ""
    for(var n=0;n<4000;n++){ s+="0123456789" }
    inc();    
}
inc();

Open task manager and run this code in browser - and you will see how fast memory consumption is grow.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1