'Low resources usage when using dedupe python

I need to find duplicates in a large dataset, so I'm testing dedupe python library.

I know it is recommended for small datasets, so I thought using a good machine could improve the performance. I have a machine with 56 GB RAM and I'm running a test similar to "csv_example" for a dataset with 200000 rows. It works but the memory usage is very low and so the processing(CPU).

It seems to take too long in the blocking stage:

INFO:dedupe.blocking:10000, 110.6458142 seconds
INFO:dedupe.blocking:20000, 300.6112282 seconds
INFO:dedupe.blocking:30000, 557.1010122 seconds
INFO:dedupe.blocking:40000, 915.3087222 seconds

Could anyone help me to improve the usage or tell me if there is any library/setting that makes the program use more available resources?



Solution 1:[1]

Most likely your stage name contains an illegal character. Serverless auto-generates a name for your s3 bucket based on your stage name. If you look at the generated template file you will see the full export, which will look something like the following:

"ServerlessDeploymentBucketName": {
    "Value": "api-deployment",
    "Export": {
      "Name": "sls-api_stage-ServerlessDeploymentBucketName"
    }
  }

The way around this (assuming you don't want to change your stage name) is to explicitly set the output by adding something like this to your serverless config (in this case the illegal character was the underscore)

resources: {
      Outputs: {
          ServerlessDeploymentBucketName: {
              Export: {
                  Name: `sls-${stageKey.replace('api_', 'api-')}-ServerlessDeploymentBucketName`
              }
          }
      }
  }

Unfortunately this has to be done for every export... It is a better option to update your stage name to not include illegal characters

Solution 2:[2]

I ran into this same problem.

In the serverless.yml I changed service that I had it as lambda_function and put it as lambdaFunction

The error was solved and it deployed correctly.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Luis Morales