'Convert doc/docx to pdf in AWS Lambda?
Tried:
- Premade Lambda application docx to pdf (the application is no longer deployable) https://github.com/NativeDocuments/docx-to-pdf-on-AWS-Lambda
- Installing comtypes.client and win32com.client (neither seem to work once deployed in lambda) Getting Error: Unable to import module 'lambda_function': cannot import name 'COMError'
Possibility:
-Convert the doc file to PDF in Browser JS when I get it from s3. -Fix either comtypes or win32com in deployment package somehow. Python 3.6 is being used.
import json
import urllib
import boto3
from boto3.s3.transfer import TransferConfig
from botocore.exceptions import ClientError
import lxml
import comtypes.client
import io
import os
import sys
import threading
from docx import Document
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
response = s3.get_object(Bucket=bucket, Key=key)
# Creating the Document
f = io.BytesIO(response['Body'].read())
document = Document(f)
//Code for formating my document object in this hidden section.
document.save('/tmp/'+key)
pdfkey = key.split(".")[0]+".pdf"
//The following function is suppose to convert my doc to pdf
doctopdf('/tmp/'+ key,'/tmp/'+pdfkey)
//PDF file is then saved to s3
s3.upload_file('/tmp/'+pdfkey,'output',pdfkey)
except exceptions as e:
Logging.error(e)
raise e
def doctopdf(in_file,out_file):
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
Solution 1:[1]
I also came across this problem of conversion of Word document (doc/docx) to PDF or any other document type. I solved this problem through LibreOffice with Python 3.8 (will work with python 3.6 and 3.7 also) using subprocess in AWS Lambda.
Basically, this setup will pick your file from S3 through input event and convert file to PDF and put converted file into same S3 location. Let's walk through setup guide.
For this setup, we need LibreOffice executable accessible through Lambda. To achieve this, we will make use of a Lambda Layer. Now, you have two options:
- You can create your own AWS Lambda layer and upload layer.tar.br.zip (you can download this archive from shelfio GitHub repository)
- or, you can use Layer ARN directly in your Lambda.
2.1. Layer ARN for python3.6 and 3.7
2.2. Layer ARN for python 3.8
It's time to create Lambda (dependency package).
- Create
fonts/fonts.confat root of your lambda folder with following content (assuming libreoffice will be extracted under /tmp/instdir dir):
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
<dir>/tmp/instdir/share/fonts/truetype</dir>
<cachedir>/tmp/fonts-cache/</cachedir>
<config></config>
</fontconfig>
- Paste following code in your
lambda_function.pyfile:
import os
from io import BytesIO
import tarfile
import boto3
import subprocess
import brotli
libre_office_install_dir = '/tmp/instdir'
def load_libre_office():
if os.path.exists(libre_office_install_dir) and os.path.isdir(libre_office_install_dir):
print('We have a cached copy of LibreOffice, skipping extraction')
else:
print('No cached copy of LibreOffice exists, extracting tar stream from Brotli file.')
buffer = BytesIO()
with open('/opt/lo.tar.br', 'rb') as brotli_file:
decompressor = brotli.Decompressor()
while True:
chunk = brotli_file.read(1024)
buffer.write(decompressor.decompress(chunk))
if len(chunk) < 1024:
break
buffer.seek(0)
print('Extracting tar stream to /tmp for caching.')
with tarfile.open(fileobj=buffer) as tar:
tar.extractall('/tmp')
print('Done caching LibreOffice!')
return f'{libre_office_install_dir}/program/soffice.bin'
def download_from_s3(bucket, key, download_path):
s3 = boto3.client("s3")
s3.download_file(bucket, key, download_path)
def upload_to_s3(file_path, bucket, key):
s3 = boto3.client("s3")
s3.upload_file(file_path, bucket, key)
def convert_word_to_pdf(soffice_path, word_file_path, output_dir):
conv_cmd = f"{soffice_path} --headless --norestore --invisible --nodefault --nofirststartwizard --nolockcheck --nologo --convert-to pdf:writer_pdf_Export --outdir {output_dir} {word_file_path}"
response = subprocess.run(conv_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if response.returncode != 0:
response = subprocess.run(conv_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if response.returncode != 0:
return False
return True
def lambda_handler(event, context):
bucket = event["document_bucket"]
key = event["document_key"]
key_prefix, base_name = os.path.split(key)
download_path = f"/tmp/{base_name}"
output_dir = "/tmp"
download_from_s3(bucket, key, download_path)
soffice_path = load_libre_office()
is_converted = convert_word_to_pdf(soffice_path, download_path, output_dir)
if is_converted:
file_name, _ = os.path.splitext(base_name)
upload_to_s3(f"{output_dir}/{file_name}.pdf", bucket, f"{key_prefix}/{file_name}.pdf")
return {"response": "file converted to PDF and available at same S3 location of input key"}
else:
return {"response": "cannot convert this document to PDF"}
- Build (and copy from
site-packages/brotlifrom your Linux Environment after installing)brotlipydependency from Linux environment, as targeted Lambda runtime is AmazonLinux.
At the end, directory structure of your lambda (dependency package) should like this:
.
+-- brotli/*
+-- fonts
| +-- fonts.conf
+-- lambda_function.py
You can use following input event to invoke this Lambda handler, if your file s3 URI is s3://my-bucket-name/dir/file.docx:
{
"document_bucket: "my-bucket-name"
"document_key": "dir/file.docx"
}
Cheers! and let me know if you face any issue, would be happy to assist :)
Solution 2:[2]
Unfortunately, I do not have enough reputation to simply upvote or comment on abhinav's answer, but all credit to him for this answer.
I followed his instructions using python 3.8, I used the ARN for the Lambda Layer specific to my region and it seemed to function perfrectly. I pip installed brotlipy to my ubuntu subsystem and created the folder structure specified.
I adapted his lambda_handler function slightly to the below and added an S3 trigger to the lambda. I found that the original function would recursively trigger itself as the function was writing the pdf's back to the s3 that was triggering the function. The code below writes to a seperate s3 folder named 'pdf.output'.
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
output_bucket = 'pdf.output'
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
# bucket = event["document_bucket"]
# key = event["document_key"]
key_prefix, base_name = os.path.split(key)
download_path = f"/tmp/{base_name}"
output_dir = "/tmp"
download_from_s3(bucket, key, download_path)
soffice_path = load_libre_office()
is_converted = convert_word_to_pdf(soffice_path, download_path, output_dir)
if is_converted:
file_name, _ = os.path.splitext(base_name)
upload_to_s3(f"{output_dir}/{file_name}.pdf", output_bucket, f"{key_prefix}/{file_name}.pdf")
return {"response": "file converted to PDF and available at same S3 location of input key"}
else:
return {"response": "cannot convert this document to PDF"}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Richtea88 |
