'How to extract metadata from docx file using Python?
How would I extract metadata (e.g. FileSize, FileModifyDate, FileAccessDate) from a docx file?
Solution 1:[1]
You could use python-docx. python-docx has a method core_properties you can utilise. This method gives 15 metadata attributes such as author, category, etc.
See the below code to extract some of the metadata into a python dictionary:
import docx
def getMetaData(doc):
metadata = {}
prop = doc.core_properties
metadata["author"] = prop.author
metadata["category"] = prop.category
metadata["comments"] = prop.comments
metadata["content_status"] = prop.content_status
metadata["created"] = prop.created
metadata["identifier"] = prop.identifier
metadata["keywords"] = prop.keywords
metadata["last_modified_by"] = prop.last_modified_by
metadata["language"] = prop.language
metadata["modified"] = prop.modified
metadata["subject"] = prop.subject
metadata["title"] = prop.title
metadata["version"] = prop.version
return metadata
doc = docx.Document(file_path)
metadata_dict = getMetaData(doc)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Davide Fiocco |
