Docx Redlines (Tracked Changes) for the Python Ecosystem
The main goal of this project is to address the significant gap in the open-source ecosystem around
.docx document comparison tools. Currently, the process of comparing and generating redline documents (documents that highlight changes between versions) is complex and largely dominated by commercial software. These tools, while effective, often come with cost barriers and limitations in terms of accessibility and integration flexibility.
Python-redlines aims to democratize the ability to run tracked change redlines for .docx, providing the open-source community with a tool to create
.docx redlines without the need for commercial software. This will let more legal hackers and hobbyist innovators experiment and create tooling for enterprise and legal.
The Open-XML-PowerTools project historically offered a solid foundation for working with
.docx files and has an excellent (if imperfect) comparison engine in its
WmlComparer class. However, Microsoft archived the repository almost five years ago, and a forked repo is not being actively maintained, as its most recent commits dates from 2 years ago and the repo issues list is disabled.
As a first step, our project aims to bring the existing capabilities of WmlCompare into the Python world. Thankfully, XML Power Tools is full cross-platform as it is written in .NET and compiles with the still-maintained .NET 8. The resulting binaries can be compiled for the latest versions of Windows and Linux (Ubuntu specifically, though other distributions should work fine too).
The initial release has a single engine
XmlPowerToolsEngine, which is just a Python wrapper for a simple C# utility written to leverage WmlComparer for 1-to-1 redlines. We hope this provides a stop-gap capability to Python developers seeking .docx redline capabilities.
Note, we don't plan to fork or maintain Open-XML-PowerTools. Version 4.4.0, which appears to only be compatible with Open XML SDK < 3.0.0 works for now, it needs to be made compatible with the latest versions of the Open XML SDK to extend its life. There are also some issues it seems the only maintainer of Open-XML-PowerTools probably won't fix, and understanding the existing code base is no small task.
Looking towards the future, rather than reverse engineer
WmlComparer and maintain a C# codebase, we envision a comparison engine written in python. We've done some experimentation with
xmldiff as the engine to compare the underlying xml of docx files. Specifically, we've built a prototype to unzip
.docx files, execute an xml comparison using
xmldiff, and then reconstructed a tracked changes docx with the proper Open XML (ooxml) tracked change tags. Preliminary experimentation with this approach has shown promise, indicating its feasibility for handling modifications such as simple span inserts and deletes.
However, this ambitious endeavor is not without its challenges. The intricacies of
.docx files and the potential for complex, corner-case scenarios necessitate a thoughtful and thorough development process. In the interim,
WmlComparer is a great solution as it has clearly been built to account for many such corner cases, through a development process that clearly was influenced by issues discovered by a large user base. The XMLDiff engine will take some time to reach a level of maturity similar to WmlComparer. At the moment it is NOT included.
The Open-XML-PowerTools engine we're using in the initial releases requires .NET to run (don't worry, this is very well-supported cross-platform at the moment). Our builds are targeting x86-64 Linux and Windows, however, so you'll need to modify the build script and build new binaries if you want to target another runtime / architecture.
You can follow Microsoft's instructions for your Linux distribution
You can follow Microsoft's instructions for your Windows vesrion
At the moment, we are not distributing via pypi. You can easily install directly from this repo, however.
pip install git+https://github.com/JSv4/Python-Redlines
You can add this as a dependency like so
python_redlines @ git+https://github.com/JSv4/Python-Redlines@v0.0.1
If you just want to use the tool, jump into our quickstart guide.
XmlPowerToolsEngine is a Python wrapper class for the
redlines C# command-line tool, source of which is available in ./csproj/Program.cs. The redlines utility and wrapper let you compare two docx files and show the differences in tracked changes (a "redline" document).
redlines C# utility is a command line tool that requires four arguments:
author_tag - A tag to identify the author of the changes.
original_path.docx - Path to the original document.
modified_path.docx - Path to the modified document.
redline_path.docx - Path where the redlined document will be saved.
The Python wrapper,
XmlPowerToolsEngine and its main method
run_redlines(), simplifies the use of
redlines by orchestrating its execution with Python and letting you pass in bytes or file paths for the original and modified documents.
The project is structured as follows:
│ ├── bin/
│ ├── obj/
│ ├── Program.cs
│ ├── redlines.csproj
│ └── redlines.sln
│ ├── developer-guide.md
│ └── quickstart.md
│ └── python_redlines/
│ ├── bin/
│ │ └── .gitignore
│ ├── dist/
│ │ ├── .gitignore
│ │ ├── linux-x64-0.0.1.tar.gz
│ │ └── win-x64-0.0.1.zip
│ ├── __about__.py
│ ├── __init__.py
│ └── engines.py
| ├── fixtures/
| ├── test_openxml_differ.py
| └── __init__.py
src/your_package/: Contains the Python wrapper code.
dist/: Contains the zipped C# binaries for different platforms.
bin/: Target directory for extracted binaries.
tests/: Contains test cases and fixtures for the wrapper.
If you want to contribute to the library or want to dive into some of the C# packaging architecture, go to our developer guide.