- Published on
Why Some XLSX Files Failed in Azure Document Intelligence and How I Fixed It
- Authors

- Name
- Daisuke Kobayashi
- https://twitter.com
Recently, when building RAG environments at work, I have often used azure-search-openai-demo as the base. In that setup, some .xlsx files failed in Azure Document Intelligence with UnsupportedContent. This post summarizes the cause and the fix.
Key points
- Azure Document Intelligence supports
XLSX, but some.xlsxfiles still producedUnsupportedContent. - The failing files contained
xl/embeddings/, which meant embedded OLE objects were still present. That was the critical difference from files that succeeded. - Once I detected
xl/embeddings/and removed unnecessary objects exposed throughOLEObjects()andOLEFormat, the files could be parsed successfully.
Background
While building a RAG environment with azure-search-openai-demo, I was using Azure Document Intelligence to analyze Excel files.
Some of the source files were older .xls files or macro-enabled .xlsm files, so I inserted a conversion step before analysis to turn them into .xlsx. Since Azure Document Intelligence supports XLSX, I assumed that normalizing everything into a supported format would be enough.
However, even after passing all files through the same conversion script, only some .xlsx files still failed. On the azure-search-openai-demo side, the failures appeared as InvalidRequest or UnsupportedContent, and the final message looked like this.
Content is not supported: The input content is corrupted or format is invalid.
Because some files succeeded, I suspected the problem was not the conversion process itself but a difference inside the resulting .xlsx files, so I started inspecting them.
Details
Since .xlsx files are ZIP archives, you can inspect their contents directly. When I checked the files that failed, I found that they still contained xl/embeddings/.
xl/embeddings/ stores embedded objects inside Excel files. In this case, only the files containing those embedded elements failed in Azure Document Intelligence, and once I removed them, parsing started to succeed. That is why I concluded this was the root cause.
The Excel object model also provides APIs related to embedded OLE objects, such as OLEObjects for worksheet OLE objects and OLEFormat for shapes that carry OLE data. On the OOXML and Open Specifications side, Excel also defines OLE data items. Reference: MS-XLSX: oleItem
So instead of handing the converted .xlsx files to Azure Document Intelligence as-is, I sanitized them in advance. In sanitize_xlsx.py, the processing flow is:
- Open the
.xlsxas a ZIP archive and check whetherxl/embeddings/exists. - Only for those files, open the workbook through Excel COM.
- Delete each sheet's
OLEObjects()and anyShapethat exposesOLEFormat. - Save the file and pass it to Azure Document Intelligence.
The code below also includes backup and rollback handling, but the part that directly solved the issue was the combination of detecting xl/embeddings/ and deleting the embedded OLE objects.
Here is the sanitize_xlsx.py script I actually used.
Open sanitize_xlsx.py
from __future__ import annotations
import argparse
import shutil
import sys
import time
import zipfile
from pathlib import Path
from typing import List, Tuple
try:
import win32com.client # type: ignore
except ImportError:
win32com = None # type: ignore
# ============================================================
# Inspection
# ============================================================
def is_valid_xlsx(path: Path) -> bool:
try:
with zipfile.ZipFile(path):
return True
except Exception:
return False
def has_embeddings(path: Path) -> bool:
try:
with zipfile.ZipFile(path) as z:
return any(name.startswith("xl/embeddings/") for name in z.namelist())
except Exception:
return False
# ============================================================
# Excel Sanitization
# ============================================================
def open_excel():
excel = win32com.client.Dispatch("Excel.Application") # type: ignore
excel.Visible = False
excel.DisplayAlerts = False
try:
excel.AskToUpdateLinks = False
except Exception:
pass
try:
excel.EnableEvents = False
except Exception:
pass
return excel
def remove_embedded_objects(path: Path) -> Tuple[int, int]:
deleted_ole = 0
deleted_shapes = 0
excel = open_excel()
try:
wb = excel.Workbooks.Open(str(path), ReadOnly=False, CorruptLoad=1)
try:
for ws in wb.Worksheets:
# --- OLEObjects ---
try:
count = ws.OLEObjects().Count
except Exception:
count = 0
for i in range(count, 0, -1):
try:
ws.OLEObjects(i).Delete()
deleted_ole += 1
except Exception:
pass
# --- Shapes with OLEFormat ---
try:
shapes = ws.Shapes
scount = shapes.Count
except Exception:
scount = 0
for i in range(scount, 0, -1):
try:
shape = shapes.Item(i)
try:
_ = shape.OLEFormat
shape.Delete()
deleted_shapes += 1
except Exception:
pass
except Exception:
pass
wb.Save()
finally:
wb.Close(SaveChanges=True)
finally:
excel.Quit()
return deleted_ole, deleted_shapes
# ============================================================
# File Processing
# ============================================================
def sanitize_file(path: Path, dry_run: bool) -> Tuple[str, str]:
if not is_valid_xlsx(path):
return "skipped", "invalid xlsx (zip error)"
if not has_embeddings(path):
return "skipped", "no embeddings"
if dry_run:
return "dry-run", "DRY-RUN"
backup_path = path.with_suffix(path.suffix + ".bak")
if backup_path.exists():
return "skipped", "backup already exists"
# --- Backup ---
shutil.copy2(path, backup_path)
try:
deleted_ole, deleted_shapes = remove_embedded_objects(path)
still_has = has_embeddings(path)
return "sanitized", (
f"sanitized "
f"(deleted_ole={deleted_ole}, "
f"deleted_shapes={deleted_shapes}, "
f"still_has_embeddings={still_has})"
)
except Exception as e:
# Rollback
restored_backup = False
try:
if path.exists():
path.unlink()
shutil.copy2(backup_path, path)
restored_backup = True
except Exception:
pass
return "failed", f"FAILED (restored_backup={restored_backup}) error={e}"
def process_folder(root: Path, dry_run: bool, sleep: float):
files = list(root.rglob("*.xlsx"))
sanitized: List[Tuple[Path, str]] = []
failed: List[Tuple[Path, str]] = []
dry_run_hits: List[Tuple[Path, str]] = []
skipped = 0
print(f"Scanning {len(files)} xlsx files under {root}")
for file in files:
status, msg = sanitize_file(file, dry_run)
if status == "sanitized":
sanitized.append((file, msg))
print(f"[SANITIZE] {file}\n -> {msg}")
elif status == "dry-run":
dry_run_hits.append((file, msg))
print(f"[DRY-RUN] {file}\n -> {msg}")
elif status == "failed":
failed.append((file, msg))
print(f"[FAILED] {file}\n -> {msg}")
else:
skipped += 1
if sleep:
time.sleep(sleep)
print("\n==============================")
print("SUMMARY")
print(f"sanitized: {len(sanitized)}")
print(f"dry-run: {len(dry_run_hits)}")
print(f"failed: {len(failed)}")
print(f"skipped: {skipped}")
if sanitized:
print("\nSanitized files:")
for p, _ in sanitized:
print(f"- {p}")
if failed:
print("\nFailed files:")
for p, _ in failed:
print(f"- {p}")
# ============================================================
# CLI
# ============================================================
def main():
parser = argparse.ArgumentParser(
description="Sanitize xlsx files by removing embedded OLE objects (xl/embeddings)."
)
parser.add_argument("root", help="Root directory to scan")
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--sleep", type=float, default=0.0)
args = parser.parse_args()
if sys.platform != "win32":
print("This script requires Windows (Excel COM).")
sys.exit(1)
if win32com is None:
print("pywin32 not installed. Run: pip install pywin32")
sys.exit(1)
root = Path(args.root).resolve()
if not root.is_dir():
print("Invalid directory.")
sys.exit(1)
process_folder(root, args.dry_run, args.sleep)
if __name__ == "__main__":
main()