Published on

Why Some XLSX Files Failed in Azure Document Intelligence and How I Fixed It

Authors

Recently, when building RAG environments at work, I have often used azure-search-openai-demo as the base. In that setup, some .xlsx files failed in Azure Document Intelligence with UnsupportedContent. This post summarizes the cause and the fix.

Key points

  • Azure Document Intelligence supports XLSX, but some .xlsx files still produced UnsupportedContent.
  • The failing files contained xl/embeddings/, which meant embedded OLE objects were still present. That was the critical difference from files that succeeded.
  • Once I detected xl/embeddings/ and removed unnecessary objects exposed through OLEObjects() and OLEFormat, the files could be parsed successfully.

Background

While building a RAG environment with azure-search-openai-demo, I was using Azure Document Intelligence to analyze Excel files.

Some of the source files were older .xls files or macro-enabled .xlsm files, so I inserted a conversion step before analysis to turn them into .xlsx. Since Azure Document Intelligence supports XLSX, I assumed that normalizing everything into a supported format would be enough.

However, even after passing all files through the same conversion script, only some .xlsx files still failed. On the azure-search-openai-demo side, the failures appeared as InvalidRequest or UnsupportedContent, and the final message looked like this.

Content is not supported: The input content is corrupted or format is invalid.

Because some files succeeded, I suspected the problem was not the conversion process itself but a difference inside the resulting .xlsx files, so I started inspecting them.

Details

Since .xlsx files are ZIP archives, you can inspect their contents directly. When I checked the files that failed, I found that they still contained xl/embeddings/.

xl/embeddings/ stores embedded objects inside Excel files. In this case, only the files containing those embedded elements failed in Azure Document Intelligence, and once I removed them, parsing started to succeed. That is why I concluded this was the root cause.

The Excel object model also provides APIs related to embedded OLE objects, such as OLEObjects for worksheet OLE objects and OLEFormat for shapes that carry OLE data. On the OOXML and Open Specifications side, Excel also defines OLE data items. Reference: MS-XLSX: oleItem

So instead of handing the converted .xlsx files to Azure Document Intelligence as-is, I sanitized them in advance. In sanitize_xlsx.py, the processing flow is:

  1. Open the .xlsx as a ZIP archive and check whether xl/embeddings/ exists.
  2. Only for those files, open the workbook through Excel COM.
  3. Delete each sheet's OLEObjects() and any Shape that exposes OLEFormat.
  4. Save the file and pass it to Azure Document Intelligence.

The code below also includes backup and rollback handling, but the part that directly solved the issue was the combination of detecting xl/embeddings/ and deleting the embedded OLE objects.

Here is the sanitize_xlsx.py script I actually used.

Open sanitize_xlsx.py
from __future__ import annotations

import argparse
import shutil
import sys
import time
import zipfile
from pathlib import Path
from typing import List, Tuple

try:
    import win32com.client  # type: ignore
except ImportError:
    win32com = None  # type: ignore


# ============================================================
# Inspection
# ============================================================


def is_valid_xlsx(path: Path) -> bool:
    try:
        with zipfile.ZipFile(path):
            return True
    except Exception:
        return False


def has_embeddings(path: Path) -> bool:
    try:
        with zipfile.ZipFile(path) as z:
            return any(name.startswith("xl/embeddings/") for name in z.namelist())
    except Exception:
        return False


# ============================================================
# Excel Sanitization
# ============================================================


def open_excel():
    excel = win32com.client.Dispatch("Excel.Application")  # type: ignore
    excel.Visible = False
    excel.DisplayAlerts = False
    try:
        excel.AskToUpdateLinks = False
    except Exception:
        pass
    try:
        excel.EnableEvents = False
    except Exception:
        pass
    return excel


def remove_embedded_objects(path: Path) -> Tuple[int, int]:
    deleted_ole = 0
    deleted_shapes = 0

    excel = open_excel()
    try:
        wb = excel.Workbooks.Open(str(path), ReadOnly=False, CorruptLoad=1)
        try:
            for ws in wb.Worksheets:
                # --- OLEObjects ---
                try:
                    count = ws.OLEObjects().Count
                except Exception:
                    count = 0

                for i in range(count, 0, -1):
                    try:
                        ws.OLEObjects(i).Delete()
                        deleted_ole += 1
                    except Exception:
                        pass

                # --- Shapes with OLEFormat ---
                try:
                    shapes = ws.Shapes
                    scount = shapes.Count
                except Exception:
                    scount = 0

                for i in range(scount, 0, -1):
                    try:
                        shape = shapes.Item(i)
                        try:
                            _ = shape.OLEFormat
                            shape.Delete()
                            deleted_shapes += 1
                        except Exception:
                            pass
                    except Exception:
                        pass

            wb.Save()
        finally:
            wb.Close(SaveChanges=True)
    finally:
        excel.Quit()

    return deleted_ole, deleted_shapes


# ============================================================
# File Processing
# ============================================================


def sanitize_file(path: Path, dry_run: bool) -> Tuple[str, str]:
    if not is_valid_xlsx(path):
        return "skipped", "invalid xlsx (zip error)"

    if not has_embeddings(path):
        return "skipped", "no embeddings"

    if dry_run:
        return "dry-run", "DRY-RUN"

    backup_path = path.with_suffix(path.suffix + ".bak")

    if backup_path.exists():
        return "skipped", "backup already exists"

    # --- Backup ---
    shutil.copy2(path, backup_path)

    try:
        deleted_ole, deleted_shapes = remove_embedded_objects(path)

        still_has = has_embeddings(path)

        return "sanitized", (
            f"sanitized "
            f"(deleted_ole={deleted_ole}, "
            f"deleted_shapes={deleted_shapes}, "
            f"still_has_embeddings={still_has})"
        )

    except Exception as e:
        # Rollback
        restored_backup = False
        try:
            if path.exists():
                path.unlink()
            shutil.copy2(backup_path, path)
            restored_backup = True
        except Exception:
            pass
        return "failed", f"FAILED (restored_backup={restored_backup}) error={e}"


def process_folder(root: Path, dry_run: bool, sleep: float):
    files = list(root.rglob("*.xlsx"))

    sanitized: List[Tuple[Path, str]] = []
    failed: List[Tuple[Path, str]] = []
    dry_run_hits: List[Tuple[Path, str]] = []
    skipped = 0

    print(f"Scanning {len(files)} xlsx files under {root}")

    for file in files:
        status, msg = sanitize_file(file, dry_run)

        if status == "sanitized":
            sanitized.append((file, msg))
            print(f"[SANITIZE] {file}\n  -> {msg}")
        elif status == "dry-run":
            dry_run_hits.append((file, msg))
            print(f"[DRY-RUN] {file}\n  -> {msg}")
        elif status == "failed":
            failed.append((file, msg))
            print(f"[FAILED] {file}\n  -> {msg}")
        else:
            skipped += 1

        if sleep:
            time.sleep(sleep)

    print("\n==============================")
    print("SUMMARY")
    print(f"sanitized: {len(sanitized)}")
    print(f"dry-run: {len(dry_run_hits)}")
    print(f"failed: {len(failed)}")
    print(f"skipped: {skipped}")

    if sanitized:
        print("\nSanitized files:")
        for p, _ in sanitized:
            print(f"- {p}")

    if failed:
        print("\nFailed files:")
        for p, _ in failed:
            print(f"- {p}")


# ============================================================
# CLI
# ============================================================


def main():
    parser = argparse.ArgumentParser(
        description="Sanitize xlsx files by removing embedded OLE objects (xl/embeddings)."
    )
    parser.add_argument("root", help="Root directory to scan")
    parser.add_argument("--dry-run", action="store_true")
    parser.add_argument("--sleep", type=float, default=0.0)

    args = parser.parse_args()

    if sys.platform != "win32":
        print("This script requires Windows (Excel COM).")
        sys.exit(1)

    if win32com is None:
        print("pywin32 not installed. Run: pip install pywin32")
        sys.exit(1)

    root = Path(args.root).resolve()

    if not root.is_dir():
        print("Invalid directory.")
        sys.exit(1)

    process_folder(root, args.dry_run, args.sleep)


if __name__ == "__main__":
    main()

References