Extract ChemDraw files from *.docx on a Mac

Recently, I came across a paper and I wanted to work with the published molecules. In the cheminformatics context working with molecules means to load them from a file, i.e., a machine readable representation. Since transcribing them from the printed pictures in the paper is cumbersome, I wrote the author if I could have them as files. He sent me back a word document with ChemDraw images. Great, ChemDraw is able to save molecules in different file formats, e.g., SMILES. So I got the Trial version for Mac. If I was working with Windows it would've been easy, since one can apparently just copy the images and paste them into ChemDraw. However, Microsoft Office for Mac is not as cooperative. Probably because copying embedded files are handled differently by the OS. Anyhow, the molecules can still be extracted on the Mac manually.

Requirements

  • p7zip (via Homebrew)
  • unzip (pre-installed)
  • bash (default shell)

Procedure

  1. Extract the file.
    • mkdir extracted
    • cp file.docx extracted/
    • cd extracted
    • unzip file.docx
  2. The images are then found in extracted/word/media, the binary Chemdraw files in extracted/word/embeddings.
    • cd word/embeddings
  3. Rename the binary files to 7z.
    • for file in *.bin; do mv "$file" "${file/.bin/.7z}"; done
  4. Extract the files with 7z into individual directories (to avoid them being overwritten since they are all named the same) and rename/move them to the current directory.
    • for file in *.7z; do DIR="${file%.*}"; echo $DIR; mkdir $DIR; 7z x -o$DIR/ $file; mv $DIR/CONTENTS $DIR.cdx; done
  5. Then copy them anywhere you want.
    • mv *.cdx anywhere