Extracting embedded files using XMP metadata


Table of Contents

1. Extracting embedded files using XMP metadata - Crane-XMPFileExtract.xsl
1.1. Obtaining XMP metadata from a PDF file
1.2. Example XMP
1.3. Stylesheet invocation parameters
1.4. Extraction logic

1. Extracting embedded files using XMP metadata - Crane-XMPFileExtract.xsl

Filename: Crane-XMPFileExtract.xsl

$Id: Crane-XMPExtract.xsl,v 1.9 2011/01/29 18:40:37 admin Exp $

This stylesheet locates files embedded in the input Adobe XMP metadata file using Crane's methodology of using clear text escaped in an XMP element:

  • each file to be extracted as clear text is found in either the element:
    <c:file xmlns:c="http://www.CraneSoftwrights.com/ns/XMP/">...</c:file>
    or the element:
    <c:file xmlns:c="http://www.cranesoftwrights.com/ns/XMP/">...</c:file>
    though the prefix is irrelevant (note that Adobe Acrobat 8 appears to sometimes (but not always) require the lower-case domain name in the URI; if anyone has any ideas on why this is, please help us and let us know your thoughts ... thank you).
  • when uri= is absent, write the file to standard output (note this attribute cannot be absent if there is more than one <c:file> element in the XMP)
  • when uri= is present, write the file to the given filename

1.1. Obtaining XMP metadata from a PDF file

The input to this stylesheet is the XMP metadata exported or extracted from a PDF file.

1.1.1. Exporting XMP using Adobe Acrobat

When using Adobe Acrobat 8.2.5 this is accomplished as follows:

  • use File/Properties to open up the document properties
  • open the Description tab
  • press the Additional Metadata button to open the XMP dialogue box
  • in the left pane select Advanced
  • press the Save... button to export the information to that file that is input into this stylesheet

1.1.2. Extracting XMP using Crane-PDF2XMP.py

This package includes the Crane-PDF2XMP.py program written in the Python language for extracting the first embedded XMP packet found in a PDF document.

The standard input is the PDF file and the standard output is the first XMP fragment found therein. A non-zero error code is returned if there is a problem. The invocation follows this simple approach:

python Crane-PDF2XMP.py <j.pdf >j.xmp

The program source code has related documentation regarding finding XMP content of a PDF file by its signature processing instruction.

1.2.  Example XMP

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-702">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <!-- PDFA ID -->
    <rdf:Description rdf:about="" xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/">
      <pdfaid:part>1</pdfaid:part>
      <pdfaid:conformance>B</pdfaid:conformance>
    </rdf:Description>

    <!-- XMP -->
    <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
      <xmp:CreateDate/>
      <xmp:ModifyDate/>
    </rdf:Description>
    
    <!-- PDF -->
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
      <pdf:Keywords>Testing, PDF/A</pdf:Keywords>
      <pdf:Producer/>
    </rdf:Description>
    
    <!-- dublin core -->
    <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
      <dc:format>application/pdf</dc:format>
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">dc:title - should appear in pdf info.title</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>dc:creator - should appear in pdf info.author</rdf:li>
        </rdf:Seq>
      </dc:creator>
      <dc:description>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">dc:description - should appear in pdf info.subject</rdf:li>
        </rdf:Alt>
      </dc:description>
    </rdf:Description>

    <!--file schema-->
    <rdf:Description rdf:about=""
                   xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/"
                   xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#"
                   xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#"
                   xmlns:pdfaType="http://www.aiim.org/pdfa/ns/type#">

      <!-- Container for all embedded extension schema descriptions -->
      <pdfaExtension:schemas>
        <rdf:Bag>
          <rdf:li rdf:parseType="Resource">  
            <!-- Optional description of schema -->
            <pdfaSchema:schema>Crane file container schema</pdfaSchema:schema>
            
            <!-- Schema namespace URI -->
            <pdfaSchema:namespaceURI>http://www.cranesoftwrights.com/ns/XMP/</pdfaSchema:namespaceURI>
            
            <!-- Preferred schema namespace prefix -->
            <pdfaSchema:prefix>c</pdfaSchema:prefix>
            
            <!-- Description of schema properties -->
            <pdfaSchema:property>
              <rdf:Seq>
                <rdf:li rdf:parseType="Resource">
                  <pdfaProperty:name>file</pdfaProperty:name>
                  <pdfaProperty:valueType>Text</pdfaProperty:valueType>
                  <pdfaProperty:category>external</pdfaProperty:category>
                  <pdfaProperty:description>Embedded file</pdfaProperty:description>
                </rdf:li>
              </rdf:Seq>
            </pdfaSchema:property>
            <pdfaSchema:valueType>
              <rdf:Seq>
                <rdf:li rdf:parseType="Resource">
                  <pdfaType:type>Text</pdfaType:type>
                  <pdfaType:description>Embedded file</pdfaType:description>
                </rdf:li>
              </rdf:Seq>
            </pdfaSchema:valueType>
          </rdf:li>
        </rdf:Bag>
      </pdfaExtension:schemas>
    </rdf:Description>
    
    <!--file content-->
    <rdf:Description rdf:about="">
      <c:file xmlns:c="http://www.cranesoftwrights.com/ns/XMP/">
        ...file content as text with any XML markup escaped...
      </c:file>
    </rdf:Description>
  </rdf:RDF>
  </x:xmpmeta>
  <?xpacket end="w"?>

1.3. Stylesheet invocation parameters

method="text" (xsl:output)

The key to this approach is that the content of the element is a simple text file, even if it contains markup characters.

1.4. Extraction logic

match="/" (xsl:template)

Extract all files from the XMP content. If only one file is embedded, then when the URI attribute is absent the content is put to the standard output. If there is more than one file embedded, then each file needs to have a uri= attribute to specify the filename for export.

The logic uses possible extensions to cover as many processors as possible.