Table of Contents
Filename: Crane-XMPFileExtract.xsl
$Id: Crane-XMPExtract.xsl,v 1.9 2011/01/29 18:40:37 admin Exp $
This stylesheet locates files embedded in the input Adobe XMP metadata file using Crane's methodology of using clear text escaped in an XMP element:
<c:file xmlns:c="http://www.CraneSoftwrights.com/ns/XMP/">...</c:file>
<c:file xmlns:c="http://www.cranesoftwrights.com/ns/XMP/">...</c:file>
uri=
is absent, write the file to standard output
(note this attribute cannot be absent if there is more than one
<c:file>
element in the XMP)
uri=
is present, write the file to the given
filename
The input to this stylesheet is the XMP metadata exported or extracted from a PDF file.
When using Adobe Acrobat 8.2.5 this is accomplished as follows:
File/Properties
to open up the document
properties
Description
tab
Additional Metadata
button to open the
XMP dialogue box
Advanced
Save...
button to export the information
to that file that is input into this stylesheet
This package includes the Crane-PDF2XMP.py
program
written in the Python language
for extracting the first embedded XMP packet found in a PDF document.
The standard input is the PDF file and the standard output is the first XMP fragment found therein. A non-zero error code is returned if there is a problem. The invocation follows this simple approach:
python Crane-PDF2XMP.py <j.pdf >j.xmp
The program source code has related documentation regarding finding XMP content of a PDF file by its signature processing instruction.
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-702"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <!-- PDFA ID --> <rdf:Description rdf:about="" xmlns:pdfaid="http://www.aiim.org/pdfa/ns/id/"> <pdfaid:part>1</pdfaid:part> <pdfaid:conformance>B</pdfaid:conformance> </rdf:Description> <!-- XMP --> <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/"> <xmp:CreateDate/> <xmp:ModifyDate/> </rdf:Description> <!-- PDF --> <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/"> <pdf:Keywords>Testing, PDF/A</pdf:Keywords> <pdf:Producer/> </rdf:Description> <!-- dublin core --> <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:format>application/pdf</dc:format> <dc:title> <rdf:Alt> <rdf:li xml:lang="x-default">dc:title - should appear in pdf info.title</rdf:li> </rdf:Alt> </dc:title> <dc:creator> <rdf:Seq> <rdf:li>dc:creator - should appear in pdf info.author</rdf:li> </rdf:Seq> </dc:creator> <dc:description> <rdf:Alt> <rdf:li xml:lang="x-default">dc:description - should appear in pdf info.subject</rdf:li> </rdf:Alt> </dc:description> </rdf:Description> <!--file schema--> <rdf:Description rdf:about="" xmlns:pdfaExtension="http://www.aiim.org/pdfa/ns/extension/" xmlns:pdfaSchema="http://www.aiim.org/pdfa/ns/schema#" xmlns:pdfaProperty="http://www.aiim.org/pdfa/ns/property#" xmlns:pdfaType="http://www.aiim.org/pdfa/ns/type#"> <!-- Container for all embedded extension schema descriptions --> <pdfaExtension:schemas> <rdf:Bag> <rdf:li rdf:parseType="Resource"> <!-- Optional description of schema --> <pdfaSchema:schema>Crane file container schema</pdfaSchema:schema> <!-- Schema namespace URI --> <pdfaSchema:namespaceURI>http://www.cranesoftwrights.com/ns/XMP/</pdfaSchema:namespaceURI> <!-- Preferred schema namespace prefix --> <pdfaSchema:prefix>c</pdfaSchema:prefix> <!-- Description of schema properties --> <pdfaSchema:property> <rdf:Seq> <rdf:li rdf:parseType="Resource"> <pdfaProperty:name>file</pdfaProperty:name> <pdfaProperty:valueType>Text</pdfaProperty:valueType> <pdfaProperty:category>external</pdfaProperty:category> <pdfaProperty:description>Embedded file</pdfaProperty:description> </rdf:li> </rdf:Seq> </pdfaSchema:property> <pdfaSchema:valueType> <rdf:Seq> <rdf:li rdf:parseType="Resource"> <pdfaType:type>Text</pdfaType:type> <pdfaType:description>Embedded file</pdfaType:description> </rdf:li> </rdf:Seq> </pdfaSchema:valueType> </rdf:li> </rdf:Bag> </pdfaExtension:schemas> </rdf:Description> <!--file content--> <rdf:Description rdf:about=""> <c:file xmlns:c="http://www.cranesoftwrights.com/ns/XMP/"> ...file content as text with any XML markup escaped... </c:file> </rdf:Description> </rdf:RDF> </x:xmpmeta> <?xpacket end="w"?>
The key to this approach is that the content of the element is a simple text file, even if it contains markup characters.
Extract all files from the XMP content. If only one file is embedded, then when the URI attribute is absent the content is put to the standard output. If there is more than one file embedded, then each file needs to have a
uri=
attribute to specify the filename for export.The logic uses possible extensions to cover as many processors as possible.