The ISO has been Studying ZIP
Annex A of "New Work Item Proposal on Document Packaging" (April 12, 2010), ISO/IEC JTC 1/SC 34 N 1414, said:
Today many electronic documents are embodied not in wholly proprietary formats, but in formats built on the foundation of standards.
One increasingly common approach is to specify formats in which XML documents and other digital resources are stored together in an archive based on a minimal implementation of what is known as the “ZIP” format.
Examples of document-centric formats which take this approach include:
• ISO/IEC 26300 (Open Document Format for Office Applications)
• ISO/IEC 29500 (Office Open XML)
• EPUB ( standardized by The International Digital Publishing Forum http://www.idpf.org/ )
• W3C Widget Packaging and Configuration
• ADL SCORM ( http://www.adlnet.gov/Technologies/scorm/ )
Note also that ITTF makes documents available in ZIP‐compatible packages.
However, despite the widespread use of the ZIP format, it has never been standardized.
Such a pervasive format as ZIP would benefit greatly from being an International Standard. In practice, formats using ZIP for document packaging use a small and well-established subset of the overall current
non-standard technology which can be quickly standardized. SC 34 has had strong indications from its experts and liaisons that a standardized, ZIP-compatible Document Packaging format would be of immense value, and wishes to ballot this NP to gather member body feedback.
Resolution 2 ("Initiation of Study Period for ‘Zip’ format") of the ISO/IEC JTC 1/SC 34 Plenary Meeting, Tokyo, Japan, 2010-09-10 included the following:
SC 34 accepts the WG 1 recommendation contained in SC 34 N 1494 to initiate a study period with aim of establishing a firmer rationale for standardization of aspects of the “ZIP” format.
SC 34 asks WG 1 that a report be submitted in time for consideration at the SC 34 meetings in Prague in 2011-03 and that time be allocated to this activity during the WG 1 meeting in Beijing in 2010-12.
WG 1 recommends that SC 34 send the following liaison statement to SC 29/WG 11:
The Zip Study Period has drawn to a close in SC 34. The latest status report is contained in document SC 34 N 1577, and shows that the principal output of the study period was a proposed New Work Item Proposal (see SC 34 N 1575).
SC 34 experts share SC 29/WG 11’s interest in possible future enhancements to packaging technologies, and have proposed the New Work as a multi-part standard, the first part of which is proposed to identify core packaging technology, and reference PKWare’s “appnote” using the RER mechanism as described in the Standing Document on Normative Referencing in the JTC 1 Directives. The intention is that future work – such as in areas suggested by SC 29/WG 11 – may create new Parts of this multi-part Standard.
On SC 29/WG 11’s specific technical points – First, SC 34 experts share a desire to specify fully how content within archives may be referenced, preferably using IRIs; secondly, on the question of moving the central directory, SC 34 experts expressed the view that the specification of the core technology must maintain 100% compatibility with exisiting archives, which precludes re-specifying where the central directory is placed in an archive. Some experts wondered whether it would be possible, however, to establish a convention whereby the central directory – or something equivalent to it – was stored as the first item in an archive.
Based on discussions within the SC34 ZIP Study Group, there is consensus that the best way to achieve our technical objectives is to have PKWARE continue its maintenance of the ZIP Application Note. We have no desire or interest in duplicating that effort. But we do see that there are benefits in a proposed multi-part standard to address the full set of desired capabilities to be build on top of the ZIP Application Note.
The present proposal is for Part 1, the core specification of the Document Container File. The creation of this standard will require the creation and approval of Referencing Explanatory Report for PKWARE’s ZIP Application Note, and that by itself should bring NBs greater assurance regarding stability of reference and intellectual property rights, making ZIP-based Document Container Files more easily used by other International Standards.
The “Resolutions of the ISO/IEC JTC 1/SC 34 Plenary Meeting, Prague, Czech Republic, 2011-04-01” are in SC 34 N 1611. Resolution 4 was as follows:
Resolution 4: New Work Item Proposal for “Document Container File — Part 1: Core”
SC 34 thanks WG 1 for preparing a report on Study Period for “zip” format contained in SC 34 N 1577. SC 34 instructs its Secretariat to circulate New Work Item Proposal contained in SC 34 N 1575 to the SC 34 members for a three-month ballot and to submit it JTC 1 for concurrent review.
The ballot is in SC 34 N 1621.
The ZIP format has been around since 1989. According to PKWARE, Inc., which introduced it:
The .ZIP format remains one of the most widely used file formats for cross-platform interoperability. Leading industry standards including the ECMA Office Open XML and the OASIS Open Document Format for Office Applications incorporate ZIP technology. Other major industry standards incorporating ZIP include the JAVA JAR specification, Sharable Content Object Reference Model (ADL-SCORM) and EPUB.
As this suggests, if you want to know what your favourite word processor is doing to your words, you should have some awareness of ZIP. The ZIP format, in general, has provided "data compression, file management, and data encryption within a portable archive format." It currently provides at least some of that to the XML buried in your .docx or .odt file.
The PK in PKWARE is due to the late Phil Katz. As the Wikipedia article about him notes, his efforts where ZIP is concerned were to a large extent inspired by litigation. (Yay, lawyers!) Wikipedia notes:
Katz received positive publicity by releasing the APPNOTE.TXT specification documenting the ZIP file format, and declaring that the ZIP file format would always be free for competing software to implement.
What do we find about this declaration on the PKWARE website? In the article above, PKWARE says:
Some ZIP technology is covered by patents or pending patents. If your application requires use of ZIP technology that is covered under a patent, PKWARE does provide reasonable and non discriminatory licensing.
Additional details are in the current Application Note: "APPNOTE.TXT – .ZIP File Format Specification ", version 6.3.2 (September 28, 2007). Section X ("Incorporating PKWARE Proprietary Technology into Your Product") says:
PKWARE is committed to the interoperability and advancement of the .ZIP format. PKWARE offers a free license for certain technological aspects described above under certain restrictions and conditions. However, the use or implementation in a product of certain technological aspects set forth in the current APPNOTE, including those with regard to strong encryption, patching, or extended tape operations requires a license from PKWARE. Please contact PKWARE with regard to acquiring a license.
That’s all I could find on the current website. Long ago, however, Robert A Freed, who was not affiliated with PKWARE but claimed to have permission, posted (on the Usenet newsgroup comp.sys.ibm.pc) part of the DISCLAIM.DOC file that PKWARE had distributed on January 11, 1989:
The file format of the files created by these programs, which file format is original with the first release of this software, is hereby dedicated to the public domain. Further, the filename extension of ".ZIP", first used in connection with data compression software on the first release of this software, is also hereby dedicated to the public domain, with the fervent and sincere hope that it will not be attempted to be appropriated by anyone else for their exclusive use, but rather that it will be used to refer to data compression and librarying software in general, of a class or type which creates files having a format generally compatible with this software.
FORMAT.DOC was posted later in the same thread.
So are you confused yet? Rick Jelliffe offers a useful discussion of the ambiguities in "Is ZIP in the public domain or not?" (June 22, 2010) (oreilly.com). He suggests:
So what safe profile of ZIP would that give us? That would be approximately ZIP 2 (which is what everyone implements) plus support for UTF-8 names (unpatentable). The only thing that might need to be looked into is whether the ZIP64 (archives longer than 4gig feature of ZIP4.5) was included. I believe this is what the proponents of the proposed ISO ZIP effort have been thinking about.
A couple of interesting recent law journal articles are Philip Johnson, “Dedicating Copyright to the Public Domain” Modern Law Review 71(4):587-610 (July 2008); and Timothy K. Armstrong, “Shrinking the Commons: Termination of Copyright Licenses and Transfers for the Benefit of the Public” Harvard Journal on Legislation 47(2):359-423 (Summer 2010).
Whatever the status of the format from time to time, one thing that’s always been clear is that PKWARE’s pkzip and pkunzip programs were not in the public domain. PKWARE has introduced new versions of the programs over time, as well as new versions of the format. The most convenient history of the versions of each is given in two pages from answers.com: "Pkzip 1.1 Version history" and "ZIP (file format) 4 Version history". Here are some highlights:
• 1989: PKZIP 0.8 and ZIP 1.0
• 1993: PKZIP 2.0 and ZIP 2.0: Introduction of the DEFLATE compression method
• 2001: PKZIP 4.5 and ZIP 4.5: Inclusion of ZIP64 archives support
• 2002: PKZIP 5.0 and ZIP 5.0: Inclusion of the DES, 3DES, RC2 and RC4 encryption formats
• 2007: SecureZIP 11.0 and ZIP 6.3.0: Inclusion of UTF-8 name support
Versions of "appnote.txt" back to 4.5 (11/01/2001) are available in PKWARE’s "Application Note Archives" (pkware.com). One earlier version from the PKWARE site, not numbered but dated September 1, 1998, is available courtesy of waybackmachine.org.
Versions dated September 15, 1996 and May 31, 1997 are available from Info-Zip. A Usenet comp.os.os2.apps posting by Chris Waters on October 20, 1992, suggests there was a lot of cooperation between PKWARE and Info-Zip with respect to the ZIP 2.0 standard. In fact, Info-Zip, in version 5.0 of its "zip" and "unzip" software, actually used version 2.0 of the ZIP format while PKUNZIP 2.0 was still in beta. To see the documentation, you can download unz50x32.exe from the University of Potsdam. The collaboration apparently continues: “Making “Zip” an International Standard?“. A number of versions of the Info-Zip appnote are listed in the file: http://www.info-zip.org/doc/README.
There may be more versions of appnote.txt at ftp://ftp.uu.net/pub/archiving/zip/doc/, but the traditional User Name: anonymous, Password: email@address doesn’t seem to work anymore.
It’s easier to find the documentation which came with the old programs. For example, I downloaded version 2.04, pkz204g.exe, from Duke University. If memory serves, that was a particularly popular and stable version. I extracted MANUAL.DOC, and found the following at page 85:
There are many different methods of compression. In the history of PKZIP alone there have been seven different methods to date. The .ZIP file format was designed so that additional methods of compression can be added as they are developed. In this way the .ZIP file format will never need to be abandoned. If you attempt to extract a .ZIP file that was created with version 2.0 or higher with a lower version of PKUNZIP you will receive the message "Don’t Know How to Handle" for every file compressed with a more advanced algorithm.
The following was on page 89:
Because PKWARE has dedicated the .ZIP file format to the public domain, it is possible for other people to write programs which can read .ZIP files.
NOTE THAT THE PKZIP, PKUNZIP, PKSFX PROGRAMS AND THEIR ASSOCIATED SOURCE CODE AND SUPPORT PROGRAMS ARE THE EXCLUSIVE PROPERTY OF PKWARE INC. AND ARE NOT PUBLIC DOMAIN SOFTWARE PROGRAMS. …
Extraction and Compression programs not developed by PKWARE may not be completely compatible with the .ZIP file standard.
This illustrates two key points. First, PKWARE still regarded version 2.0 of the file format as being in the public domain. Second, the format was designed in such a way as to be compatible with future methods of compression.
The DEFLATE Compression Method
I suppose it’s still possible now, as it was in my day, to get a graduate degree in Information Studies or Library Science without ever having heard the name "Claude Shannon", or the term "information theory". Be that as it may, Wikipedia tell us that "data compression", in information theory, is "the process of encoding information using fewer bits than the original representation would use." "Lossless data compression" is "a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data."
The "DEFLATE" method was the lossless compression algorithm developed by Phil Katz for PKZIP 2.0. One of the more definitive statements of the method is by Peter Deutsch, "DEFLATE Compressed Data Format Specification version 1.3" (May 1996): RFC 1951. The abstract says:
This specification defines a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding, with efficiency comparable to the best currently available general-purpose compression methods. … The format can be implemented readily in a manner not covered by patents.
Phil Katz is acknowledged as the designer of the format. The 1952 article of D.A. Huffman, and the 1977 article of A. Limpel and J. Ziv, are listed in the bibliography. Wikipedia also has nice pages on LZ77 and Huffman coding.
Interestingly, one of the stated purposes of the DEFLATE format was to be "compatible with the file format produced by the current widely used gzip utility, in that conforming decompressors will be able to read data produced by the existing gzip compressor." Two companions of RFC 1951 were Peter Deutsch, "GZIP file format specification version 4.3" (May 1996): RFC 1950; and Peter Deutsch and Jean-Loup Gailly, "ZLIB Compressed Data Format Specification version 3.3" (May 1996): RFC 1950. Although both use DEFLATE, GZIP and ZLIB are incompatible alternatives to ZIP. They have homes at gnu.org and zlib.net respectively.
A widely used format that is compatible with ZIP is the Sun-Oracle Java ARchive (JAR) specification, which says: "A JAR file is essentially a zip file that contains an optional META-INF directory."
Using the Format
Just because the ZIP 2.0 format is in the public domain, there are many implementations. Wikipedia, naturally, has a "comparison of file archivers" page.
The law school IT folks put ZipCentral on the Windows XP machine in my office, and it works just fine. My Ubuntu machine has File Roller. Of course the Ubuntu documentation of the command line utilities is always interesting too. Here are the manpages for zip, unzip, jar -jar, gzip, and gunzip. Have a particular look at the acknowledgements on the zip manpage.
You can create a minimal .docx file and use one of these tools to see the directory structure:
• • .rels
• • app.xml
• • core.xml
• • _rels
• • • document.xml.rels
• • theme
• • • theme1.xml
• • document.xml
• • fontTable.xml
• • settings.xml
• • styles.xml
• • webSettings.xml
The directory structure of a minimal .odt file looks like this:
• • accelerator
• • • current.xml
• • manifest.xml
• • thumbnail.png
META-INF and manifest.xml indicate JAR.
So ZIP is like prose. You can be using it all the time without even knowing.