Wednesday, January 2, 2008

Science Commons » Protocol for Implementing Open Access Data

Science Commons » Protocol for Implementing Open Access Data 

Protocol for Implementing Open Access Data

Status of this Memo

This memo provides information for the Internet community interested in distributing data or databases under an “open access” structure. There are several definitions of “open” and “open access” on the Internet, including the Open Knowledge Definition and the Budapest Declaration on Open Access; the protocol laid out herein is intended to conform to the Open Knowledge Definition and extend the ideas of the Budapest Declaration to data and databases.

This memo does not specify an Internet standard of any kind, but does specify the requirements for gaining and using the Science Commons Open Access Data Mark and metadata, by using legal tools and norms that conform to the protocol specified. This memo is available under the Creative Commons Attribution 3.0 (unported jurisdiction) license and will be submitted to the World Wide Web Consortium for consideration.

The terms MUST, MUST NOT, and SHOULD are used herein as defined in RFC 2119 (“Key words for use in RFCs to Indicate Requirement Levels”).

1. Intellectual foundation for the protocol

The motivation behind this memorandum is interoperability of scientific data.

The volume of scientific data, and the interconnectedness of the systems under study, makes integration of data a necessity. For example, life scientists must integrate data from across biology and chemistry to comprehend disease and discover cures, and climate change scientists must integrate data from wildly diverse disciplines to understand our current state and predict the impact of new policies.

The technical challenge of such integration is significant, although emerging technologies appear to be helping. But the forest of terms and conditions around data make integration difficult to legally perform in many cases. One approach might be to develop and recommend a single license: any data with this license can be integrated with any other data under this license.

But this approach, which implicitly builds on intellectual property rights and the ideas of licensing as understood in software and culture, is difficult to scale for scientific uses. There are too many databases under too many terms already, and it is unlikely that any one license or suite of licenses will have the correct mix of terms to gain critical mass and allow massive-scale machine integration of data.

Therefore we instead lay out principles for open access data and a protocol for implementing those principles, and we distribute an Open Access Data Mark and metadata for use on databases and data available under a successful implementation of the protocol.

1.2. Scope

The Science Commons open access database protocol is specifically limited in scope to provide the legal functions necessary to create a legal tool. Tools created under conforming implementations will create the foundation to legally integrate a database or data product available under a tool conforming to the protocol with another database or data product available under a tool conforming to the protocol. There are no mechanisms to manage transfer or negotiations of rights unrelated to integration (for example, patent rights over uses of the data). Legal tools conforming to the database protocol can cover any kind of database or data product.

2. Open Access Data Mark and metadata

Any implementation of the Science Commons Database Protocol may be submitted to Science Commons for certification as a conforming implementation. The submitted implementation will be reviewed by Science Commons for conformance to the Protocol and a public opinion will be returned. Implementations found to conform to the Protocol will be authorized to use the Science Commons Open Access Data trademarks (icons and phrases) and metadata on databases available under conforming implementations of the protocol. These marks will be maintained by Creative Commons and released in conjunction with the CC0 project icons and metadata.

The review process is in development and will be announced in 2008.

3. Principles of open access data
Legal tools for an open access data sharing protocol must be developed with three key principles in mind:

3.1 The protocol must promote legal predictability and certainty.
3.2 The protocol must be easy to use and understand.
3.3 The protocol must impose the lowest possible transaction costs on users.

These principles are motivated by Science Commons’ experience in distributing a database licensing Frequently Asked Questions (FAQ) file. Scientists are uncomfortable applying the FAQ because they find it hard to apply the distinction between what is copyrightable and what is not copyrightable, among other elements. A lack of simplicity restricts usage and as such restricts the open access flow of data. Thus any usage system must both be legally accurate while simultaneously very simple for scientists, reducing or eliminating the need to make the distinction between copyrightable and non-copyrightable elements.

The terms also need to satisfy the norms and expectations of the disciplines providing the database. This makes a single license approach difficult – archaeology data norms for citation will differ from those in physics, and yet again from those in biology, and yet again from those in the cultural or educational spaces. But those norms must be attached in a form that imposes the lowest possible costs on users (now and in the future).

4. Implementing the Science Commons Database Protocol for open access data

4.1 Converge on the public domain by waiving all rights based on intellectual property

The conflict between simplicity and legal certainty can be best resolved by a twofold measure: 1) a reconstruction of the public domain and 2) the use of scientific norms to express the wishes of the data provider.

Reconstructing the public domain can be achieved through the use of a legal tool (waiving the relevant rights on data and asserting that the provider makes no claims on the data).

Requesting behavior, such as citation, through norms rather than as a legal requirement based on copyright or contracts, allows for different scientific disciplines to develop different norms for citation. This allows for legal certainty without constraining one community to the norms of another.

Thus, to facilitate data integration and open access data sharing, any implementation of this protocol MUST waive all rights necessary for data extraction and re-use (including copyright, sui generis database rights, claims of unfair competition, implied contracts, and other legal rights), and MUST NOT apply any obligations on the user of the data or database such as “copyleft” or “share alike”, or even the legal requirement to provide attribution. Any implementation SHOULD define a non-legally binding set of citation norms in clear, lay-readable language.

4.2 Converge on the public domain by waiving other statutory or intellectual property rights.

In many jurisdictions there are other rights, in addition to copyright, that may apply. For example, sui generis rights apply in the European Union, and uncopyrightable databases may be protected in some countries under unfair competition laws.

Thus, to facilitate data integration and open access data sharing, any implementation MUST include waivers of sui generis and other legal grounds for database protection

4.3 Converge on the public domain by imposing no contractual controls.

There is always the possibility of using contract, rather than intellectual property or statutory rights, to apply terms to databases. This fails to provide legal certainty, ease of use, or low transaction costs, as it forces scientists to either hire a lawyer or interpret contracts themselves.

Thus, to facilitate data integration and open access data sharing, any implementation MUST affirmatively declare that contractual constraints do not apply to the database.

4.4 Provide for interoperation with databases and data not available under the Science Commons Open Access Data Protocol through metadata

There will be significant amounts of data that is not or cannot be made available under this protocol. In such cases, it is desirable that the owner provides metadata (as data) under this protocol so that the existence of the non-open access data is discoverable.

Thus, to provide for interoperation with non-open access data, any implementation of this protocol MUST NOT enable assertions of copyright, sui generis, or any other forms of contractual control on digital identifiers and metadata describing non-open access data.

5. Issues in database “licensing”

“Licensing” a database typically means that the “copyrightable elements” of a database are made available under a copyright license like the CC licenses or the GNU Free Documentation License (FDL). The Science Commons Database FAQ, in its first iteration, recommended this method. That recommendation is now withdrawn for the following reasons.

The licensing approach is marked by the conflict between legal accuracy and simplicity. It is difficult for seasoned attorneys skilled in database practice to determine with accuracy where copyright begins in and ends in many databases – much more so for non-lawyers.

As Abraham Lincoln famously noted in the United States, a house divided against itself cannot stand – it must become all of one or the other. A database divided into copyrightable and non copyrightable elements suffers a similar fate: the user tends to assume that all is under copyright or none is under copyright. And the decision dictates which part of the “license” the user decides to comply with.

There are at least three significant problems with this approach based on using intellectual property rights to enforce norms of attribution, share-alike, or other terms.

5.1 Category errors

Any solution based on rights will result in categorization errors: the application of obligations based on copyright in situations where it is not necessary (for example, a share-alike license on the copyrightable elements may be falsely assumed to operate on the factual contents of a database). In the reverse, a user might assume that the “Facts Are Free” status of the non-copyrightable elements extends to the entire database and inadvertently infringe.

We do not know what courts will decide in the future. But it is conceivable that in 20 years, a complex semantic query across tens of thousands of data records across the web might return a result which itself populates a new database. If intellectual property rights are involved, that query might well trigger requirements carrying a stiff penalty for failure, including such problems as a copyright infringement lawsuit.

These interpretative problems are exacerbated by differences among countries over the standards for copyright protection for databases, by the existence of sui generis database rights, and by the difficulty of interpreting contractual language.

For these reasons, solutions based on selective waiving of intellectual property rights fail to provide a high degree of legal certainty and ease of use.

5.2 False expectations

There is also the problem of false expectations. Many users choose to apply common-use licenses such as the GPL and CC in order to declare their intent: thus, a user might choose to apply a “copyleft” term to the copyrightable elements of a database, in hopes that those elements result in additional open access database elements coming online. But a user would be able to extract the entire contents (to the extent those contents are uncopyrightable factual content) and republish those contents without observing the copyleft or share-alike terms. The data provider, based on our research, is likely to feel “tricked” by this outcome. That is not a desired result.

For this reason, the use of such licenses fails to provide a high degree of of ease of use and legal certainty.

5.3 Attribution stacking

Last, there is a problem of cascading attribution if attribution is required as part of a license approach. In a world of database integration and federation, attribution can easily cascade into a burden for scientists if a category error is made. Would a scientist need to attribute 40,000 data depositors in the event of a query across 40,000 data sets? How does this relate to the evolved norms of citation within a discipline, and does the attribution requirement indeed conflict with accepted norms in some disciplines? Indeed, failing to give attribution to all 40,000 sources could be the basis for a copyright infringement suit at worst, and at best, imposes a significant transaction cost on the scientist using the data.

Therefore, a legal obligation to give attribution violates the principle of low transaction costs.

6. Protocol maintenance and future versions

This protocol is maintained by the Science Commons project at Creative Commons. Please refer all comments to the protocol to wilbanks (AT) creativecommons (DOT) org.

Science Commons » Protocol for Implementing Open Access Data

No comments: