Survivability of Blockchain Systems

Part 1: The Goals of the Internet Architecture

Currently there is considerable interest (real and hype) in blockchain systems as a promising technology for the future infrastructure of a global value-exchange network – or whatsome refer to as the “Internet of value”. The original blockchain idea of Haber and Stornetta is now a fundamental construct within most blockchain systems, starting with the Bitcoin system which first adopted the idea and deployed it in a digital currency context.

Many parallels have been made between blockchain systems and the Internet. However,many comparisons often fail to understand the fundamental goals of the Internet architecture as promoted and led by DARPA, and thus fail to fully appreciate how these goals have shaped the Internet to achieve its success as we see it today. There was a pressing need in the Cold War period of the 1960s and 1970s to develop a new communications network architecture that did not previously exist, one that would allow communications to survive in the face of attacks.

If blockchain technology seeks to be a fundamental component of the future global distributed network of commerce and value, then its architecture must also satisfy the same fundamental goals of the Internet architecture.

In considering the future direction for blockchain systems generally, it is useful to recall and understand goals of the Internet architecture as defined in the early 1970s as a project funded by DARPA. The definition of the Internet as view in the late 1980s is the following: it is “a packet switched communications facility in which a number of distinguishable networks are connected together using packet switched communications processors called gateways which implement a store and forward packet-forwarding algorithm”.

It is important to remember that the design of the ARPANETand the Internet  favored  military values (e.g. survivability, flexibility, and high performance) over commercial goals (e.g. low cost, simplicity, or consumer appeal), which in turn has affected how the Internet has evolved and has been used. This emphasis was understandable given the Cold War backdrop to the packet-switching discourse throughout the 1960s and1970s.

The DARPA view at the time was that there are seven (7) goals of the Internet architecture, with the first three being fundamental to the design, and the remaining four being second level goals. The following are the fundamental goals of the Internet in the order of importance:

  1. Survivability: Internet communications must continue despite loss of networks or gateways. This is the most important goal of the Internet, especially if it was to be the blueprint for military packet switched communications facilities. This meant that if two entities are communicating overthe Internet, and some failure causes the Internet to be temporarily disrupted and reconfigured to reconstitute the service, then the entities communicating should be able to continue without having to reestablish or reset the high level state of their conversation. Therefore to achieve this goal, the state information which describes the on-going conversation must be protected. But more importantly, in practice this explicitly meant that it is acceptable to lose the state information associated with an entity  if, at the same time, the entity itself is lost.
  2. Variety of service types: The Internet must support multiple types of communications service. What was meant by “multiple types” is that at the transport level the Internet architecture should support different types of services distinguished by differing requirements for speed, latency and reliability. Indeed it was this goal that resulted in the separation into two layers of the TCP layer and IP layer, and the use of bytes (not packets) at the TCP layer for flow control and acknowledgement.
  3. Variety of networks: The Internet must accommodate a variety of networks. The Internet architecture must be able to incorporate and utilize a wide variety of network technologies, including military and commercial facilities.

The remaining four goals of the Internet architecture are: (4) distributed management of resources, (5) cost effectiveness, (6) ease of attaching hosts, and (7) accountability in resource usage.Over the following decades these second level goals have been addressed in in different ways.

[The latest version of the full paper can be downloaded here]

Open Algorithms (OPAL): Key Concepts

The following are the key concepts and principles underlying the open algorithms paradigm:

  • Moving the algorithm to the data: Instead of pulling raw data into a centralized location for processing, it is the algorithms that should be sent to the data repositories and be processed there.
  • Raw data must never leave its repository: Raw data must never be exported from its repository, and must always be under the control of its owner or the owner of the data repository.
  • Vetted algorithms: Algorithms must be vetted to be “safe” from bias, discrimination, privacy violations and other unintended consequences. The data owner (data provider) must ensure that the algorithms which it authors/publishes has been thoroughly analyzed for safety and privacy-preservation (i.e. fairness, accountability and transparency in Machine Learning).
  • Provide only safe answers: When executing an algorithm on a data-set, the data-repository must always provide responses that are deemed “safe” from a privacy perspective. Responses must not release or leak personally identifying information (PII) without the consent of the user (subject). This may imply that a data repository return only aggregate answers.
  • Trust Networks (Data Federation): In a group-based information sharing configuration – referred to as Data Sharing Federation – algorithms must be vetted collectively by the trust network members. The operational aspects of the federation should be governed by a legal trust framework for data federation.
  • Consent for algorithm execution: Data repositories that hold subject data should obtain consent from the subject when the subject’s data is to be included in a given algorithm execution.
  • Decentralized Data: By leaving raw data in its repository, the OPAL paradigm points towards a decentralized architecture for data stores.
  • Personal Data Stores: Decentralized data architectures should also incorporate the notion of personal data stores (PDS) as a legitimate data repository end-point.

Public keys on blockchains: confusing existence with trust

Today Identity and Access Management (IAM) represents a core component of the Internet infrastructure,  without which users would not be able to obtain online services in a timely and scalable manner. Enterprise IAM infrastructures are well integrated into other enterprise infrastructure services — such as directory services — which provide control over employees and assets. In the case of Consumer IAM most end-users are oblivious to the underlying identity federation infrastructures that allow them to perform Web Single Sign-On (SSO) to various online services and which enables them to grant their mobile apps access to various personal resources (e.g. contacts list, calendar, etc).

The recent emergence of the Bitcoin system has created various discourses of the role of “blockchain identity”. Here the three notable fundamental features of Bitcoin are its combined use of:

  • peer-to-peer network of physically distributed mining nodes,
  • consensus-based transaction status agreement algorithm and
  • restrictive scripting language (opcodes) for transaction expression.

These three aspects of the Bitcoin system provide mining nodes with true independence in processing transactions, subject only to the 51\% majority requirement of the consensus scheme. It is precisely this node-independence that translates to “user independence” in the sense of the user not being beholden to any one mining node (or a small minority of nodes) in the Bitcoin system.

However, it is this “user independence” (in the context of Bitcoin) that have led many to incorrectly extrapolate (speculate) that the same degree user independence can be achieved in all DLTs (distributed ledger technology) in general — something that is not necessarily true in DLTs generically speaking. The Bitcoin system is an instance of a DLT, but not all proposed DLTs possess the three fundamental features of Bitcoin.

Furthermore, many commentators have equated “user independence” (in Bitcoin) to “individual empowerment” in DLTs in general, a jump in speculation that is too far and which have led to confusion among the non-technical audience.

This misunderstanding regarding individual empowerment is exacerbated when the use of self-issued public-key pairs (in the Bitcoin system) is extrapolated to mean that these self-issued keys can be used as a “digital identity” for individuals in general. More specifically, the use of self-issued public-key pairs have led many to deduce (incorrectly) that a public-key used in the Bitcoin system can be “trusted” as a “digital identity” simply because it has been recorded on a transaction-block which has been replicated by all nodes on the peer-to-peer network.

That is, the existence of a key in transaction block is being confused with trust in the provenance and ownership of that key.

Some have even coined the term “trustless” when referring to the peer-to-peer network of mining nodes, forgetting that high-value transaction networks are built on both technical-trust and legal-trust — both leading to business and social trust.

It is worth recalling that this problem of digital identity versus public keys emerged first in the mid-1990s in the context of self-signed X509 certificates,  Simple PKI (RFC2693) and in the Pretty Good Privacy (PGP) system (RFC1991). Although an implementation of the PGP system may provide technical-trust, the PGP proposal was never broadly adopted by industry due to a lack of a corresponding model for business and legal trust.

Blockchains: Evidence of Mediated Computation

In writing of the report of the Kantara BSC group (Blockchain and Smart Contracts) – a group that has been meeting bi-weekly for the past 7 months – we have come across numerous use-cases proposed by members who are looking closely at blockchain technology (or more generally from distributed ledger technology).

To enable classification of these use-cases,  some criteria were agree upon that  would highlight the features of blockchain systems. Since the attraction of blockchain technology (and more generally of distributed ledgers) lies in its empowering parties to transact without the need for a single (or few) intermediary, the following criteria has helped the team classify the received use-cases:

  • Individuals controlling their own data: Does the use-case seek to empower individuals to begin with, and does blockchain technology help to achieve that goal.
  • Individuals rising to the level of a “peer” in transactions with others: Does the use-case require individuals to function at a peer-level (or can the same outcome be achieved using other paradigms), and does blockchain technology help to achieve that goal.
  • Evidence of mediated computation: Does the use-case require immutable evidence that a neutral third party (e.g. some computer, somewhere) mediated the transaction, without which the transaction outcome would be worthless to the transacting parties.

The last criterion points to a feature of blockchain technology that is often overlooked. In many discourses regarding applications of blockchain technology, authors assume (forget) that the blockchain system consists of a network of peer-to-peer nodes which perform some computation (e.g. proof or work mining) towards the completion of a transaction. As such, one or more of these nodes are in fact performing mediated computation (to some degree) and at the same time provide evidence of this mediated act.

If evidence of mediated computation is crucial to the acceptance of a transaction, it implies that stronger forms of technical-trust must be produced by the entity (i.e. node; server; device) that is performing the computation. New forms of remote attestation may need to be devised, something along the lines of the SGX architecture that provide evidence that a given computation was performed within a secure enclave.

This raises another prospect: different nodes on a blockchain system may offer different levels of trustworthy computation, each with an associated cost (i.e. tiers of trusted computation services on the P2P network).

 

 

 

Core Identity Issuers (Part II)

Continuing from the previous post (Part I of the Core Identity series), the goal of a Core Identity Issuer (CoreID Issuer) is to collate sufficient data – aggregate data and non-PII data — from members of a given Circle of Trust in order to create a Core Identity and Core Identifier for a given user (see Figure).

The Issuer performs this task as a trusted member of the Circle of Trust, governed by rules of operations (i.e. legal contract) and with the consent of the user. Architectures and techniques such as MIT OPAL/Enigma can be used here in order for the CoreID Issuer to obtain privacy-preserving aggregate data from the various sources who are members of the Circle of Trust.

 

coreid-issuer-v03png

The goals of the Core-ID Issuer within a Circle of Trust are as follows:

  • Onboard a member-user: The Issuer’s primary function is to on-board users who are known to the CoT community, and who have requested and consented to the creation of a Core Identity.
  • Collate PII-free data into a Core Identity: The Issuer obtains aggregate data and other PII-free data regarding the user from members of the CoT. This becomes the core identity for the user, which is retained by the Issuer for the duration selected by the user. The Issuer must keep the core identity as secret, accessible only to the user.
  • Generate Core Identifier (unlinkable): For a given user and their core identity, the Issuer generates a core identifier (e.g. random number) that must be unlikable to the core identity. Note that a core identifier must not be used in a transaction. The core identifier value may be contained as a signed certificate or other signed data structure, with the Persona Provider as its intended audience (see Figure).
  • Issue Core Assertions regarding the Core Identifier: The purpose of the Issuer generating a core identifier is to allow PII-free core-assertions regarding the user to be created. These signed core assertions must retain the privacy of the user, and must declare assertions about the core identifier.
  • Interface with Persona Providers: The Issuer’s main audience is the Persona Provider, who must operate with the Issuer under legal trust framework that calls-out user privacy as a strict requirement. The Issuer must make available the necessary issuance end-points (i.e. APIs) as well as validation end-points to the Persona Provider. In some cases, from an operational deployment view the Issuer and Persona Provider may be co-located or even tightly coupled under the same provider entity, although the functional difference and boundaries are clear.

Core Identities and Transaction Identifiers for Blockchains

Etymology: Middle French identité, from Late Latin identitat-, identitas, probably from Latin identidem repeatedly, contraction of idem et idem, literally, same and same (Merriam-Webster Dictionary).

Identity is about trusted data — trusted personal data. Human beings live within social constructs and communities. People who know me can vouch for me. Organizations that know me can issue assertions or attestations about me.

At the heart of all this is the notion of the core identity, something that is inherent as part of me and inalienable from me.

There are a number of key concepts and principles underlying the notion of core identities, core identifiers and personas (see Figure).

coreid-persona-concept-v04png

 

A fundamental concept is that of derived identities and derived identifiers which provides not only privacy to its user (the person or organization that it represents) but also a degree of defense in the case of attacks against identity providers (e.g. identity theft) and for a safety net in the rare case of weaknesses within the underlying cryptographic implementations.

I would argue that transaction identities and transaction identifiers are the forms of identities that should be used on the Internet and that they must be derived (e.g. cryptographically) from a core identity which itself must be kept as private or secret. Should a transaction identity be compromised or stolen, it can be placed on a public “blacklist” and a new one be derived for the user. The derivation process or algorithms must maintain the privacy of the user and the secrecy of the user’s core identity.

Some definitions:

We define identity as the collective aspect of the set of characteristic or features by which a thing (e.g. human; device; organization) is recognizable and distinguishable one from another.  In the context of a human person, individuality of a person plays an important role in that it allows a community of people to recognize the distinct characteristics of an individual person and consider the person as a persisting entity.

  • Core identity: The collective aspect of the set of characteristics (as represented by personal data) by which a person is uniquely recognizable, and from which a unique core identifier may be generated based on the set of relevant personal data.

Thus, for example, the set of transaction data associated with a person can be collated and be used to create a core identity that distinguishes that person from others. Out of this set of transaction data, a core identifier may be generated and be held as a joint secret by its issuer and the person. The core identity pertaining to a person must be kept secret, and not be used in transactions.

  • Core Identifier: A secret data (e.g. string) or secret mechanism (e.g. crypto function) that uniquely identifies a person or entity. The core identifier must be immutable, must be kept secret, and never be used directly in transactions.
  • Persona: A persona is defined by and created based on a collection of attributes used in a given context or a given relationship. Thus, a person may have a work-persona, home-persona, social-persona, and others.  Each of these personas is context-dependent and involves only the relevant subset of the core identity characteristics of that person. A person may have one or more personas.
  • Transaction Identity and Identifier: When an individual seeks to perform a transaction (e.g. on the Internet, on a blockchain or other transacting mediums) he or she chooses a relevant persona and derives from that persona a transaction identity (and corresponding digital identifier) to be used in the transaction. A transaction identity maybe short-lived and may even be created only for that single transaction instance. A useful analogy is that of credit card numbers, which may be used at a Point of Sale (POS) locations without the user needing to provide any additional identifiers (i.e. reveal data from their core identity such as Social Security Number) and which may be replaced at any time without impacting the user’s core identity.

Query Smart Contracts: Bringing the Algorithm to the Data

One paradigm shift being championed by the MIT OPAL/Enigma community is that of using “pre-fabricated” queries (e.g. SQL queries) that have been analyzed by experts and have been vetted to be “safe” from the perspective of privacy-preservation. The term “Open Algorithm” (OPAL) here implies that the vetted queries (“algorithms”) are made open by publishing them, allowing other experts to review them and allowing other researchers to make use of them in their own context of study.

The next step in the Open Algorithms paradigm is the use of smart contracts to capture these safe algorithms in the form of executable queries residing in a legally binding digital contract.

What I’m proposing is the following: instead of a centralized data processing architecture, the P2P nodes (e.g. in a blockchain) offers the opportunity for data (user data and organizational data) to be stored by these nodes and be processed in a privacy-preserving manner, accessible via well-known APIs and authorization tokens and the use of smart contracts to let the “query meet the data”.

In this new paradigm of privacy-preserving data sharing, we “move the algorithm to the data” where queries and subqueries are computed by the data repositories (nodes on the P2P network). This means that repositories never release raw data and that they perform the algorithm/query computation locally which produce aggregate answers only. This approach of moving the algorithm to the data provides data-owners and other joint rights-holders the opportunity to exercise control over data release, and thus offers a way forward to provide the highest degree of privacy-preservation while allowing data to still be effectively shared.

This paradigm requires that queries be decomposed into one or more subqueries, where each subquery is sent to the appropriate data repository (nodes on the P2P network) and be executed at that repository. This allows each data repository to evaluate received subqueries in terms of “safety” from a privacy and data leakage perspective.

Furthermore, safe queries and subqueries can be expressed in the form of a Query Smart Contract  (QSC) that legally bind the querier (person or organization), the data repository and other related entities.

A query smart contract that has been vetted to be safe can be stored on nodes of the P2P network (e.g. blockchain). This allows Queriers to not only search for useful data (as advertised by the metadata in the repositories) but also search for prefabricated safe QSCs that are available throughout the P2P network that match the intended application. Such a query smart contract will require that identities and authorizations requirements be encoded within the contract.

A node on the P2P network may act as a Delegate Node in the completion of a subquery smart contract.  A delegate node works on a subquery by locating the relevant data repositories, sending the appropriate subquery to each data repository, and receiving individual answers and collating the results received from these data repositories for reporting to the (paying) Querier.

A Delegate Node that seeks to fulfill a query smart contract should only do so when all the conditions of the contract has been fulfilled (e.g. QSC has valid signature; identity of Querier is established; authorization to access APIs at data repositories has been obtained; payment terms has been agreed, etc.). A hierarchy of delegate nodes may be involved in the completion of a given query originating from the Querier entity. The remuneration scheme for all Delegate Nodes and the data repositories involved in a query is outside the scope of the current use-case.

What and why: MIT Enigma

I often get asked to provide a brief explanation about MIT Enigma — notably what it is, and why it is important particularly in the current age of P2P networking and blockchain technology.  So here’s a brief summary.

The MIT Enigma system is part of a broader initiative at MIT Connections Science called the Open Algorithms for Equity, Accountability, Security, and Transparency (OPAL-EAST).

The MIT Enigma system employs two core cryptographic constructs simultaneously atop a Peer-to-Peer (P2P network of nodes). These are secrets-sharing (ala Shamir’s Linear Secret Sharing Scheme (LSSS)) and multiparty computation (MPC). Although secret sharing and MPC are topics of research for the past two decades, the innovation that MIT Enigma brings is the notion of employing these constructions on a P2P network of nodes (such as the blockchain) while providing “Proof-of-MPC” (like proof of work) that a node has correctly performed some computation.

In secret-sharing schemes, a given data item is “split” into a number of ciphertext pieces (called “shares”) that are then stored separately. When the data item needs to be reconstituted or reconstructed, a minimum or “threshold” number of shares need to be obtained and merged together again in a reverse cryptographic computation. For example, in Naval parlance this is akin to needing 2 out of 3 keys in order to perform some crucial task (e.g. activate the missile). Some secret sharing schemes possess the feature that some primitive arithmetic operations can be performed on shares (shares “added” to shares) yielding a result without the need to fully reconstitute the data items first. In effect, this feature allows operations to be performed on encrypted data (similar to homomorphic encryption schemes).

The MIT Enigma system proposes to use a P2P network of nodes to randomly store the relevant shares belonging to data items.  In effect, the data owner no longer needs to keep a centralized database of data-items (e.g. health data) and instead would transform each data item into shares and disperse these on the P2P network of node.  Only the data owner would know the locations of the shares, and can fetch these from the nodes as needed.  Since each of these shares appear as garbled ciphertext to the nodes, the nodes are oblivious to their meaning or significance.  A node in the P2P network would be remunerated for storage costs and the store/fetch operations.

The second cryptographic construct employed in MIT Enigma multiparty computation (MPC). The study of MPC schemes seeks to address the problem of a group of entities needing to share some common output (e.g. result of computation) whilst maintaining as secret their individual data items.  For example, a group of patients may wish to collaboratively compute their average blood pressure information among them, but without each patient sharing actual raw data about their blood pressure information.

The MIT Enigma system combines the use of MPC schemes with secret-sharing schemes, effectively allowing some computations to be performed using the shares that are distributed on the P2P. The combination of these 3 computing paradigms (secret-sharing, MPC and P2P nodes) opens new possibilities in addressing the current urgent issues around data privacy and the growing liabilities on the part of organizations who store or work on large amounts of data.

New Principles for Privacy-Preserving Queries for Distributed Data

Here are the three (3) principles for privacy-preserving computation based on the Enigma P2P distributed multi-party computation model:

(a) Bring the Query to the Data: The current model is for the querier to fetch copies of all the data-sets from the distributed nodes, then import the data-sets into the big data processing infra and then run queries. Instead, break-up the query into components (sub-queries) and send the query pieces to the corresponding nodes on the P2P network.

(b) Keep Data Local: Never let raw data leave the node. Raw data must never leaves its physical location or the control of its owner. Instead, nodes that carry relevant data-sets execute sub-queries and report on the result.

(c) Never Decrypt Data: Homomorphic encryption remains an open field of study. However, certain types of queries can be decomposed into rudimentary operations (such as additions and multiplications) on encrypted data that would yield equivalent answers to the case where the query was run on plaintext data.