Provenance and SHACL

Note: This post is an adaptation of Section 1.4 of my PhD thesis, and serves as a general intro to our paper on the subject. The purpose of publishing it as a standalone blogpost is to further disseminate my writings. It is part of a short series of posts representing the Introduction of my thesis:

In the previous posts, we have primarily discussed the fundamental semantics for SHACL. Here, we will explore another useful semantics: data provenance for SHACL. Some elements have already been discussed in a previous post.

Given a SHACL shapes graph and a data graph that conforms to that shapes graph, we want to know what subset of the data graph is relevant to decide that it conforms. This subset of the data is called the data provenance.

There are many intuitions of what data is relevant. The provenance could be seen as the subset of the graph "traced out" by the shape by following the values of sh:path in the property shapes. Another intuition is that we would like to have a subset of the graph that the validator "looked at" while validating the data. These notions are useful but imprecise, as will be demonstrated later in this section. Our approach is to at least take into consideration the so-called sufficiency property of the provenance. The sufficiency property states, informally, that the resulting provenance still conforms to the shapes graph. Or, at the level of shapes (without targets), all nodes from the input data graph still satisfies the same shapes in the provenance. Sufficiency tells us that our provenance is actually relevant in a very precise way — clearly, data that contributes to conformance must be relevant.

Our discussion will focus on the provenance of a shape without target declarations. So, given a data graph G, a shape s, and a node n from \(G\) that conforms to \(s\), we will define the provenance which we call the neighborhood of n in G according to s. This can then later be generalized for shapes graphs (including target declarations) as opposed to just shapes.

Neighborhoods by example

We will use the following data graph for our examples:

:user_a a :Admin;
  :accesses :resource1 .

:user_b a :Admin;
  :accesses :resource1 ;
  :accesses :resource2 .

:user_c a :User;
  :accesses :resource1 ;
  :accesses :resource2 .

:resource1 a :Resource .
:resource2 a :Resource .

Suppose we have a shape defining a standard resource as one that is accessed by at least one user (or admin):

:standardResource a sh:NodeShape ;
  sh:class :Resource ;
  sh:property [
    sh:path [ sh:inversePath :accesses ] ;
    sh:minCount 1 ;
  ] .

Clearly, both :resource1 and :resource2 satisfy the shape. We will look at the neighborhood of :resource1. Looking at the intuition behind neighborhoods, the triple stating that :resource1 is a :Resource is definitely in the neighborhood, as well as at least one of the triples stating that it is accessed by a user. However, there is no order on the triples defined in RDF, so picking any of the three triples is somewhat arbitrary. We could define provenance to be non-deterministic, and say that there are three neighborhoods. However, our approach is to have a deterministic provenance definition and thus, we choose to define the neighborhood as:

:resource1 a :Resource .
:user_a :accesses :resource1 .
:user_b :accesses :resource1 .
:user_c :accesses :resource1 .

Note that this does not follow the intuition that we want to have the subgraph that the validator looks at. Any reasonable validator would only need to consider one :accesses triple to know that :resource1 conforms to :standardResource.

We will continue with another example. Consider the shape defining unpopular resources as resources that are accessed by at most two users:

:unpopularResource a sh:NodeShape ;
  sh:class :resource ;
  sh:property [
    sh:path [ sh:inversePath :accesses ] ;
    sh:maxCount 2 ;
  ] .

Only :resource2 satisfies this shape. There are multiple options again for what the neighborhood could be. There are two deterministic options. Either the neighborhood contains all :accesses triples or none! According to our intuition of wanting to "trace out" the shape the former is the suiting definition. However, our definition will choose to not include any :accesses triples in this case. Our design philosophy here is that neighborhoods should be somehow minimal. Choosing not to include any :accesses triples is minimal, and still makes :resource2 conform to the shape. Although this may be counterintuitive, this design choice makes sense in light of sufficiency. We will demonstrate a similar situation with a slightly more complex shape containing the same ideas:

:specialResource a sh:NodeShape ;
  sh:class :resource ;
  sh:property [
    sh:path [ sh:inversePath :accesses ]

    sh:minCount 1 ;

    sh:qualifiedValueShape [ sh:not [ sh:class :admin ]] ;
    sh:qualifiedMaxCount 1 ;
  ]

This shape consists out of three constraints. First, the node must be a resource. Second, the resource must be accessed by at least one user. Finally, at most one user that accesses this resource may not be an admin.

The resources described by this shape are special in the sense that they may only be accessed by admins, with at most one exception.

It is clear from the previous discussion that the neighborhood at least includes the following:

:resource2 a :resource .

:user_b :accesses :resource2 .
:user_c :accesses :resource2 .

This is due to the first two constraints. However, what do we do with the qualified max count? To include what is mentioned in the shape would result in adding the following triple to the neighborhood:

:user_c a :user .

Indicating that :user_c is not an :admin. However, this is exactly the opposite of the information we need for conformance. What we do need is the triple stating that :user_b is indeed an :admin. The idea here is that we want to include in our provenance all data proving that a certain node conforms to a shape. Furthermore, in a shape where only the third constraint is specified, returning nothing also seems to be a possibility for the definition of neighborhood. However, we designed our neighborhood to contain all necessary information to decide conformance, even when other parts of the graph are added. Indeed, adding the triples

:user_b :accesses :resource2 .
:user_c :accesses :resource2 .

would break this property. Having this property is technically necessary but also useful — it allows for neighborhood engines to be flexible and add more to the provenance, while never breaking sufficiency.

This example illustrates the intricacies of defining provenance that is sufficient. Here, we discussed the (qualified) cardinality constraint components, but defining neighborhoods for all shapes requires thinking about every feature in SHACL and choosing a suitable definition for each of them while keeping in mind sufficiency and our two design principles: determinism and minimality.

Shape Fragments

This definition of neighborhoods can be viewed as an additional semantics for SHACL defining a retrieval mechanism. We call these retrieval semantics shape fragments.

Given a shape and a data graph, the shape fragment is simply the neighborhood of all nodes conforming to that shape. When there are targeting declarations, we only consider the nodes that are targeted (and include the information that targets them). It can similarly be defined for sets of shapes, or for shapes graphs.

One can imagine the usefulness of such a retrieval mechanism. When dealing with large RDF graphs, a shapes graph may only describe part of it that is relevant to the intended usage. Retrieving the shape fragment of the data graph then gives us a (possibly) smaller RDF graph which is easier to process, and still satisfies the constraints while containing relevant data.