Provenance and SHACL
Note: This post is an adaptation of Section 1.4 of my PhD thesis, and serves as a general intro to our paper on the subject. The purpose of publishing it as a standalone blogpost is to further disseminate my writings. It is part of a short series of posts representing the Introduction of my thesis:
- Section 1.1: SHACL in a nutshell
- Section 1.2: Expressiveness
- Section 1.3: Recursion
- Section 1.4: Provenance (this post)
In the previous posts, we have primarily discussed the fundamental semantics for SHACL. Here, we will explore another useful semantics: data provenance for SHACL. Some elements have already been discussed in a previous post.
Given a SHACL shapes graph and a data graph that conforms to that shapes graph, we want to know what subset of the data graph is relevant to decide that it conforms. This subset of the data is called the data provenance.
There are many intuitions of what data is relevant. The provenance
could be seen as the subset of the graph "traced out" by the shape by
following the values of sh:path
in the property shapes. Another
intuition is that we would like to have a subset of the graph that the
validator "looked at" while validating the data. These notions are
useful but imprecise, as will be demonstrated later in this
section. Our approach is to at least take into consideration the
so-called sufficiency property of the provenance. The sufficiency
property states, informally, that the resulting provenance still
conforms to the shapes graph. Or, at the level of shapes (without
targets), all nodes from the input data graph still satisfies the same
shapes in the provenance. Sufficiency tells us that our provenance is
actually relevant in a very precise way — clearly, data that
contributes to conformance must be relevant.
Our discussion will focus on the provenance of a shape without target declarations. So, given a data graph G, a shape s, and a node n from \(G\) that conforms to \(s\), we will define the provenance which we call the neighborhood of n in G according to s. This can then later be generalized for shapes graphs (including target declarations) as opposed to just shapes.
Neighborhoods by example
We will use the following data graph for our examples:
:user_a a :Admin; :accesses :resource1 . :user_b a :Admin; :accesses :resource1 ; :accesses :resource2 . :user_c a :User; :accesses :resource1 ; :accesses :resource2 . :resource1 a :Resource . :resource2 a :Resource .
Suppose we have a shape defining a standard resource as one that is accessed by at least one user (or admin):
:standardResource a sh:NodeShape ; sh:class :Resource ; sh:property [ sh:path [ sh:inversePath :accesses ] ; sh:minCount 1 ; ] .
Clearly, both :resource1
and :resource2
satisfy the shape. We will
look at the neighborhood of :resource1
. Looking at the intuition
behind neighborhoods, the triple stating that :resource1
is a
:Resource
is definitely in the neighborhood, as well as at least one
of the triples stating that it is accessed by a user. However, there
is no order on the triples defined in RDF, so picking any of the three
triples is somewhat arbitrary. We could define provenance to be
non-deterministic, and say that there are three
neighborhoods. However, our approach is to have a deterministic
provenance definition and thus, we choose to define the neighborhood
as:
:resource1 a :Resource . :user_a :accesses :resource1 . :user_b :accesses :resource1 . :user_c :accesses :resource1 .
Note that this does not follow the intuition that we want to have the
subgraph that the validator looks at. Any reasonable validator would
only need to consider one :accesses
triple to know that :resource1
conforms to :standardResource
.
We will continue with another example. Consider the shape defining unpopular resources as resources that are accessed by at most two users:
:unpopularResource a sh:NodeShape ; sh:class :resource ; sh:property [ sh:path [ sh:inversePath :accesses ] ; sh:maxCount 2 ; ] .
Only :resource2
satisfies this shape. There are multiple options again
for what the neighborhood could be. There are two deterministic
options. Either the neighborhood contains all :accesses
triples or
none! According to our intuition of wanting to "trace out" the shape
the former is the suiting definition. However, our definition will
choose to not include any :accesses
triples in this case. Our design
philosophy here is that neighborhoods should be somehow
minimal. Choosing not to include any :accesses
triples is minimal, and
still makes :resource2
conform to the shape. Although this may be
counterintuitive, this design choice makes sense in light of
sufficiency. We will demonstrate a similar situation with a slightly
more complex shape containing the same ideas:
:specialResource a sh:NodeShape ; sh:class :resource ; sh:property [ sh:path [ sh:inversePath :accesses ] sh:minCount 1 ; sh:qualifiedValueShape [ sh:not [ sh:class :admin ]] ; sh:qualifiedMaxCount 1 ; ]
This shape consists out of three constraints. First, the node must be a resource. Second, the resource must be accessed by at least one user. Finally, at most one user that accesses this resource may not be an admin.
The resources described by this shape are special in the sense that they may only be accessed by admins, with at most one exception.
It is clear from the previous discussion that the neighborhood at least includes the following:
:resource2 a :resource . :user_b :accesses :resource2 . :user_c :accesses :resource2 .
This is due to the first two constraints. However, what do we do with the qualified max count? To include what is mentioned in the shape would result in adding the following triple to the neighborhood:
:user_c a :user .
Indicating that :user_c
is not an :admin
. However, this is exactly the
opposite of the information we need for conformance. What we do
need is the triple stating that :user_b
is indeed an
:admin
. The idea here is that we want to include in our
provenance all data proving that a certain node conforms to a
shape. Furthermore, in a shape where only the third constraint is
specified, returning nothing also seems to be a possibility for the
definition of neighborhood. However, we designed our neighborhood to
contain all necessary information to decide conformance, even when
other parts of the graph are added. Indeed, adding the triples
:user_b :accesses :resource2 . :user_c :accesses :resource2 .
would break this property. Having this property is technically necessary but also useful — it allows for neighborhood engines to be flexible and add more to the provenance, while never breaking sufficiency.
This example illustrates the intricacies of defining provenance that is sufficient. Here, we discussed the (qualified) cardinality constraint components, but defining neighborhoods for all shapes requires thinking about every feature in SHACL and choosing a suitable definition for each of them while keeping in mind sufficiency and our two design principles: determinism and minimality.
Shape Fragments
This definition of neighborhoods can be viewed as an additional semantics for SHACL defining a retrieval mechanism. We call these retrieval semantics shape fragments.
Given a shape and a data graph, the shape fragment is simply the neighborhood of all nodes conforming to that shape. When there are targeting declarations, we only consider the nodes that are targeted (and include the information that targets them). It can similarly be defined for sets of shapes, or for shapes graphs.
One can imagine the usefulness of such a retrieval mechanism. When dealing with large RDF graphs, a shapes graph may only describe part of it that is relevant to the intended usage. Retrieving the shape fragment of the data graph then gives us a (possibly) smaller RDF graph which is easier to process, and still satisfies the constraints while containing relevant data.