Shape Fragments and Subgraphs
In this post I'd like to explain some of the ideas from our paper, starting small to illustrate the design choices and considerations. The paper is titled Data Provenance for SHACL (formerly Shape Fragments) and in its essence, it describes how the constraint language SHACL can be used to specify subsets of an RDF graph.
The idea described previously seems simple. We take a data graph, we take a shapes graph, then we "trace out" the shapes from the shapes graph from the data graph and, voila, we're done with it. Of course, this leaves out what it means to "trace out" shapes from a graph. Consider the following example shape:
:SocialShape a sh:PropertyShape ; sh:targetClass foaf:Person ; sh:path foaf:knows ; sh:minCount 1 .
Intuitively, you want only consider nodes that adhere to the shape and
you want to "follow properties" to create a subgraph. In this example,
you take all triples where the subject is the focus node and which
also have foaf:knows
as a predicate. Consider the following data
graph:
:maxime a foaf:Person ; foaf:givenName "Maxime" ; foaf:knows :thomas ; foaf:knows :anastasia ; foaf:knows :jan .
Using the ideas formulated earlier, we obtain the subgraph:
:maxime foaf:knows :thomas ; foaf:knows :anastasia ; foaf:knows :jan .
We believe this to be a simple and intuitive definition for the given
shape. You may argue that because :SocialShape
mentioned the
rdf:class
, we also want it in the subgraph. We would agree and have
defined it as such in the paper.
You can imagine trying to define subgraphs for every possible SHACL
construct, but few definitions are as straightforward as the one
discussed above. The one defined above isn't even that obvious. You
could ask yourself: why does the subgraph contain all outgoing
foaf:knows
edges? One philosophy could be to let the subgraph only
contain "just enough". After all, :SocialShape
only says there needs
to be at least one.
This brings us to one of the principles we followed when defining the
subgraphs: determinism. In the :SocialShape
case, because triples in
an RDF graph are not sorted, the only choice we have is returning all
the triples.
Let's consider another example:
:AntisocialShape a sh:PropertyShape ; sh:targetClass foaf:Person ; sh:path foaf:knows ; sh:maxCount 2 .
Here, you are antisocial when you know at most two others. Consider a new data graph:
:bob a foaf:Person ; foaf:givenName "Bob" ; foaf:knows :alice ; foaf:knows :tim .
What would be a natural definition for a subgraph given our
:AntisocialShape
? Keeping in mind our principle of determinism, we are
left with two choices. Both contain the triple :bob a foaf:Person
as
we discussed earlier. The first option is the subgraph constaining all
foaf:knows
triples where :bob
is the subject. The second option
contains only the above mentioned triple. Both options seem
reasonable, and we opted for the latter.
The reason being that this somehow comes closer to another underlying principle: minimality. We chose the smallest subgraph we can while somehow still maintaining the essence of the original data graph.
This leads to the following observation: the empty subgraph also is minimal and deterministic. This is the essence of the major open problem within this work: we want formally defined "postulates" which lead us to a definition of Shape Fragments. The principles of determinism and minimality are just design guidelines.
Nevertheless, I believe these design guidelines together with our
proposal for the definition of subgraphs are reasonable. This believe
is strengthened by one of the formal contributions of the paper: the
Sufficiency Lemma (and its corollary). The notion of Sufficiency is
borrowed from work on database Provenance, which is highly
relevant. Informally, the lemma states that for every SHACL construct
(like the ones demonstrated by :SocialShape
and :AntisocialShape
)
defines a subgraph which contains enough triples such that the shape
used for defining it still holds for the same nodes in the
subgraph. Even when you add more triples from the original graph to
the subgraph. In short, shapes that hold in the original graph, also
hold in every subgraph that contains at least the triples provided by
our definition. The "at least" is important here. For example, it
means that the choice of minimality in the subgraph definition of
:AntisocialShape
is not necessary for our result to hold.
Finally I would like to note that the Suffiency property of our definition captures some intuition that subgraphs given by shapes need to still adhere to these shapes. This gives us a stronger link to the definition of SHACL as a constraint language.
Hopefully this short post raises your interest in our work. There are interesting problems to solve here, both for more theoretically minded people (I'm thinking of the above mentioned postulates and other properties of the subgraphs) and more practically minded peope (Relating to the implementation and use-case side of things). The paper also discusses some preliminary results on implemenatation which may be of interest.