1. What is ShEx?
2. Where is EaaSI using ShEx?
We use ShEx for data modeling, testing the conformance of a graph to a schema, and to drive elements of the WikiDP (https://wikidp.org) user interface.
In this post I’ll discuss an example of how we use ShEx for validation of entity data from Wikidata.
For the blog post format, I am presenting a general description of this approach. If you’d like more details, you will find additional information here. (https://zenodo.org/record/1214521)
There are currently more than 3,000 items for file formats in Wikidata. Users create items and add statements to them as they edit. Some of these items only have one statement in Wikidata. For example, see Figure 1.
Other items have many dozens of statements. For example, see Figure 2.
In order to investigate the data quality across all file format items, we can use ShEx.
I wrote a ShEx schema to describe the expected structure for a minimal file format item. You can consult the schema here. (https://github.com/shexSpec/schemas/blob/master/Wikidata/DigitalPreservation/wikidataFileFormat.shex)
By using this schema we can test all file format items for conformance, and see where gaps exist in the data.
This schema begins with prefix declarations for all referenced namespaces, see Figure 3.
This schema has a start point, which indicates where to begin iterating over the graph.
In the shape for <#wikidata-file_format> I list out all of the Wikidata predicates I expect to be used on file format items. All of the P numbers are identifiers for Wikidata properties, see Figure 4.
On the left-hand side is the schema, on the right-hand side are comments explaining the schema.
The schema describes all predicates expected, the data types of the values expected to be used with the predicates, and cardinalities. For an overview of how ShEx schemas work, consult the ShEx Primer. (http://shex.io/shex-primer/)
The ShEx technical standard currently has five actively-maintained implementations.
Using any of these software tools, it is possible to use this schema to test for conformance of entity data from Wikidata. This way we can investigate the data quality of the 3,000+ file format items in greater detail.