1. What is ShEx?
Shape Expressions, commonly known as ShEx, is a constraint language for modeling and validating RDF data. As stated in this paper (http://ceur-ws.org/Vol-2042/paper40.pdf ),
XML Schema is to XML what ShEx is to RDF. A great place to gain an overview of ShEx is the Primer. (http://shex.io/shex-primer/)
RDF stands for Resource Description Format. RDF is a widely used data model on the web. Here is one of the specification documents where you can find out about the concepts used in RDF ( https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/)
Using RDF allows us to more easily integrate sets of data. For EaaSI specifically, working with RDF allows us to combine our data with data from Wikidata.
2. Where is EaaSI using ShEx?

We use ShEx for data modeling, testing the conformance of a graph to a schema, and to drive elements of the WikiDP (https://wikidp.org) user interface.
In this post I’ll discuss an example of how we use ShEx for validation of entity data from Wikidata.

For the blog post format, I am presenting a general description of this approach. If you’d like more details, you will find additional information here. (https://zenodo.org/record/1214521)

There are currently more than 3,000 items for file formats in Wikidata. Users create items and add statements to them as they edit. Some of these items only have one statement in Wikidata. For example, see Figure 1.

A screenshot of a file format record from Wikidata that currently has one statement
Figure 1: A file format item from Wikidata that currently has one statement

Other items have many dozens of statements. For example, see Figure 2.

A screenshot of a file format record from Wikidata that currently has dozens of statements.
Figure 2: A file format item from Wikidata that currently has dozens of statements.

In order to investigate the data quality across all file format items, we can use ShEx. I wrote a ShEx schema to describe the expected structure for a minimal file format item. You can consult the schema here. (https://github.com/shexSpec/schemas/blob/master/Wikidata/DigitalPreservation/wikidataFileFormat.shex)

By using this schema we can test all file format items for conformance, and see where gaps exist in the data. This schema begins with prefix declarations for all referenced namespaces, see Figure 3.

Figure 3: The preamble of a ShEx schema for a file format item on Wikidata.

This schema has a start point, which indicates where to begin iterating over the graph.

In the shape for <#wikidata-file_format> I list out all of the Wikidata predicates I expect to be used on file format items. All of the P numbers are identifiers for Wikidata properties, see Figure 4.

Figure 4: A section of a ShEx schema for a file format item on Wikidata.

On the left-hand side is the schema, on the right-hand side are comments explaining the schema. The schema describes all predicates expected, the data types of the values expected to be used with the predicates, and cardinalities. For an overview of how ShEx schemas work, consult the ShEx Primer. (http://shex.io/shex-primer/)

The ShEx technical standard currently has five actively-maintained implementations.

Using any of these software tools, it is possible to use this schema to test for conformance of entity data from Wikidata. This way we can investigate the data quality of the 3,000+ file format items in greater detail.

Learn More:
ShEx homepage: http://shex.io/
ShEx Primer: http://shex.io/shex-primer/

Preferred citation:

Thornton, Katherine. (2019, April 23). Using ShEx to Investigate Data about Software and File Formats in Wikidata. Software Preservation Network. https://www.softwarepreservationnetwork.org/using-shex-to-investigate-data-about-software-and-file-formats-in-wikidata/