“Is it a bird, is it a plane?” Here’s what Datopian founder & CTO, Rufus Pollock, had to say about data quality at SEMIC 2020
In case you missed it, here’s a transcript of Rufus’ insights during the closing debate of the first digital edition of the EU’s annual semantic interoperability conference (SEMIC). Alongside fellow panelists Heather Broomfield, Ruben Verborgh and chairman David Osimo, Rufus argued his case for why data quality should be at the forefront of EU data policy if we are to achieve true interoperability.
David: tell us a bit about what you learned during your many years’ experience in data sharing and reuse - what works, what doesn’t?
Rufus: it is worth looking at where we are in the journey to data nirvana. The first part of that journey is just being able to have data. There is much still to do here, whether it’s open data or non-open data, in data availability. There is still a long way to go in that journey, but a lot has been achieved. It’s not a live issue; it’s something we need to keep working on, but the data availability debate has largely been won in many ways.
Step two in the journey is what you could call the data quality part of this. One thing I would say here is that I want to distinguish quality very briefly, in the sense that it can mean a multitude of things. I think it’s useful to break it down into three levels, crudely. One level would be what I call data tidiness. Many datasets that you get - eg. something as simple as a spreadsheet, which we can all be familiar with - are not tidy, it turns out [Rufus laughs]. Even if they are machine readable, so in a digital form, they’re missing columns, they’re missing rows, there are blank rows at the top, there are footnotes shoved in the dataset. You just have to look at the quality of a lot of data online, eg. on government open data portals, or even internally… at Datopian, an offshoot of the Open Knowledge Foundation, we do a lot of internal data management with companies, governments and others. The quality of internal data in tidiness is often a big issue.
Above tidiness you would have what I call syntax. This is the basic information about the column in your spreadsheet: is it a date, is it a number, is it a bird, is it a plane? It’s what would allow you to load that data automatically into something like a database or between spreadsheets in a consistent way. This is what, by the way, is missing from CSV data, if people are familiar with it; it’s all strings, meaning you don’t know if it’s a date, a number, or so on.
Then we have semantics, the pinnacle of data quality. Not only do we know our data is tidy, not only do we know that this column is a number, but we know that this column is GDP per capita (or whatever). Ultimately, I would say that still, in my practical experience, when you look at these things, they are relevant because they allow us to use tools. The reason why we care about data quality is because they allow us to automate some part of what we were doing manually with data. If you look at many organisations, I would say we are still often at stage one or two of that pyramid: at the tidiness or syntax level. Basic errors arrive at that kind of level.
Just to give a very concrete example, relating to tidiness and tooling: recently, it turned out that Covid numbers were being incorrectly reported in the UK because people used an old excel format that was truncating the files. You may think this is dumb, but it’s not actually that uncommon.
We have made a lot of progress, but the basic layer, tidiness and syntax, is still a big issue. The tooling, if I were to have a recommendation for policymakers, then focusing on tooling and the practical problems - the use cases people have - is really, really essential. Making it real, being what I call agile data (just like we have agile software, ie. based on user stories and rapid iteration) is important. That can even apply to standards; I’ve been in standards processes where they have a six month deadline and need to ship a standard (or whatever). You wouldn’t build a tool this way.
From left to right: David Osimo, Director of Research at the Lisbon Council for Economic Competitiveness and Social Renewal in Brussels; Heather Broomfield, Senior Advisor at the Resource Centre for Data Sharing of the Digitalisation Agency, Norway; [Osimo]; Ruben Verborgh, Professor of Decentralised Web Technology at the IDLab of Ghent University; Rufus Pollock.
David: What are your key messages for building effective data spaces?
Rufus: There’s quality in terms of is the data actually accurate, and there’s what I would call technical quality - is the data actually consumable by a computer or by tools relatively easily, or by a programmer using tools? If you are not testing and assessing quality, you are not going to drive quality in any way. This comes back to my point that tidiness and syntax are machine-checkable. It’s more difficult for a machine to know whether this is GDP per capita than it is for it to check whether a column contains numbers, eg. whether a column has 123 in it rather than abc.
One thing we spent a lot of time on at Datopian and OKF over the past 15 years is a stack called Frictionless Data. This is not semantics; you can plug semantics in, but it’s just syntax and it’s just tidiness, and it’s mainly tools. That’s the other thing I want to say: forget standards to some extent. I love rough consensus and running code. I’ve seen specs written that you couldn’t write a validator for or that it would take a PhD to write that validator. That is not a useful spec. Rough consensus and running code. One thing we put a lot of effort into at Frictionless [Data] was the tooling, in that regard.
One thing I want to say is: be inspired by code. One thing I like to talk about is continuous data integration. For anyone not familiar with it, continuous integration in software is the practice of, every time you make a change to the code, you integrate it, you run tests on it, you automatically run your test suite. I would love a world where, everytime someone is pushing some data into some system, that data is validated as part of the process. This doesn’t necessarily mean it’s blocked on it, but it means that if it does fail the basic test, you get some report on it. That’s something we could do today. We do it at Frictionless Data, where we actually have a system that lets you push CSVs to GitHub automatically and, like you have with co-testing, gives you a report on whether your data failed the test or not. This is at the level of syntax or tidiness; are there blank rows or columns and is this column that is supposed to be a date actually a date?
I mean, you wouldn’t believe this. Just to go back a few years, I did a lot of work with open data with the UK government - I don’t know whether it’s still up to date - and there was a point where they were supposed to publish spending data in CSVs with about eight columns. Out of about 2000 CVSs, not one of them actually managed to comply with the basic standard, that they have the columns in the right order with the right names with the right data. It was incredible.
Continuous data integration, you want automated tooling validating technical quality. That will have a huge impact. Even on educating users - many users who are pushing data, particularly in public administrations, are not necessarily always expert; they might be uploading an excel file. If you have good feedback from those systems, you can educate the publishers quite rapidly. They want to publish good data, but they don’t have the feedback and you don’t have the time and resources to do data training, but if you have that in your tooling, then you can give error messages and so on.
So, I guess what I would say is, have some way to automate validation and other tools that help you publish good quality data. Focus on tidiness and syntax to some extent at this point. If you have semantics, then absolutely, go for it, that would be a huge win. Adopt best practice from software, namely continuous integration and user-driven work. When you are doing this stuff, have real use cases you are applying to and see how that is going. This is much more exciting and much more effective. Agile, even including rough consensus running code. We can have specs, rather than standards, which we iterate on rapidly with the tools.
Participant: what about running tests first and then collecting the data?
Rufus: data systems in organisations are of all kinds of scale. One thing I’ve noticed, and that I’ve been a fan of for a long time, is reusing software infrastructure. I’ve noticed a lot of people publishing data on GitHub and versioning data that way. For example, large amounts of Covid data have been published that way. I have been doing this for about 13 years. One of the reasons I wanted to say this is because yes, you can write tests, but you want some way that they are going to be automatically run - these things go together.
You should be thinking about how data is flowing in your own organisation and how you can intervene to test that. I don’t know the specifics of the UK Covid case, but, in a perfect world, that excel file that came in every day - even if it went into some system where there was a trigger every time it got uploaded - would go through a prewritten test. The test might be: we expect X amount of cases per day and, if we are outside of that range, then it isn’t necessarily wrong, but some alert should happen. That would be great. Test-driven data development, you could call it. Some data is just getting published, but particularly when data is getting consumed. In the real world, in some way, you want data production to be driven by demand, by use cases. I think that would be very, very powerful. In that sense, it is something you can start doing; you could write basic tests and have them run.
What I am also trying to say is that, overall, there is a mentality that I would like to call agile data, that we start to adopt some of the mentality we had in software, like automation and automated testing, but also things like version control, which is a really hot topic in data science at the moment. In general, you also want to move out of a world of Excel. These things go together: it is hard to do automated testing when you can’t track changes, or you can’t see what happened, or you can’t go back and see when a bug arose. I just want to emphasise at the end that quality relates to adopting a new mindset, an agile data mindset, that would apply across all tooling - and that’s with simple tools that we have, from software and elsewhere, that we can apply and adopt.
Want to work with Datopian? We are data management experts providing open-source tooling and related services to organisations worldwide. Check our website for more information or contact us.
© Datopian (CC Attribution-Sharealike (by-sa))