5

Open data and data privacy, how can information governance play a role?

Steven De Costa is the co-steward of the CKAN Project and Executive Director at Link Digital. He joined Anthony and Kris to dig deep into open data, its past and future, and its utility for governments and large organizations, and the implications for privacy and security. Could information governance provide the tools to ensure open data is useful for governments now, and in the AI future? And would a Croissant help?

They also discuss:

  • What is open data?
  • How do you manage security and privacy of open data sets?
  • Exploring CKAN, the open data management tool
  • Real world applications for open data
  • AI and open data: the future implications
  • How does Croissant provide a solution?
  • The role of governance in AI and open data

Resources:

Transcript

Anthony: [00:00:00] Welcome to FILED a monthly conversation with those at the convergence of data privacy. Data security, data, regulations, records, and governance. I'm Anthony Woodward, the CEO of RecordPoint and with me today is my co-host, Kris Brown, our executive Vice President of Partners Evangelism and Solution Engineering.

How are you, Kris?  

Kris: I'm excellent. No smack talk about my title today. I'm terribly disappointed.  

Anthony: Yeah, look, the joke's a bit old now, isn't it? So, we're all getting on with it. But look, I think we've got a really exciting today, Kris. We're talking to Steven De Costa, who is the executive director of Link Digital and the co-steward of the CKAN project.

Steven: Thank you very much for having me, and I'm very well, it's a very crisp morning here in Canberra.  

Kris: Awesome. So, Steven, let's be open and honest. We were talking about it before we kicked off recording. There's a bit of a relationship here. I've worked for you in the past, I think it was probably a hundred years ago, and I was doing a bunch of terrible HTML coding for a previous business of yours.

But today what I'm really looking forward to is we're gonna talk about open government, and more importantly, its connection to information governance. Certainly, I think in conversations we've had in the past, I think we've got the right person for the job. So, maybe for all of the audience members you can give us a little bit of a background on sort of how you came to be, you know, what Link Digital is and, and, then, you know, talk a little bit about how you got interested in open data.

Steven: Yeah, sure. Well growing up in Canberra, so Link Digital is a digital agency and one of our first projects actually in 2001 was with the National Portrait Gallery. And I think you would even remember back then they used a thing called KE Emu. So, in terms of their knowledge management. So, working with data, working with collections, kind of like right from the start, that was a real key interest for me.

If you've ever built a website, which is what we did for our first decade of operations, it's always a pain to get the content from the clients. And so, the art is gold. When you wanna create something substantial work in the collection agencies and museums and galleries and things like that were always great ‘cos the data is so amazing, and it comes with pictures so.

That really drove interest. But then around 2012, I think it was, we worked with a hackathon group called Gov Hack and we were sponsoring that, and we got involved with building as part of that data gov au on C camp. So, that's where we got interested. And my genuine interest sort of came at that moment around.

I did economics at university, at open data was brought in as this non-rival risk information group, which I remembered from lectures. And I kind of got it that public data is in the public interest, and it can be used for public reuse. So, suddenly my personal experience of valuing data as a web dev guy and then making that data more available for more people.

Anthony: Yeah. And it's a meaty topic when we start to dive into open data a bit. But it'd be good to unlock that a little bit for the audience. When we talk about open data, what do we mean? Because we can't just say, you know, I don't think government or anybody who's working the open data ecosystem can go, look, here, all this, we don't care.

Privacy issues, security issues, gonna have a bunch of things going. So, what does that look like? What do you mean by open? How do we unpack that?  

Steven: Yeah, well maybe I talk about it from three perspectives, maybe. So, from a kind of scientific perspective, you can think about the fair data principles, findable, accessible, interoperable, reusable.

Largely, it's about making things reproducible so that you can recreate other people's research, which might have really important impacts if you wanna redo that same research in different context, government. Around 2009 with the Obama administration. It's really around open government, so participatory democracy, transparency of government institutions and that kind of activity.

So, public institutions, the resilience of public institutions and democratic integrity, all that kind of stuff is also interlocked with the open data concept. And then lastly, if you think about Airbnb and Uber and this sort of social contracting model where you reduce the cost of party A and party B because the intermediary requirements of how they can trust each other or interoperate, a lot of that can be supported by open data as well, reducing the cost of contracting.

So, they're the three primary ways that folks have been getting into it and leveraging it. That's the way I still think about it.  

Anthony: And it makes perfect sense. If we drill in a little bit, where are the big successes that have occurred? Because there, I know there have been a few and then there's been some other bigger failures.

Where are the big successes that you've been involved with, where we've seen the application of open data?  

Steven: Well, I mean, really, like a social tool as much as it is a technical tool. Organizations grow, and they structure themselves around their remit, and sometimes the effect of their work is pretty hard to put forward.

They have policy documents, they have annual reports, they have all sorts of things, but if you wanna really bring the receipts, it's in the data of the operations, the programs that they're running. So, one of our best examples, or my favorite example is the Pacific communities. The Pacific Data Hub, 22 nations.

And largely there's a lot of research going on in that area of the world. There's a lot of threats with regard to climate change, and the evidence of that sort of research is really important to bring the lobbying capacity of that region and support the economic development. Protect environments, all sorts of things.

So, that project, that data portal has grown significantly over the years and it's a key asset now for that region. It helps 'em with their lobbying, helps 'em with their evidence, and helps 'em with their policy development over time.  

Anthony: Yeah, cool. And which is all great, but how do you start to deal with the issues of security and privacy with these data sets?

We've been talking a little bit on this podcast about DOGE and what's happening in the US, and you know, the principles that sit around, which is an open data play in some ways, and there are some ways to think about that, but how does open data deal with those risks?  

Steven:  

Institutions that wanna publish data sets will typically have their own internal controls, and they control, manage and operate their data internally. they work with data, how they collect data, how they govern those collections. And that can be the full pipeline. As you would obviously know, records management is key to that as well.

The records of how to collect and manage data sets is also a record that gets managed, the governance model, it's the policies, the procedure is the, you know. IMS is the security management framework. So, these are all things that get managed within organizations, but just like in the real world, when you wanna cross a border and go into a different territory, a more public territory or a foreign space, you typically have declarations.

You typically have ways and means by which you do release the data. Public data and CKAN are largely about published artifacts that go through that workflow, go through that pipeline. It harvests from different information sources and it has API points so that you can publish into it via API. So, in my view, it's always the source of truth, source of governance.

Source of management is local to the security policies. And then if you have a public publishing agent such as CKAN, you make sure that border control is well within your territory.  

Kris: So, you've jumped a little bit ahead there, Steve, but in starting to talk about CKAN and that publication.

So, maybe give us, just again, for the audience, I would think that open data is probably a fairly familiar concept for the people who are listening to FILED, but CKAN is probably a little bit different and, you being the co-steward, and I'll dive into that in a little bit, but talk to us just a little bit about, what is CKAN and how is it used?

Steven: Yeah, well, it emerged out of an open knowledge foundation. So, it emerged at a time when the economic debate, policy debate was occurring around open data, open government, coming from an organization that is advocating for more open knowledge, whether that's in government or it's in, galleries, libraries, and museums everywhere.

So, Ika emerged as like this way of easily publishing heterogeneous collections of data sets. It's unstructured wildly diverse. Different publishers, different organizations and hierarchies of organizations, but then trying to find a singular schema for access discovering methods, right? So, it's a pretty simple application.

It's just a create, read, update, delete kind of application. But then all the wiring is around data management, data access. So, how it emerged really is it was at the point where it had the right intersection between political will, bureaucratic will, technical will. And they've got the kind of the winds of civic hackers and developers and all sorts behind it.

So, I would say it's path dependency now is pretty solid. It's not so much the tech, but the people behind the tech and the communities behind the tech that makes continue to be successful, which is, you know, why we're integrating with other similar initiatives like ML Commons Croissant.  

 

Kris: beautiful. You don't necessarily realize how much of this is out there, like being a parent with a number of children, I use my school to understand, which schools are doing well, how students are comparing in terms of that plan and other results, which is sort of the Australian schooling system.

Having lived in London more recently, like a number of applications that I would just use on a regular basis around train timetables and other things all feed off of these open data portals. They're constantly evolving. They're constantly adding more value. So, even just Bureau of Meteorology for example, having different types of weather apps, there's a bit of a boatie like the, that having an app that's a little bit more focused on wind and tide and swell.

As opposed to just, you know, is it gonna be sunny? Is it raining? These types of applications are able to kick on. And, and that sounds all very, you know, linked heavily to government. But I was reading in the preparation for this, that like Lego is a huge user of this. Do you have other examples or do you have some detail there about, you know, why this makes sense for them?

Yeah, well, Lego's a bit of a special case.  

Steven: It was the first time in my life that I was actually a hero to make his, with the job that I had, ‘cos I got to fly to Berlin and do a presentation to Lego and talk about CKAN and they got to come with the with and yeah, Lego World's been amazing.

But in that team, basically they had traditional business intelligence type folks, and then they had a new team, big data team. So, the office of the CO is looking, well, how do we bring these two types of teams together? Different sorts of data teams with different approaches. So, CA was brought in as a lightweight thing to sit atop the two different pieces of infrastructure, the different technical toolings, I guess you'd say.

And particularly in corporate settings. That’s a pretty common use-case for CA. Like I said, it's just a crowd application. It can sit on top, and it can harvest from different sources. So, if you wanna clear fairly quick picture. You can drop in so you can and get a, you don't have to go through a big internal change management program.

That's sort of the thing, right? So, between those two teams, you get a lot of visibility of each other's data sets. I wouldn't say that's a common use case everywhere, but it, it's, that's how we got invited into that opportunity at the time.  

Kris: It's interesting hearing some of the words you use there though, right?

Like it's a large organization, very unstructured data sets everywhere, trying to get governance. That sort of starts to, to ring true again for the audience here at FILED. Hopefully from a business sense, they're paying you in Lego sets and you have the Death Star and all the other cool ones, right?  

Steven: Yeah, they have build tables there so you can just walk around.

You can build whatever you want. And I did get a Lego Technics motorbike. A BMW.  

Kris: Yeah, nice. That's that, that's exactly how it should be. Fantastic. And look, I think that tie there, and certainly we were talking about this earlier, so I. Do you have an opinion on how that then ties in and to the more of this audience around that information governance?

Like we sort of spoke very quickly about privacy and security. Is it better if an organization has a good handle of its data before they get to dropping these things? So, does it make it easier from those workflows into CKAN?  

Steven: Well, yeah. I'd say transparency is never bad, but transparency without oversight is kind of a little bit pointless.

And even better than that, transparency with oversight with a planned management framework for governing that oversight, it's even better. So, in might rely on public oversight in the sort of open government, so models, but within a corporate setting, they still need to have their own oversight models applied.

Just the sake of transparency to make it more discoverable is sort of okay. Largely, again, CKAN can be used. ‘Cos you know we're web developers traditionally, so some of the social engineering that might come with it is, is kind of key. Integrating it with other discovery models, like who are those bi application developers, those BI folks, and then having profiles of the data prep apps that they do or the data insights that they generate, and then helping people find the best people in the organization by dataset or by data visualization is kind of a, a good thing to have in terms organizational discovery as well.

But yeah, without governance, internal data catalogs are kind of up for the wind, whoever, however they might use them.  

Anthony: And that's an interesting point. Should we think about CKAN as just another form of data catalog or is it deeper than that? What's the extremities of those things as we start to think about it?

Steven: Well, there's a couple of things that's not trying to be. Even though it can integrate with those things like data discovery, like a deep interrogation to your data assets, it's not trying to be that largely because there are those barriers of different protocols, different security models, different sort of handshakes.

Get into and understand data and kind of bringing back that synthesized view to enable metadata, for example, on the fly is not really what CKAN is built for at the moment, but it is built for the concept of information systems doing that and then servicing a packaged data set. A package declaration of a schema with data dictionary and then making that discoverable in something like CKAN, the sweet spot is really, like I said, that light touch where you can put something over an existing structure and make it more discoverable, particularly if you've got different organizational governance models.

So, in government there is a particular, kind of obvious one, different government departments. So, if you have a jurisdictional data portal like a state province. We have a national data portal. Usually, the ministries or agencies are very different organizations, different remits and different legislative requirements as well.

Because most countries have this constitutional sort of structure that precipitates down to legislative acts, that creates the ministries, creates the programs, creates the projects, creates the data. So, if you have this jurisdictional thing, it's, it might be at this legislative level of an, of an --- for nationhood.

We see can, can sit on top of that, represent all of the kind of structural integrity of, of that jurisdiction, and then have the heterogeneous collection of data for discovery.  

Anthony: I. And CKAN's not the only sets of standards out there, right? Socrates, I think, is one of the other ones that I've come across, right, that are trying to do these things with different governments, with different government agencies. One of our customers is the City of New York. We have been working in New York City around different ways to push data in and out, and there's quite a lot of different challenges out there.

Is there going to be some harmonization of some of these different approaches, or do you see them as quite divergent?  

Steven: Oh, they're all very different. So, DKAN is like a Drupal based version of CKAN and data store is pretty much the same sort of schema or same, same sort of approach. In that sense, and it was put forward because at the time, government agencies were using Drupal, and they would be more inclined to have PHB support teams. So, a Drupal-based version of CKAN made some sense. With regard to would be the one you're thinking of that got bought by Tyler. It's open data soft as well, which is a French based one.

A couple of these emerge at the same time as CKAN at the nexus of the political interest, the economic interest. A lot of the discussion, integration with hackathons and civic tech. But in large part, you know, national government is a small sector, like it's a very small market. And getting into governments all around the world is very difficult.

You need proximity and local understanding. So, it's very difficult to do a global open data platform as sort of like a SaaS without a hell of a lot of salespeople and a hell of a lot of contracts, and a hell of a lot of heavy lifting. And by the time you've done all that, you don't really know.

You don't have a product that's gonna be unique to that jurisdiction and that customer, and you're not gonna have the retention or the stickiness with that customer. Anyone with proximity to that customer is gonna take it. I think a lot of those global approaches end up pivoting into something that's a bit more specific, like smart city dashboards or particular financial dashboards or something that internal and sticky to that organization, the way they structure and publish their data.

CKAN just remains agnostic and ‘cos it's open source, it can get that local deployment, that local expertise that merely understands how to support their government. Specific tech can pick it up and deploy it. No, it makes sense. You're speaking to the preacher, not even the converter. So, I have to have bias.

Anthony: It's really good though to hear your view on that, where there are some similarities, but there's clear differences around how these things operate. We had an earlier guest on the FILED podcast talking about CDR, so the customer data projects happening in Australia, and it was really interesting to talk about how they're approaching data.

Do you see an interplay with where you're taking CKAN and where this is going to things like customer data records and other backbone processes that are occurring within financial institutions that governments are actually coordinating. Like where's the extremity of that in your view?  

Steven: Again, to me the, not to say it's not possible, but working with data is different to publishing data. And so, something like customer records, you need to know that that customer data is held with integrity. Customers change and they have rights and identities that change over time as well. It's not just static data, it's active data.

When it becomes published as an artifact, it's like a canonical, reusable, interoperable thing. So, it's a difference between a tool that might help you write books and compose magazines versus a library which just catalogs them and makes them available. Once that's all done. So, you can say that libraries could become the places where writers come to collect and talk about how to write good books.

I think that's the same with I think that there's this, this notion of community that emerges around libraries. Every community has a library, and big cities of the world have amazing libraries. It becomes the heart of that city, and I think the open data portal has the potential to be that for.

Digital communities as well, that it's not just the data that's there, it's not just the record and the receipts that's there. There’s a collective interest in coming together and actually talking about community, talking about concepts and using the data to form that conversation. So, that's where I see CKAN going.

Anthony: I love that analogy. That library analogy is fantastic. But switching gears slightly and talking about where things are going, we really have to bring AI and large language models and GPTs and whatever, three other three letter acronym into this conversation because it's just such a hot topic I think out in, the information and data world.

Where do you see that evolution? What's the impact of this conversation around data sets and archival data? Is it starting to be used inside of machine learning and those other processes?  

Steven: Yeah, sure. Well, we spoke earlier about, what is open data? And I gave three definitions of it, but the fourth emerges when you think about AI and observability.

So, to me, AI, when it first kicked off the GenAI, I was thinking, this is great. GenAI is not the starting point because agency is gonna require an understanding of what's pro and contra to one's agency, and if you're gonna have that, then you need to know what's pro or contra to collective agency because those two usually butt up against each other.

The observability though, as we see in the conversation around AI and agentic AI agents that act with intent on behalf of others. Becomes kind of a key point. So, the openness that we can observe in the real world and the discernment that AI can have with that, means that it can actually collect open data, it can collect data from the environment, and then potentially publish that out as an artifact that anyone might be able then reuse.

So, that to me is gonna be a key point and how we manage that ethically and how we manage that with intent and how we manage that so that the agency of everyone is protected or whether it's privacy or whether it's sort of like the discreteness of individuals is somehow managed into the collective interests.

I think that's gonna be a key thing that absolutely has to happen as we develop AI and machine learning is part of that. The point about Croissant is you can have data that you used for training, but then you have to have data that will be used to validate that. And between those two, especially with the government, you need to know that that's within ethical boundaries.

That's kind of open and transparent so people can check it out. And certainly, as we start to look at what government does, it's not like with DOGE where you have to find efficiency everywhere. Government's not necessarily about efficiency to say, what's the best way to solve this problem?

You need to solve the problem, not just for 80%, but for the whole 100%, right? That last 20% is actually the biggest job for government. How do we look after those folks? When we start to employ ai, we're gonna have to really look at the data that we're using to train models that are gonna support everyone.

'Cos the data will inherently probabilistically come up with solutions for the 80%, not the 20%.  

Anthony: Yeah, and it creates such black box problems, right? So, you have these complex systems that sit in front of that, whether that's, or systems we have, workflows we have today. How do you see that we're able to create more opaqueness in the data rather than necessarily having to expose the black box of that?

And so, like to draw on that analogy to make it clearer, you know, you have this situation with Chatt PT or philanthropic or some of these other brands you've thrown around, where they will say to you, we cannot explain how this comes up with answers. It just does trust us. But they also don't want to create any opaqueness around how, what data sets were given, what is the layering of that data, what are the tuning of the models?

That's obviously not gonna work in that you, when we come to creating standards in a government world. So, there is a real conflict there between creating these open sets of data that can be leveraged to create agent endpoints and as you say, you know, humans and their data and the agency of that data.

How does that come together in your view? And are you looking at solutions in that realm?  

Steven: So, how does it come together? Well, first of all, governments will start to regulate and publish data sets that are appropriate for machine learning in particular context.

So, if AI is gonna be regulated around ethical use, then those data sets become important. We wanna make them discoverable so they'd be reusable for others. Different collections of their data sets might be wired into different models so that they can be used in combination. You might get meta-analysis as more governments in more sectors publish more say infrastructure orientated or mobility orientated ML data sets.

So, that's how that kind of comes together from a community of that practice. Ethical, you know, government, community endorsed type practice, the large-scale activity that's going on at the moment, which is like the goldrush for data. And the reason why it's such a black box is everyone's soaking up as much as they can to get as much inference as they can for as much data as they can.

‘Cos they're looking for those nuggets, those reams of those seams of gold, sorry, that others won't get ‘cos they don't have all the data. I don't know if that gold Rush is gonna get that kind of data or the kind of insight. I think it's gonna be more likely to refine your data sets such that the data sets are so pure that the models become inherently valuable.

So, again, this comes down to where I'm thinking the last part of your kind of questions. A big question is observability in the right, observability in the right context. The discernment of those environments means that we can get. Data that can then be used to kind of interoperate fire AI with others, but that's also privacy.

Anthony: And I think there's another project that you've been working on and believe from what I've read, and I haven't had a chance to play with it properly and love you to talk some more about it.

It starts to give us ways, or at least. A library of tools to create better representations of that data, understanding some of these constraints. Is that fair?  

Steven: Yeah. I mean, the Croissant thing for CKAN is just a pretty easy integration. It plugs straight into the DA scheme like extension that's already there.

So, DKAN will just publish out profiles, schemas for the data catalog, and we've got one there for schema do org. So, you can index your CKAN site and it'll go into the Google dataset search. The same thing can happen with chant now, so you can export out your data sets with the right sort of profile that helps other ML researchers and such discover data sets that are suitable for training and then suitable like validation. And then any of that kind of additional metadata that becomes more valuable over time at the providence, the lineage and where this data has come from, the consent models that have been applied, things like that, that all becomes more declarative. Universally discoverable. So, I think that's part of the solution, but it's largely a workflow thing.

It helps those ML researchers discover this across the globe. So, CKAN catalogs are helpful for that. But where I see this going, in more of a sci-fi sort of concepts in the way that I'm really excited about, is this objective observation that we can create data that is uniquely ours. We can create our own data sets and our own kind of lived experiences that are observed such that we can then create our own agentic profiles that become more interoperable with others, you might say.

I think the concept of data is not just for organizations and institutions. I think the concept of data in the future of AI interoperability with individuals and with homes and domiciles are in my concept. We need to be able to push back with our own agents and our own agents have to have the data about us to negotiate on our behalf with the agentic, others that are aiming to provide us with services and all sorts.

Kris: I just stepped into 3025, Steve, that's exactly what just happened there.  

Anthony: 3025. I'm pretty sure Steve described just next year, Kris.

Kris: Highly likely. Correct.  

And so, I think to get to the end here, I think we've had a great conversation. And as you said, Steve, there's probably lots of other pieces that we can dive into.

For me, there's a ton of convergence here. You know, there's that governance and usability that you've spoken about. There's the policy and the technology pieces that, you know, we're trying to bake into that publishing pipeline, right? Like to really understand the responsibility of ai.

So, you know, responsible AI being a key objective of these. Even just for the audience, simple things like. This data set was collected from an 80%, you know, male population, for example, means that there's a genuine bias that's going to be baked into that. And as you say that 80 V 20, it's the last mile, it's that last 20%.

That is the goal of government because they're there to govern a hundred percent. And I think that that's where this gets super exciting. To give you an opportunity, a little bit of that, piece here to wrap up, what do you see as the advantage of AI ready public data?

And you've probably given us a little bit of a vision there to where you go and what role do you see the Croissant and CKAN playing in society? Like, really draw that out for us. ‘cos you know, we've, we've got a world now where there's just too much data for humans to analyze altogether. So.

How's that wrap up for us? Here's that huge, broad question again, but, and if you can, and certainly how does information governance underpin this?  

Steven: Yeah, so, let's see if I can wrap it up. See the world where AI is based on statistical probabilities, right? The most probable solutions will be proposed, but most people in one way or another are actually statistical outliers.

So, the role of governance is to make sure that we govern for discreteness, we govern for individuals. We govern for the many that are actually very uncommon. The uncommon sense basically is what we're looking for. And common sense if that's taken over by ai. Then it's not gonna be that helpful.

There's gonna be some aspect in everyone's life that is very uncommon, and we're gonna be beating ourselves up against a wall to try and get that addressed if we don't actually address that. So, things like Croissant, things like CKAN, things like data collections and governance that move us to cover those, those gaps in society are gonna be really important as we move forward.

No matter how much we automate, no matter how much we might wanna optimize, we're the worst of us if we don't pay attention to the best of us. I like that.  

Anthony: No, that's great. Look, there's so much more. I really appreciate you spending some time with us, Steven. I'd love to drill in and maybe I'll come down and find you in Canberra and buy you a beer and I'd probably need a whiteboard but be around a whiteboard and we can map through some things here.

We had a lot of conversations here that I’d love to pick up other pieces on. Yeah, I really appreciate you coming on the podcast today.  

Steven: well, thanks Anthony and thanks Kris. It's fun. I love talking about this stuff and it's great to wake up in the morning and talk about some more.

That’s terrific.  

Anthony: Look, thanks everyone for listening. I'm Anthony Woodward. And I'm Kris Brown. We'll see you next time on FILED.

Become a FILED guest

If you’re an expert in any of the industries we discuss – data privacy, cybersecurity, regulation or governance, and more – we want you.
Learn more

Enjoying the podcast?

Subscribe to FILED Newsletter.  
Your monthly round-up of the latest news and views at the intersection of data privacy, data security, and governance.
Subscribe Now