We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
On March 21, President Biden warned of cyberattacks from Russia and reiterated the need to improve the state of domestic cybersecurity. We live in a world where adversaries have many ways to infiltrate our systems. As a result, today’s security professionals need to act under the premise that no part of a network should be trusted. Malicious actors increasingly have free reign in cyberspace, so failure must be presumed at each node. This is known as a ‘zero trust’ architecture. In the digital world, in other words, we must now presume the enemy is everywhere and act accordingly.
A recent executive order from the Biden administration specifically calls for a zero-trust approach to securing the United States government’s data, building on the Department of Defense’s own zero-trust strategy released earlier this year.
The digital world is now so fundamentally insecure that a zero-trust strategy is warranted anywhere computing is taking place — with one exception: data science.
It is not yet possible to accept the tenets of zero trust while also enabling data science activities and the AI systems they give rise to. This means that just as calls for the use of AI are growing, so too is the gap between the demands of cybersecurity and an organization’s ability to invest in data science and AI.
Finding a way to apply evolving security practices to data science has become the most pressing policy issue in the world of technology.
The problem with zero trust for data
Data science rests on human judgment, which is to say that in the process of creating analytic models, someone, somewhere must be trusted. How else can we take large volumes of data, assess the value of the data, clean and transform the data, and then build models based on the insights the data hold?
If we were to completely remove any trusted actors from the lifecycle of analytic modeling, as is the logical conclusion of the zero-trust approach, that lifecycle would collapse — there would be no data scientist to engage in the modeling.
In practice, data scientists spend only about 20% of their time engaged in what might be considered “data science.” The other 80% of their time is spent on more painstaking activities such as evaluating, cleaning, and transforming raw datasets to make data ready for modeling — a process that, collectively, is referred to as “data munging.”
Data munging is at the heart of all analytics. Without munging, there are no models. And without trust, there can be no munging. Munging requires raw access to data, it requires the ability to change that data in a variety of unpredictable ways, and it frequently requires unconstrained time spent with the raw data itself.
Now, compare the requirements of munging to the needs of zero trust. Here, for example, is how the National Institute of Standards and Technology (NIST) describes the process of implementing zero trust in practice:
…protections usually involve minimizing access to resources (such as data and compute resources and applications/services) to only those subjects and assets identified as needing access as well as continually authenticating and authorizing the identity and security posture of each access request…
By this description, for zero trust to work, every request to access data must be individually and continually authenticated (“does the right person require the right access to the data?”) and authorized (“should the requested access be granted or not?”). In practice, this is akin to inserting administrative oversight between a writer and their keyboard, reviewing and approving every key before it is punched. Put more simply, the need to munge — to engage in pure, unadulterated access to raw data — undermines every basic requirement of zero trust.
So, what to do?
Zero trust for data science
There are three fundamental tenets that can help to realign the emerging requirements of zero trust to the needs of data science: minimization, distributed data, and high observability.
We start with minimization, a concept already embedded into a host of data protection laws and regulations and a longstanding principle within the information security community. The principle of minimization mandates that no more data is ever accessible than is needed for specific tasks. This ensures that if a breach does occur, there are some limits to how much data is exposed. If we think in terms of “attack surfaces,” minimization ensures that the attack surface is as shallow as possible — any successful attack is brunted because, even once successful, the attacker will not have access to all the underlying data, only some of it.
This means that before data scientists engage with raw data, they should justify how much data and in what form they need it. Do they need full social security numbers? Rarely. Do they need full birth dates? Sometimes. Hashing, or other basic anonymization or pseudonymization practices, should be applied as widely as possible as a baseline defensive measure. Ensuring that basic minimization practices are applied to the data will serve to blunt the impact of any successful attack, constituting the first and best way to apply zero trust to data science.
There are times when minimization might not be possible, given the needs of the data scientist and their use case. At times in the healthcare and life sciences space, for example, there is no way around using patient or diagnostic data for modeling. In this case, the following two tenets are even more important.
The tenet of distributed data requires the decentralized storage of data to limit the impact of any one breach. If minimization keeps the attack surface shallow, distributed data ensures that the surface is as wide as possible, increasing the time and resource costs required for any successful attack.
For example, while a variety of departments and agencies in the US government have been subject to massive hacks, one organization has not: Congress. This is not because the First Branch itself has mastered the nuances of cybersecurity better than its peers but simply because there is no such thing as “Congress” from a cybersecurity perspective. Each of its 540-plus offices manages its own IT resources separately, meaning an intruder would need to successfully hack into hundreds of separate environments rather than just one. As Dan Geer warned nearly two decades ago, diversity is among the best protections for single-source failures. The more distributed the data, the harder it will be to centralize and therefore compromise, and the more protected it will be over time.
However, a warning: Diverse computing environments are complex, and complexity itself is costly in terms of time and resources. Embracing this type of diversity in many ways cuts against the trend towards the adoption of single cloud compute environments, which are designed to simplify IT needs and move organizations away from a siloed approach to data. Data mesh architectures are helping to make it possible to retain decentralized architecture while unifying access to data through a single data access layer. However, some limits on distributed data might be warranted in practice. And this brings us to our last point: high observability.
High observability is the monitoring of as many activities in cyberspace as is possible, enough to be able to form a compelling baseline for what counts as “normal” behavior so that meaningful deviations from this baseline can be spotted. This can be applied at the data layer, tracking what the underlying data looks like and how it might be changing over time. It can be applied to the query layer, understanding how and when the data is being queried, for what reason, and what each individual query looks like. And it can be applied to the user layer, understanding which individual users are accessing the data and when, and monitoring these elements both in real-time and during audits.
At a basic level, some data scientists, somewhere, must be fully trusted if they are to successfully do their job, and observability is the last and best defense organizations have to secure their data, ensuring that any compromise is detected even if it cannot be prevented.
Note that observability is only protective in layers. Organizations must track each layer and their interactions to fully understand their threat environment and to protect their data and analytics. For example, anomalous activity at the query layer might be reasonable in light of the user activity (is it the user’s first day on the job?) or due to changes to the data itself (did the data drift so significantly that a more expansive query was needed to determine how the data changed?). Only by understanding how changes and patterns at each layer interact can organizations develop a sufficiently broad understanding of their data to implement a zero-trust approach while enabling data science in practice.
Adopting a zero-trust approach to data science environments is admittedly far from straightforward. To some, applying the tenets of minimization, distributed data, and high observability to these environments might seem impossible, at least in practice. But if you don’t take steps to secure your data science environment, the difficulties of applying zero trust to that environment will only become more acute over time, rendering entire data science programs and AI systems fundamentally insecure. This means that now is the time to get started, even if the path forward is not yet fully clear.
Matthew Carroll is CEO of Immuta.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!