Decision Ready Data, What is Data Health/Bias?
- by Tallgrass
- July 20, 2020
Decision Ready Data, What is Data Health/Bias?
There are simple tests to evaluate the health of data. When a decision is made can you measure compliance? If you can then your data’s bias is within the threshold of that decision. But not all decisions are made equally, tactical decisions have a small loop to follow before they’re complied to. Strategic decisions are complex and influence many decision loops, some years into the future.
Most strategic decisions are abandoned before they can be measured!
The resistance to this level of decision is unmeasurable, or is it? Bias clouds our performance in ways that are hard to identify. Only when bias can be measured thoroughly can it be removed and/or controlled. When controlled data can speak and reveals the performance stories contained therein.
We are at a critical juncture with data. Machine learning and AI is driving more and more of our core processes and soon decisions. If we do nothing they will inherit these dysfunctions. Is it too late? How do you know?
Let’s start our discussion with one of the most complex challenges of data health; bias.
Bias, by definition, means to have a subjective inclination. When it comes to data, no one is safe from data bias. There are many reasons why data bias is the norm, we will cover them, and more importantly what we can do to identify it, measure it, and keep it at bay. You might think that human bias is at the foundation of data bias but you would be mostly wrong. It’s not what we humans do directly to data but what we don’t do as the events that create and capture it evolve.
Where does Data Bias Come From?
Before we talk about how existing data is applied, let’s talk about where it starts to go wrong within the source systems themselves. Semantics is one of the key reasons data bias exists. This type of bias happens in a couple of ways. One moment something means one thing the next it’s redefined as something else or equated as the same from two sources.
A process change that splits a behavior into a new state but does not re-define and acknowledge the changes to the existing. If you now treat the legacy process as the same bias the codes will introduce a myriad of unintended consequences. Most times more rules are put in place to circumvent and conceal the imbalance caused by the inheriting processes and KPI’s.
The impact on KPI’s is only the immediate and most evident result, how it relates to dependent processes within the workflow is almost impossible to see. There is a ripple effect on processes, even when a change to the underlying logic is subtle. Over time they turn into logical time bombs that cannot be disregarded and a painful reconciliation is required.
The second form of semantical data opportunity is when two like but independent processes exist and claim the same meaning but are defined differently. In basic legal contracting there are classic terms that cause a lot of trouble like “failure” and “repair”, among too many to list. Not to state the obvious, but how we define something will have huge implications of how it’s applied later. The company with the strongest legal team will eventually win the final definition.
Next, let’s cover the typical case of what evolved data bias is. Here, company A. purchases company B. and like events are not fully appreciated when the companies are conjoined. Even when a common event exists rarely will it be defined the same. It’s here where we witness the fine art of sausage making in an an attempt to “smooth out” the unintended consequences and extrapolate unbiased information.
Statistical modeling exists to correct bias by taking a normalized sample- but even then bias is inherited and outcomes will prove invalid. Tomorrow’s information will be brought to you by impressive technologies like machine learning and AI. Machine learning built on bias will hide it in equally innovative ways. As we add more layers to our data it will be harder to evaluate outcomes when they fail to relate to our actions.
Ironically resolving data bias is not a technical challenge but a cultural one. Its takes a very brave executive to want to pull back the layers of the extrapolated results that drive their company. Most likely they will not like what they find, but until they do this further layers will be created and inherit the dysfunction.
When change, true change, is requested, there are laws, simple cascading laws of Data Health that can be put into place.
A Quick Overview, TG’s Cascading Laws of Data Health
Availability – The expectation and delivery of data is traced for consistency per event or job.
Linkages–Joins between tables are logged after the availability check above has been performed.
Field Consistency – Fields are validated for population, masks, spot metric calculation, and type.
Business Rules and Behavior–As data passes the initial three levels of health, this layer allows for complex rules to be tested and validated across tables. The opportunity of this layer is proving to be endless. Combined with machine learning can align process level alerts to top-down business rules, when they are out of compliance, and even prescribe corrective actions.