Best Practices for Building a Single Source of Truth: What Is It, Who Owns It, and Where Did It Come From?

Many of our clients are interested in building, or have already built, a single source of truth for their data. This does not mean that all data is stored in one massive data lake. It means that for every data element there is only one place that is the undisputed primary location for storing, editing, and deleting that data, and all other locations depend on or refer to that primary. 

Beacon’s architecture separates workloads and data, making it easier to designate those primary data sources, and bringing your data to life by connecting it with the models and analytics that give it meaning. As our Blackstone Case Study summarized, users should not have to worry about where the data is coming from, how to get it, or whether it is the current version.

Building your own single source of truth requires 4 essential elements:

Data dictionary for every calculated entry

Data lineage process for every value

Data model for relationships

Data governance that is clear and automated

Data dictionary

  • What is it?
  • Who owns it?

The master data dictionary contains a clear definition of every data point and calculated metric in your single source of truth, as well as its validation criteria. For example, what type of data is it, what is the range of possible values, and how are null values to be treated? Each definition should also include the governance info, such as the person or department responsible for the data, how frequently it is recalculated or updated, and the approval process for changes. If the data comes from a model or algorithm, there should be links to the source code, documentation, and testing procedures, including the approved date and version numbers. It is best if this dictionary is human readable, so that non-technical users can easily understand and find their way through it.

For example, every company in an investment portfolio or potential trade will have a data element of earnings before interest and taxes (EBIT). A dictionary entry for this value could be:

Data Element: Quarterly earnings before interest and taxes (EBIT)

Company: Amazon.com Inc.

Current Value: $7.681 B

Currency: USD

Calculation: Revenue – Operating Expenses, excluding Interest and Taxes, indicating that this is a calculated metric based on two other data elements.

Source: Refinitiv Company Fundamentals, updated daily after market close

Data lineage

  • Where did it come from?
  • When was it last updated?
  • Is it still fresh?

The data lineage process keeps track of when the value was entered into its primary location in the single source of truth, where it came from, and some sort of authorization code identifying who approved it and when. Each element should also have a timestamp or version from the source system, including details or links to any extract, transform, and load code or method. If the value was calculated, there should be runtime links to all of the input parameters or dependencies used in the model or algorithm. If the timestamp on any of the inputs is newer than the timestamp on the output being used, the output value should be automatically marked as stale. (Dependency graphs are an excellent way to keep track of these things). Finally, no one should be able to bypass or override the lineage controls in the production environment.

Continuing with the example, the lineage of Amazon’s EBIT could be:

Source: Refinitiv Company Fundamentals, updated daily after market close

Value date: 2023-06-30 17:00:00

Current value timestamp: 2023-08-03 22:15:08

Prior entries: Values and timestamps of any earlier values that were different from the current value

Stale?: No

Data model

  • What is it related to?
  • Where does it go?

The data model gathers related information together into tables or other constructs. Each table should make sense as a standalone unit and avoid duplication of data across multiple tables. For example, EBIT is calculated from Revenue and Operating Expenses, and is not stored as a value on its own. Important or meaningful relationships between this table and other data elements are collected here, such as many-to-many and one-to-many mappings. Front-end systems and reporting tools should act as a thin layer on top of a robust data model, and not be used as the data model by themselves, to avoid placing form over function.

Continuing with the Amazon example, the data model could contain:

Parent values: Revenue, Operating Expenses

Child values: Earnings per share, Quarterly growth

Related values: GAAP operating income

Data governance

  • Who changed what, when?
  • How do we validate it?
  • What were the earlier values?

Similar to financial regulations that govern business practices, data governance establishes rules and guidelines for how data should be handled, stored, and utilized. It ensures that the source data is reliable, and any calculations used are correct. Equipped with this information, it is fast and easy to track down errors, omissions, or other issues, so that they can be identified and corrected by the proper owners in an appropriate time frame. This should include automated validations for as many aspects of the data as possible, and prompt escalation of any failures. 

When data corrections or updates are made, only validated and authorized processes should make the change, to avoid overwriting good data with data of uncertain lineage. The most effective single source systems don’t overwrite existing data at all, but instead add a new entry marked with an “as of” time. This enables users to “time travel” through the data and faithfully reproduce reports and analytics from an earlier date and time. Finally, another important principle of a successful single source of truth is that models, reports, and other data usages should never depend directly on data stored in the uncontrolled and unchecked data lake, but should always get it from the primary source.

Update process: Automated data feed from Refinitiv data plugin, account details

Current value timestamp: 2023-08-03 22:15:08

Override authorizations: IDs of people/groups who can change this entry

Manually entered?: No

The truth is out there

In today’s data-driven world, a reliable and accurate single source of truth is essential for making informed decisions and deriving meaningful insights. The four essential elements of building a single source of truth—data dictionary, data lineage, data model, and data governance—form the foundation for a robust and cohesive data ecosystem. By focusing on these elements, businesses can streamline their data management processes, reduce redundancy and discrepancies, and ensure that data elements have a designated and trustworthy primary location.

Companies that are embracing the single source of truth concept, like Blackstone, spend less time worrying about the inner workings of data sources, enabling users to ask more questions, get faster results, and derive deeper insights on potential investments. Beacon’s database implementation and bi-temporal data object model are essential attributes, ensuring that revised values and the relevant dates are added to the database, they do not replace earlier values, preserving the full history of an instrument.

Ready to take the next step towards building your single source of truth? Reach out to our team of experts at Beacon Platform to discover how the Beacon architecture can revolutionize your data management and analytics capabilities.