Blog

7 Data Sins Series: Losing Track of Your Data

Written by Neil Sandle - Director, Product Management | 28-Oct-2024 17:46:43

Mistakes and changes, we all make them - every day and every moment of the day. If we didn’t, then there would be no need for a backspace key on your keyboard, right? And so, we correct them every moment of the day.

But when the changes and corrections are affecting transactions, investments, portfolios, index constituents, benchmarks, investment classifications, fundamental investment data, research data, report data, compliance data, or the like, we had better track and trace all these changes that we feel is necessary to make (for the better), right?

 

The Need for Comprehensive Traceability

For information to be reliable, we should be able to trace: what attribute changed, when this happened exactly, who did it, and even – this is where it gets really interesting for the auditors- the reason why we changed it? This last bit of crucial information is what often gets missed but it is essential in keeping trust in data and processes. One needs to be able to log the reason for the changes when we make them, regardless of whether it’s derived automatically or manually. In fact, maybe in some cases, we should have a ‘four–eyes–principle’ in place before we officially validate any change.

Next, when any of these necessary corrections affect other data attributes further downstream, one ought to be able to record that too and carry the meta-information about the changes forward. In short: we should track what we are doing and make it easy to trace back any steps.

From a data management system perspective, this requires we don’t throw any data away – not even the wrong data, that is being corrected – we just keep all incremental changes, regardless of whether it‘s market data or reference data. You can imagine that this causes data stores to become quite big, quite fast. All the more reason to have a scalable solution.

The power and ability to track, trace and see how data morphed in-house from source to end–consumption, is called data lineage, and it’s becoming ever more important as data consumption and scrutiny over the breadth of sources grow. Not just for auditing purposes, but increasingly more so for yourself to understand how reliable any data-source actually is, and how often one needs to make corrections.

 

Enhancing Transparency for New Data Types

Good data lineage avoids wasting time looking for changes and creates more overall visibility into the movement of mutations across the entire data supply chain. Gresham's Prime EDM makes this very visible. This is key because, with the fast-growing number of commercial data providers that have started to serve the investment industry as of recent years, the amount of different data sources one has to manage (including one’s own in-house views) has mushroomed.

And yes, a lot of what’s being offered today could potentially help reveal otherwise hidden alpha or positive and negative market externalities in the future. Therefore, the aim for most buy-side participants is simply to keep up with researching this range of new data types and ultimately integrate many of these sources into their investment process.

Here at Gresham, we refer to these industry efforts as so-called horizontal integrations (as opposed to vertical integrations of the several analytical tasks and disciplines across the firm) and we foresee that in the very near future, most investment managers are looking to horizontally integrate the following different data types seamlessly, including their data lineage:

  • Price data
  • Reference data
  • (In-house) Factor data
  • Ratings data
  • Benchmark data
  • ESG data
  • Fund data
  • Alternative data

We also see a bigger demand for enabling business users with more ready self-access to these data types, instead of waiting for the IT department to facilitate every possible datalink separately all the time. This liberation of the data for the end-user empowers the business analyst to detect, reveal, and create information. Because, as we saw in our first of the 7 Data Sins Topic, “Can’t see the Wood for the GREEN trees!”, real information can only emerge when multiple data elements come together.

 

Balancing Data Access with Data Lineage

The aim of horizontal integration is therefore to expedite the ability for Analysts, Portfolio Managers, Fund Managers, Risk Managers, (and anyone else, really) to become ever more self-reliant on sourcing, aggregating, and combining these sets of relevant data to create the piece of information they need.

But all this in-house data-liberation and user-enablement, doesn’t mean they should become ‘too self-serving’ in the sense that they can alter certain pieces to their personal benefit without anybody else knowing or tracing this. This is where data-lineage comes in again. And most professional investors and regulators think it’s paramount to have this in your architectural setup.

Another thing to recognize in the market is the significant amount of US asset-managers that have quite recently started to catch-up with their European counterparts on making ESG-conscious investment decisions. Arriving somewhat late to the party, it requires them to contemplate on how to freshly embed this type of data into the many steps of their investment cycle- like research, asset allocation, portfolio construction, risk, performance, compliance, and reporting.

 

The Competitive Edge of Raw ESG Data

Unlike what one might think about arriving almost too late to the game, it is the European head start which might actually turn out to be disadvantageous in comparison to these late arrivals – as it seems to miss out on the clear advantage of leveraging the latest in data science technology. Especially R and Python have started to rapidly evolve as the new standard in the industry. Their open-source libraries such as SciPy, Numpy, and Pandas harness enormous power by dealing with one dimensional and two-dimensional data. And with opensource Python libraries such as Tensorflow, Theano, and Skikit-learn, the proverbial door to using AI within the firm is wide open. That is provided you have a strong architecture that caters for all the data flow. It is one of the key reasons why Gresham decided to choose Cassandra, and Spark for its data warehouse architecture to natively run Python and R.

From a business perspective, however, these new technologies – albeit very interesting on their own– aren’t half as interesting as the prospect of adopting fresh new types of ESG data to give them a competitive advantage. And what many of them have come to realize is that when a majority of managers adapt to the same standard vendor scores, it gets harder to differentiate. Subsequently, more of them have now started to prefer to take in so-called raw ESG data rather than predefined ratings which most vendors are trying to prescribe.

But the argument here goes, that if even more, raw data comes in, and you would have to classify, sift, clean, correct and enrich this supply of data as it flows through the entire ecosystem of the firm, data lineage (i.e.: tracking the what, when, who and why, of the changes) becomes exponentially more important and asset managers need a data management system that keeps pace with these requirements.