The biggest data analytics limitation that we are currently seeing play out is the ability to obtain and move real-time data into an analytics platform. The challenge is looking at legacy applications and data on platforms that are not easily enabled or prepared for real-time movement into a data analytics platform. Also, many modern data stores are operating at a scale beyond the capability of earlier-generation data movement technologies.
This challenge is significant because in order to truly get a full understanding of operational data, real-time is really an all-or-nothing scenario to be truly efficient. If there are segments of data that are not provided in real-time to the analytics platforms you are using, the full picture cannot truly be known. Even worse, mixing the real-time data with stale, batch-loaded data will actually skew the results before and after that batch data load.
Imagine, for instance, in a debt collection scenario where delinquent customers are uploaded to a list in real-time, but their payments are only updated daily on the system via batch. This could mean that some customers could be informed that they are delinquent even though they have already paid. If payments also were synced in real-time, versus a nightly batch update, then teams dealing with the delinquencies would have a better chance at accurately working the delinquencies. On top of that, the institution also would be able to pull more accurate reports on the state of delinquent customers in real-time.
The best way to get around these limitations of data analytics is to use a mechanism, such as change data capture (CDC), which allows even legacy systems to enable real-time data movement. Regardless of the platform, when a change in the data occurs, a change event is created in the CDC pipeline, which then cascades this change to any systems that require it. This means that you can forego the batch-based approach and have all systems stay in sync in real-time. At this point, you no longer have the discrepancy between real-time data and stale, batch-based data which could skew the overall picture you are trying to assess through your data analytics.
The best way to correctly use data analytics is to ensure that you are pouring in as much data as possible to the platform and are making accurate assumptions on what can be seen. Accuracy has several facets as well. For example, data needs to show up, but it shouldn’t show up twice (or more). As well, it should show up in the order in which it was created, or any time series sort of analysis will never return accurate results. As humans, we sometimes see patterns that are simply just figments, and we are not immune to that when looking at data analytics. If you have an assumption, it should be tested to make sure that the hypothesis you are deriving from the data is true. The old phrase “correlation does not imply causation” is extremely true when looking at data analytics and the assumptions we make around certain trends.