For many people, the coronavirus pandemic offered their first glimpse into the world of data analysis and modeling — and some of the challenges with data quality.
Chris Moore, Director of Solution Engineering at Trifacta, shares his insights into those challenges, including preparing data for analysis, cleaning and structuring data, and focusing on context.
Are there challenges with receiving data from different sources?
Moore: Absolutely, as we’ve seen a growing number of datasets that have been made available for COVID research and response efforts, most of the data is coming from different sources. A lot of this data has questionable quality and underlying structural issues, and it is difficult for non-data scientists to analyze. With the success of most initiatives completely dependent on data quality, it can be difficult for data scientists to ensure quality when drawing disparate data from multiple sources. As a result, over 80 percent of any data-driven project is spent cleaning and preparing the data that will be used.
What are data scientists doing to ensure data quality?
Moore: In order to ensure data quality, data scientists must assess the contents of the data, structure and blend with other sources, and then clean up anomalies and engineer features to ensure that the data feeding its models feature-rich and properly structured data. This could be approached in the same process that a data scientist would use when identifying and defining any critical data element for analysis. When using this approach, data scientists should consider the following questions prior to making data-driven decisions:
- Is the data element critical to the success of the outcome?
- What is the source of the data, and is it considered trustworthy, accurate, and consistent?
- Is it relevant for the time period and question being analyzed?
- In terms of the quality of the data, is it complete and properly formatted free of outliers and erroneous inconsistencies?
- Do you have visibility or a solid understanding of the lineage of various datasets that are feeding these metrics?
Are there tools that can help ensure data quality?
Moore: Yes, by using data preparation platforms, such as Trifacta, data scientists and data engineers can collaborate to clean and structure the data, engineer features with a visual and machine learning-driven interface that provides interactive guidance and real-time validation to ensure high-quality data. These systems also provide them with the ability to set up data pipelines that automate and feed the machine learning and AI models with monitoring and alerting built-in.
What are the most important things that data scientists and IT health leaders need to consider as they analyze, model predictions, and have assurance as they pass information on COVID-19 to the public?
Moore: Now more than ever, it’s critical to have a solid understanding around the context of the data — how it was assembled, metadata around each feature, when it was last updated, etc. Decision-makers should make sure that they understand the context of the data they’re using before they can confidently make decisions from it.
Another component of this process that IT health leaders should consider is comparing its datasets to different outlets. With the number of datasets growing exponentially during this time, it is important to identify questionable data quality and underlying structural issues. If an organization has a dataset that shows trends that massively differ from multiple other public datasets that contain similar information, organizations should put into question the validity of the dataset prior to implementing it into any data-driven initiative.
What advice can you offer software developers whose clients need their help with data analysis?
Moore: Ultimately, when dealing with any data, but specifically in speaking about data for COVID-19 research, it’s important to get a good understanding of the data. By looking at the data’s quality, statistics and shape analysts are able to gain quicker insights from the data. In this process, it’s important to fail in the beginning stages as it helps data scientists identify what data isn’t helpful to their research. Through tools that offer quicker iterative processes, analysts can minimize the amount of effort needed to find the correct data quicker. This helps analysts develop insights and make decisions around its COVID-19 research using insightful and quality data.