IT training a guide to improving data integrity and adoption khotailieu

A Guide to Improving Data Integrity and Adoption A Case Study in Verifying Usage Data Jessica Roper Beijing Boston Farnham Sebastopol Tokyo A Guide to Improving Data Integrity and Adoption by Jessica Roper Copyright © 2017 O’Reilly Media Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Nicole Tache Production Editor: Colleen Lobner Copyeditor: Octal Publishing Services December 2016: Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest First Edition Revision History for the First Edition 2016-12-12: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc A Guide to Improving Data Integrity and Adoption, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights 978-1-491-97052-2 [LSI] Table of Contents A Guide to Improving Data Integrity and Adoption Validating Data Integrity as an Integral Part of Business Using the Case Study as a Guide An Overview of the Usage Data Project Getting Started with Data Managing Layers of Data Performing Additional Transformation and Formatting Starting with Smaller Datasets Determining Acceptable Error Rates Creating Work Groups Reassessing the Value of Data Over Time Checking the System for Internal Consistency Verifying Accuracy of Transformations and Aggregation Reports Allowing for Tests to Evolve Implementing Automation Conclusion Further Reading 10 11 13 14 19 21 29 31 32 iii A Guide to Improving Data Integrity and Adoption In most companies, quality data is crucial to measuring success and planning for business goals Unlike sample datasets in classes and examples, real data is messy and requires processing and effort to be utilized, maintained, and trusted How we know if the data is accurate or whether we can trust final conclusions? What steps can we take to not only ensure that all of the data is transformed cor‐ rectly, but also to verify that the source data itself can be trusted as accurate? How can we motivate others to treat data and its accuracy as priority? What can we to expand adoption of data? Validating Data Integrity as an Integral Part of Business Data can be messy for many reasons Unstructured data such as log files can be complicated to understand and parse information A lot of data, even when structured, is still not standardized For example, parsing text from online forums can be complicated and might need to include logic to accommodate slang such as “bad ass,” which is a positive phrase but made with negative words The system creating the data can also make it messy because different languages have dif‐ ferent expectations for design, such as Ruby on Rails, which requires a separate table to represent many-to-many relationships Implementation or design can also lead to messy data For example, the process or code that creates data, and the database storing that data might use incompatible formats Or, the code might store a set of values as one column instead of many columns Some languages parse and store values in a format that is not compatible with the databases used to store and process it, such as YAML (YAML Ain’t Markup Language), which is not a valid data type in some databases and is stored instead as a string Because this format is intended to work much like a hash with key-and-value pairs, searching with the database language can be difficult Also, code design can inadvertently produce a table that holds data for many different, unrelated models (such as categories, address, name, and other profile information) that is also self-referential For example, the dataset in Table 1-1 is self-referential, wherein each row has a parent ID representing the type or category of the row The value of the parent ID refers to the ID column of the same table In Table 1-1, all information around a “User Profile” is stored in the same table, including labels for profile values, resulting in some val‐ ues representing labels, whereas others represent final values for those labels The data in Table 1-1 shows that “Mexico” is a “Coun‐ try,” part of the “User Profile” because the parent ID of “Mexico” is 11, the ID for “Country,” and so on I’ve seen this kind of example in the real world, and this format can be difficult to query I believe this relationship was mostly the result of poor design My guess is that, at the time, the idea was to keep all “profile-like” things in one table and, as a result, relationships between different parts of the profile also needed to be stored in the same place Table 1-1 Self-referential data example (source: Jessica Roper and Brian Johnson) ID 16 11 Parent ID 11 NULL Value Mexico Country User Profile Data quality is important for a lot of reasons, chiefly that it’s difficult to draw valid conclusions from impartial or inaccurate data With a dataset that is too small, skewed, inaccurate, or incomplete, it’s easy to draw invalid conclusions Organizations that make data quality a priority are said to be data driven; to be a data-driven company means priorities, features, products used, staffing, and areas of focus are all determined by data rather than intuition or personal experi‐ ence The company’s success is also measured by data Other things that might be measured include ad impression inventory, user | A Guide to Improving Data Integrity and Adoption engagement with different products and features, user-base size and predictions, revenue predictions, and most successful marketing campaigns To affect data priority and quality will likely require some work to make the data more usable and reportable and will almost certainly require working with others within the organiza‐ tion Using the Case Study as a Guide In this report, I will follow a case study from a large and critical data project at Spiceworks, where I’ve worked for the past seven years as part of the data team, validating, processing and creating reports Spiceworks is a software company that aims to be “everything IT for everyone IT,” bringing together vendors and IT pros in one place Spiceworks offers many products including an online community for IT pros to research and collaborate with colleagues and ven‐ dors, a help desk with a user portal, network monitoring tools, net‐ work inventory tools, user management, and much more Throughout much of the case study project, I worked with other teams at Spiceworks to understand and improve our datasets We have many teams and applications that either produce or consume data, from the network-monitoring tool and online community that create data, to the business analysts and managers who consume data to create internal reports and prove return on investment to customers My team helps to analyze and process the data to provide value and enable further utilization by other teams and products via standardizing, filtering, and classifying the data (Later in this report, I will talk about how this collaboration with other teams is a critical component to achieving confidence in the accuracy and usage of data.) This case study demonstrates Spiceworks’ process for checking each part of the system for internal and external consistency Throughout the discussion of the usage data case study, I’ll provide some quick tips to keep in mind when testing data, and then I’ll walk through strategies and test cases to verify raw data sources (such as parsing logs) and work with transformations (such as appending and sum‐ marizing data) I will also use the case study to talk about vetting data for trustworthiness and explain how to use data monitors to identify anomalies and system issues for the future Finally, I will discuss automation and how you can automate different tests at dif‐ Using the Case Study as a Guide | ferent levels and in different ways This report should serve as a guide for how to think about data verification and analysis and some of the tools that you can use to determine whether data is reliable and accurate, and to increase the usage of data An Overview of the Usage Data Project The case study, which I’ll refer to as the usage data project, or UDP, began with a high-level goal: to determine usage across all of Spice‐ works’ products and to identify page views and trends by our users The need for this new processing and data collection came after a long road of hodge-podge reporting wherein individual teams and products were all measured in different ways Each team and depart‐ ment collected and assessed data in its own way—how data was measured in each team could be unique Metrics became increas‐ ingly important for us to measure success and determine which fea‐ tures and products brought the most value to the company and, therefore, should have more resources devoted to them The impetus for this project was partially due to company growth— Spiceworks had reached a size at which not everyone knew exactly what was being worked on and how the data from each place corre‐ lated to their own Another determining factor was inventory—to improve and increase our inventory, we needed to accurately deter‐ mine feature priority and value We also needed to utilize and understand our users and audience more effectively to know what to show, to whom, and when (such as display ads or send emails) When access to this data occurred at an executive level, it was even more necessary to be able to easily compare products and under‐ stand the data as a whole to answer questions like: “How many total active users we have across all of our products?” and “How many users are in each product?” It wasn’t necessary to understand how each product’s data worked We also needed to be able to analysis on cross-product adoption and usage The product-focused reporting and methods of measuring perfor‐ mance that were already in place made comparison and analysis of products impossible The different data pieces did not share the same mappings, and some were missing critical statistics such as which specific user was active on a feature We thus needed to find a new source for data (discussed in a moment) | A Guide to Improving Data Integrity and Adoption into table format, more information was appended and the tables were summarized for easier analysis One of the goals for these more processed tables, which included any categorizations and final filter‐ ing of invalid data such as that from development environments, was to indicate total page views of a product per day by user We verified that rows were unique across user ID, page categorization, and date Usually at this point, we had the data in some sort of database, which allowed us to this check by simply writing a query that selected and counted the columns that should be unique and ensure that the resulting counts equal As Figure 1-5 illustrates, you also can run this check in Excel by using the Remove Duplicates action, which will return the number of duplicate rows that are removed The goal is for zero rows to be removed to show that the data is unique Figure 1-5 Using Excel to test for duplicates (source: Jessica Roper) Verifying Accuracy of Transformations and Aggregation Reports Because all of the data needed was not already in the logs, we added more to the parsed logs and then translated and aggregated the results so that everything we wanted to filter and report on was included Data transformations like appended information, aggrega‐ tion, and transformation require more validation Anything con‐ verted or categorized needs to be individually checked Verifying Accuracy of Transformations and Aggregation Reports | 19 In the usage data project, we converted each timestamp to a date and had to ensure that the timestamps were converted correctly One way we did this was to manually find the first and last log entries for a day in the logs and compare them to the first and last entries in the parsed data tables This test revealed an issue with a time zone difference between the logs and the database system, which shifted and excluded results for several hours To account for this we processed all logs for a given day as well as the following day and then filtered the results based on the date after adjusting for the time zone differences We also validated all data appended to the original source One piece of data that we appended was location information for each page view based on the IP address To this, we used a third-party com‐ pany that provides an application program interface (API) to corre‐ late IP addresses with location data, such as country, for further processing and geographical analysis For each value, we verified that the source of the appended data matched what was found in the final table For example, we ensured that the country and user infor‐ mation appended was correct by comparing the source location data from the third-party and user data to the final appended results We did this by joining the source data to the parsed dataset and compar‐ ing values For the aggregations, we checked that raw row counts from the parsed advertising service log tables matched the sum of the aggre‐ gate values In this case, we wanted to roll up our data by pages viewed per user in each product, requiring validation that the total count of rows parsed matched the summary totals stored in the aggregate table Part of the UDP required building aggregated data for the reporting layer, generically customized for the needs of the final reports In most cases, consumers of data (individuals, other applications, or custom tools) will need to transform, filter, and aggregate data for the unique needs of the report or application We created this trans‐ formation for them in a way that allowed the final product to easily filter and further aggregate the data (in this case, we used a business intelligence software) Those final transformations also required val‐ idation for completeness and accuracy, such as ensuring that any total summaries equal the sum of their parts, nothing was double counted, and so on 20 | A Guide to Improving Data Integrity and Adoption The goal for this level of testing is to validate that aggregate and appended data has as much integrity as the initial data set Here are some questions that you should ask during this process: • If values are split into columns, the columns add up to the total? • Are any values negative that should not be, such as a calculated “other” count? • Is all the data from the source found in the final results? • Is there any data that should be filtered out but is still present? • Do appended counts match totals of the original source? As an example of the last point, when dealing with Spiceworks’ advertising service, there are a handful of sources and services that can make a request for an ad and cause a log entry to be added Some different kinds of requests included new organic pageviews, ads refreshing automatically, and for pages with ad-block software One test we included checked that the total requests equaled the sum of the different request types As we built this report and con‐ tinued to evolve our understanding of the data and requirements for the final results, the reportable tables and tests also evolved The test process itself helped us to define some of this when outliers or unex‐ pected patterns were found Allowing for Tests to Evolve It is common for tests to evolve as more data is introduced and con‐ sumed and therefore better understood As specific edge cases and errors are discovered, you might need to add more automation or processes One such case I encountered was caused by the fact that every few years there are 53 weeks in the calendar year This extra “week” (it is approximately half a week, actually) results in weeks in December and 14 weeks in the last quarter When this situation occurred for the first time after building our process, the reporting for the last quarter of the year as well as for the following quarter were incorrect When the issue and cause were discovered, special logic for the process and new test cases were added to account for this unexpected edge case Allowing for Tests to Evolve | 21 For scrubbing transformations or clustering of data, your tests should search through all unique possible options, filtering one piece at a time Look for under folding, whereby data has not clus‐ tered/grouped everything it should have, and over folding, which is when things are over-grouped or categorized where they should not be [2] Part of the aggregations for this project required us to classify URLs based on the different products of which they were a part To test and scrub these, we first broke apart the required URL com‐ ponents to ensure that all variations are captured For example, one of the products that required its own category was an “app center” where users can share and download small applications or plug-ins for our other products; to test this, we began by searching for all URLs that had “app” and “center” in the URL We did not require “app center,” “app%center,” or other combined variations, because we wanted to make no assumptions about the format of the URL By searching in this more generic way, we were able to identify many URLs with formats of “appcenter,” “app-center,” and “app center.” Next, we looked for URLs that match only part of the string In this case, we found the URL “/apps” by looking for URLs that had the word “app” but not “center.” This test identified several URLs that looked similar to other app center URLs, but after further investiga‐ tion were found to be part of another product This allowed us to add automated tests that ensured those URLs were always catego‐ rized correctly and separately To categorize this data required using the acceptable error to identify what should be used to create the logic In this case, we did not need to focus on getting down and dirty with our “long tail”—what was usually thousands of pages that only have a handful of views A few page views account for well below one thousandth of a percent and would provide virtually no value even if scrubbed Most of the time the URLs are still incorpo‐ rated by the other logic created Checking for External Consistency: Analyzing and Monitoring Trends The last two components of data validation, vetting data with trend analysis and monitoring, are the most useful for determining data reliability and helping to ensure continued validity This layer is heavily dependent on the kind of data that is going to be reported on and what data is considered critical for any analysis It is part of 22 | A Guide to Improving Data Integrity and Adoption maintaining and verifying the reportable data layer, especially when data comes from external or multiple sources First is vetting data by comparing what was collected to other related data sources to comprehensively cover known boundaries This helps to ensure the data is complete and that other data sources correlate and agree with the data being tested Of course, other data sources will not represent the exact same information, but they can be used to check things such as whether trends over time match, if total unique values are within the bounds of expectations, and so on For example, in Spiceworks’ advertising service log data, there should not be more active users than the total users registered Even further, active users should not be higher than total users that have logged in during the time period The goal is to verify the data against any reliable external data possible External data could be from a source such as Google Analytics, which is a reliable source for page views, user counts, and general usage with some data available for free We used external data avail‐ able to us to compare total active users and page views over time for many products Even public data such as general market share com‐ parisons are a good option; compare sales records to product usage, active application counts, and associated users to total users, and so on Checking against external sources is just a different way to think about the data and other data related to it It provides boundaries and expectations for what the data should look like and edge-case conditions that might be encountered Some things to include are comparing total counts, averages, and categories or classifica‐ tions In some cases, the counts from the external source might be summaries or estimates, so it’s important to understand those val‐ ues, as well, to determine if inconsistencies among datasets indicate an error For the UDP, we were fortunate to have many internal sources of data that we used to verify that our data was within expected bounds and matched trends One key component is to compare trends We compared data over time for unique user activity, total page views, and active installations (and users related to those installations) and checked our results against available data sources (a hypothetical example is depicted in Figure 1-6) Allowing for Tests to Evolve | 23 Figure 1-6 Hypothetical example of vetting data trends (source: Jessica Roper and Brian Johnson) We aimed to answer questions such as the following: • Do the number of active users over time correlate to unique users seen in our other stats? • Do the total page views correlate to the usage we see from users in our installation stats? • Can all the URLs we see in our monitoring tools and request logs be found in the new data set? During this comparison, we found that several pages were not being tracked with the system This led us to work with the appropriate development teams to start tracking those pages, and determined what the impact would be on final analysis The total number of active users for each product was critical for reporting teams and project managers During testing, we found some products only had data available to indicate the number of active installations and the total number of users related to the installation As users change jobs and add installations, they can be a user in all of those applications, making the user-to-installation rela‐ tionship many-to-many Some application users also misunderstood the purpose of adding new users, which is meant to be for all IT pros providing support to the people in their companies However, in some cases an IT pro might add not only all other support staff, but include all end users, who won’t actually use the application and therefore are never active on the installation itself In this circum‐ stance, they were adding the set of end users they support, but those 24 | A Guide to Improving Data Integrity and Adoption end users are neither expected to interact directly with the applica‐ tion nor are they considered official users of it We wanted to define what assumptions were testable in the new data set from the ad ser‐ vice For example, at least one user should be active on every instal‐ lation; otherwise no page views could be generated Also, there should not be more active users than the total number of users asso‐ ciated to an installation After we defined all the expectations for the data, we built several tests One tested that each installation had less active users than the total associated with it More important, however, we also tested that the total active users trended over time was consistent with trends for total active installations and total users registered to the applica‐ tion (Figure 1-6) We expected the trend to be consistent between the two values and follow the same patterns of usage for time of day, day of week, and so on The key to this phase is having as much understanding as possible of the data, its boundaries, how it is rep‐ resented, and how it is produced so that you know what to expect from the test results Trends usually will match generally, but in my experience, it’s rare for them to match exactly Performing Time–Series Analyses The next step is time–series analysis—understanding how the data behaves over time and what “makes sense” for the dataset Time–series analysis provides insights needed to monitor the system over the long term and to validate data consistency This sort of analysis also verifies the data’s accuracy and reliability Are there large changes from month to month or week to week? Is change consistent over time? One way to verify whether a trend makes sense is by looking for expected anomalies such as new product launch dates causing spikes, holidays causing a dip, known outage times, and expected low-usage periods (i.e., AM) A hypothetical example is provided in Figure 1-7 This can also help identify other issues such as miss‐ ing data or even problems in the system itself After you understand trends and how they change over time, you might find it helpful to implement alerts that ensure the data fits within expected bounds You can this by checking for thresholds being crossed, or by veri‐ fying that new updates to the dataset grow or decline at a rate that is similar to the average seen across the previous few datasets Allowing for Tests to Evolve | 25 Figure 1-7 Hypothetical example of page view counts over time vetting (source: Jessica Roper and Brian Johnson) For example, in the UDP, we looked at how page views by product change over time by month compared to the growth of the most recent month We verified that the change we saw from month to month as new data was stable over time, and dips or spikes were seen only when expected (e.g., when a new product was launched) We used the average over several months to account for anomalies caused by months with several holidays during the week We wanted to identify thresholds and data existence expectations During this testing, we found several issues, including a failing log copy process and products that stopped sending up data to the sys‐ tem This test verified that each product was present in the final dataset Using this data, we were able to identify a problem with ad server tracking in one of our products before it caused major prob‐ lems This kind of issue was previously difficult to detect without time–series analysis We knew the number of active installations for different products and total users associated with each of those installations, but we could not determine which users were actually active before the new data source was created To validate the new data and these active user counts, we ensured that the total number of users we saw mak‐ ing page views in each product was higher than the total number of installations, but lower than the total associated users, because not all users in an installation would be active 26 | A Guide to Improving Data Integrity and Adoption Putting the Right Monitors in Place The time–series analysis was key to identifying the kinds of moni‐ tors needed, such as ones for user and client growth It also identi‐ fied general usage trends to expect, such as average page views per product Monitors are used to test new data being appended to and created by the system for the future; one time or single historical reports will not require monitoring One thing we had to account for when creating monitors was traffic changes throughout the week, such as significant drops on the weekends A couple of trend complications we had to deal with were weeks that have holidays and general annual trends such as drops in traffic in December and during the summer It is not enough to verify that the month looks similar to the month before it or that a week has similar data to the week before; we also had to determine a list of known holidays to add indicators to those dates when the monitors are triggered and compare averages over a reasonable amount of time It is important to note that we did not allow holidays to mute errors; instead, we added indicators and high-level data trend summaries in the monitor errors that allowed us to easily determine if the alert could be ignored Some specific monitors we added included looking at total page views over time and ensuring that the total was close to the average total over the previous three months We also added the same moni‐ tors for the total page views of each product and category, which tracked that all categories collect data consistently This also ensured that issues in the system creating the data were monitored and changes such as accidental removal of tracking code would not go unnoticed Other tests included looking at these same trends for totals and by category for registered users and visitors to ensure that tracking around users remained consistent We added many tests around users because knowing active users and their demographics was crit‐ ical to our reporting The main functionality for monitors is to ensure that critical data continues to have the integrity required A large change in a trend is an indicator that something might not be working as expected in all parts of the system A good rule of thumb for what defines a “large” change is when the data in ques‐ tion is outside one to two standard deviations from the average For example, we found one application that collected the expected data Allowing for Tests to Evolve | 27 for three months while in beta, but when the final product was deployed, the tracking was removed Our monitors discovered this issue by detecting a drop in total page views for that product cate‐ gory, allowing us to dig in and correct the issue before it had a large impact There are other monitors we also added that not focus heavily on trends over time Rather, they ensured that we would see the expected number of total categories and that the directory contain‐ ing all the files being processed had the minimum number of expected files, each with the minimum expected size This was determined to be critical because we found one issue in which some log files were not properly copied for parsing and therefore signifi‐ cant portions of data were missing for a day Missing even only a few hours of data can have large effects on different product results, depending on what part of the day is missing from our data These monitors helped us to ensure data copy processes and sources were updated correctly and provided high-level trackers to make sure the system is maintained As with other testing, the monitors can change over time In fact, we did not start out with a monitor to ensure that all the files being pro‐ cessed were present and the correct sizes The monitor was added when we discovered data missing after running a very long process When new data or data processes are created it is important to use it skeptically until no new issues or questions are found for a reason‐ able amount of time This is usually related to how the processed data is consumed and used Much of the data I work with at Spiceworks is produced and ana‐ lyzed monthly, so we closely and heavily manually monitor the sys‐ tem until the process has run fully successfully for several months This included working closely with our analysts as they worked with the data to find any potential issues or remaining edge cases in the data Anytime we found a new issue or unexpected change, a new monitor was added Monitors were also updated over time to be more tolerant of acceptable changes Many of these monitors were less around the system (there are different kinds of tests for that), and more about the data integrity and ensuring reliability Finally, another way to monitor the system is to “provide end users with a dead-easy way to raise an issue the moment an inaccuracy is discovered,” and, even better, let them fix it If you can provide a tool 28 | A Guide to Improving Data Integrity and Adoption that both allows a user to report on data as well as make corrections, the data will be able to mature and be maintained more effectively One tool we created at Spiceworks helped maintain how different products are categorized We provided a user interface with a data‐ base backend that allowed interested parties to update classifications of URLs This created a way to dynamically update and maintain the data without requiring code changes and manual updates Yet another way we did this was to incorporate regular communica‐ tions and meetings with all of the users of our data This included our financial planning teams, business analysts, and product man‐ agers We spent time understanding the way the data would be used and what the end goals were for those using it In every application, we included a way to give feedback on each page, usually through a form that includes all the page’s details Anytime the reporting tool did not have enough data results for the user, we gave an easy way to connect with us directly to help obtain the necessary data Implementing Automation At each layer of testing, automation can help ensure long-term relia‐ bility of the data and quickly identify problems during development and process updates This can include unit tests, trend alerts, or any‐ thing in between These are valuable for products that are being changed frequently or require heavy monitoring In the UDP, we automated almost all of the tests around transforma‐ tions and aggregations, which allowed for shorter test cycles while iterating through process and provided long-term stability monitor‐ ing of the parsing process in case anything changes in the future or a new system needs to be tested Not all tests need to be automated or created as monitors To deter‐ mine which tests should be automated, I try to focus on three areas: • Overall totals that indicate system health and accuracy • Edge cases that have a large effect on the data • How much effect code changes can have on the data Implementing Automation | 29 There are four general levels of testing, and each of these levels gen‐ erally describes how the tests are implemented: Unit This tests focus on single complete components in isolation Integration Integration tests focus on two components working together to build a new or combined data set System This level tests verify the infrastructure and overall process itself as a whole Acceptance Acceptance tests validate data as reasonable before publishing or appending data sets In the UDP, because having complete sets of logs was critical, a sepa‐ rate system-level test was created to run before the rest of the pro‐ cess to ensure that data for each day and hour could be identified in the log files This approach further ensures that critical and difficultto-find errors would not go unnoticed Other tests we focused on were between transformations of the data such as comparing initial parsed logs as well as aggregate counts of users and total page views Some tests, such as categorization verification, were only done man‐ ually because most changes to the process should not affect this data and any change in categorization would require more manual test‐ ing either way Different tests require different kinds of automation; for example, we created an automated test to validate the final reporting tables, which included a column for total impressions as well as the breakdown for type of impression based on that impres‐ sion being caused by a new page view versus ad refresh, and so on This test was implemented as a unit test to ensure that at a low level the total was equal to the sum of the page view types Another unit test included creating samples for the log parsing logic including edge cases as well as both common and invalid examples These were fed through the parsing logic after each change to it as we discovered new elements of the data One integration test included in the automation suite was the test to ensure country data from the third-party geographical dataset was valid and present The automating tests for data integrity and reliability using monitors and trends were done at the acceptance level after processing to ensure 30 | A Guide to Improving Data Integrity and Adoption valid data that followed the patterns expected before publishing it Usually when automated tests are needed, there will be some at every level It is helpful to document test suites and coverage, even if they are not automated immediately or at all This makes it easy to review tests and coverage as well as allow for new or inexperienced testers, developers, and so on, to assist in automation and manual testing Usually, I just record tests as they are manually created and exe‐ cuted This helps to document edge cases and other expectations and attributes of the data As needed, when critical tests were identified, we worked to auto‐ mate those tests to allow for faster iterations working with the data Because almost all code changes required some regression testing, covering critical and high-level tests automatically provided easy smoke testing for the system and gave some confidence in the con‐ tinued integrity of the data when changes were made Conclusion Having confidence in data accuracy and integrity can be a daunting task, but it can be accomplished without having a Ph.D or back‐ ground in data analysis Although you cannot use some of these strategies in every scenario or project, they should provide a guide for how you think about data verification, analysis, and automation, as well as give you the tools and ways to think about data to be able to provide confidence that the data you’re using is trustworthy It is important that you become familiar with the data at each layer and create tests between each transformation to ensure consistency in the data Becoming familiar with the data will allow you to under‐ stand what edge cases to look for as well as trends and outliers to expect It will usually be necessary to work with other teams and groups to improve and validate data accuracy (a quick drink never hurts to build rapport) Some ways to make this collaboration easier are to understand what the focus is for those being collaborated with and to show how the data can be valuable to those teams to use themselves Finally, you can ensure and monitor reliability through automation of process tests and acceptance tests that verify trends and boundaries and also allow the data collection processes to be converted and iterated on easily Conclusion | 31 Further Reading Peters, M (2013) “How Do You Know If Your Data is Accu‐ rate?” Retrieved December 12, 2016, from http://bit.ly/2gJz84p Polovets, L (2011) “Data Testing Challenge.” Retrieved Decem‐ ber 12, 2016 from http://bit.ly/2hfakCF Chen, W (2010) “How to Measure Data Accuracy?” Retrieved December 12, 2016 from http://bit.ly/2gj2wxp Chen, W (2010) “What’s the Root Cause of Bad Data?” Retrieved December 12, 2016 from http://bit.ly/2hnkm7x Jain, K (2013) “Being paranoid about data accuracy!” Retrieved December 12, 2016 from http://bit.ly/2hbS0Kh 32 | A Guide to Improving Data Integrity and Adoption About the Author Since graduating from University of Texas at Austin with a BS in computer science, Jessica Roper has worked as a software developer working with data to maintain, process, scrub, warehouse, test, report on, and create products for it She is an avid mentor and teacher, taking any opportunity available to share knowledge Jessica is currently senior developer in the data analytics division of Spiceworks, Inc., a network used by IT professionals to stay connec‐ ted and monitor their systems Outside of her technical work, she enjoys biking, swimming, cook‐ ing, and traveling ... “everything IT for everyone IT, ” bringing together vendors and IT pros in one place Spiceworks offers many products including an online community for IT pros to research and collaborate with colleagues... makes sense for how the application itself works Most often, it is not easily consumed nor does it lend itself well for creating reports For example, in the community, due to the frameworks used,... source data itself can be trusted as accurate? How can we motivate others to treat data and its accuracy as priority? What can we to expand adoption of data? Validating Data Integrity as an Integral

IT training a guide to improving data integrity and adoption khotailieu

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Cover

Strata+Hadoop World

Copyright

Table of Contents

Chapter 1. A Guide to Improving Data Integrity and Adoption

Validating Data Integrity as an Integral Part of Business

Using the Case Study as a Guide

An Overview of the Usage Data Project

Getting Started with Data

Managing Layers of Data

Performing Additional Transformation and Formatting

Starting with Smaller Datasets

Determining Acceptable Error Rates

Creating Work Groups

Reassessing the Value of Data Over Time

Checking the System for Internal Consistency

Validating the Initial Parsing Process

Checking the Validity of Each Field

Verifying Accuracy of Transformations and Aggregation Reports

Allowing for Tests to Evolve

Checking for External Consistency: Analyzing and Monitoring Trends

Performing Time–Series Analyses

Putting the Right Monitors in Place

Tài liệu cùng người dùng

Tài liệu liên quan