IT training big data all stars khotailieu

Big Data All-Stars Real-World Stories and Wisdom from the Best in Big Data Presented by D a t a n a m i Sponsored by ® Introduction Those of us looking to take a significant step towards creating a data-driven business sometimes need a little inspiration from those that have traveled the path we are looking to tread This book presents a series of real-world stories from those on the big data frontier who have moved beyond experimentation to creating sustainable, successful big data solutions within their organizations Read these stories to get an inside look at nine “big data all-stars” who have been recognized by MapR and Datanami as having achieved great success in the expanding field of big data Use the examples in this guide to help you develop your own methods, approaches, and best practices for creating big data solutions within your organization Whether you are a business analyst, data scientist, enterprise architect, IT administrator, or developer, you’ll gain key insights from these big data luminaries—insights that will help you tackle the big data challenges you face in your own company Table of Contents How comScore Uses Hadoop and MapR to Build its Business Michael Brown, CTO at comScore comScore uses MapR to manage and scale their Hadoop cluster of 450 servers, create more files, process more data faster, and produce better streaming and random I/O results MapR allows comScore to easily access data in the cluster and just as easily store it in a variety of warehouse environments Making Good Things Happen at Wells Fargo Paul Cao, Director of Data Services for Wells Fargo’s Capital Markets business Wells Fargo uses MapR to serve the company’s data needs across the entire banking business, which involve a variety of data types including reference data, market data, and structured and unstructured data, all under the same umbrella Using NoSQL and Hadoop, their solution requires the utmost in security, ease of ingest, ability to scale, high performance, and—particularly important for Wells Fargo—multi-tenancy 11 Coping with Big Data at Experian–“Don’t Wait, Don’t Stop” Tom Thomas, Director of IT at Experian Experian uses MapR to store in-bound source data The files are then available for analysts to query with SQL via Hive, without the need to build and load a structured database Experian is now able to achieve significantly more processing power and storage space, and clients have access to deeper data 14 Trevor Mason and Big Data: Doing What Comes Naturally Trevor Mason, Vice President Technology Research at IRI IRI used MapR to maximize file system performance, facilitate the use of a large number of smaller files, and send files via FTP from the mainframe directly to the cluster With Hadoop, they have been able to speed up data processing while reducing mainframe load, saving more than $1.5 million 17 Leveraging Big Data to Economically Fuel Growth Kevin McClowry, Director of Analytics Application Development at TransUnion TransUnion uses a hybrid architecture made of commercial databases and Hadoop so that their analysts can work with data in a way that was previously out of reach The company is introducing the analytics architecture worldwide and sizing it to fit the needs and resources of each country’s operation 20 Making Big Data Work for a Major Oil & Gas Equipment Manufacturer Warren Sharp, Big Data Engineer at National Oilwell VARCO (NOV) NOV created a data platform for time-series data from sensors and control systems to support deep analytics and machine learning The organization is now able to build, test, and deliver complicated condition-based maintenance models and applications 23 The NIH Pushes the Boundaries of Health Research with Data Analytics Chuck Lynch, Chief Knowledge Officer at National Institutes of Health The National Institutes for Health created a five-server cluster that enables the office to effectively apply analytical tools to newly-shared data NIH can now things with health science data it couldn’t before, and in the process, advance medicine 27 Keeping an Eye on the Analytic End Game at UnitedHealthcare Alex Barclay, Vice President of Advanced Analytics at UnitedHealthcare UnitedHealthcare uses Hadoop as a basic data framework and built a single platform equipped with the tools needed to analyze information generated by claims, prescriptions, plan participants, care providers, and claim review outcomes They can now identify mispaid claims in a systematic, consistent way 29 Creating Flexible Big Data Solutions for Drug Discovery David Tester, Application Architect at Novartis Institutes for Biomedical Research Novartis Institutes for Biomedical Research built a workflow system that uses Hadoop for performance and robustness Bioinformaticians use their familiar tools and metadata to write complex workflows, and researchers can take advantage of the tens of thousands of experiments that public organizations have conducted How comScore Uses Hadoop and MapR to Build its Business Michael Brown CTO at comScore When comScore was founded in 1999, Mike Brown, the company’s first engineer, was immediately immersed in the world of Big Data The company was created to provide digital marketing intelligence and digital media analytics in the form of custom solutions in online audience measurement, e-commerce, advertising, search, video and mobile Brown’s job was to create the architecture and design to support the founders’ ambitious plans It worked Over the past 15 years comScore has built a highly successful business and a customer base composed of some of the world’s top companies—Microsoft, Google, Yahoo!, Facebook, Twitter, craigslist, and the BBC to name just a few Overall the company has more than 2,100 clients worldwide Measurements are derived from 172 countries with 43 markets reported “With MapR we see a 3X performance increase running the same data and the same code—the jobs just run faster.” Mike Brown, CTO, comScore To service this extensive client base, well over 1.8 trillion interactions are captured monthly, equal to about 40% of the monthly page views of the entire Internet This is Big Data on steroids Brown, who was named CTO in 2012, continues to grow and evolve the company’s IT infrastructure to keep pace with this constantly increasing data deluge “We were a Dell shop from the beginning In 2002 we put together our own grid processing stack to tie all our systems together in order to deal with the fast growing data volumes,” Brown recalls Introducing Unified Digital Measurement In addition to its ongoing business, in 2009 the company embarked on a new initiative called Unified Digital Measurement (UDM), which directly addresses the frequent disparity between census-based site analytics data and panel-based audience measurement data UDM blends these two approaches into a “best of breed” approach that combines person-level measurement from the two million person comScore global panel with census informed consumption to account for 100 percent of a client’s audience UDM helped prompt a new round of IT infrastructure upgrades “The volume of data was growing rapidly and processing requirements were growing dramatically as well,” Brown says “In addition, our clients were asking us to turn the “MapR has built in to the design an automated DR strategy.” Mike Brown, CTO, comScore data around much faster So we looked into building our own stack again, but decided we’d be better off adopting a well accepted, open source, heavy duty processing model—Hadoop.” With the implementation of Hadoop, comScore continued to expand its server cluster Multiple servers also meant they had to solve the Hadoop shuffle problem During the high volume, parallel processing of data sets coming in from around the world, data is scattered across the server farm To count the number of events, all this data has to be gathered, or “shuffled” into one location comScore needed a Hadoop platform that could not only scale, but also provide data protection, high availability, as well as being easy to use It was requirements like these that led Brown to adopt the MapR distribution for Hadoop He was not disappointed—by using the MapR distro, the company is able to more easily manage and scale their Hadoop cluster, create more files and process more data faster, and produce better streaming and random I/O results than other Hadoop distributions “With MapR we see a 3X performance increase running the same data and the same code—the jobs just run faster.” In addition, the MapR solution provides the requisite data protection and disaster recovery functions: “MapR has built in to the design an automated DR strategy,” Brown notes Solving the Shuffle He said they leveraged a feature in MapR known as volumes to directly address the shuffle problem “It allows us to make this process run superfast We reduced the processing time from 36 hours to three hours—no new hardware, no new software, no new anything, just a design change This is just what we needed to colocate the data for efficient processing.” Using volumes to optimize processing was one of several unique solutions that Brown and his team applied to processing comScore’s massive amounts of data Another innovation is pre-sorting the data before it is loaded into the Hadoop cluster Sorting optimizes the data’s storage compression ratio, from the usual ratio of 3:1 to a highly compressed 8:1 with no data loss And this leads to a cas- “With MapR, you can just mount HDFS as NFS and then use native tools whether they’re in Windows, Unix, Linux or whatever NFS allowed our enterprise to easily access data in the cluster and just as easily store it in a variety of warehouse environments.” Mike Brown, CTO, comScore cade of benefits: more efficient processing with far fewer IOPS, less data to read from disk, and less equipment which, in turn, means savings on power, cooling and floor space “HDFS is great internally,” says Brown “But to get data in and out of Hadoop, you have to some kind of HDFS export With MapR, you can just mount HDFS as NFS and then use native tools whether they’re in Windows, Unix, Linux or whatever NFS allowed our enterprise to easily access data in the cluster and just as easily store it in a variety of warehouse environments.” For the near future, Brown says the comScore IT infrastructure will continue to scale to meet new customer demand The Hadoop cluster has grown to 450 servers with 17,000 cores and more than 10 petabytes of disk MapR’s distro of Hadoop is also helping to support a major new product announced in 2012 and enjoying rapid growth Know as validated Campaign Essential (vCE), the new measurement solution provides a holistic view of campaign delivery and a verified assessment of ad-exposed audiences via a single, third-party source vCE also allows the identification of non-human traffic and fraudulent delivery Start Small When asked if he had any advice for his peers in IT who are also wrestling with Big Data projects, Brown commented, “We all know we have to process mountains of data, but when you begin developing your environment, start small Cut out a subset of the data and work on that first while testing your code and making sure everything functions properly Get some small wins Then you can move on to the big stuff.” Making Good Things Happen at Wells Fargo Paul Cao Director of Data Services for Wells Fargo’s Capital Markets business When Paul Cao joined Wells Fargo several years ago, his timing was perfect Big Data analytic technology had just made a major leap forward, providing him with the tools he needed to implement an ambitious program designed to meet the company’s analytic needs Wells Fargo is big—a nationwide, community-based financial services company with $1.8 trillion in assets It provides its various services through 8,700 locations as well as on the Internet and through mobile apps The company has some 265,000 employees and offices in 36 countries They generate a lot of data The MapR solution, for example, provides powerful features to logically partition a physical cluster to provide separate administrative control, data placement, job execution and network access Cao has been working with data for twenty years Now, as the Director of Data Services for Wells Fargo’s Capital Markets business, he is creating systems that support the Business Intelligence and analytic needs of its far-flung operations Meeting Customer and Regulatory Needs “We receive massive amounts of data from a variety of different systems, covering all types of securities (equity, fixed income, FX, etc.) from around the world,” Cao says “Many of our models reflect the interactions between these systems—it’s multi-layered The analytic solutions we offer are not only driven by customers’ needs, but by regulatory considerations as well “We serve the company’s data needs across the entire banking business and so we work with a variety of data types including reference data, market data, structured and unstructured data, all under the same umbrella,” he continues “Because of the broad scope of the data we are dealing with, we needed tools that could handle the volume, speed and variety of data as well as all the requirements that had to be met in order to process that data Just one example is market tick data For North American cash equities, we are dealing with up to three million ticks per second, a huge amount of data that includes all the different price points for the various equity stocks and the movement of those stocks.” Enterprise NoSQL on Hadoop Cao says that given his experience with various Big Data solutions in the past and the recent revolution in the technology, he and his team were well aware of the limitations of more traditional relational databases So they concentrated their attention on solutions that support NoSQL and Hadoop They wanted “The new technology we are introducing is not an incremental change—this is a dramatic change in the way we are handling data.” Paul Cao, Director of Data Services for Wells Fargo’s Capital Markets business to deal with vendors like MapR that could provide commercial support for the Hadoop distribution rather than relying on open source channels The vendors had to meet criteria such as their ability to provide utmost in security, ease of ingest, ability to scale, high performance, and—particularly important for Wells Fargo—multi-tenancy Cao explains that he is partnering with the Wells Fargo Enterprise Data & Analytics and Enterprise Technology Infrastructure teams to develop a platform servicing many different kinds of capital markets related data– including files of all sizes and real time and batch data from a variety of sources within Well Fargo Multi-tenancy is a must to cost-efficiently and securely share IT resources and allow different business lines, data providers and data consumer applications to coexist on the same cluster with true job isolation and customized security The MapR solution, for example, provides powerful features to logically partition a physical cluster to provide separate administrative control, data placement, job execution and network access Dramatic Change to Handling Data “The new technology we are introducing is not an incremental change—this is a dramatic change in the way we are handling data,” Cao says “Among our challenges is to get users to accept working with the new Hadoop and NoSQL infrastructure, which is so different from what they were used to Within Data Services, we have been fortunate to have people who not only know the new technology, but really know the business This domain expertise is essential to an understanding of how to deploy and apply the new technologies to solve essential business problems and work successfully with our users.” When asked what advice he would pass on to others working with Big Data, Cao reiterates his emphasis on gaining a solid understanding of the new technologies along with a comprehensive knowledge of their business domain “This allows you to marry business and technology to solve business problems,” he concludes “You’ll be able to understand your users concerns and work with them to make good things happen.” 10 “We are trying to foster innovation and growth Embracing these new Big Data platforms and architectures has helped lay that foundation Nurturing the expertise and creativity within our analysts is how we’ll build on that foundation.” And when asked what he would say to other technologists introducing Big Data into their organizations, McClowry advises, “You’ll be tempted and pressured to over-promise the benefits Try to keep the expectations grounded in reality You can probably reliably avoid costs, or rationalize a few silos to start out, and that can be enough as a first exercise But along the way, acknowledge every failure and use it to drive the next foray into the unknown; those unintended consequences are often where you get the next big idea And as far as technologies go, these tools are not just for the techno-giants anymore and there is no indication that there will be fewer of them tomorrow Organizations need to understand how they will and will not leverage them.” 19 Making Big Data Work for a Major Oil & Gas Equipment Manufacturer Warren Sharp Big Data Engineer at National Oilwell VARCO (NOV) Big data requires a big vision This was one of the primary reasons that Warren Sharp was asked to join National Oilwell Varco (NOV) a little over six months ago NOV is a worldwide leader in the design, manufacture and sale of equipment and components used in oil and gas drilling and production operations and the provision of oilfield services to the upstream oil and gas industry Sharp, whose title is Big Data Engineer in NOV’s Corporate Engineering and Technology Group, honed his Big Data analytic skills with a previous employer— a leading waste management company that was collecting information about driver behavior by analyzing GPS data for 15,000 trucks around the country The goals are more complicated and challenging at NOV Says Sharp, “We are creating a data platform for time-series data from sensors and control systems to support the deep analytics and machine learning This platform will efficiently ingest and store all time-series data from any source within the organization and make it widely available to tools that talk Hadoop or SQL The first business use case is to support Condition-Based Maintenance efforts by making years of equipment sensor information available to all machine learning applications from a single source.“ MapR at NOV For Sharp using the MapR data platform was a given—he was already familiar with its features and capabilities Coincidentally, his boss-to-be at NOV had already come to the same conclusion six month’s earlier and made MapR a part of their infrastructure “Learning that MapR was part of the infrastructure was one of the reasons I took the job,” comments Sharp “I realized we had compatible ideas about how to solve Big Data problems.” “MapR is relatively easy to install and setup, and the POSIX-compliant NFSenabled clustered file system makes loading data onto MapR very easy,” Sharp adds “It is the quickest way to get started with Hadoop and the most flexible in terms of using ecosystem tools The next step was to figure out which tools in the Hadoop ecosystem to include to create a viable solution.” 20 “MapR is relatively easy to install and setup, and the POSIX-compliant NFS-enabled clustered file system makes loading data onto MapR very easy.” Warren Sharp, Big Data Engineer at National Oilwell VARCO Querying OpenTSDB The initial goal was to load large volumes of data into OpenTSDB, a time series database However, Sharp realized that other Hadoop SQL-based tools could not query the native OpenTSDB data table easily So he designed a partitioned Hive-table to store all ingested data as well This hybrid storage approach supported options to negotiate the tradeoffs between storage size and query time, and has yielded some interesting results For example, Hive allowed data to be accessed by common tools such as Spark and Drill for analytics with query times in minutes, whereas OpenTSDB offered for near-instantaneous visualization of months and years of data The ultimate solution, says Sharp, was to ingest data into a canonical partitioned Hive table for use by Spark and Drill and use Hive to generate files for the OpenTSDB import process Coping with Data Preparation Storage presented another problem “Hundreds of billions of data points uses a lot of storage space,” he notes “Storage space is less expensive now than it’s ever been, but the physical size of the data also affects read times of the data while querying Understanding the typical read patterns of the data allows us to lay down the data in MapR in a way to maximize the read performance Moreover, partitioning data by its source and date leads to compact daily files.” Sharp found both ORC (Optimized Row Columnar) format and Spark were essential tools for handling time-series data and analytic queries over larger time ranges Bottom Line As a result of his efforts, he has created a very compact, lossless storage mechanism for sensor data Each terabyte of storage has the capacity to store 750 billion to trillion data points This is equivalent to 20,000—150,000 sensor-years of Hz data and will allow NOV to store all sensor data on a single MapR cluster “Our organization now has platform data capabilities to enable Condition-Based Maintenance,” Sharp says “All sensor data are accessible by any authorized user 21 “MapR is the quickest way to get started with Hadoop and the most flexible in terms of using ecosystem tools The next step was to figure out which tools in the Hadoop ecosystem to include to create a viable solution.” Warren Sharp, Big Data Engineer at National Oilwell VARCO or application at any time for analytics, machine learning, and visualization with Hive, Spark, OpenTSDB and other vendor software The Data Science and Product teams have all the tools and data necessary to build, test, and deliver complicated CBM models and applications.” Becoming a Big Data All Star When asked what advice he might have for other potential Big Data All Stars, Sharp comments, “Have a big vision Use cases are great to get started, vision is critical to creating a sustainable platform.” “Learn as much of the ecosystem as you can, what each tool does and how it can be applied End-to-end solutions won’t come from a single tool or implementation, but rather by assembling the use of a broad range of available Big Data tools to create solutions.” 22 The NIH Pushes the Boundaries of Health Research with Data Analytics Chuck Lynch Chief Knowledge Officer at National Institutes of Health Few things probably excite a data analyst more than data on a mission, especially when that mission has the potential to literally save lives That fact might make the National Institutes for Health the mother-load of gratifying work projects for data analysts that work there In fact, the NIH is 27 separate Institutes and Centers under one umbrella title, all dedicated to the most advanced biomedical research in the world At approximately 20,000 employees strong, including some of the most prestigious experts in their respective fields, the NIH is generating a tremendous amount of data on healthcare research From studies on cancer, to infectious diseases, to Aids, or women’s health issues, the NIH probably has more data on each topic than nearly everyone else Even the agency’s library—the National Library of Medicine—is the largest of its kind in the world Data Lake Gives Access to Research Data ‘Big data’ has been a very big thing for the NIH for some time But this fall the NIH will benefit from a new ability to combine and compare separate institute grant data sets in a single ‘data lake’ With the help of MapR, the NIH created a five-server cluster—with approximately 150 terabytes of raw storage—that will be able to “accumulate that data, manipulate the data and clean it, and then apply analytics tools against it,” explains Chuck Lynch, a senior IT specialist with the NIH Office of Portfolio Analysis, in the Division of Program Coordination, Planning, and Strategic Initiatives If Lynch’s credentials seem long, they actually get longer Add to the above the Office of the Director, which coordinates the activities of all of the institutes Each individual institute in turn has its own director, and a separate budget, set by Congress “What the NIH does is basically drive the biomedical research in the United States in two ways,” Lynch explains “There’s an intermural program where we have the scientists here on campus biomedical research in laboratories They are highly credentialed and many of them are world famous.” 23 This is all really great stuff, but it just got a lot better The new cluster enables the “Additionally, we have an extramural program where we issue billions of dollars in grants to universities and to scientists around the world to perform biomedical research—both basic and applied—to advance different areas of research that are of concern to the nation and to the world,” Lynch says to the newly-shared This is all really great stuff, but it just got a lot better The new cluster enables the office to effectively apply analytical tools to the newly-shared data The hope is that the NIH can now things with health science data it couldn’t before, and in the process advance medicine data The hope is that Expanding Access to ‘Knowledge Stores’ the NIH can now As Lynch notes, ‘big data’ is not about having volumes of information It is about the ability to apply analytics to data to find new meaning and value in it That includes the ability to see new relationships between seemingly unrelated data, and to discover gaps in those relationships As Lynch describes it, analytics helps you better know what you don’t know If done well, big data raises as many questions as it provides answers, he says office to effectively apply analytical tools things with health science data it couldn’t before, and in the process advance medicine The challenge for the NIH was the large number of institutes collecting and managing their own data Lynch refers to them as “knowledge stores” of the scientific research being done “We would tap into these and research on them, but the problem was that we really needed to have all the information at one location where we could manipulate it without interfering with the [original] system of record,” Lynch says “For instance, we have an organization that manages all of the grants, research, and documentation, and we have the Library of Medicine that handles all of the publications in medicine We need that information, but it’s very difficult to tap into those resources and have it all accumulated to the analysis that we need to So the concept that we came up with was building a data lake,” Lynch recalls That was exactly one year ago, and the NIH initially hoped to undertake the project itself 24 “We have a system here at NIH that we’re not responsible for called Biowulf, which is a play on Beowulf It’s a high speed computing environment but it’s not data intensive It’s really computationally intensive,” Lynch explains “We first talked to them but we realized that what they had wasn’t going to serve our purposes.” So the IT staff at NIH worked on a preliminary design, and then engaged vendors to help formulate a more formal design From that process the NIH chose MapR to help it develop the cluster “We used end-of-year funding in September of last year to start the procurement of the equipment and the software,” Lynch says “That arrived here in the November /December timeframe and we started to coordinate with our office of information technology to build the cluster out Implementation took place in the April to June timeframe, and the cluster went operational in August.” Training Mitigates Learning Curve “What we’re doing is that we’re in the process of testing the system and basically wringing out the bugs,” Lynch notes “Probably the biggest challenge that we’ve faced is our own learning curve; trying to understand the system The challenge that we have right now as we begin to put data into the system is how we want to deploy that data? Some of the data lends itself to the different elements of the MapR ecosystem What should we be putting it into— not just raw data, but should we be using Pig or Hive or any of the other ecosystem elements?” Key to the project success so far, and going forward, is training “Many of the people here are biomedical scientists The vast majority of them have PhDs in biomedical science or chemistry or something We want them to be able to use the system directly,” Lynch says “We had MapR come in and give us training and also give our IT people training on administering the MapR system and using the tools.” 25 “The knowledge that we’re dealing with is so complex,” Lynch concludes “In this environment it is a huge step forward and I think it is going to resonate with the biomedical community at large There are other biomedical organizations that look to NIH to drive methods and approaches and best practices I think this is the start of a new best practice.” Chuck Lynch, Senior IT Specialist, National Institutes of Health Office of Portfolio Analysis and Office of the Director But that is the beginning of the story, not the conclusion As Lynch notes, “The longer journey is now to use it for true big data analysis; to find tools that we can apply to it; to index; to get metadata; to look at the information that we have there and to start finding things in the data that we had never seen before.” “Our view is that applying big data analytics to the data that we have will help us discover relationships that we didn’t realize existed,” Lynch continues “Success for us is being able to answer the questions being given to us by senior leadership,” Lynch says “For example, is the research that we’re doing in a particular area productive? Are we getting value out of research? Is there something that we’re missing? Is there overlap in the different types of research or are there gaps in the research? And in what we are funding are we returning value to the public?” Next Steps So what is the next step for the NIH? “To work with experts in the field to find better ways of doing analysis with the data and make it truly a big data environment as opposed to just a data lake,” Lynch says “That will involve a considerable amount of effort and will take us some time to put that together I think what we’re interested in doing is comparing and contrasting different methods and analytic techniques That is the long haul.” “The knowledge that we’re dealing with is so complex,” Lynch concludes “In this environment it is a huge step forward and I think it is going to resonate with the biomedical community at large There are other biomedical organizations that look to NIH to drive methods and approaches and best practices I think this is the start of a new best practice.” 26 Keeping an Eye on the Analytic End Game at UnitedHealthcare Alex Barclay Vice President of Advanced Analytics at UnitedHealthcare When Alex Barclay received his Ph.D in mathematics from the University of California San Diego in 1999, he was already well on his way to a career focused on big data Barclay brought his interest and expertise in analytics to Fair Isaac, a software analytics company, and then Experian, the credit reporting company Then, two years ago, he joined UnitedHealthcare and brought his experience with Big Data and a mature analytic environment to help advance the company’s Payment Integrity analytic capabilities UnitedHealthcare offers the full spectrum of health benefit programs to individuals, employers, military service members, retirees and their families, and Medicare and Medicaid beneficiaries, and contracts directly with more than 850,000 physicians and care professionals, and 6,000 hospitals and other care facilities nationwide For Barclay, as Vice President of Advanced Analytics for the company’s Payment Integrity organization, these numbers translated into huge amounts of data flowing into and through the company When he first surveyed the internal big data landscape, he found an ad hoc approach to analytics characterized by data silos and a heavily rule-based, fragmented data environment Building a New Environment “In the other environments I’ve been in, we had an analytic sandbox development platform and a comprehensive, integrated analytic delivery platform,” said Barclay “This is what I wanted to set up for UnitedHealthcare The first thing I did was partner with our IT organization to create a big data environment that we didn’t have to re-create every year—one that would scale over time.” The IT and analytic team members used Hadoop as the basic data framework and built a single platform equipped with the tools needed to analyze information generated by claims, prescriptions, plan participants and contracted care providers, and associated claim review outcomes “We spent a about a year pulling this together—I was almost an IT guy as opposed to an analytics guy,” Barclay added Rather than tackling a broad range of organizational entities, Barclay has concentrated his team’s efforts on a single, key function—payment integrity “The 27 “We have been able to help identify mispaid claims in a systematic, consistent way And we have encouraged the company to embrace big data analytics and move toward broad- idea is to use analytics to ensure that once a claim is received we pay the correct amount, no more, no less, including preventing fraudulent claims,” he said The Payment Integrity organization handles more than million claims every day The footprint for this data is about 10 terabytes and growing, according to Barclay It is also complex; data is generated by 16 different platforms, so although the claim forms are similar, they are not the same and must be rationalized Another major challenge to revamping the organization’s approach to big data was finding the right tools our business including “The tools landscape for analytics is very dynamic—it changes practically every day,” said Barclay “What’s interesting is that some of the tools we were looking at two years ago and rejected because they didn’t yet have sufficient capability for our purposes, have matured over time and now can meet our needs.” clinical, care provider Embracing Big Data networks, and customer “It’s working,” said Barclay “We have been able to help identify mispaid claims in a systematic, consistent way And we have encouraged the company to embrace big data analytics and move toward broadening the landscape to include other aspects of our business including clinical, care provider networks, and customer experience ening the landscape to include other aspects of experience.” Alex Barclay, Vice President of Advanced Analytics for UnitedHealthcare “We emphasize innovation For example, we apply artificial intelligence and deep learning methodologies to better understand our customers and meet their needs We are broadening our analytic scope to look beyond claims and understand our customers’ health care needs We’re really just at the beginning of a long and rewarding Big Data journey.” When asked if he had any advice for his fellow data analysts who might be implementing a big data analytic solution, Barclay replied: “Be patient Start slow and grow with a clear vision of what you want to accomplish It you don’t have a clearly defined use case, you can get lost in the mud in a hurry With Payment Integrity, in spite of some early challenges, we created something that is real and has tangible payback We always knew what the end game was supposed to look like.” 28 Creating Flexible Big Data Solutions for Drug Discovery David Tester Application Architect at Novartis Institutes for Biomedical Research He didn’t know it at the time, but when high school student David Tester acquired his first “computer”—a TI-82 graphing calculator from Texas Instruments—he was on a path that would inevitably lead to Big Data Tester’s interest in computation continued through undergraduate and graduate schools culminating in a Ph.D from the University of Oxford He followed a rather unusual path, investigating where formal semantics and logic interacted with statistical reasoning He notes that this problem is one that computers are still not very good at solving as compared to people Although computers are excellent for crunching statistics and for following an efficient chain of logic, they still fall short where those two tasks combine: using statistical heuristics to guide complex chains of logic Data Science Techniques for Drug Discovery Since joining the Novartis Institutes for Biomedical Research over two years ago, Tester has been making good use of his academic background He works as an application architect charged with devising new applications of data science techniques for drug research His primary focus is on genomic data— specifically Next Generation Sequencing (NGS) data, a classic Big Data application In addition to dealing with vast amounts of raw heterogeneous data, one of the major challenges facing Tester and his colleagues is that best practices in NGS research are an actively moving target Additionally, much of the cutting-edge research requires heavy interaction with diverse data from external organizations For these reasons, making the most of the latest NGS research in the literature ends up having two major parts Firstly, it requires workflow tools that are robust enough to process vast amounts of raw NGS data yet flexible enough to keep up with quickly changing research techniques Secondly, it requires a way to meaningfully integrate data from Novartis with data from these large external organizations—such as 1000 Genomes, NIH’s GTEx (Genotype-Tissue Expression) and TCGA (The Cancer Genome Atlas)— paying particular attention to clinical, phenotypical, experimental and other 29 This workflow system uses Hadoop for performance and robustness and MapR to provide the POSIX file access that lets bioinformaticians use their familiar tools Additionally, it uses the researchers’ own metadata to allow them to write complex workflows that blend the best aspects of Hadoop and traditional HPC associated data Integrating these heterogeneous datasets is labor intensive, so they only want to it once However, researchers have diverse analytical needs that can’t be met with any one database These seemingly conflicting requirements suggest a need for a moderately complex solution Finding the Answers To solve the first part of this NGS Big Data problem, Tester and his team built a workflow system that allows them to process NGS data robustly while being responsive to advances in the scientific literature Although NGS data requires high data volumes that are ideal for Hadoop, a common problem is that researchers have come to rely on many tools that simply don’t work on native HDFS Since these researchers previously couldn’t use systems like Hadoop –they have had to maintain complicated ‘bookkeeping’ logic to parallelize for optimum efficiency on traditional HPC This workflow system uses Hadoop for performance and robustness and MapR to provide the POSIX file access that lets bioinformaticians use their familiar tools Additionally, it uses the researchers’ own metadata to allow them to write complex workflows that blend the best aspects of Hadoop and traditional HPC As a result of their efforts, the flexible workflow tool is now being used for a variety of different projects across Novartis, including video analysis, proteomics, and metagenomics An additional benefit is that the integration of data science infrastructure into pipelines built partly from legacy bioinformatics tools can be achieved in days, rather than months Built-in Flexibility For the second part of the problem—the “integrating highly diverse public datasets” requirement, the team used Apache Spark, a fast, general engine for large scale data processing Their specific approach to dealing with heterogeneity was to represent the data as a vast knowledge graph (currently trillions of edges) that is stored in HDFS and manipulated with custom Spark code This use of a knowledge graph lets Novartis bioinformaticians easily model the complex and changing ways that biological datasets connect to one another, while the use of Spark allows them to perform graph manipulations reliably and at scale 30 On the analytics side, researchers can access data directly through a Spark API or through a number of endpoint databases with schemas tailored to their specific analytic needs Their toolchain allows entire schemas with 100 billions of rows to be created quickly from the knowledge graph and then imported into the analyst’s favorite database technologies Spark and MapR enable Research Advantage As a result, these combined Spark and MapR-based workflow and integration layers allow the company’s life science researchers to meaningfully take advantage of the tens of thousands of experiments that public organizations have conducted—a significant competitive advantage “In some ways I feel that I’ve come full circle from my days at Oxford by using my training in machine learning and formal logic and semantics to bring together all these computational elements,” Tester adds “This is particularly important because, as the cost of sequencing continues to drop exponentially, the amount of data that’s being produced increases We will need to design highly flexible infrastructures so that the latest and greatest analytical tools, techniques and databases can be swapped into our platform with minimal effort as NGS technologies and scientific requirements change Designing platforms with this fact in mind eases user resistance to change and can make interactions between computer scientists and life scientists more productive To his counterparts wrestling with the problems and promise of Big Data in other companies, Tester says that if all you want to is increase scale and bring down costs, Hadoop and Spark are great But most organizations’ needs are not so simple For more complex Big Data requirements, the many tools that are available can be looked at merely as powerful components that can be used to fashion a novel solution The trick is to work with those components creatively in ways that are sensitive to the users’ needs by drawing upon non-Big Data branches of computer science like artificial intelligence and formal semantics— while also designing the system for flexibility as things change He thinks that this is an ultimately more productive way to tease out the value of your Big Data implementation 31 Summary Whether you are a business analyst, data scientist, enterprise architect, IT administrator, or developer, the examples in this guide provide concrete applications of big data technologies to help you develop your own methods, approaches, and best practices for creating big data solutions within your organization Moving beyond experimentation to implementing sustainable big data solutions is necessary to impact the growth of your business Are you or your colleagues Big Data All-Stars? If you are pushing the boundaries of big data, we’d love to hear from you Drop us a note at BigDataAllStars@mapr.com and tell us about your journey or share how your colleagues are innovating with big data Originally published in Datanami Creating Flexible Big Data Solutions for Drug Discovery Datanami, January 19, 2015 Coping with Big Data at Experian—“Don’t Wait, Don’t Stop” Datanami, September 1, 2014 Trevor Mason and Big Data: Doing What Comes Naturally Datanami, October 20, 2014 Leveraging Big Data to Economically Fuel Growth Datanami, November 18, 2014 How comScore Uses Hadoop and MapR to Build its Business Datanami, December 1, 2014 Keeping an Eye on the Analytic End Game at UnitedHealthcare Datanami, September 7, 2015 The NIH Pushes the Boundaries of Health Research with Data Analytics Datanami, September 21, 2015 Making Big Data Work for a Major Oil & Gas Equipment Manufacturer Datanami, November 16, 2015 Making Good Things Happen at Wells Fargo Datanami, January 11, 2016 32 Presented by D a t a n a m i Sponsored by ® ... order to deal with the fast growing data volumes,” Brown recalls Introducing Unified Digital Measurement In addition to its ongoing business, in 2009 the company embarked on a new initiative called... an ambitious program designed to meet the company’s analytic needs Wells Fargo is big—a nationwide, community-based financial services company with $1.8 trillion in assets It provides its various... kind of capability, we can add additional nodes and storage and processing capacity at the same time.” 11 “All our solutions leverage MapR NFS functionality.” Tom Thomas, Director IT at Experian

IT training big data all stars khotailieu

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan