PRGX is in the business of making sure its clients, which include 75% of the top 20 global retailers, don't let money slip through the cracks. Now the audit recovery leader is doing the same for itself with its move from a SQL-based data system to one based on Hadoop.
The business, which looks for money that may have been missed in negotiated contracts, pricing, procurements, and payments, got its start in 1970s, before the whole process was computerized. Instead, auditors got boxes of documents delivered to warehouses and spent countless hours looking through the papers inside.
Today PRGX receives more than 2 million client files each year, amounting to 2.3 petabytes of data from customers (who the company can't name for a media story), in formats that include Microsoft Excel spreadsheets, EDI, XML, comma delimited flat files, and PDFs. The data is dervived from sources such as purchasing, payment, receiving, deals, point of sale, and emails.
Data can arrive via FTP, email, flash drives, DVDs, hard drives, and tape. PRGX ingests the data and mines it for overpayments, and then retrieves the supporting information to prove customers' claims.
Working with this huge volume of data comes with challenges. PRGX has been keeping the data that is mined in SQL on an IBM AS/400 -- an expensive setup for such big data.
Recently, PRGX evaluated a migration of its data analysis and recovery audit services -- off of SQL on IBM AS/400 systems and onto a system that would provide less expensive storage and considerably faster data analysis.
SQL on AS/400 may have been a great solution for the 1990s, when it was designed, but it limited PRGX's ability to rerun jobs with new queries, according to Jonathon Whitton, director of data services at PRGX. Whitton told InformationWeek about PRGX's data analysis modernization project in an interview.
The project started small in the third quarter of 2013.
"We started on old machines under a desk," Whitton said. "What that showed us was that we could do things faster than in the SQL environment."
Taking that early success, Whitton's group started working on a formal proof of concept in 2014, choosing the Cloudera Hadoop stack including Talend for unzipping and decrypting client data, cleaning that data, and getting it ready to load into the Hadoop cluster.
The Tech Stack
The process for structured data is different from the unstructured data process. PRGX's data services team uses HDFS (Hadoop distributed file system) for long-term data storage, Apache Hive and Apache Spark for batch processing, and Apache Impala for investigative queries. PRGX then uses Talend and Apache Scoop to load the data and then export it to the relational database management system.
For unstructured data such as email, PRGX loads data to Hbase for long-term storage via Apache Thrift and uses Apache Tika to read email data from headers, bodies, and attachments. The email system also relies on a series of tools for working with .Net technology.
The Hadoop launched to production in 2015. By the end of this year, Whitton said, PRGX expects to have 60% of its accounts moved over to the Talend and Cloudera stack.
The data services group is delivering jobs an average of 10 times faster than it did on the old infrastructure, Whitton said, and that gives auditors more time to work with the data.
"We have one thing that was taking 160 hours that finished in 8 hours," he said. That faster processing is enabling business users to look deeper in their audits. "A lot of times we'll get requests back from the audit that they want to change one thing and run the job again -- say a shipping date."
Whitton's group only serves internal clients, but he says that since the system migration began his group is getting more requests, because those clients have more time to work with data.
"The accounts we've moved over, we've found more revenue for them. We are able to answer more questions than we were able to answer before."
[The big data management market is poised to expand. Read Big Data Management Market Expected to Grow 12.8%.]
Hadoop's lower cost of storage has also allowed PRGX to keep more data available online. Before the migration, Whitton's group would need to go to the archive, often in tape form, and recover data from it in order to run another query on the data. Now it doesn't have to. Whitton said that the company normally has about two petabytes of data online and between six and ten petabytes in longer term storage.
"It's very expensive to store in SQL. We would have to rotate data in and out," he said. Now his group is able to keep three years' worth of client data online at all times.
Whitton's team inside data services has had just over 100 people internally working on development, with 30 devoted to this migration. PRGX's IT team totals 250 members out of a total employee count for the company of about 1,400.
Another result of the improvements to the data stack is that PRGX's technology workers now have more time to work with the data, which only serves to increase revenues for the company.
The company has kept its current staff onboard with the switch to these new technologies by training the company's current personnel in big data skills.
Whitton said that the plan for this big data project going forward is to democratize it. His team is working to "open it up to add much more self-service, and have the senior audit staff being able to work directly with the data on the cluster."
He says he believes that this change will happen sooner rather than later because the company has more work than it can handle.