InformationWeek Stories by Mark Johnsonhttp://www.informationweek.comInformationWeeken-usCopyright 2012, UBM LLC.2013-04-16T09:06:00ZBig Data Fakers: 5 Warning SignsData falsification at research institutions to make results look better is nothing new. Here's what it can teach us about misuse of big data in business.http://www.informationweek.com/big-data/news/big-data-analytics/big-data-fakers-5-warning-signs/240152921?cid=RSSfeed_IWK_Authors<!-- KINDLE EXCLUDE --><div class="inlineStoryImage inlineStoryImageRight"><a href="http://www.informationweek.com/big-data/news/big-data-analytics/20-top-masters-degrees-for-big-data-analytics-professionals/240145673"><img src="http://twimgs.com/informationweek/galleries/automated/934/IntroImage_tn.jpg" alt=" Big Data Analytics Masters Degrees: 20 Top Programs" title=" Big Data Analytics Masters Degrees: 20 Top Programs" class="img175" /></a><br /> <div class="storyImageTitle"> Big Data Analytics Masters Degrees: 20 Top Programs</div> <span class="inlinelargerView">(click image for larger view and for slideshow)</span></div><!-- /KINDLE EXCLUDE --><p>Data fabrication and falsification pose a major problem in academic research, especially for projects funded by government agencies. Large fines and moratoria for researchers await those individuals and institutions caught cheating. The extent to which this problem also occurs in the amorphous world of big data is difficult to assess, but worth evaluating given the embarrassments in academia and the likelihood that motivations to cheat are universal.</p> <P> Universities are increasingly cognizant of the problem and their compliance offices are taking aggressive steps to demonstrate to funding agencies that they are vigilant in handling the problem proactively. <P> At the University of Central Florida, in response to a request from senior management for a seminar on data fabrication and falsification, I developed a two-hour module addressing scientific misconduct and compensatory measures. Graduate students are required to attend the seminar to be officially admitted to Ph.D. candidacy. <P> <strong>[ Big data is not as daunting as you think. Read <a href="http://www.informationweek.com/software/business-intelligence/microsoft-goes-after-3-big-data-myths/240150462?itc=edit_in_body_cross">Microsoft Goes After 3 Big Data Myths</a>. ]</strong> <P> Although I suspect <em>InformationWeek</em> readers are better informed than our Ph.D. students about data fabrication and falsification scandals, I thought I'd share some of the preliminary conclusions from my seminar on academia's dealings with data misconduct. <P> Here are some of the most egregious cases I came across. We'll start with five all-star perps, researchers who have made an embarrassing name for themselves by falsifying or misrepresenting data. Then we'll move on to five types of big-data people or scenarios that should make you suspicious enough to do some additional digging. <P> <strong>1. Eric "Massage Muscles not Data" Poehlman.</strong> <P> This University of Vermont kinesiologist was the <a href="http://www.nytimes.com/2006/10/22/magazine/22sciencefraud.html?pagewanted=1&_r=2&">first researcher to earn a federal prison term</a> -- 366 days -- owing to extensive data fabrications. If the data did not support his hypothesis, he changed it to suit his purposes. Credit should be given to his graduate student/technician Walter deNiro, who had the courage and fortitude to question the honesty of his supervisor's analyses. Poehlman cited the need to fund his lab as motivation for tampering with the data to keep the funding flowing. <P> <strong>2. Yoshitaka "Retracto" Fujii</strong>. <P> Fujii, an anethesiologist at Toho University, likely holds the <a href="http://retractionwatch.wordpress.com/2012/03/07/major-fraud-probe-of-japanese-anesthesiologist-yoshitaka-fujii-may-challenge-retraction-record/ ">all-time record of retractions of papers</a> with 172 found to be bogus by an expert panel and thus in various stages of retraction. The panel found that 126 of his randomized controlled studies -- double blind, no less -- "were totally fabricated." Some of his co-authors were in fact unaware that they were even co-authors because he forged their signatures. <P> <strong>3. Dipak "Sommelier" Das.</strong> <P> Das, a researcher at the Cardiovascular Research Center at the University of Connecticut, avoided detection for many years because the results of his studies &#8211; <a href="http://www.examiner.com/article/new-scandal-misconduct-found-for-resveratrol-benefits-red-wine ">a glass of red wine per day is good for health</a> -- was so comforting. Who wanted to overturn this result? He eventually was caught tinkering with Western blots, a type of figure for identifying proteins. Das unsuccessfully tried to transfer the blame to his students, one of whom admitted that he changed a figure the way Das wanted him to. <P> <strong>4. Diederik "Media Dude" Stapel.</strong> <P> This Tilburg University researcher studied human phenomena of great topical interest -- <a href="http://slatest.slate.com/posts/2011/11/03/fake_science_dutch_psychologist_made_up_results.html ">bias and stereotypes</a> -- leading to numerous interviews with the mainstream media regarding his findings. Unfortunately, as the sole proprietor of his data, much of it faked from his office, it took years before his falsehoods were discovered. <P> <strong>5. Eric "Not So" Smart.</strong> <P> For at least 10 years, Smart falsified data in grant proposals and publications in his areas, cardiovascular disease and diabetes. A key problem area was again Western blots, and he also reported results on genetically engineered mice &#8211; "knockout" mice -- <a href="http://www.the-scientist.com/?articles.view/articleNo/33464/title/A-Decade-of-Misconduct/">that did not exist</a>. Some of these publications garnered over one hundred citations and he drew funding to the University of Kentucky to the tune of $8 million. Smart resigned from the university and evidently works now as a science teacher in the Lexington area. <P> These are just five of the bad actors among many possible world-class data fabricators or manipulators we might not know about. The Department of Health and Human Services <a href="http://ori.dhhs.gov/case_summary ">maintains a list</a> that currently has 43 individuals with active administrative actions against them, a data falsification wall of shame if you will. Publicizing the guilty parties, their crimes and the corresponding penalties is in stark contrast to the old days of handling data fraud cases internally and quietly -- and ineffectively.Publicizing infractions eliminates some repeat offenders, but there are some obvious warning signs and common sense measures that companies can use to prevent or reduce problems of data fabrication. The following red flags, drawn from funded academic research, are likely the same for big-data applications. <P> <strong>1. Data Emperor.</strong> <P> When just one person has access to and control of the data and this person blocks others from looking at it, there might be a problem. If the "emperor" is a department that resists data inspection, this too could be a red flag. Stapel is the poster boy of data emperors, faking data from the comfort of his office; 51 retractions and counting. <P> <strong>2. Superman/Superwoman. </strong> <P> Astonishing, almost unbelievable output beyond what seems to be humanly possible should raise suspicions. Although there might be geniuses at your company, output that is three or more times as much as anyone else is worth exploring. At a Massachusetts crime lab, Annie Dookhan was processing 9,000 samples per year while her colleagues did on the order of 3,000. She was faking the results and was fired along with her clueless supervisors. Another faker, Robert Slutsky, was publishing a paper on average every 10 days. John Darsee was considered a brilliant cardiologist at Harvard Medical School until it was discovered that much of his data was faked. Darsee started his meteoric rise at Notre Dame as an undergraduate, <a href="http://www.nytimes.com/1983/04/24/us/doctor-ousted-by-harvard-also-suspected-of-falsifying-research-at-emory.html">reporting experiments on rats that would have been impossible to conduct</a>. With funding and accolades rolling in, his supervisors apparently operated under the "ignorance is bliss" paradigm. <P> <strong>3. Chaos As Cover. </strong> <P> Disorganized data structures with a concurrent lack of traceability of materials make it difficult to manage and later to audit a project, much less detect fabrication or falsification. The "sommerlier" Dipak Das provides an example of this style. <P> <strong>4. Cherry Picking. </strong> <P> A devious antithesis to fabricating data is to filter the data to select a subset that fits the desired hypothesis. When this is the sole motivation for subsetting a data set, then it raises the action to the fabrication level. The difficulty here in assessing the "crime" is to distinguish incompetence from intent to deceive. <P> <strong>5. Too Good To Be True. </strong> <P> If the results and conclusions are spectacularly wonderful and pleasing in an area where previously successes were epsilon-incremental, perhaps a closer look is warranted. The cloning wizard -- Hwang Woo Suk of Seoul National University who became a temporary national hero before humiliation -- and the cold-fusion guys, Stanley Pons and Martin Fleischmann, come to mind. Hwang, along with 24 rapidly distancing coauthors, was ultimately revealed to have <a href="http://www.nature.com/news/specials/hwang/index.html">faked his cloning</a>, resulting in articles from the journal <em>Science</em> being retracted. Pons and Fleishmann were shown to be sloppy in the laboratory, but there was no evidence of fabrication. Carl Sagan coined the phrase, "extraordinary claims require extraordinary evidence," so those with such claims ought to be eager and willing to provide the evidence. <P> What if there are no red flags? How can data fabrication and falsification be detected? Whistleblowers could help, but the corporate culture must be willing to protect those that come forward. Furthermore, protection from false accusations -- which mire honest analysts in distraction while the cheats zoom ahead via their shortcuts -- also is needed. For detection, a third-party audit could provide both detection and deterrent capability. <P> Some situations are not always so black and white. Recently I started my data ethics seminar with a plot summary of a book. A young man leaves his home country on a large ship that is carrying wild animals. The ship sinks and the boy ends up in a lifeboat, which he shares with a wild cat. After many days at sea, he is rescued and the disposition of the cat is unknown to the rescuers. Sounds like Yann Martel's <em>The Life of Pi</em>, right? What you might not know is that Martel was inspired by a review of Dr. Moacyr Scliar's book, <em>Max and the Cats</em>, published in 1988 in Portuguese. Scliar's story had a panther rather than a tiger and Max was fleeing from Germany rather than India, among other differences. There was no specific plagiarism but the plots are very similar. Scliar graciously complemented Martel's book while the Brazilian press was less generous. I wonder if the Man Booker prize would have gone to Martel if the panel had known about this inspirational precedent. Had Martel <a href="http://flcenterlitarts.wordpress.com/2011/02/28/lets-honor-brazilian-great-moacyr-scliar-by-reading-his-books/">acknowledged Scliar's prior work</a> initially, the controversy might have been averted. <P> In a followup column I plan to talk about data fabrication and falsification in the corporate world. If you have any examples of big data fabrication in the business world -- suitably sanitized for anonymity, of course -- please share them with me either via the comments section below or email. Thanks in advance! <P> <i>Companies want more than they're getting today from big data analytics. But small and big vendors are working to solve the key problems. Also in the new, all-digital <a href="http://www.informationweek.com/gogreen/032513?k=axxe&cid=article_axxt_os">Analytics Wish List</a> issue of InformationWeek: Jay Parikh, the Facebook's infrastructure VP, discusses the company's big data plans. (Free registration required.)</i>2012-11-12T12:07:00ZHurricane Sandy: Big Data Predicted Big Power OutagesWhat can be learned for future weather events? For starters, simulations must happen quickly and no single forecasting model will do.http://www.informationweek.com/news/240115312?cid=RSSfeed_IWK_AuthorsHurricane Sandy followed by the November nor'easter delivered a one-two punch that rival the Kathryn Hepburn hurricane of 1938 and the 1821 Norfolk-Long Island storm that temporarily created North and South Manhattan Islands. These back-to-backs provide a silver lining: a chance to showcase Big Data Analytics (coupled with catastrophe models) in forecasting events, assessing their impacts and mitigating future events. <P> Before getting into the Big Data Analytics technical side, I must first start by noting that the current level of pain experienced by the folks without power on Long Island is worse than it should have been. My research collaborator Chuck Watson of <a href="http://www.kinanco.com/">Kinetic Analysis Corporation</a> did a pilot study back in 2006 for the Long Island Power Authority (LIPA) consisting of some simulations using a hypothetical storm quite similar to Hurricane Sandy. His results showed the vulnerability of their grid and indicated a prolonged recovery period. LIPA's response was dismissive as they claimed that both the outages were overestimated and the restoration of power capability was underestimated. LIPA claimed that it would take no more than 10 days to restore power! Here we are six years later and 150,000 LIPA customers are without power two weeks after landfall. Chuck notes that some customers <a href="http://satblog.methaz.org">may not even have power before Thanksgiving</a>. <P> We would hope that decision makers could learn the lessons from previous disasters and take advantage of analytics that point the way to mitigation. Hurricane Hugo (1989) power outages dragged on for weeks, so it was not like there were no historical precedents. Validation is an essential aspect of hurricane modeling and the professionals in the hazard business do not make forecasts to feed the hype-machine. We are happy to discuss the assumptions of the models and their implementation; we are very disappointed when recommendations are dismissed merely because the deciders do not like the results -- let them ignore at their own peril. <P> <strong>[ Big Data is playing several roles in the wake of Hurricane Sandy. Read <a href="http://www.informationweek.com/big-data/news/big-data-analytics/big-data-supports-superstorm-sandy-relie/240044405?itc=edit_in_body_cross">Big Data Supports Superstorm Sandy Relief Efforts</a>. ]</strong> <P> Getting back to the analytics, a prerequisite for doing hurricane modeling is having the capability to deal with massive data bases integrated with geographic information systems. The databases include atmospheric conditions (current and recent wind speeds, pressures, temperatures), ocean temperatures, terrain elevation and land coverage for the hazard simulation and for the exposures (building types, heights, value, and so forth). The atmospheric databases require continual updating as new data arrives (from the hurricane hunters or satellite reconnaissance). <P> Another requirement is to have the computer firepower to perform numerical simulations in a timely fashion. The National Hurricane Center and the media are accustomed to a six-hour schedule of position updates and revised forecasts. This dictates that the track forecast given updated conditions needs to be accomplished prior to the next forecast announcement. There is also something to be said for simple, quick forecasting models that could perform updates at faster than six hour intervals. The six-hour window is awkward for fast moving storms (Sandy was clipping along at around 30 mph at times). <P> <center><img src="http://twimgs.com/informationweek/1350/sandy_tracks.jpg" width="595" height="484" alt="Sandy Tracks" hspace="0" vspace="0" border="0" style="margin-bottom:7px;" /><br /></center></p> <P> How did track forecasting play out for Hurricane Sandy? Sandy was identified by National Hurricane Center as a tropical depression on October 22. Sandy subsequently walloped the Caribbean as a hurricane causing at least 50 deaths and considerable pain to the already devastated Haiti. The storm weakened over the mountains of eastern Cuba and then strengthened again while straddling the Gulf Stream. The U.S. media began to get excited when the European model ECMWF (European Center for Medium-Range Weather Forecasting) showed a track that could make landfall in the mid-Atlantic states. Such a track is unusual since historically these hurricanes tend to re-curve to the northeast, becoming fish storms. The ECMWF predicted landfall near New York city was 5 days out, which is well beyond the reliable forecast window (five day error of at least 500 miles). If the ECMWF track turned out correct (nailed the landfall location), luck would be partly responsible. Moreover, when the collection of forecast track models are in disagreement, as they were at this point in time, it is ludicrous to bet on one specific track model. Of course, one must be vigilant and pay attention to subsequent track updates, but the evacuation decision for coastal/low-lying regions of the mid-Atlantic and New England states could wait. There is a tendency by some news media folks to pay extra attention to the scariest scenarios. <P> There are a slew of forecasting models for tracks. Also of interest but a lot harder to forecast is the intensity of storms (the sudden strengthening, weakening, eye wall collapse and reformation aspects of hurricanes is less well understood than tracks). The baseline -- no skill track forecast model -- is known as CLIPER (CLImatology and PERsistence) and is a statistical track forecast model based on the historical record with minimal inputs for prediction (date, position, intensity). Other track models that show up on forecast maps include GFS, GFDL, UKMET, BAMM, LBAR, etc. Having the various forecast tracks available from multiple media sources (Weather Channel, NHC news bulletins, weather websites) helps to educate the general public and demonstrates the difficulty in forecasting. <P> How did the track models perform for Hurricane Sandy? The statistical forecast model CLP5 (similar to CLIPER) was awful, reflecting the fact that most storms like Sandy re-curve to the northeast and spare the northeast U.S. ECMWF did fairly well for both Sandy and the nor'easter but then GFDL was not bad either. The attached figure (kindly provided by C. Watson) shows the track map for October 25. Several viable track models have forecast tracks all over the place! <P> There has been some silly media coverage that the good performance of ECMWF (for a few specific time points) is a reflection of the downtrodden state of U.S. forecasting efforts. Searching for the Holy Grail single best model is misguided. Complex phenomena require multi-pronged research attacks. A track model that does well for part of the lifetime of a storm could really tank for another part of it. Evaluating the model predictions following an event is a necessity in order to learn from the event. We cannot tell in advance which track models are going to work best -- hence, a research effort is required. We are also heavily involved in damage assessment and utility restoration. I hope to comment on these areas of applications from an analytics perspective in future columns. <P> <em>Dr. Mark E. Johnson is Professor of Statistics at the University of Central Florida in Orlando. He is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute, and a Chartered Statistician with the Royal Statistical Society. Mark does extensive consulting in the area of catastrophic risks (especially hurricanes) and regularly is retained as an expert witness in legal cases.</em>2012-11-02T10:50:00ZBig Data Education: When Should It Start?When elementary school students find data that interests them, they're ready to learn basic statistics concepts. The key: make data analysis relevant to young learners.http://www.informationweek.com/news/240012780?cid=RSSfeed_IWK_AuthorsHow early should one's big data education begin? If we followed the classical music paradigm, then <a href="http://www.rps.psu.edu/probing/inutero.html">in utero</a> is not too early. But what genre of music is most suitable for future big data analysts? Perhaps improvisational jazz, to foster exploratory analysis? Sousa marches, to inspire dedicated data preparation? Honky-tonk -- well, maybe not. <P> A related and more pertinent question is when should one's data analysis education begin? A few years ago, I visited my daughter in Japan, where she was teaching English as a second language via the wonderful <a href="http://www.jetprogramme.org">JET program</a>. In a third-grade mathematics class, the day's lesson involved collecting data on the favorite sports of each student in the class. Each student in the class of about 35 kids came to the front of the class, picked a magnetic plaque with their favorite sports name (soccer, running, table tennis, etc.) and put it on the blackboard. <P> In short order, the teacher constructed a physical histogram corresponding to this categorical variable. The frequency counts showed some variability, and it was also evident that the proportions varied by gender. By the end of the class the students had developed, very painlessly, a good feel for histogram counts and variability. The exercise was fun and interactive, and the learning was implicit rather than authoritarian. Every kid in the class had a very good chance of retaining the gist of the lesson indefinitely. Data analysis education should commence the first occasion that data is collected. <P> <strong>[ Some high school teachers are addressing the anticipated shortage of data scientists now. Read more at <a href="http://www.informationweek.com/big-data/news/big-data-analytics/should-high-schools-teach-big-data/240009120">Should High Schools Teach Big Data?</a> ]</strong> <P> Aside from plotting data, I observed some other features of the school's operations that would have an indirect bearing on the students' capabilities to work in the area of data analytics. Lunch was consumed not in a cafeteria but in individual classrooms. A few students were sent to pick up the food in the school kitchen, others donned aprons and became servers, and lunch was not over until everything served was consumed. The dishes were collected, and a cleanup crew of students marched the used dishes back to the kitchen. <P> Then something even more remarkable happened: After lunch, each student went to an assigned area on the school campus, where they had an area to clean. Using brooms, sponges or other cleaning materials, each student performed their assigned duty. Only when this activity was complete could the kids go to the playground for a brief play recess. <P> This discipline and attention to detail with exhaustive cleaning also corresponds to data preparation, where the entirety of the data file is examined, cleaned, imputed and prepared for analysis. I observed no resistance, dawdling, or impertinence. (The only regrettable part of the visit was a school newspaper photo, taken unbeknownst to me, in which I am evidently impatiently checking my watch during recess.) <P> I am not advocating the imposition of janitorial duties on elementary school students -- just commenting on my observations and speculating that these kids could do backroom data preparation jobs. <P> Getting back to the question of when to start big data education: I contend that the best time to commence data analysis education is when the student encounters data of interest. A classic example of introducing histograms is to march a large class of captive statistics students to a field and arrange them in columns of comparable heights -- <a href="http://www.biostat.jhsph.edu/bstcourse/bio751/papers/bimodalHeight.pdf">a living histogram</a>. Surely, the students participating will remember the experience and maybe even recall something about bi-modality. This example suggests that earlier opportunities of statistics had not been exploited. <P> Advanced Placement (AP) statistics courses are available at the high school level and have experienced increasing enrollment since their inception in 1997. A test score of at least 4 (out of 5) is required to get college credit at some universities. Only about one-third of those taking the AP statistics test achieve this level, so there is no guarantee of meeting this threshold after taking the course. My own limited experience with college students who have taken AP statistics in high school is that they are in my introductory statistics class because they did not pass out of the requirement. Moreover, they may bring some unfortunate statistical baggage with misconceptions about statistics -- for example, they remember some stuff about t-tests and Z-things but do not really understand what they were doing back then. To them, statistics is somehow a set of formulas awaiting injection of numbers and is thus the epitome of boredom. <P> I personally would like to see statistical concepts introduced in the context of applications of interest to the student, regardless of their age or grade level. Wouldn't it be great if our kids could get some statistical feedback every time they conquer the next level of Angry Birds or Mario Brothers? They could see how they are doing in each level and how they compare to other players of their age. <P> Learning some elementary statistics in a play environment is painless and generates some interest in the summaries. Long ago I developed some intuition on probability and statistics via dice games (Monopoly and Risk) and cards (War and Pinochle). Data from electronic games or social media sites are more the realm of interest of K-12 and college kids. Augmenting and hopefully enhancing these experiences with related statistics and some analyses would be a plus to future more important activities. <P> <i>In-memory analytics offers subsecond response times and hundreds of thousands of transactions per second. Now falling costs put it in reach of more enterprises. Also in the <a href="http://www.informationweek.com/gogreen/091912s/?k=axxe&cid=article_axxt_os">Analytics Speed Demon</a> special issue of InformationWeek: Louisiana State University hopes to align business and IT more closely through a master's program focused on analytics. (Free registration required.)</i>