How early should one's big data education begin? If we followed the classical music paradigm, then in utero is not too early. But what genre of music is most suitable for future big data analysts? Perhaps improvisational jazz, to foster exploratory analysis? Sousa marches, to inspire dedicated data preparation? Honky-tonk -- well, maybe not.
A related and more pertinent question is when should one's data analysis education begin? A few years ago, I visited my daughter in Japan, where she was teaching English as a second language via the wonderful JET program. In a third-grade mathematics class, the day's lesson involved collecting data on the favorite sports of each student in the class. Each student in the class of about 35 kids came to the front of the class, picked a magnetic plaque with their favorite sports name (soccer, running, table tennis, etc.) and put it on the blackboard.
In short order, the teacher constructed a physical histogram corresponding to this categorical variable. The frequency counts showed some variability, and it was also evident that the proportions varied by gender. By the end of the class the students had developed, very painlessly, a good feel for histogram counts and variability. The exercise was fun and interactive, and the learning was implicit rather than authoritarian. Every kid in the class had a very good chance of retaining the gist of the lesson indefinitely. Data analysis education should commence the first occasion that data is collected.
[ Some high school teachers are addressing the anticipated shortage of data scientists now. Read more at Should High Schools Teach Big Data? ]
Aside from plotting data, I observed some other features of the school's operations that would have an indirect bearing on the students' capabilities to work in the area of data analytics. Lunch was consumed not in a cafeteria but in individual classrooms. A few students were sent to pick up the food in the school kitchen, others donned aprons and became servers, and lunch was not over until everything served was consumed. The dishes were collected, and a cleanup crew of students marched the used dishes back to the kitchen.
Then something even more remarkable happened: After lunch, each student went to an assigned area on the school campus, where they had an area to clean. Using brooms, sponges or other cleaning materials, each student performed their assigned duty. Only when this activity was complete could the kids go to the playground for a brief play recess.
This discipline and attention to detail with exhaustive cleaning also corresponds to data preparation, where the entirety of the data file is examined, cleaned, imputed and prepared for analysis. I observed no resistance, dawdling, or impertinence. (The only regrettable part of the visit was a school newspaper photo, taken unbeknownst to me, in which I am evidently impatiently checking my watch during recess.)
I am not advocating the imposition of janitorial duties on elementary school students -- just commenting on my observations and speculating that these kids could do backroom data preparation jobs.
Getting back to the question of when to start big data education: I contend that the best time to commence data analysis education is when the student encounters data of interest. A classic example of introducing histograms is to march a large class of captive statistics students to a field and arrange them in columns of comparable heights -- a living histogram. Surely, the students participating will remember the experience and maybe even recall something about bi-modality. This example suggests that earlier opportunities of statistics had not been exploited.
Advanced Placement (AP) statistics courses are available at the high school level and have experienced increasing enrollment since their inception in 1997. A test score of at least 4 (out of 5) is required to get college credit at some universities. Only about one-third of those taking the AP statistics test achieve this level, so there is no guarantee of meeting this threshold after taking the course. My own limited experience with college students who have taken AP statistics in high school is that they are in my introductory statistics class because they did not pass out of the requirement. Moreover, they may bring some unfortunate statistical baggage with misconceptions about statistics -- for example, they remember some stuff about t-tests and Z-things but do not really understand what they were doing back then. To them, statistics is somehow a set of formulas awaiting injection of numbers and is thus the epitome of boredom.
I personally would like to see statistical concepts introduced in the context of applications of interest to the student, regardless of their age or grade level. Wouldn't it be great if our kids could get some statistical feedback every time they conquer the next level of Angry Birds or Mario Brothers? They could see how they are doing in each level and how they compare to other players of their age.
Learning some elementary statistics in a play environment is painless and generates some interest in the summaries. Long ago I developed some intuition on probability and statistics via dice games (Monopoly and Risk) and cards (War and Pinochle). Data from electronic games or social media sites are more the realm of interest of K-12 and college kids. Augmenting and hopefully enhancing these experiences with related statistics and some analyses would be a plus to future more important activities.
In-memory analytics offers subsecond response times and hundreds of thousands of transactions per second. Now falling costs put it in reach of more enterprises. Also in the Analytics Speed Demon special issue of InformationWeek: Louisiana State University hopes to align business and IT more closely through a master's program focused on analytics. (Free registration required.)