# Learn Data Science: Would You Drive A Ferrari to the Grocery Store?

*One of my clients is IBM, which is effectively having its global staff learn data science to one degree or another. Below I present an article that I published for one of the internal IBM blogs, aimed at staff members with little to no exposure to data science.*

As IBM focuses more on being an IoT and Cognitive entity, many of us are being asked to think of ways to use machine learning within our teams and products. The recent partnership with The Weather Company also presents various opportunities for a cognitive approach. But along with this new focus comes a dizzying combination of terms: analytics, data science, machine learning, predictive algorithms, cognitive programming, etc. In this article, I will try to present a fundamental approach to all of these items and what they may mean for you and your teams within IBM.

If you are already a data scientist or Watson specialist at IBM, then this article is probably not for you; I would expect that you already understand these concepts. This article is for everyone else: the sales teams, the programmers, the DBAs, the trainers, and so on. There are very few corners of IBM that won’t be touched by a cognitive focus and many of us have been asked to think about it in some way or another. Let this article be an introductory guide to further your learning — however deep it may need to be in your given position — and help you understand what a cognitive approach means to you. If you finish this article and want to dig deeper, I provide some learning resources at the end too.

Let’s establish some norms before we begin. While it can be somewhat of a moving target, the *en vogue* term for what we are about to discuss is **data science**. Throughout this article, you can take data science, machine learning, algorithmic programming, predictive models, etc., to all mean the same thing. In the cases where the terms need to be differentiated, I will make that clear. Also, when I refer to Watson, I am referring to the *Watson Analytics* version that is available to all of us as an online tool. Obviously, Watson gets more complex than that web package, but Watson Analytics is an excellent way to get working with these concepts and can make for great demos for your teams and/or clients.

At the time of this writing, all kinds of berries are in season. Blueberries, strawberries, raspberries and more can be easily found in grocery stores. I will illustrate these data science concepts with our fictional business, The BIG BLUEberry Farm (see what I did there?).

### It All Begins With Data

All data science efforts begin with … data. So at The Farm, we create all kinds of records for our daily sales from our roadside stand. We record how many cases of each berry were sold and at what price, whether we had a special sale offering that day or not, the day of the week, the outdoor temperature, the number of cars that stopped by, the number of signs we had along the highway, and so on. In the real world, data can be generated by sensors, phone surveys, automated processes or various other methods. This is known as **data collection** and can be done in any number of ways. But we keep things humble at The Farm and just collect it with pen and paper, which ironically is just as effective.

At the end of each month, we stop to look at the number of data points that we have and get a general feel for our data. This is known as **exploratory data analysis** or **EDA**. We don’t quite need reports from our data just yet, we just want to make sure that we are getting what we need and that the data makes sense overall. We decide if we are getting a good amount of data and maybe make some changes to how/what we are collecting. At this stage, EDA leads to a more refined set of data and improved data collection.

At the end of the farming season, we have collected quite a bit of data. To continue our EDA, we start taking averages of items or maybe take a look at the spread of numbers in front of us to get a range for our sales figures or items along those lines. In some cases we may find that we are missing data and can track it down (such as historical weather data) or we may have to estimate how many signs we put out along the highway that one weekend because that data is fuzzy in our records. This is where our EDA efforts start to be more of an exercise in **data wrangling** and it too leads to a more refined set of data. In other cases, you might want to modify your data, such as switching your daily temperatures from Fahrenheit to Celsius; these are known as **data transformations**.

Within Watson, these items are handled by the orange data tiles, where you upload your **data sets** and give them appropriate column names and the like. In some cases, you may come across the concept of **learning sets** when reading about data science. Data sets are, in effect, learning sets since they are given to models to make inferences. In most cases, you can use the terms interchangeably.

Keep in mind that the amount of effort for these data steps will vary based on your project and clients. The data collected by automated sensors with millions of records will require a different approach from, say, data you collect from a small sample of hospital patients on a paper form.

### Analytics & Explorations

Unless you went on to study statistics in an advanced role, your study of it probably ended with terms like mean, median, mode and distribution. Maybe you worked with some histograms and some scatter plots too. These concepts remain an important part of statistics and you should think of them as **analytics**. At The Farm, we look at things like our average sales and the number and type of berries that were sold by the day of the week, we might even create a line chart that shows the average temperature for each day that our roadside stand was open. If we go on to create a series of charts and graphs to show our data, then that would be an exercise in **data visualizations** or “data viz” if you want to sound even more modern.

Within Watson, these items can be handled with the blue tiles known as **Explorations**. You can quickly create any number of charts and graphs to get a visual view into your data. To be clear, analytics can still play a vital role in data science. Most advanced analytics would fall under the umbrella of **applied statistics** and can be useful, but not quite predictive or cognitive.

If your study of Watson and our berry data ended here, then you would be selling yourself short. In fact, I would say that this would be a misuse of Watson and what it can do for teams and clients within IBM. Would you drive a Ferrari to do your weekly grocery shopping? I mean you *could* do that, it’s not illegal. But driving a Ferrari on residential streets at 25 miles-per-hour wouldn’t even come close to exercising the full power of a Ferrari, would it?

The power of Watson really comes to light when you start using it for **predictive models** and algorithms about the future. Analytics is about reporting on *what has already happened*. Data science, and the resulting algorithms, help us understand*what will most likely happen* in the future. Watson helps us all unleash predictive models on our data and allows us to make informed decisions about the future.

Think of it this way: Everything we have looked at so far (data sets, analytics and data visualizations) are things that can also be handled by virtually any modern spreadsheet program like Microsoft Excel or even Google Sheets. If you are looking at a potential cognitive project and all your ideas end with items that can be solved with a spreadsheet, then I would encourage you to dig a little deeper and start thinking about things that you want to know about the future of your data, not just what has already happened.

### Predictive Models

When you click on the green tiles of Watson, the **Prediction** functions, then you have stepped into the real power of Watson and what will eventually become the future of IBM. Now, we are no longer on a grocery run; we are now behind the wheel of our Ferrari with our foot pressing down towards 145 miles-per-hour!

Coming back to The Farm, we must first determine the questions about the future that we want to seek from our data. These can be items like:

- What are my predicted sales for the month of October?
- How many cars will stop by on a rainy day? How many will make purchases?
- Does having a special offering on strawberries lead to more blueberry sales?
- Are there improvements that I can make to my advertising budget?
- Given certain weather patterns, which fertilizer combination will lead to a bigger harvest of raspberries next year?

Each of these questions can be traced down to one or two variables in our data that we aim to answer, such as sales, customer count, expenses. Within Watson, these are known as **targets**. Within a predictive algorithm, the target is the item that we are solving for. When you tell Watson which target you want to answer, it goes to work for you and attempts to find **interactions** between all your data points and the target that you want to answer. The end result is a model — our predictive algorithm — that we can use to make decisions about the future.

So how does Watson generate these predictive models? Well, that’s a complex question that can’t be answered within a single article. But rest assured that there are tried-and-true statistics principles powering Watson. From working with it on various data sets, I have seen Watson use everything from linear regression, decision trees and support vector machines among various other models. A proper data science approach also uses cross-validation, a method that tests proposed algorithms against data with a known result. Watson iterates through these concepts and models and applies it to your learning sets.

### But Wait, What About Cognitive?

I want to close out this article by looking at this concept that is being thrown around quite a bit: *cognitive*. So what exactly does it mean to have cognitive project teams and products? Isn’t that the same as something that is predictive? No, they are not the same. They are definitely kissing cousins, but not the same concept.

Once you have a predictive model, you can use that model to make adjustments to your future actions. That gives you more refined data to work with and it allows you to improve your model. Once you create revised models based on previous models, then you have closed the cycle where your models are learning from each other and improving. That, in effect, is the cognitive cycle: models learning from itself and/or other models.

The best example of a cognitive model is a self-driving car. This type of project requires a cognitive approach because there are so many variables to driving a car — road conditions, actions by others, highway signage and speeds among other things. You don’t teach a car the various ways a stop sign can appear or where it can be placed; that would be a long, ridiculous list of IF THEN ELSEIF THEN ELSEIF statements and highly inefficient. Instead, you teach a car the rules of the road and then you let the car learn from itself and its successes and, for better or worse, its mistakes. A self-driving car is given a series of predictive models that it can use along with various pieces of input — mostly from cameras and sensors — that it uses to determine how to drive itself on the road. The iteration through versions of models, which the car creates itself, is the cognitive solution for a self-driving car.

### So What Now?

My hope is that this article sets you on a path to think about your projects and clients in a slightly different light. Sources of data are all around us, it’s up to us to see them and determine what it can do for us. What data is your project collecting? How can it be useful in the future? What other sources of data can you combine with your project to reach some meaningful results? Not all data is good predictive data; is my data better suited for analytics or predictive models? If you can start with questions like this, and then keep on digging, you should be well on your way to developing some predictive and cognitive solutions here at IBM.

### Further Reading to Learn Data Science

If you want to continue taking a high-level view into data science, then I would recommend *The Master Algorithm* by Pedro Domingos. It presents various examples of data science presented from the five basic approaches to algorithm development. By understanding the five “tribes” of machine learning, it can help you understand how to apply them to various learning challenges out there. This book is ideal for managers, sales or other roles in IBM that may only need a high-level view into predictive models. If you are a programmer, or just want to take a deeper dive into the actual coding of data science algorithms, I would recommend *An Introduction to Statistical Learning* by James, Witten, Hastie, and Tibshirani. It has long been considered the “bible” when it comes to machine learning. You’ll have to know the R programming language to work with the examples, but if you can master this book then you are well on your way to becoming pretty dangerous with some predictive models!