In our technology-driven era, there is an ever-increasing hype around artificial intelligence, machine learning and data analytics software. These words conjure up visions of self-driving cars, humanoid robots and even Terminator’s Skynet. The truth is a lot simpler than that: whether you are using analytics software or business intelligence tools, machine learning uses programming techniques to process large amounts of data to extract insights. This means that even something as simple as a linear regression with an independent and dependent variable is a form of machine learning, given a large enough training dataset.
The 2 Types of Data
Before we discuss the intricacies of how these algorithms are implemented for the purposes of machine learning in business or pricing, it is important to understand the two types of data that feed these algorithms on a basic level.
Labelled data refers to data that is clearly marked and tagged with identifiers. These identifiers could be the column headers in an excel sheet, or in the case of your Google Photos archive, it is the differentiation between two folders—one with pictures featuring yourself and another without. With labelled data, you can ask an algorithm to classify whether you are in a picture by scanning images that contain your face, and images that do not.
Unlabelled data, as the name suggests, is missing the identifier tag. In the photo folder example, you have a folder featuring pictures of yourself in a group of other people. You may want to know if you can use these pictures to identify your close friends. This process is a lot more complicated than the first one. Because we do not have clear labels on how to group the pictures, we have no idea who the actual close friends are, which pictures they are featured in and which ones they are not. This data is unlabelled and difficult to work with.
How Machine Learning Works
Machine learning is the process of using a specific dataset, or “input” to train an algorithm for a specific function. In a simple linear regression with just one x variable, the algorithm tries to determine the relationship between the x and the y variables. Suppose you have two weeks’ worth of data with which to train your model—this means that the algorithm has looked over the past two weeks to understand the nature of the relationship between the variables and plots it out on a line of best fit.
Now imagine that you have two years’ worth of data with which to train the model—your regression reads data from the past two years to understand the same relationship. We can safely assume that the algorithm will be able to understand and predict the relationship with the second dataset more effectively.
Because there is more data to read from, the algorithm understands the relationship better and reduces the error coefficient. In this case, machine learning is the process of increasing the accuracy of the prediction by having the algorithm examine more data.
The 2 Types of Machine Learning
Supervised learning is implemented on data that is already labelled, such as volume data that already has meaningful tags associated with it. Functions performed on this type of data include two distinct types of tasks: regression and classification. So, when you examine two folders on Google Photos, one featuring photos of yourself and one without, you can use those labelled pictures to set up an algorithm that can be trained to determine if you are present in a photo. This process relates to the task of classification.
- The objective is clear: you want to predict if you are featured in a specific photo
- The accuracy of the results from supervised algorithms is easy to measure: are you in the photo or not?
- Requires a fully labelled dataset which means more work up-front in data collection
- Requires large of amount of data to make effective predictions
Unsupervised learning occurs when your data is unlabelled. Machine learning tasks performed on this type of data include clustering and segmentation. In our example, the database of images on your Google Photos account provides us with photos containing unlabelled data—such as those of your friends—which make it impossible to predict anything. It is possible, however, to group these images in terms of similarity with each other. This could mean grouping images with similar visual elements, similar sizes, similar colours, or similar contours (artificial environments vs organic environments). By creating clusters of photos based on similarities, it is possible for an algorithm to investigate whether these clusters represent something about the subject.
For example, if you cluster your Google Photos images into groups, you might find that the biggest cluster features images of yourself outdoors. You could then assume that you enjoy spending time outdoors, thus effectively learning from the process of clustering and segmentation.
- Unsupervised techniques are faster to implement than supervised methods
- Since you do not enter this analysis with a clear objective, it opens up the possibility of the results highlighting unique and disruptive findings
- There is no clear way to measure the accuracy of your results
- This method is not particularly suited to data that has a high number of variables. As a result, this method requires more data cleaning
Of course, there is much more to machine learning, but now you have a basic language to speak about these concepts. In the next article, we will look at how certain machine learning techniques can be used and applied successfully in pricing and revenue management.
ABOUT THE AUTHOR Farhan Ahmed is an Associate at Revenue Management Labs. Revenue Management Labs helps companies develop and execute practical solutions to maximize long-term revenue and profitability. Connect with Farhan at [email protected]