You can read Part 1 of this series here
Here we go again, a week and a bit later than I expected. #startup-life #oscp-life.
Let’s dive straight into it. Last week we left off after examining a bit of the threat landscape and the approaches to modelling threats. This week we are going to examine Machine Learning as well as how and where it is used. This understanding is key to modelling threats against it. This one is going to be a long one.
Content warning - Math adjacency is unavoidable.
So what the hell is machine learning?
First a few bits of groundwork. In short, machine learning is the study and implementation of software algorithms that improve automatically through experience. An algorithm in this instance being a programmatic means of solving a problem and is also referred to as a model.
Experience for an algorithm explicitly entails the exposure of the algorithm to training data.
Training data is the initial set of data used to fit parameters of the chosen algorithm to the selected task. This is the initial shaping material that forms order, prediction and decision making from the primordial algorithm.
Parameters are the weights, variables and values that the machine learning algorithm can adjust internally to itself to obtain fit towards an expected output based on input for example:
- Weights in a neural network
- Support vectors in a support vector machine (SVM)
- Coefficients in a linear or logistic regression.
The fit of a model represents how well a model generalises to similar data to its initial training data. The fit of a model is a lot like the goldilocks porridge analog. It can be overfit - meaning the model is too tightly bound to the training data, underfit - meaning the model is too loosely bound to the training data or just right. The fit is usually verified using a validation dataset.
Datafit is a lukewarm porridge.
The validation dataset is an additional dataset that has the same statistical distribution as the training dataset but is distinct in content. It is used to adjust for better fit without overfitting. In models where it is relevant, this is achieved by tweaking hyperparameters.
Hyperparameters are properties that govern the entire machine learning process but are not derived from the training process itself. They don’t directly effect the performance of the model but can effect the speed and quality of the learning process. Examples include:
- The learning rate for training a neural network
- The C value and sigma parameters for SVM’s
- K in k-nearest neighbours
What about the algorithms?
Alright, so we have the components of the models, but what about the underlying algorithms themselves? For starters there are five main types of machine learning algorithms supervised, semi-supervised, unsupervised, reinforcement and self-supervised. Each of these and their relevant applications are outlined below:
Supervised
Supervised algorithms are taught by example. The person training the algorithm provides it with a dataset with known inputs and expected outputs. Throughout the process of feeding the algorithm, the operator corrects the output until enough accuracy and performance is achieved.
There are 2 main use cases for supervised learning:
Classification: Classification is where the algorithm draws conclusions from observed values. An example of this is spam filtering, image recognition
Regression: Regression tasks revolve around estimating and understanding the relationships among variables. The analysis starts with a dependent variable and a set of changing variables. This is used to learn from past analysis and is analogous too forecasting. An example of this is stock trading prediction.
Some example algorithms are listed below with links:
Semi-supervised
Semi-supervised algorithms are similar in form to supervised. However the difference is that the source data is a mix of both labelled (known) and unlabelled (unknown) data. Semi-supervised algorithms are largely used to label a dataset. Semi-supervised is largely implemented as a combinatory approach including both supervised and unsupervised algorithms.
Unsupervised
Unsupervised algorithms start with no data and no initial understanding of the world. They then take input and attempt to derive relationships from that inputted data. These relationships take the form of clusters or groups that attempt to describe the structure of the data. The more data the algorithm is fed the greater the accuracy of the model gets.
Example use cases include:
Data Analytics: This might include grouping customers based on their purchase types and demographics. The algorithm could then identify relationships between other groups with similar purchasing preferences. This is a perfect problem for unsupervised learning as the potential target groups are unknown in advance.
Some example algorithms are listed below with links:
Reinforcement
Reinforcement relies upon a regimented learning process. The algorithm is provided a set of actions, parameters and end states. The algorithm interprets these rules and explores different options and outcomes, it monitors the results of each and attempts to choose the optimal solution. This algorithmic approach learns by trial and error, building on past experience and changing based on the situation it is in.
Example use cases include:
Natural Advertising: A reinforcement algorithm can be used to select dynamically changing advertising content based on a users profile.
Some example algorithms are listed below with links:
Self-supervised
Self-supervised algorithms automate data labelling. The algorithm uses the existing data it has processed and understood to form an understanding of the new data it is presented. It applies this self-label generation to a traditional supervised process to create a hybrid that is capable of inferring components of missing data.
Example use cases include:
- All of the supervised learning use cases but not limited by manual labelling.
An addendum - What’s this data and data labelling stuff?
So we’ve skirted around this question. What is the data that these machine learning algorithms are using?
Well that depends on the use case. It might be some of the below items:
- Images (photos, X-rays)
- Video (Live surveillance video, autonomous car video)
- Audio (Recordings of music, voice prints)
- Text (books, articles)
- Physibles (Depth data, barometric data, time of flight data)
Depending on the kind of algorithm and its intent, the form these data sources take will change dynamically.
So what about labelling?
Well again it depends on the algorithmic intent. But labels are exactly what they sound like: labels. Identifying components of data that are ideal and labelling the data as such. This is important for supervised, semi-supervised and self supervised datasets. Because each of these require some idea about where to start and labels are just that.
Who, what, why, how?
Alright so this is a lot of information… What do we do from here? Well now we have a really rough idea about what an algorithm is. But we still need to work out how they work. Well, this is where the next post comes in. Next time we are going to combine threat modelling with an idea of how the algorithms work. This will help identify attack vectors and solidify how exploitation and defence might look. See you in a few…