Understanding and Protecting Against Adversarial Attacks in Machine Learning

16 minutes, 43 seconds Read

Machine learning has found applications in numerous fields, including banking, healthcare, retail, transportation, and autonomous vehicles. Machine Learning is what allows computers to pick up new skills without being given any new instructions. As a result, computers can make reliable predictions based on observed trends. In machine learning, the model (algorithm) is given data to learn from. The model can recognise trends in the data and provide accurate forecasts. At the outset of the training process, the model is fed training data from which predictions are then made. We then fine-tune the model until we reach our target level of precision. In order to ensure the model’s correctness, new data must be included. Until the desired result is achieved, the model is retrained.

If you want to trick a deep learning model into making misleading predictions, you may employ a method called adversarial machine learning attack. The enemy’s goal is to make the model fail in some way.

Machine Learning Countermeasures Against Adversarial Examples

The ability of classifiers (models) to generate accurate predictions in machine learning is largely attributable to the availability of large datasets for training. A function measuring the classification error is minimised during training. By tweaking the classifier’s parameters, you may optimise this function and, in turn, reduce the inaccuracy of your predictions on the training data.

The goal of adversarial assaults is to increase the likelihood of mistakes in the input data by using the same underlying process of learning. This is happen either to the use of maliciously crafted data for a previously trained model or due to the use of faulty or misrepresented data during training.

An indication of the rise in popularity of adversarial assaults is the absence of publications addressing them on the preprint server in 2014. There are now over a thousand academic articles discussing adversarial attacks and providing concrete case studies. Researchers from Google and New York University published a study in 2013 titled “Intriguing properties of neural networks,” which demonstrated the core concept of adversarial assault on neural networks.

The topics of adversarial assaults and countermeasures are increasingly prevalent at conferences like Black Hat, DEF CON, ICLR, and others.

Forms of Adversarial Attacks

Vectors of assault from adversaries come in many shapes and sizes.

Evasion – As the name implies, these operations are performed in order to elude detection and are applied to trained models. In order to trick a trained model into making mistakes, an adversary might add new data. This is a common method of assault.

Toxicology experiments are conducted during the learning stage. In order to trick the model into making false predictions, the adversary will feed it polluted (misrepresented or inaccurate) data while it is being trained.

The adversary in this scenario interacts with a model in production and attempts to recreate it locally, creating a substitute model that is 99.9% consistent with the model in production.In other words, for all intents and purposes, the clone is indistinguishable from the original. It’s also known as a “Model Stealing” assault.

How Do We Create Adversarial examples?

Both supervised learning, in which input and output data are known, and unsupervised learning, in which input data is analysed for hidden patterns or inherent structures, are used in machine learning. The fundamental output of the model is a loss function. To what extent your prediction model succeeds in estimating the target value or result is quantified by the loss function.

A faulty guess will result in a loss. Simply put, loss quantifies how poorly a model performed when applied to a particular data point. The loss is 0 if the model’s forecast is accurate; otherwise, it increases. When training a model, you want to settle on a collection of weights and biases that results in minimal average loss.

The adversary in an adversarial assault might contaminate the input data or alter the projected outcome to fool the system. Attacks can be either targeted or random. The goal of a targeted assault is to trick a model into making an inaccurate prediction by injecting noise into its input dataset. An adversary’s goal in an untargeted assault is to discover any inputs that fool the model.

Here are a few illustrations:

  • A self-driving automobile may be fooled into misreading a stop sign as a speed restriction sign with just a few bits of tape. On the left, we see the original image that was used as a template to create the adversarial example.
  • Harvard researchers successfully tricked a medical imaging system into labelling a harmless mole as cancerous.
  • An opponent may train a Speech-To-Text Transcription Neural Network to interpret a modified version of the original waveform as whatever sentence they wanted.
  • Deep facial recognition networks are vulnerable to attacks using fake glasses.

As a result of these adversarial scenarios, it is now clear that implementations of deep neural network learning models in operational systems pose a significant security risk.

Adversarial Perturbation

To mislead a machine learning model, an adversarial perturbation may be defined as any change to a clean picture that keeps the semantics of the original input. To achieve this, the adversary calculates the derivative of the categorization function. The input picture is then corrupted with noise before being given back to the algorithm. To construct the adversarial image seen below, we subtly altered the original input image.

The following are examples of common forms of adversarial attack:

To reduce the amount of noise introduced into photographs, the L-BFGS technique uses a non-linear gradient-based numerical optimisation approach. While it does a good job at producing adversarial samples, it requires a lot of processing power.

To produce adversarial instances with the goal of reducing the maximum perturbation applied to each pixel of the picture to induce misclassification, a simple and quick gradient-based approach called the FastGradient Sign method (FGSM) is utilised.

A Saliency Map Attack (JSMA) that Relies on Jacobians. The approach differs from FGSM in that it employs feature selection to reduce the overall number of characteristics that are altered to the point of misclassification. Iteratively, flat perturbations are introduced to features in descending order of saliency value. It requires more processing power than FGSM, but the benefits more carefully preserved characteristics.

Deepfool Attack is an unfocused adversarial sample generation method that attempts to find the smallest euclidean distance between a disturbed sample and its original counterpart. Iterative perturbations are introduced to the calculated decision boundaries between classes. Although it is computationally more expensive than FGSM and JSMA, it is successful in producing adversarial samples with less perturbations.

The Carlini & Wagner assault (C&W) is a variant of the L-BFGS assault that eliminates the need for boxes and employs alternative goal functions. The produced adversarial instances were effective in protecting against cutting-edge defences like defensive distillation and adversarial training. Both its ability to generate adversarial cases and to overcome adversarial defences are impressive.

In the context of a competition between two neural networks, generative adversarial networks (GANs) have been utilised to produce adversarial assaults. Thus, one performs the role of generator, while the other takes on the role of discriminator. The generator attempts to generate samples that the discriminator would misclassify in a zero-sum game between the two networks. The discriminator, meantime, tries to tell fake samples from real ones. It can be quite computationally unstable to train a Generate Adversarial Network.

Attack Types: Black Box vs. White Box

The following are two sorts of attacks that an adversary can launch, with or without prior knowledge of the target model:

The adversary in a black box assault does not have any information about the model (such as its parameters or the size of the neural network) or the training dataset. The adversary is limited to viewing the model’s final results. This makes it the most challenging to exploit, but also the most promising. It would thus be possible for an opponent to generate an adversarial example either with a blank slate model or no model at all.

White-box attack
In a white box assault, the adversary is familiar with all aspects of the model being attacked, including its architecture, input and output parameters, and training dataset. To counter the target model, the adversary can modify and create hostile samples independently. Gradient-based or iterative assault are other names for adaptive attack. The adversary’s capacity to adjust the assault in response to data from the model is what we mean by “adaptive.” The attacker can test the model with a made-up set of inputs and see how it performs. This feedback allows the adversary to adjust their attacks to better circumvent the model’s safeguards. In order for the adversary to identify inputs that can consistently mislead the model, this procedure is iterated over and over again. Since the enemy may adjust their attack approach in real time, it is very difficult to defend against these types of attacks. As an added bonus, the adversary can build model-specific assaults after learning the model’s internal structure and parameters.

Against an adaptable white-box attack, the current state-of-the-art defence technique is helpless. Building defences that can fully defend a model from an adaptive attack is difficult.

DescriptionBlack boxWhite box
Adversary KnowledgeRestricted knowledge from being able to only observe the network output on some probed inputs.Detailed knowledge of the network architecture and the parameters resulting from training.
StrategyBased on a greedy local search generating an implicit approximation to the actual gradient w.r.t the current output by observing changes in input.Based on the gradient of the network loss function w.r.t to the output.

Black-box attacks: few examples

Below, we detail some of the most common manifestations of black box assaults in the real world:

Physical Attacks
To do this, ‘physically’ fabricated data is added to the input in order to fool the model. It’s typically rather simple to grasp. One study from Carnegie Mellon University shown, for instance, that if an attacker augmented a facial recognition model with colourful eyeglasses, the machine would be fooled. See the picture below for an example of this: An adversarial sample picture (second image) is superimposed on top of the original image (first image).

Out of Distribution (OOD) Attack
Out-of-distribution (OOD) assaults are another form of black box attack. An adversarial assault known as an out-of-distribution attack involves the use of data points that are outside the typical range of the training data.

To solve a problem, machine learning models must be trained on a dataset that accurately depicts the issue area. An attacker can try to fool a model by feeding it data that doesn’t fit this distribution. In real-world applications like autonomous vehicles, medical diagnostics, and fraud detection, this might have disastrous results if the model produces inaccurate outputs.

When Can We Have Faith in Machine Learning

How can we trust machine learning when it makes more and more sophisticated judgements for us?

Questions like this get to the heart of trust’s fundamentals.

  • Can I have faith in what I’m creating?
  • Is your faith in my work justified?
  • Do we have faith in what we are creating as a group?

The three most crucial characteristics to think about when trying to answer the questions above are

The ability to communicate effectively and be understood is what we mean when we talk about clarity. It’s about asking ourselves whether or not the motivations behind our choices are sound. Humans can benefit from more clarity when making judgements. There has to be no confusion regarding the metric(s) we will use.


Competence is the state of being well-equipped to do a task, whether that task requires knowledge, judgement, skill, or strength. Competency assessment is the key to success in machine learning. This calls for more systematic testing of machine learning training models. The offline training we perform gives us very little information about how the system is likely to act in the wild. Therefore, the benchmark and test datasets are only a rough representation of what may occur in the actual world.


It’s alignment that’s the hardest part. It’s when people, organisations, states, and the like all work together for a single goal. Every decision you make when designing systems has an effect on people, therefore you need to be on the same page as them in terms of the balance of concerns and the response to the question, “Does my system have the same cause or viewpoint that I hope to have?” One of the crucial decisions that determines the machine learning model’s performance is the data set itself. To eliminate prejudice and the perpetuation of stereotypes, it is crucial to have a diverse and comprehensive data set.

How do we defend against adversarial attacks

It’s possible to protect against adversarial attacks with a mix of defensive and aggressive strategies, even if total prevention is impossible. White-box assaults can bypass defences that have been shown to be effective against black-box attacks.

To counteract assaults, a defensive machine learning system would use denoising and verification ensembles.

Noise-Canceling Groups
To clean up signals or pictures, a denoising algorithm is employed. The term “denoising ensemble” describes a method in machine learning for enhancing denoising algorithms’ precision.

Multiple denoising algorithms are trained on the same input data but with various initializations, architectures, or hyperparameters, resulting in a denoising ensemble. The theory is that the final denoised output will be more accurate if many algorithms are used, each of which has its own strengths and drawbacks.

Several tasks, including picture denoising, voice denoising, and signal denoising, have been effectively implemented using denoising ensembles.

Consortia for Validation
The goal of the machine learning approach known as “verification ensemble” is to increase the accuracy with which verification models can decide whether or not two inputs belong to the same class. To check if two photos of a face are of the same person, a verification model might be employed in a facial recognition system.

Methods for doing Verification Ensemble range from simply averaging the results of the individual verifiers to employing a vote mechanism to select the result with the highest level of consensus. Face recognition, speech verification, and signature verification are just a few examples of the verification tasks where the use of a verification ensemble has been demonstrated to increase performance.

A Bit-Plane classifier is a method for extracting information and recognising patterns or features in an image by analysing the distribution of picture data over several bit planes. Image processing operations like image compression and feature extraction rely heavily on it. Bit-Plane classifiers are useful for pinpointing the parts of a picture that might be exploited by an attacker. Classifiers based on robust bit planes can be taught to ignore more easily attacked parts of images in favour of those with less weak spots.

Having a variety of denoisers and verifiers will increase the likelihood of successfully resisting an attack as sophisticated adversarial attackers grow better at what they do. By acting as numerous gate guards, a diversified set of denoisers and verifiers makes it more difficult for the opponent to carry out an assault.

It’s possible that various forms and intensities of noise call for specialised denoisers. Denoisers can be specialised to remove specific types of noise, such as high- or low-frequency noise. The model’s ability to filter out a broad variety of noises is improved by employing a large collection of denoisers.

It’s possible that certain verifiers are more suited to particular kinds of data or faults than others. Some verifiers may be more adept at finding syntactic mistakes, while others may be better at finding semantic ones. The model’s ability to detect and correct a broad variety of faults and guarantee the reliability of its output is greatly enhanced when a large and varied group of verifiers is used.

Conduct Adversarial Training
In machine learning, adversarial training is used to make a model more resilient to assaults from malicious actors. The goal is to introduce adversarial perturbations to the testing data in a way that will lead the model to make a mistake. Once the hostile instances have been added to the original set, the model is subjected to both sets of data, which ultimately leads to better learning and more reliable outcomes.

The following are the stages of the process:

  • A number of methods, including the Fast Gradient Sign approach (FGSM), the Projected Gradient Descent (PGD) approach, and the Carlini & Wagner (C&W) assault, can be used to produce adversarial cases. These strategies entail making adjustments to the input data in order to optimise the loss function of the model.
  • The hostile instances and their labels are included into the existing training data.
  • Model training: Standard optimisation methods, such as stochastic gradient descent (SGD), are used to train the model on the additional dataset.
  • Model evaluation: The model is tested using a data set that includes both ideal and problematic instances.
  • To do it again: The aforementioned stages are repeated throughout numerous epochs, with further adjustments to the hostile input, to progressively enhance the model’s resistance to adversarial attacks.

Despite its limitations, adversarial training is a powerful method for making machine learning models more resistant to outside interference.

The original model may be attacked using adversarial perturbations generated by a surrogate classifier. When the original model is too complicated or difficult to address directly, a surrogate model might serve as a more manageable alternative. An option is to train a new neural network that closely mimics the original model’s structure and operation. This stand-in model is then put to use in order to produce adversarial perturbations that may be used to evaluate the original model’s resistance to assault.

Analyse Danger
When a model is deployed, there are certain dangers that must be assessed, such as the possibility and severity of an assault from an adversary.

Determine what kinds of malicious attacks the model could face if it is put into production. Attacks like evasion, poisoning, and model extraction are all possibilities.

Evaluate the possible outcomes of each discovered assault and their significance. Factors including the attacker’s access to the machine learning model, their familiarity with the model’s design, and the degree of difficulty of the assault might all influence the likelihood of an attack. An attack’s severity may be measured in terms of its effect on the model’s reputation or the cost of inaccurate predictions.

Create safeguards to lessen the severity and frequency of the detected attacks. The use of several machine learning models for prediction, restricted model access, and the use of input preprocessing techniques for identifying potential hostile inputs are all possible methods.

Identify malicious assaults by keeping an eye on the model’s output and activity. The model’s decision-making process or the distribution of input data can be monitored for telltale indicators of an assault.

As new attack scenarios appear or the deployment environment evolves, it is crucial to continually assess and adapt these techniques.

Evidence Check
Verifying the data used to train a model entails double-checking and confirming it. Data preparation, cleaning, augmenting, and quality checking are all possible steps in this procedure. Normalisation and compression are necessary preprocessing processes. Input normalisation is another viable option. Preprocessing the data beforehand to make sure it fits a given distribution is required. This can assist mitigate the impact of adversarial assaults that introduce little perturbations to the input data, for example.Data encryption and cleansing is another way to strengthen training data. You should also consider doing your model’s training in an offline environment.

The model’s resistance to attacks from adversaries can be improved by regular inspections of the training data for contamination.

The sophistication of machine learning models will only increase over time. As the sophistication of adversarial attacks grows, so too will the need for multifaceted defences. Adversarial examples used in deep learning models may assist improve resilience, however existing implementations of this technique have not completely solved the problem, and scaling remains difficult.

Defence against adversarial assaults is a continual process that requires machine learning professionals to be on high alert, constantly submit their systems to attack simulations, and adjust their defence techniques in response to changing threats.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *