## Summer research intern projects for Cambridge students (2022)

In this summer, together with Prof. Robert Mullins, We are willing to supervise around 4 students for a summer internship in our group, focusing on AutoML and ML security research.

We listed around 7 projects for the students to pick the ones that interest them the most, but are also happy to listen to self-proposed projects if they are relevant to our research.

You can check on this link for more details http://to.eng.cam.ac.uk/teaching/urops/projects.html

**We advertise on the UROP page but also welcome Mphil or Part III graduates to spend the summer with us!**

Due to high demands, we will have to host a short interview for all candidates at the end of April.

So, send me an email with your CV (yaz21@cam.ac.uk) if you are intersted!

## Searchformer: search for the right attention

The Transformer is a model architecture that relies on a so-called “attention” mechanism. Since this model was proposed [1], many different schemes have been proposed to simplify its implementation, e.g. Reformer [2], Linformer [3], Performer [4], Longformer [5]. Intuitively, there does not exist a single ‘best’ transformer architecture for all NLP tasks and datasets. In fact, these different mechanisms have different performance on a variety of language tasks.

This project will focus on using Network Architecture Search methods to automatically find the most appropriate attention mechanisms given the task and data.

[1] https://arxiv.org/abs/1706.03762

[2] https://arxiv.org/abs/2001.04451

[3] https://arxiv.org/abs/2006.04768

[4] https://arxiv.org/abs/2009.14794

**Model backdooring through pre-processing**

A backdoor is a covert functionality in a machine learning model that causes it to produce incorrect outputs on inputs containing a certain “trigger” feature chosen by the attacker [1].

Pytorch has this interesting illustration of the supported image transforms:

In this project will explore whether it is possible to add backdoors to Machine Learning models through data pre-processing (e.g. using standard preprocessing to insert backdoor triggers).

Some of these data pre-processing can be used to insert backdoor triggers, this is the obvious first step. It is popular today for researchers to publish their ‘optimised’ series of pre-processings, this is known as auto-augmentation [2]. The further steps of this project will include making most of these existing pre-processing transforms differentiable and try to optimise for a global backdoor trigger that would work with all models. The threat model is that such an evil auto-augmentation could be published online and whoever has trained their models using it would then introduce a backdoor trigger into their model.

[1] https://arxiv.org/abs/1708.06733

[2] https://arxiv.org/abs/1911.06987

**Knowledge distillation for efficient transformers**

Transformer models are now widely used in many language processing tasks, yet their run-time efficiency prevents them from being deployed on a wide range of devices. The most computation heavy block in a transformer is the multi-head self-attention module. This module considers three input vectors Q, K, V, and calculates the output as Y=softmax(QK^T)V. This then provides us with a nice formulation of the weighted sum of its value vector V, naturally, we should be able to build distillations for all these post-softmax values before they multiply to V [1].

The project will explore how existing knowledge distillation techniques [2] can help us to utilise this post-softmax probability vector. The objective is to distill a smaller model without sacrificing too much performance. The smaller model would use a smaller number of attention heads, or has a hidden dimension (allowing a smaller matrix multiplication), or more efficient number representations such as fixed-point quantization.

[1] https://arxiv.org/abs/1706.03762

[2] https://arxiv.org/abs/1503.02531

**Varying search spaces for Network Architecture Search**

Neural Network Architecture Search (NAS) methods [1] are algorithmic solutions to automate the manual process of designing the architectures of neural networks (e.g. “placement and routing” of different neural network layers). There is now also a growing interest in making NAS methods training-free [2].

The project will explore how existing knowledge distillation techniques [2] can help us to utilise this post-softmax probability vector. The objective is to distill a smaller model without sacrificing too much performance. The smaller model would use a smaller number of attention heads, or has a hidden dimension (allowing a smaller matrix multiplication), or more efficient number representations such as fixed-point quantization.

In this project, we are interested in utilising one of these existing training-free NAS methods, but exploring how we might make the search space more dynamic. Most NAS methods today operate on a fixed search space, ignoring the fact that different designs of the search space can heavily affect the search quality. For instance, a search of possible activation functions might have a limited impact compared to a search of possible convolution types. This project will focus on working on an extended training-free NAS that can also discover a customized search space using the loss landspace on different search dimensions.

[1] https://arxiv.org/abs/1806.09055

[2] https://arxiv.org/abs/2006.04647

Dynamic Sequence Pruning in Transformers

There has been a series of schemes that have explored how the efficiency of model inference may be improved by embracing the concept of dynamic computation [1, 2]. The idea is that not all computational components should be active all time – some components that are not responsible for a particular set of inputs should be turned off when performing network inference.

In this project, we are interested in utilising one of these existing training-free NAS methods, but exploring how we might make the search space more dynamic. Most NAS methods today operate on a fixed search space, ignoring the fact that different designs of the search space can heavily affect the search quality. For instance, a search of possible activation functions might have a limited impact compared to a search of possible convolution types. This project will focus on working on an extended training-free NAS that can also discover a customized search space using the loss landspace on different search dimensions.

This project will explore how dynamic computation can help Transformer models to become more efficient. There is a chance that not all tokens in a sequence are important and we can apply a dynamic pruning technique [1] on them.

[1] https://arxiv.org/abs/1810.05331

[2] https://arxiv.org/abs/2003.05997

**Data-model co-optimization using blackbox optimizers**

The idea of learning on top of a graph has attracted much attention (eg. GCN [1], GAT [2]). However, most of these graph neural networks assume the existence of a pre-defined embedding vector on each graph node without thinking carefully about the amount of pre-training required to obtain a good node embedding.

This project will investigate what would happen if we relax the constraint of having an embedding vector on graph nodes. We will explore and design neural networks together with new datasets. The implemented neural networks would be graph-connected and the student would make a detailed comparison between the proposed method to existing GNNs.

[1] https://research.google/pubs/pub46180/

**Graph-Connected Neural Networks**

The idea of learning on top of a graph has attracted much attention (eg. GCN [1], GAT [2]). However, most of these graph neural networks assume the existence of a pre-defined embedding vector on each graph node without thinking carefully about the amount of pre-training required to obtain a good node embedding.

This project will investigate what would happen if we relax the constraint of having an embedding vector on graph nodes. We will explore and design neural networks together with new datasets. The implemented neural networks would be graph-connected and the student would make a detailed comparison between the proposed method to existing GNNs.

[1] https://arxiv.org/abs/1609.02907

[2] https://arxiv.org/abs/1710.10903