This is a list of Research in CMS projects from summer 2018.

- Structured high-dimensional changepoint estimation
- How Long Before We Are There ?
- Practical aspects of $k$ nearest neighbour methods
- Oracle complexity for convex optimization on the real line

## Structured high-dimensional changepoint estimation

Contact Name | Richard Samworth and Tengyao Wang |

Contact Email | r.samworth@statslab.cam.ac.uk, T.Wang@statslab.cam.ac.uk |

Company Name | Statistical Laboratory |

Address | Wilberforce Road, Cambridge, CB3 0WB |

Period of the Project | 8 weeks, from second week of July onwards |

Project Open to | Undergraduates |

Deadline to Register Interest | 3 March |

Brief Description of the Project |
These days, it is very common to monitor simultaneously many attributes that may vary over time. Times at which the underlying data generating mechanism changes, known as changepoints, are of particular interest in many applications. For instance, in internet security monitoring, changepoints may indicate a distributed denial of service attack, while in functional Magnetic Resonance Imaging studies, a changepoint may indicate a significant neurological activity. Frequently, changes occur only in a relatively sparse subset of the attributes. For instance, in the case of stock price data, it may well be the case that market shocks affect only a particular industry sector. A recent method, called `inspect', has been proposed to estimate sparse changes in the mean vector of a high-dimensional data stream (Wang and Samworth, 2018). The aim of this project is to generalise the `inspect' methodology to take advantage of known or hypothesised structures in the mean changes. Examples include spatial clustering of changes in fMRI or other images and sector grouping in financial time series. This project would be particularly suitable for a statistician who has just finished Part II (though it may be accessible to a very strong Part IB student too). It has the potential to involve a mixture of methodological development, generalisation of existing theoretical results, implementation of algorithms in R and applications to real-world datasets, though the weights of this mixture can be tailored to skills and interests of the student involved. Wang, T., Samworth, R. J. (2018) High-dimensional change point estimation via sparse projection. J. Roy. Statist. Soc., Ser. B, 80, 57–83. |

Skills Required | Statistics; some linear algebra, analysis and probability. Enthusiasm for a genuine research project. |

Skills Desired | Previous experience of R |

## How Long Before We Are There ?

Contact Name | Herbert E. Huppert |

Contact Email | heh1@cam.ac.uk |

Company Name | DAMTP |

Address | Wilberforce Road, Cambridge, CB3 0WB |

Period of the Project | 8 weeks |

Project Open to | Undergraduates |

Deadline to Register Interest | 3 March |

Brief Description of the Project | This project would be stimulating for anyone interested in solving numerically nonlinear partial differential equations relevant to Biology, Earth Sciences, Elasticity, Fluid Mechanics, …… The point is that there are a huge number of nonlinear pdes that do not have analytic solutions. Often one can find similarity solutions, which reduce the number of independent variables, but still leads, generally, to a nonlinear equation whose solutions are independent of the initial conditions. But what role do they play? It is generally stated that the similarity solution agrees with the (not determined) exact solution when some variable, t say, obeys t >> t_1. But what is t_1? How does it depend on the initial conditions? How large must t be for the similarity solution to be within 15, 10, 5, 1, 0.1, ….. percent of the real solution? And how does this depend on the parameters of the problem? The student will consider a series of typical, fundamental problems using a new numerical program and approach I have developed. A conscientious effort over the summer should lead to two first class scientific papers to be published in top quality journals. |

Skills Required | Undergraduate understanding of nonlinear partial differential equations; ability to augment and write code to numerically solve numerous differential equations. |

Skills Desired | Ambition and pleasure to undertake exciting research project; compatibility with (many) other members of the group; desire to write influential papers which will have wide uptake. |

## Practical aspects of $k$ nearest neighbour methods

Contact Name | Richard Samworth (project to be jointly supervised with Tom Berrett) |

Contact Email | r.samworth@statslab.cam.ac.uk |

Company Name | Statistical Laboratory |

Address | Wilberforce Road, Cambridge, CB3 0WB |

Period of the Project | 8 weeks, from 7 July |

Project Open to | Undergraduates |

Deadline to Register Interest | 3 March |

Brief Description of the Project |
Nearest neighbour methods are a family of techniques whose wide-ranging influence can be felt in many areas of data science. Their use in statistics dates back at least as far as Fix and Hodges (1951) in which the search for a fully nonparametric classification rule led the authors to propose a $k$-nearest neighbour classifier and density estimator. A large part of the popularity of nearest neighbour methods is undoubtedly due to their simplicity and analytic tractability, which make efficient practical implementation possible and open up the challenge of providing a thorough theoretical understanding. For their use in recent work on entropy estimation, classification and independence testing see Berrett, Samworth and Yuan (2018), Cannings, Berrett and Samworth (2017) and Berrett and Samworth (2017). As with many nonparametric techniques there is a tuning parameter, in this case $k$, whose value may affect the performance of the procedures significantly. Heuristically speaking, in many applications $k$ can be seen as controlling a bias--variance trade-off where larger values of $k$ result in larger bias and smaller values of $k$ result in larger variance. Theoretical results provide some insight into the optimal choice of $k$ (e.g. Cannings, Berrett and Samworth, 2017), but in practice $k$ is usually chosen heuristically or empirically, often by cross-validation (where possible). For this project the student would investigate the practical considerations of nearest neighbour methods in one or more of entropy estimation, classification and independence testing. This will principally involve the choice of $k$, but could also include choice of weights in weighted nearest neighbour methods or a study of whether standardising the data (prewhitening) reduces error. One further possibility is to explore the use of approximate nearest neighbours and the inherent trade-off between computational cost and statistical accuracy (e.g. Muja and Lowe, 2014). References: Berrett, T. B. and Samworth, R. J. (2017) Nonparametric independence testing via mutual information. Available at arXiv:1711.06642. Berrett, T. B., Samworth, R. J. and Yuan, M. (2018) Efficient multivariate entropy estimation via $k$-nearest neighbour distances. Ann. Statist., to appear. Cannings, T. I., Berrett, T. B. and Samworth, R. J. (2017) Local nearest neighbour classification with applications to semi-supervised learning. Available at arXiv:1704.00642. Fix, E. and Hodges, J. L. (1951) Discriminatory analysis -- nonparametric discrimination: Consistency properties. Technical Report number 4, USAF School of Aviation Medicine, Randolph Field, Texas. Muja, M. and Lowe, D.~G. (2014) Scalable nearest neighbour algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell., 36, 2227-2240. |

Skills Required | Statistics, and an enthusiasm to learn about nonparametric Statistics, real analysis |

Skills Desired | Some previous experience of R would be desirable, but not essential |

## Oracle complexity for convex optimization on the real line

Contact Name | Hamza Fawzi |

Contact Email | hf323@cam.ac.uk |

Company Name | DAMTP |

Address | DAMTP F0.16, Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA |

Period of the Project | 8 weeks between late June and 30 September |

Project Open to | Undergraduates, Part III (master's) students, PhD Students |

Deadline to Register Interest | 3 March |

Brief Description of the Project | The goal of the project is to get a tight complexity bound for the problem of minimizing a convex function on the real line. The complexity is defined in terms of the number of queries to the derivative of the function. It is known that the number of queries needed is a constant multiple of log(1/epsilon) where epsilon is the desired accuracy. However the exact constant is unknown. The goal of the project is to get the best possible constant. |

Skills Required | |

Skills Desired |