Data Analysis
Generalities
This course is intended to provide a (somewhat formal) basis for the use of statistics in physics. Relevant information:
Lecture Notes
As a guideline, the lecture notes written by Wes Metzger (an emeritus staff member of the Experimental HEP department)
will be used for the material dealing most strictly with statistics issues.
There are many alternative sources of information about statistics; besides the books mentioned below, there are e.g. the lecture notes by
Els de Wolf (an emeritus Nikhef staff member).
Material covered
Legend for the items below: M = lecture notes by Metzger; C = book by Cowan. The numbers will typically refer to section numbers in these documents.
Note that, insofar as future lectures are concerned, this information pertains to the material covered in the previous academic year; it should however provide a fair impression of the upcoming lecture series.
Changes may happen in particular to the material of Week 7.
- Week 1:
- Basics of probability (M2 except 2.2.5; C1.1-5)
- Specific probability distribution functions (M3.2-6, M3.8, M3.11; C1.1-5, C2.1-5, C2.7)
- Week 2 (note that the Law of Large Numbers and the topic of Bayesian reasoning are not discussed at this level in either M or C; for these see the slides posted on BlackBoard):
- Characteristic function (M2.2.5; C10.1-2)
- Central Limit Theorem, Law of Large Numbers (M5; C10.3)
- Bayesian reasoning, continued
- Week 3:
- Test statistic, estimator; bias, consistency (M7, M8.2.1-2; C5)
- Method of Moments (M8.3; C8)
- Information and the likelihood function (M8.2.5; C6.6)
- Cramer-Rao lower variance bound and the exponential family (M8.2.6-7)
- Week 4:
- Maximum Likelihood Method (M8.4.1-7; C6, except C6.8, C6.11, C6.12)
- Least Squares Method (M8.5.1-5, M8.5.9; C7)
- Week 5:
- Confidence Intervals (M9.1-6, M9.11; C9.1-4, C9.7-8)
- Unphysical regions (C6.13)
- Week 6:
- Hypothesis testing basics (simple versus composite, size, power, ...) (M10.2, M10.3.1-4; C4.1)
- Simple hypotheses and the Neyman-Pearson lemma (M10.4.1; C4.3)
- Simple null hypothesis versus composite alternative hypothesis; exponential family (M10.4.2)
- Likelihood ratio for composite hypotheses; asymptotic statistics (M10.4.3)
- χ2 test; Pearson's χ2 test for binned distributions (M10.6.3, M10.6.5; C4.7)
- Non-parametric testing: Run test (M10.6.6)
- Non-parametric testing: Kolmogorov-Smirnov test (M10.6.7)
- Week 7:
Language specific items
As specified in the Study Prospectus, it is believed that it should be possible to use any of the C, C++, Python, R, or MatLab/Octave programming languages / packages to complete computer based exercises.
Some (hopefully useful) suggestions for (freely available) software that can be used in some of these languages:
- C++: ROOT. This is a large package used widely in particle physics, and of interest to anyone (about to be) involved in particle physics data analysis.
It is freely available and comes with its own visualisation code, but comes with quite a large amount of code. Version 6.08/00 is presently installed on the faculty's Linux machines.
Alternatively, the Boost libraries, and in particular its Math/Statistical Distributions library, may be of interest. Also Boost involves a fairly large amount
of code, though, and is not presently installed (to my knowledge) on the university's Linux systems.
- Python: SciPy, NumPy, and matplotlib provide a lot of statistics, array, and visualisation functionality beyond what is available in bare Python;
they are installed on the university's Linux systems.
Other suggestions are most welcome! For instance, there is a high-level Python interface to ROOT called rootpy, which one may elect to import ROOT's functionality into the Python world (which could have the benefit of being able to use compiled code for CPU intensive computations).
Examination
The exam will consist of two parts: a take-home part (testing practical problem solving skills), and an oral examination (testing mostly theoretical knowledge).
Exercise material
Part of the exercises require the use of computer programming, and it is expected that the students will be able to do so (using any of Matlab, C, C++, R, or Python; these skills will be needed also for the take-home part of the exam).
Corresponding material will be made available on the Brightspace site.
Teaching assistance will be provided by Marion Missio.
Literature
The following books are good references:
- G. Cowan, Statistical Data Analysis, 1998, Oxford University Press, ISBN: 978-0198501558 (paperback).
This is a somewhat basic book and does not cover the recent developments; however, it offers a nice pedagogical introduction.
- O. Behnke, K. Kröninger, G. Schott, and T. Schorner-Sadenius (eds.), Data Analysis in High Energy Physics: A Practical Guide to Statistical Methods,
2013, Wiley, ISBN: 978-3-527-41058-3 (paperback), 978-3-527-65343-0 (e-book).
This book is also available electronically through the university library.
It is more modern, has a somewhat wider scope and, despite its title, should be useful for other physics specialisations besides particle physics (notably astrophysics).
In addition, there are a host of other books covering more or less the same material. Here are a few:
- F. James, Statistical Methods in Experimental Physics (second edition), 2006, World Scientific, ISBN: 981-570-527-9 (paperback), 981-256-795-X.
This fairly recent book focuses mostly on the foundations of statistics and makes for somewhat terse reading. However, it provides useful references.
- J.A. Rice, Mathematical Statistics and Data Analysis (third edition), 2007, Thomson, ISBN: 0-534-39942-8.
This book offers a fairly pedagogical introduction and again has a wider scope (notably, it goes beyond the assumption that data are always identically but independently distributed),
but apart from this remains fairly basic.