Astrostatistics and Data Mining

GREAT Summer School on Astrostatistics and Data Mining

The preliminary schedule for the Joint Workshop and Summer School can be consulted here . Please, take this program only as preliminary during March and the beginning of April.

You can find a brief list of introductory concepts with links to useful wikipedia articles here

Models: Specification, complexity and choice (David Hogg)

What is a model? What freedoms does a model have and how can we capture that? Are qualitatively different models comparable? What is the difference between a likelihood and a probability for a model or for model parameters? How do we decide among models that are qualitatively similar but quantitatively different? How do we decide among models that are qualitatively different? The most important content will be conveyed through a lab session in which participants pair-code solutions to some model selection problems.

Table of contents
- Lecture 0: (to be provided in advance as links or bibliography if needed)
- Lecture 1: Model specification and likelihood formulation
- Lecture 2: Model complexity and choice
- Lecture 3: (pair-coding) Model selection workshop
- Lecture 4: (pair-coding) workshop continued
Knowledge Discovery and Data Mining (Giuseppe Longo)

Feature selection: filter approach, wrapper approach, PCA, Diffusion Maps. Supervised classification: the curse of dimensionality, bias-variance trade-off, the kernel trick, support vector machines, cross-validation, evaluation of classifiers. Unsupervised classification taxonomy, evaluation measures.

Table of contents:
- Lecture 0: (to be provided in advance as links or bibliography if needed)
- Lecture 1: what is data mining
- Lecture 2: feature selection and dimensionality reduction
- Lecture 3: classification tasks and supervised methods
- Lecture 4: clustering methods
Statistical Image Analysis (Robert Lupton)

The source detection problem, source modelling, catalogue cross correlations, combination of images...

Table of contents
- Lecture 0 (to be provided in advance as links or bibliography if needed)
- Lecture 1 The Sampling Theorem and Image Resampling
- Lecture 2 Object Detection and Measurement as Statistical Estimation
- Lecture 3 Workshop: object detection and measurement
- Lecture 4 (workshop continued, if needed)
Technical aspects of the analysis of petabyte-size databases (Matthew Graham)

It would take over 33 years to watch a 1 PB MP3 movie yet, within the decade, data sets of this size will be as everyday a feature of astronomical life as astro-ph or APOD. This section will cover the practical aspects of handling petascale (and larger) data sets and streams including new computational approaches needed to work with them from an astronomer's perspective.

Table of contents
- Lecture 0 (to be provided in advance as links or bibliography if needed)
  - How big is a petabyte?
  - Big data sets en route: astronomy, other sciences
- Lecture 1: How to store a petabyte
  - What do you store?
  - Cost and performance of storage
  - Databases: relational vs non-relational, indexing
- Lecture 2: How to work with a petabyte
  - Distribution
  - Divide and conquer: MapReduce, Hadoop (how to sort 1 PB)
  - Putting things together: PIG
- Lecture 3: How to analyze a petabyte
  - Random access
  - Characterizing data
  - Streaming statistics
- Ideas for pair-coding examples (to be discussed with SOC / other lecturers).
  - Coding up a simple analysis routine using Hadoop
Time series analysis (Suzanne Aigrain)

This section will cover common tool for exploring and characterising time-series and ensembles thereof. The first two lectures are devoted to time- and frequency domain techniques respectively, and cover some frequently used exploratory . Particular attention will be devoted to the treatment of stochastic processes and mixtures of stochastic and periodic processes.

Table of contents
- Lecture 0 (to be provided in advance as links or bibliography if needed)
  - stationarity, autocorrelation function, (discrete) Fourier transform, window function
  - properties of the Gaussian distribution
- Lecture 1: Time-domain analysis
  - autocorrelation techniques
  - common time-domain filters
  - stochastic processes: ARIMA models, Gaussian processes
- Lecture 2: Frequency analysis
  - noise properties in the frequency domain
  - periodic signal detection
  - time-frequency analysis, wavelet transforms
- Lecture 3: Ensembles of time series
  - principal component analysis in the time and frequency domains
  - classification and clustering

Models: Specification, complexity and choice (David Hogg)

Knowledge Discovery and Data Mining (Giuseppe Longo)

Statistical Image Analysis (Robert Lupton)

Technical aspects of the analysis of petabyte-size databases (Matthew Graham)

Time series analysis (Suzanne Aigrain)