First cycle
degree courses
Second cycle
degree courses
Single cycle
degree courses
School of Science
STATISTICAL SCIENCES
Course unit
INFORMATIC METHODS FOR STATISTICS AND DATA SCIENCE
SCP7081820, A.A. 2018/19

Information concerning the students who enrolled in A.Y. 2018/19

Information on the course unit
Degree course Second cycle degree in
STATISTICAL SCIENCES
SS1736, Degree course structure A.Y. 2014/15, A.Y. 2018/19
N0
bring this page
with you
Number of ECTS credits allocated 9.0
Type of assessment Mark
Course unit English denomination INFORMATIC METHODS FOR STATISTICS AND DATA SCIENCE
Website of the academic structure http://www.stat.unipd.it/studiare/ammissione-laurea-magistrale
Department of reference Department of Statistical Sciences
E-Learning website https://elearning.unipd.it/stat/course/view.php?idnumber=2018-SS1736-000ZZ-2018-SCP7081820-N0
Mandatory attendance No
Language of instruction Italian
Branch PADOVA
Single Course unit The Course unit can be attended under the option Single Course unit attendance
Optional Course unit The Course unit can be chosen as Optional Course unit

Lecturers
Teacher in charge MASSIMO MELUCCI ING-INF/05

ECTS: details
Type Scientific-Disciplinary Sector Credits allocated
Educational activities in elective or integrative disciplines ING-INF/05 Data Processing Systems 9.0

Course unit organization
Period Second semester
Year 1st Year
Teaching method frontal

Type of hours Credits Teaching
hours
Hours of
Individual study
Shifts
Lecture 9.0 64 161.0 No turn

Calendar
Start of activities 25/02/2019
End of activities 14/06/2019
Show course schedule 2019/20 Reg.2014 course timetable

Examination board
Board From To Members of the board
2 Commissione a.a.2018/19 01/10/2018 30/09/2019 MELUCCI MASSIMO (Presidente)
MORO MICHELE (Membro Effettivo)
ZINGIRIAN NICOLA (Membro Effettivo)

Syllabus
Prerequisites: The prerequisites are relatively simple but necessary: foundations of data structures (variable, file, vector, matrix), algorithms, computer science, and database management systems.
The knowledge of a programming language is useful, but not strictly necessary. The knowledge of R is discouraged.
Target skills and knowledge: We aim to provide effective knowledge of computational methods for a student to have greater competence in Statistics than an IT specialist and greater competence in Computer Science than a Statistician. Particular emphasis will be placed on programming and data management and on overcoming the way of writing software induced by languages such as R and packages of pre-packaged software.
Examination methods: Given the nature and methods of teaching, the exam will be oral and will be based on the discussion of a mini-project to deepen the issues addressed.
The "mini-project" is a project of a Data Science application. It is chosen and led by an independent group of one, two or three students. The aim of the project is to put into practice the contents of the discipline illustrated during the lessons. You must submit a written documentation in digital format; we provide a template.
The group must be able to explain the problems, methodologies, tools and results obtained with its own mini-project. The verification of the explanation will consist of a questionnaire given to the members of the other groups.
The theme _must_ be chosen among the following:
1. Data flows (Stream Processing).
2. Frequent collections in data flows.
3. FPgrowth algorithm for the calculation of frequent sets.
4. Grouping of images / music / film.
5. Image / music / film recommendation.
6. Perceptors and SVM.
7. Stacked grouping.
8. Spectral analysis of the graphs.
9. Simrank.
10. Topics search engine.
11. CUR decomposition.
12. PageRank sensitive to topics.
13. Link Spam and Link Farm.
14. Variable budget publication (BALANCE algorithm).
Whatever the theme, the mini-project can be:
• descriptive: it is about telling the problem, the approaches, the computational aspects; be careful not to translate blindly from English, try rather to make the essential aspects understand as if it were material for an exam; it is recommended to develop exemplary software;
• empirical: we illustrate the results of using one or more methods without necessarily showing the superiority of one method over another; it is necessary to create software and make available the used datasets;
• experimental: experiments are planned and carried out with collections of public data in order to show the superiority of one method over another; the developed software must also be delivered; the report must describe the experiments in detail in order to allow their reproducibility;
• theoretical: we illustrate the theoretical and formal properties of methods, models and algorithms through theorems or in-depth and rigorous discussions; although, in this case, the software is not strictly necessary, it is advisable to accompany the discussion with empirical evidence (see above);
• methodological: a methodology is planned, ie a set of coordinated methods for the purpose of solving a problem and achieving results; the methodology must be implemented by a series of functioning programs and must be documented in the report.
There are some requirements:
• the application software must be developed in Python; other tools are allowed, but only "outline", such as R for statistical analysis and graphics; programs and data must be delivered in compressed archives or folders named with the group name;
• the software must be written "clean" and must be commented in English or Italian; the names of objects and functions must be self-explanatory; the file names of the programs and data must also be self-explanatory;
• the final application must be accompanied by the file named README.txt in which the files and modes of use are briefly described.
Assessment criteria: We will evaluate the understanding of the problems and the ability to find and design automated solutions for the organization, management and analysis of data in order to carry out the tasks illustrated in the contents and provided for by the oral test project.
Course unit contents: 1. Introduction to Python: environment, constructs, first examples.
2. Collection, organization and management of large masses of data: pattern matching, parsing (XML, CSV).
3. Basic data structures: lists, hashes, graphs, trees.
4. Fundamental algorithms: recursion, research, ordering.
5. Architectures distributed with MapReduce.
6. Representation and indexing, retrieval and ranking.
7. Networks, links and click-through: WWW, Link Analysis, HITS, Pagerank.
8. Decomposition and reduction of the dimensionality.
9. Frequent sets.
Planned learning activities and teaching methods: The contents will be treated in a mainly laboratory form by developing programs and using software libraries in Python.
The methodological elements will be introduced in order to know the underlying issues, to design and implement projects, and to use the tools in a conscious way.
Additional notes about suggested reading: Teaching material will be distributed during the lessons in addition to the reference texts. Some texts, especially those for programming and data management, will be indicated at the beginning of the lessons.
Textbooks (and optional supplementary readings)
  • Melucci, Massimo, Information Retrieval. --: Franco Angeli, 2013. Cerca nel catalogo
  • Aho, Alfred; Ullmann, Jeffrey D., Fondamenti di informatica. --: Zanichelli, --. Versione inglese disponibile all'indirizzo http://infolab.stanford.edu/~ullman/focs.html [visitato in aprile 2018] Cerca nel catalogo
  • Leskovec, Juri; Rajaraman, Anand; Ullman, Jeffrey D., Mining Massive Datasets. --: Cambridge University Press, 2014. Disponibile all'indirizzo http://www.mmds.org [visitato in aprile 2018] Cerca nel catalogo

Innovative teaching methods: Software or applications used
  • Latex