Internet Measurement and Data Analysis

September 25, 2013 – January 8, 2014

Lecturer

Dr. Kenjiro Cho
Guest Professor
Faculty of Environment and Information Studies
Keio University

 

Course Summary

It becomes possible to access a huge amount of diverse data through the Internet. It allows us to obtain new knowledge and create new services, leading to an innovation called “Big Data” or “Collective Intelligence”. In order to understand such data and use it as a tool, one needs to have a good understanding of the technical background in statistics, machine learning, and computer network systems.

In this class, you will learn about the overview of large-scale data analysis on the Internet, and basic skills to obtain new knowledge from massive information for the forthcoming information society.

Theme, Goals, Methods

In this class, you will learn about data collection and data analysis methods on the Internet, to obtain knowledge and understanding of networking technologies and large-scale data analysis.

Each class will provide specific topics where you will learn the technologies and the theories behind the technologies. In addition to the lectures, each class includes programming exercises to obtain data analysis skills through the exercises.

Textbooks, References

The lecture slide materials will be provided online.
ruby: http://www.ruby-lang.org/
gnuplot: http://gnuplot.info/
[1] Mark Crovella and Balachander Krishnamurthy. Internet measurement: infrastructure, traffic, and applications. Wiley, 2006.
[2] Pang-Ning Tan, Michael Steinbach and Vipin Kumar. Introduction to Data Mining. Addison Wesley, 2006.
[3] Raj Jain. The art of computer systems performance analysis. Wiley, 1991.
[4] Toby Segaran. Programming Collective Intelligence. O’Reilly Media. 2007.
[5] Allen B. Downey. Think Stats: Probability and Statistics for Programmers. O’Reilly Media. 2011.
[6] Chris Sanders. Practical Packet Analysis, 2nd Edition No Starch Press. 2011.

Prerequisites

The prerequisites for the class are basic programming skills and basic knowledge about statistics.

In the exercises and assignments, you will need to write programs to process large data sets, using the Ruby scripting language and the Gnuplot plotting tool. To understand the theoretical aspects, you will need basic knowledge about algebra and statistics. However, the focus of the class is to understand how mathematics is used for engineering applications.

 

Lecture Schedule

 You can join to the live sessions (Schedule TBD)
◯ Learn by the archived video

 

TYPE # DATE TOPIC / THEME
   1 Sep 25, 2013 (Wed) Introduction
– Big Data and Collective Intelligence
– Internet measurement
– Large-scale data analysis
– exercise: introduction of Ruby scripting language
   2 Oct 2, 2013 (Wed) Data and variability
– Summary statistics
– Sampling
– How to make good graphs
– exercise: graph plotting by Gnuplot
  ◯  3 Oct 9, 2013 (Wed) Data recording and log analysis
– Network management tools
– Data format
– Log analysis methods
– exercise: log data and regular expression
  ◯  4 Oct 16, 2013 (Wed) Distribution and confidence intervals
– Normal distribution
– Confidence intervals and statistical tests
– Distribution generation
– exercise: confidence intervals
– assignment 1
  ◯ 5 Oct 23, 2013 (Wed) Diversity and complexity
– Long tail
– Web access and content distribution
– Power-law and complex systems
– exercise: power-law analysis
   6 Oct 30, 2013 (Wed) Correlation
– Online recommendation systems
– Distance
– Correlation coefficient
– exercise: correlation analysis
   7 Nov 6, 2013 (Wed) Multivariate analysis
– Data sensing
– Linear regression
– Principal Component Analysis
– exercise: linear regression
   8 Nov 13, 2013 (Wed) Time-series analysis
– Internet and time
– Network Time Protocol
– Time series analysis
– exercise: time-series analysis
– assignment 2
  ◯  9 Nov 27, 2013 (Wed) Topology an graph
– Routing protocols
– Graph theory
– exercise: shortest-path algorithm
   10 Dec 4, 2013 (Wed) Anomaly detection and machine learning
– Anomaly detection
– Machine Learning
– SPAM filtering and Bayes theorem
– exercise: naive Bayesian filter
   11 Dec 11, 2013 (Wed) Data Mining
– Pattern extraction
– Classification
– Clustering
– exercise: clustering
   12 Dec 18, 2013 (Wed) Search and Ranking
– Search systems
– PageRank
– exercise: PageRank algorithm
   13 Dec 25, 2013 (Wed) Scalable measurement and analysis
– Distributed parallel processing
– Cloud computing technology
– MapReduce
– exercise: MapReduce algorithm
   14 Jan 8, 2014 (Wed) Privacy Issues
– Internet data analysis and privacy issues
– Summary of the class