COMP 3400: Data Preparation Techniques

COMP 3400

Skills Developed

  • Data preprocessing and cleaning
  • Data scaling, normalization, and discretization
  • Machine learning model implementation (supervised and unsupervised)
  • Dimensionality reduction techniques
  • Feature selection methods
  • Data integration and encoding
  • Large-scale data processing with MapReduce


Course Contents

  • Data Manipulation and Visualization:
  • iPython and Jupyter notebooks
  • NumPy and Pandas for data manipulation
  • Matplotlib for data visualization


  • Data Cleaning and Preprocessing:
  • Identifying and handling outliers and duplicates
  • Data scaling techniques (e.g., min-max, standard, robust scalers)
  • Normalization methods (range, clipping, log, z-score)
  • Discretization techniques (equal width, equal-frequency, binning)


  • Machine Learning with Scikit-learn:
  • Supervised learning (Bayesian, k-Nearest Neighbors, Decision Trees, Linear Models)
  • Unsupervised learning (K-means, DBSCAN)
  • Impact of data preprocessing on model performance


  • Advanced Data Analysis Techniques:
  • Dimensionality reduction (PCA, t-SNE)
  • Feature selection methods
  • Data integration and encoding (join, merge, concatenation, one-hot encoding)


  • Big Data Processing:
  • MapReduce framework
  • Apache Spark basics for scaling data analysis