Skills Developed
- Data preprocessing and cleaning
- Data scaling, normalization, and discretization
- Machine learning model implementation (supervised and unsupervised)
- Dimensionality reduction techniques
- Feature selection methods
- Data integration and encoding
- Large-scale data processing with MapReduce
Course Contents
- Data Manipulation and Visualization:
- iPython and Jupyter notebooks
- NumPy and Pandas for data manipulation
- Matplotlib for data visualization
- Data Cleaning and Preprocessing:
- Identifying and handling outliers and duplicates
- Data scaling techniques (e.g., min-max, standard, robust scalers)
- Normalization methods (range, clipping, log, z-score)
- Discretization techniques (equal width, equal-frequency, binning)
- Machine Learning with Scikit-learn:
- Supervised learning (Bayesian, k-Nearest Neighbors, Decision Trees, Linear Models)
- Unsupervised learning (K-means, DBSCAN)
- Impact of data preprocessing on model performance
- Advanced Data Analysis Techniques:
- Dimensionality reduction (PCA, t-SNE)
- Feature selection methods
- Data integration and encoding (join, merge, concatenation, one-hot encoding)
- Big Data Processing:
- MapReduce framework
- Apache Spark basics for scaling data analysis