This article introduces key concepts in machine learning such as feature selection, feature engineering, data cleaning, training sets, and test sets. Furthermore, we will explain the types of data you may encounter and how to handle potential issues.
The Basics of Machine Learning
Machine learning, a subset of Artificial Intelligence (AI), is an innovative field that leverages data to solve complex problems, rendering traditional programming techniques inadequate. An early example of machine learning application is an email spam filter, which has significantly improved accuracy over older algorithms.
Machine learning is highly data-dependent, where issues such as insufficient data, poor data quality, incorrect or missing data, irrelevant data, and duplicate data values could pose significant challenges. Fortunately, there are techniques to mitigate these data-related issues, as we will see later in this chapter.
If you’re new to machine learning, it’s crucial to understand the term ‘dataset.’ A dataset is a collection of data values stored in formats like CSV files or spreadsheets. Each column is a ‘feature’, and each row is a ‘datapoint’ representing specific values for each feature. For instance, in a customer dataset, each row would represent a specific customer.
Diverse Types of Machine Learning
Machine learning primarily encompasses three categories: Supervised learning, Unsupervised learning, and Semi-supervised learning. However, combinations of these are also possible.
Supervised Learning
In supervised learning, each datapoint in the dataset has a label identifying its contents. For example, in the popular MNIST dataset, each 28×28 PNG file contains a single hand-drawn digit labeled from 0 through 9. Similarly, the Titanic dataset has features related to each passenger, such as their gender, cabin class, ticket price, and survival status. These datasets involve classification tasks where models are trained and then used to predict the class of each row in a test dataset.
Unsupervised Learning
Unsupervised learning deals with unlabeled data and is commonly used for clustering and dimensionality reduction tasks. Here, the significant algorithms include:
- Clustering: k-Means, Hierarchical Cluster Analysis (HCA), Expectation Maximization.
- Dimensionality Reduction: PCA (Principal Component Analysis), Kernel PCA, LLE (Locally Linear Embedding), t-SNE (t-distributed Stochastic Neighbor Embedding).
Anomaly detection, another critical unsupervised task, is particularly useful for fraud detection and outlier detection.
Semi-Supervised Learning
Semi-supervised learning, a combination of supervised and unsupervised learning, involves some labeled and some unlabeled datapoints. One technique involves using the labeled data to classify (label) the unlabeled data, after which you can apply a classification algorithm.
Feature Selection and Extraction
Feature selection or extraction is a crucial part of preparing a dataset for machine learning, accomplished using various algorithms. It aids in reducing dimensionality, improving model accuracy, and minimizing computational demands.
Linear Regression in Machine Learning
Despite its inception over 200 years ago, linear regression remains a core technique for solving basic problems in statistics and machine learning. Its simplicity and effectiveness, especially in predicting continuous outcomes, are the reasons behind its wide usage. Python and TensorFlow utilize the technique of ‘Mean Squared Error’ (MSE) for finding the best-fitting line for data points in a 2D plane (or a hyperplane for higher dimensions) to minimize cost functions.
Application in Python using NumPy and Keras
Python libraries such as NumPy and Keras offer robust functionality for executing linear regression tasks. Python’s simple syntax, coupled with the versatility of these libraries, makes it an excellent choice for machine learning applications.
This article only scratches the surface of machine learning. There are numerous algorithms and techniques waiting to be explored and mastered. Some of these will be discussed in greater detail in subsequent chapters.
Related Article