Due to the recent technological advancements in data science, It has imposed the prevalence of Big data, which presents a big task for humans to compute and analyse without errors. With this challenge, Data scientists have come up with an artificial intelligence technique; machine learning which uses programmed algorithms that receive and analyse input data to predict output values within an acceptable range. As new data is fed to these Machine learning algorithms, they learn and optimize their operations to improve performance, developing ‘intelligence’ over time. For data scientists to achieve this advanced technology where computer programs are no longer necessary to output data, they use the machine learning algorithms explained below.
Machine learning algorithms
Machine learning algorithms are either grouped based on style (supervised, unsupervised, semi-supervised, reinforcement) or similarity in form or function. In this article, we will use the latter method as it groups machine learning algorithms based on how they work. However based on style, those listed below fall under supervised and unsupervised ML types.
1. Linear Regression (Supervised Learning/Regression)
Linear regression is the most basic type of regression which allows us to understand the relationships between two continuous variables. In machine learning, we have a set of input variables (x) that are used to determine an output variable (y). The goal of ML is to quantify the relationship existing between these two variables. In Linear Regression, the relationship between the input variables (x) and output variable (y) is expressed as an equation of the form y = a + bx. Thus, the goal of linear regression is to find out the values of coefficients a and b. Here, a is the intercept and b is the slope of the line.
2. Logistic Regression (Supervised learning – Classification)
While Linear regression predictions are continuous values (i.e., rainfall in cm), logistic regression predictions are discrete values (i.e., whether a student passed/failed) after applying a transformation function.
Logistic regression is best suited for binary classification: data sets where y = 0 or 1, where 1 denotes the possibility for it to be true. For example, in predicting whether an event will occur or not, there are only two possibilities: that it occurs (which we denote as 1) or that it does not (0).
Logistic regression obeys the logistic function h(x) = 1/ (1 + ex). This forms an S-shaped curve.
In logistic regression, the output takes the form of probabilities which lies in the range of 0-1 (unlike linear regression, where the output is directly produced).
3. Naive Bayes (Supervised Learning – Classification)
The Naïve Bayes classifier is a machine learning algorithm whose functionality bases on Bayes’ theorem and classifies every value as independent of any other value. It allows us to predict a class/category, based on a given set of features, using probability.
Despite its simplicity, the classifier does surprisingly well and is often used due to the fact it outperforms more sophisticated classification methods.
4. Classification and Regression Trees (CART) or Decision Trees
A decision tree is a flow-chart-like tree structure that uses a branching method to illustrate every possible outcome of a decision. Each node within the tree represents a test on a specific variable – and each branch is the outcome of that test.
5. KNN (K-Nearest Neighbors: Supervised Learning)
The K-Nearest-Neighbor ML algorithm estimates how likely a data point is to be a member of one group or another. It essentially looks at the data points around a single data point to determine what group it is actually in. For example, if one point is on a grid and the algorithm is trying to determine what group that data point is in (Group A or Group B, for example) it would look at the data points near it to see what group the majority of the points are in.
6. Apriori (unsupervised)
The Apriori principle states that if an itemset is frequent, then all of its subsets must also be frequent. The Apriori algorithm is used in a transactional database to mine frequent itemsets and then generate association rules to suggest other items. It is popularly used in market basket analysis, where one checks for combinations of products that frequently co-occur in the database. In general, we write the association rule for ‘if a person purchases item X, then he purchases item Y’ as : X -> Y. Example, if a client bought milk and sugar then he is likely to buy coffee.
7. K-means (Unsupervised Learning – Clustering)
The K Means Clustering machine learning algorithm is a type of unsupervised learning, which is used to categorize unlabeled data, i.e. data without defined categories or groups. The algorithm works by finding groups within the data, with the number of groups represented by the variable K. It then works iteratively to assign each data point to one of K groups based on the features provided.
8. PCA (Principal Component Analysis)
PCA is a machine learning algorithm used to explore and visualize data by reducing the number of variables. This is done by capturing the maximum variance in the data into a new coordinate system with axes called ‘principal components’. Each component is a linear combination of the original variables and is orthogonal to one another. This orthogonal nature between components indicates that the correlation between these components is zero. The first principal component captures the direction of the maximum variability in the data. The second principal component captures the remaining variance in the data but has variables uncorrelated with the first component.
9. Random Forest (Supervised Learning – Classification/Regression)
Random forests or ‘random decision forests’ is an ensemble learning method, combining multiple algorithms to generate better results for classification, regression and other tasks. Each individual classifier is weak, but when combined with others, can produce excellent results. The algorithm starts with a ‘decision tree’ (a tree-like graph or model of decisions) and an input is entered at the top. It then travels down the tree, with data being segmented into smaller and smaller sets, based on specific variables.
10. Support Vector Machine Algorithm (Supervised Learning – Classification)
Support Vector Machine algorithms are supervised learning models that analyse data used for classification and regression analysis. They essentially filter data into categories, which is achieved by providing a set of training examples, each set marked as belonging to one or the other of the two categories. The algorithm then works to build a model that assigns new values to one category or the other.
Conclusion
Clearly, there are a lot of things to consider when it comes to choosing the right machine learning algorithms for your business’ analytics. However, choosing the right machine learning algorithm depends on several factors, including, but not limited to: data size, quality and diversity, as well as what answers businesses want to derive from that data. Additional considerations include accuracy, training time, parameters, data points and much more. Therefore, choosing the right algorithm is both a combination of business needs, specification, experimentation and time available.