2 C
New Jersey
Saturday, November 23, 2024

Gaussian Naive Bayes, Defined: A Visible Information with Code Examples for Newcomers | by Samy Baladram | Oct, 2024


CLASSIFICATION ALGORITHM

Bell-shaped assumptions for higher predictions

⛳️ Extra CLASSIFICATION ALGORITHM, defined:
· Dummy Classifier
· Ok Nearest Neighbor Classifier
· Bernoulli Naive Bayes
Gaussian Naive Bayes
· Choice Tree Classifier
· Logistic Regression
· Help Vector Classifier
· Multilayer Perceptron (quickly!)

Constructing on our earlier article about Bernoulli Naive Bayes, which handles binary knowledge, we now discover Gaussian Naive Bayes for steady knowledge. In contrast to the binary method, this algorithm assumes every function follows a traditional (Gaussian) distribution.

Right here, we’ll see how Gaussian Naive Bayes handles steady, bell-shaped knowledge — ringing in correct predictions — all with out entering into the intricate math of Bayes’ Theorem.

All visuals: Creator-created utilizing Canva Professional. Optimized for cellular; might seem outsized on desktop.

Like different Naive Bayes variants, Gaussian Naive Bayes makes the “naive” assumption of function independence. It assumes that the options are conditionally impartial given the category label.

Nonetheless, whereas Bernoulli Naive Bayes is fitted to datasets with binary options, Gaussian Naive Bayes assumes that the options observe a steady regular (Gaussian) distribution. Though this assumption might not at all times maintain true in actuality, it simplifies the calculations and sometimes results in surprisingly correct outcomes.

Bernoulli NB assumes binary knowledge, Multinomial NB works with discrete counts, and Gaussian NB handles steady knowledge assuming a traditional distribution.

All through this text, we’ll use this synthetic golf dataset (made by creator) for instance. This dataset predicts whether or not an individual will play golf primarily based on climate circumstances.

Columns: ‘RainfallAmount’ (in mm), ‘Temperature’ (in Celcius), ‘Humidity’ (in %), ‘WindSpeed’ (in km/h) and ‘Play’ (Sure/No, goal function)
# IMPORTING DATASET #
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

dataset_dict = {
'Rainfall': [0.0, 2.0, 7.0, 18.0, 3.0, 3.0, 0.0, 1.0, 0.0, 25.0, 0.0, 18.0, 9.0, 5.0, 0.0, 1.0, 7.0, 0.0, 0.0, 7.0, 5.0, 3.0, 0.0, 2.0, 0.0, 8.0, 4.0, 4.0],
'Temperature': [29.4, 26.7, 28.3, 21.1, 20.0, 18.3, 17.8, 22.2, 20.6, 23.9, 23.9, 22.2, 27.2, 21.7, 27.2, 23.3, 24.4, 25.6, 27.8, 19.4, 29.4, 22.8, 31.1, 25.0, 26.1, 26.7, 18.9, 28.9],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'WindSpeed': [2.1, 21.2, 1.5, 3.3, 2.0, 17.4, 14.9, 6.9, 2.7, 1.6, 30.3, 10.9, 3.0, 7.5, 10.3, 3.0, 3.9, 21.9, 2.6, 17.3, 9.6, 1.9, 16.0, 4.6, 3.2, 8.3, 3.2, 2.2],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Set function matrix X and goal vector y
X, y = df.drop(columns='Play'), df['Play']

# Break up the info into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
print(pd.concat([X_train, y_train], axis=1), finish='nn')
print(pd.concat([X_test, y_test], axis=1))

Gaussian Naive Bayes works with steady knowledge, assuming every function follows a Gaussian (regular) distribution.

  1. Calculate the likelihood of every class within the coaching knowledge.
  2. For every function and sophistication, estimate the imply and variance of the function values inside that class.
  3. For a brand new occasion:
    a. For every class, calculate the likelihood density perform (PDF) of every function worth beneath the Gaussian distribution of that function throughout the class.
    b. Multiply the category likelihood by the product of the PDF values for all options.
  4. Predict the category with the very best ensuing likelihood.
Gaussian Naive Bayes makes use of the traditional distribution to mannequin the probability of various function values for every class. It then combines these likelihoods to make a prediction.

Remodeling non-Gaussian distributed knowledge

Keep in mind that this algorithm naively assume that each one the enter options are having Gaussian/regular distribution?

Since we’re not actually positive concerning the distribution of our knowledge, particularly for options that clearly don’t observe a Gaussian distribution, making use of a energy transformation (like Field-Cox) earlier than utilizing Gaussian Naive Bayes might be useful. This method might help make the info extra Gaussian-like, which aligns higher with the assumptions of the algorithm.

All columns are scaled utilizing Energy Transformation (Field-Cox Transformation) after which standardized.
from sklearn.preprocessing import PowerTransformer

# Initialize and match the PowerTransformer
pt = PowerTransformer(standardize=True) # Commonplace Scaling already included
X_train_transformed = pt.fit_transform(X_train)
X_test_transformed = pt.rework(X_test)

Now we’re prepared for the coaching.

1. Class Likelihood Calculation: For every class, calculate its likelihood: (Variety of situations on this class) / (Whole variety of situations)

from fractions import Fraction

def calc_target_prob(attr):
total_counts = attr.value_counts().sum()
prob_series = attr.value_counts().apply(lambda x: Fraction(x, total_counts).limit_denominator())
return prob_series

print(calc_target_prob(y_train))

2. Function Likelihood Calculation : For every function and every class, calculate the imply (μ) and customary deviation (σ) of the function values inside that class utilizing the coaching knowledge. Then, calculate the likelihood utilizing Gaussian Likelihood Density Operate (PDF) method.

For every climate situation, decide the imply and customary deviation for each “YES” and “NO” situations. Then calculate their PDF utilizing the PDF method for regular/Gaussian distribution.
The identical course of is utilized to all the different options.
def calculate_class_probabilities(X_train_transformed, y_train, feature_names):
courses = y_train.distinctive()
equations = pd.DataFrame(index=courses, columns=feature_names)

for cls in courses:
X_class = X_train_transformed[y_train == cls]
imply = X_class.imply(axis=0)
std = X_class.std(axis=0)
k1 = 1 / (std * np.sqrt(2 * np.pi))
k2 = 2 * (std ** 2)

for i, column in enumerate(feature_names):
equation = f"{k1[i]:.3f}·exp(-(x-({imply[i]:.2f}))²/{k2[i]:.3f})"
equations.loc[cls, column] = equation

return equations

# Use the perform with the reworked coaching knowledge
equation_table = calculate_class_probabilities(X_train_transformed, y_train, X.columns)

# Show the equation desk
print(equation_table)

3. Smoothing: Gaussian Naive Bayes makes use of a novel smoothing method. In contrast to Laplace smoothing in different variants, it provides a tiny worth (0.000000001 occasions the most important variance) to all variances. This prevents numerical instability from division by zero or very small numbers.

Given a brand new occasion with steady options:

1. Likelihood Assortment:
For every potential class:
· Begin with the likelihood of this class occurring (class likelihood).
· For every function within the new occasion, calculate the likelihood density perform of that function throughout the class.

For ID 14, we calculate the PDF every of the function for each “YES” and “NO” situations.

2. Rating Calculation & Prediction:
For every class:
· Multiply all of the collected PDF values collectively.
· The result’s the rating for this class.
· The category with the very best rating is the prediction.

from scipy.stats import norm

def calculate_class_probability_products(X_train_transformed, y_train, X_new, feature_names, target_name):
courses = y_train.distinctive()
n_features = X_train_transformed.form[1]

# Create column names utilizing precise function names
column_names = [target_name] + listing(feature_names) + ['Product']

probability_products = pd.DataFrame(index=courses, columns=column_names)

for cls in courses:
X_class = X_train_transformed[y_train == cls]
imply = X_class.imply(axis=0)
std = X_class.std(axis=0)

prior_prob = np.imply(y_train == cls)
probability_products.loc[cls, target_name] = prior_prob

feature_probs = []
for i, function in enumerate(feature_names):
prob = norm.pdf(X_new[0, i], imply[i], std[i])
probability_products.loc[cls, feature] = prob
feature_probs.append(prob)

product = prior_prob * np.prod(feature_probs)
probability_products.loc[cls, 'Product'] = product

return probability_products

# Assuming X_new is your new pattern reshaped to (1, n_features)
X_new = np.array([-1.28, 1.115, 0.84, 0.68]).reshape(1, -1)

# Calculate likelihood merchandise
prob_products = calculate_class_probability_products(X_train_transformed, y_train, X_new, X.columns, y.title)

# Show the likelihood product desk
print(prob_products)

For this specific dataset, this accuracy is taken into account fairly good.
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Initialize and prepare the Gaussian Naive Bayes mannequin
gnb = GaussianNB()
gnb.match(X_train_transformed, y_train)

# Make predictions on the take a look at set
y_pred = gnb.predict(X_test_transformed)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy: {accuracy:.4f}")

GaussianNB is understood for its simplicity and effectiveness. The primary factor to recollect about its parameters is:

  1. priors: That is essentially the most notable parameter, much like Bernoulli Naive Bayes. Normally, you don’t must set it manually. By default, it’s calculated out of your coaching knowledge, which frequently works properly.
  2. var_smoothing: This can be a stability parameter that you simply not often want to regulate. (the default is 0.000000001)

The important thing takeaway is that this algoritm is designed to work properly out-of-the-box. In most conditions, you need to use it with out worrying about parameter tuning.

Execs:

  1. Simplicity: Maintains the easy-to-implement and perceive trait.
  2. Effectivity: Stays swift in coaching and prediction, making it appropriate for large-scale purposes with steady options.
  3. Flexibility with Knowledge: Handles each small and enormous datasets properly, adapting to the size of the issue at hand.
  4. Steady Function Dealing with: Thrives with steady and real-valued options, making it excellent for duties like predicting real-valued outputs or working with knowledge the place options differ on a continuum.

Cons:

  1. Independence Assumption: Nonetheless assumes that options are conditionally impartial given the category, which could not maintain in all real-world eventualities.
  2. Gaussian Distribution Assumption: Works finest when function values really observe a traditional distribution. Non-normal distributions might result in suboptimal efficiency (however might be mounted with Energy Transformation we’ve mentioned)
  3. Sensitivity to Outliers: Will be considerably affected by outliers within the coaching knowledge, as they skew the imply and variance calculations.

Gaussian Naive Bayes stands as an environment friendly classifier for a variety of purposes involving steady knowledge. Its capacity to deal with real-valued options extends its use past binary classification duties, making it a go-to selection for quite a few purposes.

Whereas it makes some assumptions about knowledge (function independence and regular distribution), when these circumstances are met, it offers strong efficiency, making it a favourite amongst each freshmen and seasoned knowledge scientists for its stability of simplicity and energy.

import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load the dataset
dataset_dict = {
'Rainfall': [0.0, 2.0, 7.0, 18.0, 3.0, 3.0, 0.0, 1.0, 0.0, 25.0, 0.0, 18.0, 9.0, 5.0, 0.0, 1.0, 7.0, 0.0, 0.0, 7.0, 5.0, 3.0, 0.0, 2.0, 0.0, 8.0, 4.0, 4.0],
'Temperature': [29.4, 26.7, 28.3, 21.1, 20.0, 18.3, 17.8, 22.2, 20.6, 23.9, 23.9, 22.2, 27.2, 21.7, 27.2, 23.3, 24.4, 25.6, 27.8, 19.4, 29.4, 22.8, 31.1, 25.0, 26.1, 26.7, 18.9, 28.9],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'WindSpeed': [2.1, 21.2, 1.5, 3.3, 2.0, 17.4, 14.9, 6.9, 2.7, 1.6, 30.3, 10.9, 3.0, 7.5, 10.3, 3.0, 3.9, 21.9, 2.6, 17.3, 9.6, 1.9, 16.0, 4.6, 3.2, 8.3, 3.2, 2.2],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}

df = pd.DataFrame(dataset_dict)

# Put together knowledge for mannequin
X, y = df.drop('Play', axis=1), (df['Play'] == 'Sure').astype(int)

# Break up knowledge into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

# Apply PowerTransformer
pt = PowerTransformer(standardize=True)
X_train_transformed = pt.fit_transform(X_train)
X_test_transformed = pt.rework(X_test)

# Practice the mannequin
nb_clf = GaussianNB()
nb_clf.match(X_train_transformed, y_train)

# Make predictions
y_pred = nb_clf.predict(X_test_transformed)

# Verify accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

237FansLike
121FollowersFollow
17FollowersFollow

Latest Articles