The aim of this article is to have an introduction to Naive baysian classification using scikit-learn. The naive Bayesian classification is a simple Bayesian type of probabilistic classification based on Bayes’ theorem with strong (so-called naive) independence of hypotheses. In this article, we will use it to build a basic text prediction system. We will predict Equity codes in a search form fashion (i.e prediction starts when user starts typing).
Preparing the data
The data used in this article is a list of equities codes coming from Euronext website (https://www.euronext.com/en/equities/directory). We only kept the codes list from the Euronext csv file.
The first step is to import the necessary packages and to load the data.
import pandas as pd
import numpy as np
import copy
from sklearn.naive_bayes import GaussianNB
# Loads Equity codes in a pandas DataFrame
equities_codes_df = pd.read_csv('/Users/rekahalmai/Documents/euronext_equities_codes.csv')
Next Step is to build features related to the equity codes. On way to do this is to consider the situation where a user start typing a code in a search form and get recommended a code. In this situation, the user gets a different recommandation when he types a new letter. So the features of a code can be thought as an ordered sequences of letters of the code. For exemple, the features of ‘AXA’ are [‘A’, ‘AX’, ‘AXA’]. Let’s write tow functions that will allow us to build the dictionary codes_features_dict holding a list of features for every code.
# Builds a list of feature from a code
def get_features(code):
i = 0
features = []
for c in str.lower(code): # code to lower case --> we won't make a difference between 'AXA' and 'axa'
if not i == 0:
features.append(features[i-1]+c)
else:
features.append(c)
i += 1
return features
# Builds a dictionary holding a list of features for every code
def create_codes_features_dict(codes_df):
codes_features = {}
for index, row in codes_df.iterrows():
code = str(row.get(0))
# We use the code index provide by the dataframe as a key for our dictionary
codes_features[index] = get_features(code)
return codes_features
codes_features_dict = create_codes_features_dict(equities_codes_df)
Now, in order train our naive baysian classifier, every training data in our set should have the same number of features wich is not the case with the function get_features().
Let’s write a function that will give us the complete set of features from our list codes.
def create_set_of_features():
is_first = True
for codes_index, features in codes_features_dict.items():
if is_first:
s = set(features)
else:
s = s.union(set(features))
is_first = False
return s
set_of_features = create_set_of_features()
Finally, we want to create for every code, a vecteur of features (of the size of the previous set). The vector of features will have 0 if the feature is missing from the concerned code and 1 if it is present. We collect the data in the dictionary data_dict
def create_default_feature_dict():
default_feature_dict = {}
for feature in set_of_features:
default_feature_dict.setdefault(feature, 0)
return default_feature_dict
def create_data_dict():
default_feature_dict = create_default_feature_dict()
data_dict = {}
for index, row in equities_codes_df.iterrows():
data_dict[index] = copy.deepcopy((default_feature_dict))
for code, features_list in codes_features_dict.items():
for feature in features_list:
data_dict[code][feature] = 1
return data_dict
data_dict = create_data_dict()
Training
We have now everything ready to start the learning phase. But first, we have to separate our training data (i.e the features) and our training label (i.e the codes indexes) in two separate vectors.
def get_training_data_vectors(data):
X_list = []
for k, v in data.items():
X_list.append(list(v.values()))
return np.array(X_list)
X = get_training_data_vectors(data_dict)
Y = np.array(list(data_dict.keys()))
Now it’s time to train our Naive Baysian Classifier. We will use Gaussian Naive Bayes algorithm for classification (i.e the likelihood of the features is assumed to be Normally distributed).
clf = GaussianNB()
clf.fit(X, Y)
Testing
We can now test our classifier. For that purpose we write two functions, get_features_vector() will return the features vector corresponding to the inputed code and get_prediction() will return the predicted code (i.e the code classification with the maximum probability).
def get_features_vector(code):
features = get_features(code)
features_dict = create_default_feature_dict()
for f in features:
if f in features_dict:
features_dict[f] = 1
return list(features_dict.values())
def get_prediction(code):
features_vector = get_features_vector(code)
code_index = clf.predict([features_vector])
return equities_codes_df.iloc[code_index]
Testing examples
get_prediction("X")
get_prediction("XP")
get_prediction("A")
get_prediction("AC")
get_prediction("ACC")
get_prediction("M")
get_prediction("ML")
get_prediction("MLA")
get_prediction("MLAC")
get_prediction("")
The last test predict the code “AB” while the code tested was an empty code. We see here that our classifier return the simplest code in our list (i.e “AB”) when no code’ features are in our set of features. Let’s write a function that will display us the vector of probabilities for a given code classification.
def get_prediction_probabilities(code):
features_vector = get_features_vector(code)
return clf.predict_proba([features_vector])
print(get_prediction_probabilities(""))
When we input an empty string our baysian classifier tells us that the best code (i.e the best probability) is “AB” (with 2% probability).
Leave a Reply