Expectation Maximization for MAP estimation

 

“Expectation is the root of a heartache.” – William Shakespeare 

 

Expectation–maximization (EM) is an iterative method that attempts to find the maximum likelihood estimator of a parameter θ of a parametric probability distribution. The algorithm computes maximum likelihood estimates of unknown parameters in probabilistic models involving latent variables. Therefore the EM algorithm is an iterative method that alternates between computing a conditional expectation and solving a maximization problem, hence its name.

Intro of the model: – will be detailed soon (right after my exams 🙂 )

Suppose we have some observed data ‘y’ and a parametric density p(y |θ) and some complete data x that we wish we had, and the parametric density p(x|θ).

Given that we only observe y, the goal here is to find the maximum likelihood estimate (MLE) of θ, that is :

θ_{MLE} = arg max p(y|θ)

The algo can be decribed in the following steps:

Step 1.  Initialize θ with an initial step

Step 2. Given the observed data y and pretending for the moment that our current guess of θ is correct, we compute the probability distribution of p(x|y,θ(m))

Step 3.  Using the conditional probability distribution p(x|y,θ(m)) calculated in Step 2, we construct the conditional expected log-likelihood, which is called the Q-function, expressed as below:

Q(θ|θM)=E{X|Y,θm} logP(X|θ)

Step 4. We find the θ that maximizes the Q-function. The result is our new estimate θ(m+1).

Step 5 We go back to step 2 and we iterate until convergence! (So we can define an epsilon value and if our variables don’t move more than epsilon, we stop.)

So the Expectation Maximization algo can be described as 2 iterative steps, the E (expectation) and the M (maximization step). It is most often used for Mixture Models but in what follows, I will use it for a MAP (Maximum Aposteri Probability Maximization) with a non-negative matrix decomposition.

Implementation

This section is based (but implemented a bit differently) on an article : (https://www.hindawi.com/journals/cin/2009/785152/), check it out, it is really very interesting!)

Consider the following probabilistic non-negative matrix factorization (NMF) model with W, H matrices the matrices that factorized give V approximately back, (for f = 1, . . . , F , n = 1, . . . , N , k = 1,…,K)

Screenshot 2018-11-27 at 23.46.01.png

where G and PO denote the gamma and the Poisson distributions, respectively.

Suppose also that we have a latent variable, S, and

Screenshot 2018-11-27 at 23.46.52.png

Where  PO(s;λ) denotes the Poisson distribution of the random variable s, with non-negative  λ:

Screenshot 2018-11-27 at 23.48.06.png

where:

Screenshot 2018-11-27 at 23.48.59.png

Therefore:

Screenshot 2018-11-27 at 23.49.04.png

Since S – M,we can say that  E_{S_{fin}} = v_{fn} p_i

Once we defined the latent variable, S and its conditional distribution knowing W, H, V, we can proceed with the first step of the Expectation Maximization algorithm.

Expectation Maximization Algo in this case

E-step

We define the E-Step as follows:

Screenshot 2018-11-27 at 23.55.48.png

M-step

The M-step is defined as follows (with the following updates):

Screenshot 2018-11-27 at 23.56.43.png

Similarly, we take the partial derivative of  Lt w.r.t. H, we obtain by symmetricity :

Screenshot 2018-11-27 at 23.57.17.png

Therefore we can illustrate the graph model as follows:

import networkx as nx
import numpy as np
import matplotlib.pyplot as plt

G=nx.DiGraph()
G.add_nodes_from(['S', 'W', 'H','V'])
G.add_edges_from([('W','S'),('H','S'),('S','V')])
nx.draw(G, with_labels=True, font_size=18, node_size=2000, node_color='#CDC1C5')
plt.title("Directed graph model of the MAP estimation with latent variable S", fontsize=20)
plt.show()
Screenshot 2018-11-27 at 23.58.17.png
I will use the AT&T Database of Faces, containing face images from 40 people, where there are 10 images for each person. In total there are 400 images in the dataset, each image is 92 x 112 pixels, with 256 gray levels per pixel.
# First open the data, save the np.array of pixels as V 
import scipy.io as sio
data = sio.loadmat('attfaces.mat')
V = data["V"]
# Plot the different images in a first and simple way:
# reorganize the data to show each image next to another 
def all_photo_visualisation(input_matrix):
    """
    Input : the input np.array
    
    Returns: Reorganizes the columns of the input np.array to (92*112) np.arrays, 
    concatenate these lines and return a graph showing all images. 
    """
    v_act = np.array(())
    V_images = (input_matrix[:,0].reshape((92,112)))
    v_act = np.array(())

    for i in range(1,input_matrix.shape[1]):
        v_act = input_matrix[:,i].reshape((92,112))
        V_images = np.concatenate((V_images, v_act), axis=0)

    list_i = [i*2 for i in range(1,20)]

    v_act = V_images[range(92*20), :]
    for i in list_i:
        v_act2 = V_images[range(92*i*10, 92*(i+2)*10, ), :]
        v_act = np.concatenate((v_act, v_act2), axis=1)
    
    fig = plt.figure()
    fig.subplots_adjust(hspace=0.1, wspace=0.1)
    fig.set_size_inches(26, 30)
    plt.imshow(v_act.T, interpolation='none',cmap='gray')
all_photo_visualisation(V)

pic.png

EM algorithm implementation

In [25]:
# First I define a class for the model 
class NMF_EM(object):
    """
    V - a matrix of dimension (f, n) that we try to estimate
    W, H- the non-negative matrix factorisation of V, dimensions (f, K), (K, n) respectively
    K - the number of trends 
    alpha_w, alpha_h, beta_w, beta_h the constant terms of the shape and scale of the gamma distributions of W and H respectively 
    """
    def __init__(self, V, W, H, K, alpha_w, alpha_h, beta_w, beta_h):
        self.V = V
        self.W = W
        self.H = H
        self.K = K
        self.alpha_h = alpha_h
        self.alpha_w = alpha_w
        self.beta_w = beta_w
        self.beta_h = beta_h
        self.n_rows, self.n_cols = V.shape
In [18]:
# The EM algo 
def em_algo(Nmf_Class, max_iters):
    """
    Inputs: a NMF_EM class with parameters: 
    V - a matrix of dimension (f, n) that we try to estimate
    W, H- the non-negative matrix factorisation of V, dimensions (f, K), (K, n) respectively
    K - the number of trends 
    
    Returns: iterates over W, H to get better estimates
    """
    # Denominator initialization
    den = (Nmf_Class.W).dot(Nmf_Class.H)
    
    # Initialization of the updates np.arrays
    W_new = (Nmf_Class.W).copy()
    H_new = (Nmf_Class.H).copy()

    # Temporary np.arrays to facilitate the calculations
    temp_den_w = (Nmf_Class.W).copy()
    temp_den_h = (Nmf_Class.H).copy()
    
    for i in range(max_iters):
        W_new = (Nmf_Class.alpha_w-1) * np.ones((Nmf_Class.W).shape) + np.multiply((Nmf_Class.W), ((Nmf_Class.V)/ den).dot((Nmf_Class.H).T))
        temp_den_w = Nmf_Class.beta_w * np.ones((Nmf_Class.W).shape) + np.ones((Nmf_Class.V).shape).dot((Nmf_Class.H).T)
        W_new = W_new/ temp_den_w
        
        H_new = (Nmf_Class.alpha_h - 1) * np.ones((Nmf_Class.H).shape) + np.multiply(Nmf_Class.H, ((Nmf_Class.W).T.dot(Nmf_Class.V/ den)))
        temp_den_h = Nmf_Class.beta_h * np.ones((Nmf_Class.H).shape) + (Nmf_Class.W).T.dot(np.ones((Nmf_Class.V).shape))
        H_new = H_new/ temp_den_h
        
        Nmf_Class.W = W_new
        Nmf_Class.H = H_new 
        
        den = W_new.dot(H_new)
        
    return Nmf_Class.W, Nmf_Class.H
In [19]:
# Now define a function that generates W and H randomly 
def generate_data(K=25, f=92*112, n=400, alpha_w=1, beta_w=1, alpha_h=1, beta_h=1):
    """
    Inputs: 
    K - nb of features 
    alpha_w, alpha_h - shape parameter of the gamma distribution of W and H
    beta_w, beta_h - scale parameter of the gamma distribution of W and H 
    """
    # Set prior parameters 
    Alpha_w = alpha_w * np.ones([f, K]) # shape
    Beta_w = beta_w * np.ones([f, K]) # scale
    Alpha_h = alpha_h * np.ones([K, n]) # shape
    Beta_h = beta_h * np.ones([K, n]) # scale

    # Generate a random W and H matrix
    W = np.random.gamma(Alpha_w, Beta_w)
    H = np.random.gamma(Alpha_h, Beta_h)
    
    return W, H
# We initialize randomly W_0 and H_0, following a Gamma distribution
W_0, H_0 = generate_data(K=25, f=92*112, n=400, alpha_w=1, beta_w=1, alpha_h=1, beta_h=1)
# Initialise a class. 
# We initialize here the value of alpha and beta for W and H that we will use in the em algorithm
First_model = NMF_EM(V, W_0, H_0, K=25, alpha_w=1, alpha_h=1, beta_w=1, beta_h=1)
# We test the model on the initialized class with alpha_w = alpha_h = beta_w = beta_h = 1  
W_new, H_new = em_algo(First_model, 100)
# Reorganize the data of the W matrix to show each feature (we reuse the originally written all_photo_visualisation fct
# But we change the number of photos per line and column.)
def all_feature_visualisation(input_matrice, nb_col, nb_row):
    """
    Input : np_array (updated W matrix)
    
    Returns: Reorganizes the columns of W into (92,112) np.array, 
    concecanate the np.arrays to plot a 5*5 together 
    """
    v_act = np.array(())
    V_images = (input_matrice[:,0].reshape((92,112)))
    v_act = np.array(())

    for i in range(1,input_matrice.shape[1]):
        v_act = input_matrice[:,i].reshape((92,112))
        V_images = np.concatenate((V_images, v_act), axis=0)

    list_i = [i for i in range(1, nb_row)]

    v_act = V_images[range(92*nb_col), :]
    for i in list_i:
        v_act2 = V_images[range(92*i*nb_col, 92*(i+1)*nb_col, ), :]
        v_act = np.concatenate((v_act, v_act2), axis=1)
    
    fig = plt.figure()
    fig.subplots_adjust(hspace=0.1, wspace=0.1)
    fig.set_size_inches(14, 20)
    plt.imshow(v_act.T, interpolation='none',cmap='gray')
In [24]:
all_feature_visualisation(W_new,5,5)
pic2.png

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: