A semi supervised model uses the unlabeled data to extract latent features and pairs these with labels to learn an associate classifier.

A hidden (Latent) feature discriminative model (Model 1):

The model provides an embedding or feature representation of the data of all taxpayers. The features are then used to train a separate classifier. The information acquired allows for the clustering of related features in a hidden space.

A deep generative model of both audited and not audited taxpayers data pro- vides a more robust set of hidden(latent) features. The generative model used is:

p(z) = N (z|0,I); p? (x|z) = f (x;z, ?), (1)

where f (x; z, ?)is a Gaussian distribution whose probabilities are formed by a non-linear functions (deep neural networks), with parameters ?, of a set of hidden (latent) variables z.

Approximate samples from the posterior distribution (the probability distribu- tion that represents the updated beliefs about the parameters after the model has seen the data) over the hidden (latent) variables p(z|x) are used as features to train a classifier that predicts whether a material audit yield will result if a taxpayer is audited (y) such as Support Vector Machine (SVM).This approach enables the classi- fication to be performed in a lower dimensional space since we typically use hidden (latent) variables whose dimensionality is much less than that of the observations.

These low dimensional embeddings should now also be more easily separable since we make use of independent hidden (latent) Gaussian posteriors whose pa- rameters are formed by a sequence of non-linear transformations of the data.

Generative semi-supervised model (Model 2):

A probabilistic model describes the data as being generated by a hidden(latent) class variable y in addition to a continuous hidden(latent) variable z. The model used is:

p(y) = Cat(y|?); p(z) = N (z|0, I); p?(X|y, Z) = f (x; y, z, ?), (2)

7

8 Chapter2. Approach

where Cat(y|?) is the multinomial distribution, the class labels y are treated as hid- den (latent) variables if no class label is available and z are additional hidden (latent) variables. These hidden (latent) variables are marginally independent.

As in Model 1, f (X; y, z, ?) is a Gaussian distribution, parameterized by a non- linear function (deep neural networks) of the hidden(latent) variables.

Since most labels y are unobserved, we integrate over the class of any unlabeled data during the inference process, thus performing classification as inference (deriv- ing logical conclusions from premises known or assumed to be true.). The inferred posterior distribution is used to obtain labels for any missing labels.

Stacked generative semi-supervised model: The two models can be stacked to- gether; the Model 1 learns the new hidden (latent) representation z1 using the gen- erative model, and afterwards the generative semi-supervised Model 2 using z1 in- stead of raw data (x).

The outcome is a deep generative model with two layers: p?(x, y,z1, z2)= p(y)p(z2)p?(z1|y,z2)p?(x|z1)

where the priors p(y) and p(z2) equal those of y and z above, and both p?(z1|y, z2) and p?(x|z1) are parameterized as deep neural networks.

The computation of the exact posterior distribution is not easily managed be- cause of the nonlinear, non-conjugate dependencies between the random variables. To allow for easier management and scalable inference and parameter learning, the recent advances in variational inference (Kingma and Welling, 2014; Rezende et al., 2014) are utilized. A fixed form distribution q?(z|x)with parameters ? that approxi- mates the true posterior distribution p(z|x).

The variational principle is used to derive a lower bound on the maximum likeli- hood of the model. This consists in maximizing function of the variational bound and the approximate posterior has the minimum difference with the true poste- rior. The approximate posterior distribution q?(·) is constructed as an inference or recognition model (Dayan, 2000; Kingma and Welling, 2014; Rezende et al., 2014; Stuhlmuller et al., 2013).

With the use of an inference network, a set of global variational parameters ?, al- lowing for fast inference at both training and testing because the delay of inference is for all the posterior estimates for all hidden (latent) variables through the param- eters of the inference network. An inference network is introduced for all hidden (latent) variables, and are parameterized as deep neural networks. Their outputs construct the parameters of the distribution q?(·).

For the latent-feature discriminative model (Model 1), we use a Gaussian inference network q?(z|x)for the hidden(latent) variable z. For the generative semi-supervised model (Model 2),an inference model the hidden(latent) variables z and y, which its assumed have a factorized form q?(z,y|x) = q?(z|x)q?(y|x), specified as Gaussian and multinomial distributions.

Model 1: q?(z|x) = N (z|??(x), diag(?2?(x))), (3)

Model 2: q?(z|y, x) = N (z|??(y, x), diag(?2?(x))); q?(y|x) = Cat(y|??(x)), (4)

Chapter2. Approach 9

where: ??(x) is a vector of standard deviations, ??(x) is a probability vector, func- tions ??(x), ??(x)and??(x) are represented as MLPs.

Generative Semi-supervised Model Objective The label corresponding to a data point is observed and the variational bound is:

logp?(X,y) ? Eq?(z|x,y)logp?(x|y,z)+logp?(y)+logp(z)?logq?(z|x,y) = ?L(x,y), (5)

The objective function is minimized by resorting to AdaGrad, which is a gradient- descent based optimization algorithm. It automatically tunes the learning rate based on its observations of the data’s geometry. AdaGrad is designed to perform well with datasets that have infrequently-occurring features.