- Sen Yuan ([email protected])
- John Lazarsfeld (jlazarsfeld.github.io)
One approach to calibrating predictions from differentially private aggregate data is to utilize a large number of breakdown buckets. This approach has been explored in the context of charting out the future of advertising in a post-cookie world: Criteo Competition.
The benefit of utilizing noisy breakdowns is that a private measurement standard designed to support ad measurement use-cases will already be capable of producing such noisy breakdown buckets. The main disadvantages of bucketization are:
- High cardinality, large amount of LR training features and large number of model parameters.
- Bucketization coarsens the original granular features.
We propose a new LR training algorithm (WALR) that exhibits the following properties:
- Weighted aggregation: the WALR model assumes a private measurement system can achieve global weighted aggregation over all attribution outcomes (conversion or no conversion). The weights are to be supplied by the API consumer.
- Global DP: Gaussian noise will be injected to weighted aggregates before the private measurement system releases the outputs to the API consumer.
- One time aggregation: Per training cadence, the aforementioned computation and addition of gaussian noise only needs to be performed one time.
- Consume granular features: Granular features can be perfectly utilized without any kind of discretization.
The design and key privacy-related properties of the WALR model are based on the derivation for non-private, gradient-based LR model training. We re-derive these details and introduce notation before discussing privacy aspects and implementation details of the model.
The following gives an overview of the standard (non-private) gradient derivation for logistic regression models, which is the foundation of WALR:
- Our training dataset is comprised of N, k-dimensional (normalized) feature vectors
$X^{(1)}, X^{(2)}, ... , X^{(N)}$ ,
and N binary labels
$y^{(1)}, y^{(2)}, ... , y^{(N)} \in \{0, 1\}$ . - We want to train a logistic regression model
$\theta \in R^k$ . - We consider the cost function that is the average loss across all examples
$L(\theta, \{ X^{(i)} \}, \{ y^{(i)} \} )=\frac{1}{N} \cdot \sum\limits_{i=1}^{n}l_{i}(\theta,X^{(i)},y^{(i)})$
where each$l_i$ is the cross entropy loss
$l_i(\theta, X^{(i)}, y^{(i)}) = -[y^{(i)} \log(p_i) + (1-y^{(i)})\log(1-p_i)]$ .
Here, we let$p_i = \sigma(\theta^TX^{(i)})$ , where$\sigma(\cdot)$ denotes the sigmoid function. - The gradient of
$L$ with regard to$\theta$ is then given by
$\nabla L(\theta)=(\frac{1}{N} \cdot \sum\limits_{i=1}^{N} \sigma(\theta^TX^{(i)}) X^{(i)}) - (\frac{1}{N} \cdot \sum\limits_{i=1}^{N} y^{(i)} X^{(i)} )$ . - In the absence of any computational or privacy constraints, the model can be trained via full-batch gradient descent of the form, where here
$\text{lr}$ is the learning rate:-
initialize model vector
$\theta$ -
while not converged:
$\text{set } \theta := \theta - \text{lr} \cdot ((\frac{1}{N} \cdot \sum\limits_{i=1}^{N} \sigma(\theta^T X^{(i)}) X^{(i)} ) - \frac{1}{N} \cdot \sum\limits_{i=1}^{N} y^{(i)} X^{(i)} ))$ -
output
$\theta$
-
The WALR model assumes the presence of a private measurement system. This private measurement system has access to conversion labels (to perform aggregation and add DP noise), but the API consumer does not.
To this end, the key observation in the calculation of
This means that iterative optimization algorithms like gradient descent only require computing dot-product once.
In the context of model training and label privacy, this motivates viewing
where
In this perspective, gradient descent-based training algorithms that (re)-compute
-
Question: why is this quantity referred to as "dot-product"?
Answer: Note that the expression
$\frac{1}{N} \sum\limits_{i=1}^{N} y^{(i)} X^{(i)}$
is a linear combination of k-dimensional vectors, where k is the number of features. If we let
$X = (X^{(1)}, ..., X^{(N)}) \in [0, 1]^{N*k}$ and$y = (y^{(1)}, ..., y^{(N)})^T \in \{0, 1\}^N$
denote the (N * k)-dimensional feature matrix and N-dimensional label vector respectively, then we have
which is a matrix-vector multiply. Thus every i'th coordinate of this sum is the dot product between the i'th row vector of
Thus the dot-product vector can be interpreted as
Note: the exact naming convention may be subject to change.
We can also ensure that the final model vector
Our privacy model assumes the private measurement system has access to training labels.
Our definition of
In the presence of a private measurement system that computes the (non-private) dot-product vector, label-DP can be achieved via a single-occurrence output perturbation of the form:
where
In total, the WALR model will consider the following noisy gradient approximation
which is both label-blind (i.e., no direct label access required) and label-DP.
We emphasize that the noisy-dot-product only needs to be computed once. When training the WALR model, we will assume that this vector is computed by the private measurement system and supplied as a parameter to the training algorithm.
-
Question: what is the variance of each normal RV in the
$\text{gaussian-noise}$ vector?
Answer: Using the classic Gaussian Mechanism for differential privacy (see Dwork+11 textbook), we can ensure$\text{noisy-dot-product}$ is a label-DP approximation of$\text{dot-product}$ by setting each coordinate of$\text{gaussian-noise}$ to have variance calibrated to the l2-sensitivity$s$ of$\text{dot-product}$ .
To this end, under the assumption that all feature vectors$X^{(i)}$ have coordinates in the range$[0, 1]$ , we have the following upper bound on$s^2$ :
$s^2 = \frac{1}{N^2}\max\limits_{i \in [N]} \lVert X^{(i)} \rVert^2 \leq \frac{k}{N^2}$ .
This comes from the fact that when a single binary label vector$y^{(i)}$ flips its value, then the difference in the dot-product sum is simply the feature vector$X^{(i)}$ corresponding to that sample. After factoring out the$\frac{1}{N}$ multiplicative term, we have that the L2 (squared) sensitivity is equal to the maximum (over all samples) squared 2-norm of any feature vector. Because all features are normalized in the [0,1] range, the upper bound follows.
Then each coordinate of the vector gaussian-noise can be set as an iid, zero-mean normal random variable with variance
$\sigma^2 = s^2 \cdot \frac{2\log(1.25/\delta)}{\epsilon^2}$
where$(\epsilon, \delta)$ are the privacy parameters.
It follows by the privacy guarantee of the Gaussian Mechanism that$\text{noisy-dot-product}$ is$(\epsilon, \delta)\text{-label-DP}$ . -
Question: if
$\text{noisy-} \nabla L(\theta)$ is used in place of$\nabla L(\theta)$ within a gradient descent update rule, how many times do we need to add noise? Is the final output model vector$\theta$ private?
Answer: The$\text{gaussian-noise}$ vector need only be generated and computed once in order to produce the label-DP$\text{noisy-dot-product}$ vector.
Any additional computations used to update$\text{noisy-} \nabla L(\theta)$ (as in a gradient descent procedure) will still be label-DP (with the same privacy parameters) due to the Post Processing Theorem of DP (see Dwork+11 text). -
Question: is
$\text{noisy-dot-product}$ efficiently computable?
Answer: Yes, computing this vector requires just one pass through the set of feature vectors, and$k$ random draws from a Gaussian distribution. Assuming the private measurement system is MPC based, the division byN
need not be computed within MPC, and can instead be performed by the API consumer.
To train the WALR model, our implementation will apply a gradient descent procedure that uses the vector
Recall that in the absence of practical computational constraints, this label-DP training algorithm could proceed using the following "full-batch" gradient descent approach:
- initialize model vector
$\theta$ - while not converged:
$\text{set } \theta := \theta - lr \cdot ((\frac{1}{N} \cdot \sum\limits_{i=1}^{N} p_i X^{(i)} ) - \text{noisy-dot-product})$ - output
$\theta$
As mentioned, we assume the vector while
loop requires computing the term
To circumvent this computational issue, we use a simple "hybrid"-minibatch gradient computation that we describe here.
First, we define the terms
which means that we can express
To avoid computing
where the sum is over every index
Then the corresponding "hybrid" and "noisy-hybrid"-minibatch gradient steps become:
-
Question: why is this referred to as a "hybrid" minibatch?
Answer: The term "hybrid" is used to emphasize the fact that the left-hand term is an average over a minibatch of size$m \leq N$ , while the right-hand term is an average over the entire set of$N$ samples.
This is in contrast to a full-batch gradient (where both terms would average over all$N$ samples) and a traditional minibatch gradient (where both terms would average over the same minibatch of size$m$ ).
Intuitively, the full-batch, private, one-time computation of the right hand term (i.e., the dot-product or noisy-dot-product) should lead to better approximations of the true gradients at each optimization step, and the minibatch computation of the left hand term should lead to computational gains.
The final noisy-hybrid-minibatch gradient descent procedure for WALR model training will then proceed by:
- initialize model vector
$\theta$ - while not converged:
sample minibatch of size$m$ , and
$\text{set } \theta := \theta - lr \cdot ( \text{noisy-hybrid-} \nabla L(\theta))$ - output
$\theta$
Many thanks to original authors Sen Yuan & John Lazarsfeld, who developed the WALR algorithm.
Thanks also to reviewers Prasad Buddhavarapu and Huanyu Zhang who helped review this algorithm and writeup.
We would also like to recognize Kamalika Chaudhuri, Claire Monteleoni and Anand D. Sarwate, the authors of “Differentially Private Empirical Risk Minimization”. The WALR algorithm can be viewed as a label DP version of the objective perturbation in DP machine learning, as described in this paper and the follow-up works on this topic.
Finally, we would like to thank Criteo for hosting the “Criteo Privacy Preserving ML Competition @ AdKDD”, and for their continued publications on their research into privacy-preserving machine learning. These contributions have continued to help move the industry forward in this emerging space.