Some snippets I liked….
QT:{{”
implement a function to check whether a dataframe satisfies
k-Anonymity, we loop over the rows; for each row, we query the dataframe to see how many rows match its values for the
quasi-identifiers. If the number of rows in any group is less than , the dataframe does not satisfy -Anonymity for that value of , and we return False. Note that in this simple definition, we consider all columns to contain quasi-identifiers; to limit our check to a subset of all columns, we would need to replace the df.columnsexpression with something else.
…
A function which satisfies differential privacy is often called a mechanism. We say that a mechanism satisfies -differential privacy if for all neighboring datasets and , and all possible sets of outputs (where refers to “sets of outputs of the mechanism.”)
The parameter in the definition is called the privacy parameter or the privacy budget. provides a knob to tune the “amount of privacy” the definition provides. Small values of require to provide very similar outputs when given similar inputs, and therefore provide higher levels of privacy; large values of allow less similarity in the outputs, and therefore provide less privacy.
Note that is typically a randomized function, which has many possible outputs under the same input. Therefore, the probability distribution describing its outputs is not just a point distribution.
In the definition of ε-differential privacy, the probability is taken over the randomness of the algorithm itself—that is, over the internal randomness used by the privacy mechanism to produce an output.
The important implication of this definition is that ’s output will be pretty much the same, with or without the data of any specific individual. In other words, the randomness built into should be “enough” so that an observed output from will not reveal which of or was the input. Imagine that my data is present in but not in . If an adversary can’t determine which of or was the input to , then the adversary can’t tell whether or not my data was present in the input – let alone the contents of that data.
…
According to the Laplace mechanism, for a function which returns a number, the following definition of satisfies -differential privacy: …
The Sensitivity of a function is the amount ’s output changes when its input changes in a minimal way. Intuitively, for a simple function with one numeric input, we think of the scenario where the input increases or decreases (changes) by 1.
However, more generally, define it in terms of adjacent dataset inputs.
Two datasets are said to be adjacent if they differ in the data of exactly one individual. This could mean adding or removing a single row (in the add-remove model) or changing a single row (in the substitution model). This notion of adjacency defines the smallest possible difference between datasets, and it forms the basis for reasoning about privacy guarantees.
The global sensitivity of a function is then generally defined as the maximum amount its output can change between any input pair of adjacent datasets.
Sensitivity is a complex topic, and an integral part of designing differentially private algorithms; we will have much more to say about it later. For now, we will just point out that counting queries always have a sensitivity of 1: if a query counts the number of rows in the dataset with a particular property, and then we modify exactly one row of the dataset, then the query’s output can change by at most 1.
Thus we can achieve differential privacy for our example query by using the Laplace mechanism with sensitivity 1 and an of our choosing. For now, let’s pick . We can sample from the Laplace distribution using Numpy’s random.laplace.
sensitivity = 1
epsilon = 0.1
adult[adult[‘Age’] >= 40].shape[0] + np.random.laplace(loc=0, scale=sensitivity/epsilon)
…
Sequential Composition
The first major property of differential privacy is sequential composition [11, 12], which bounds the total privacy cost of releasing multiple results of differentially private mechanisms on the same input data. Formally, the sequential composition theorem for differential privacy says that:
…
Sequential composition is a vital property of differential privacy because it enables the design of algorithms that consult the data more than once. Sequential composition is also important when multiple separate analyses are performed on a single dataset, since it allows individuals to bound the total privacy cost they incur by
participating in all of these analyses. The bound on privacy cost given by sequential composition is an upper bound – the actual privacy cost of two particular differentially private releases may be smaller than this, but never larger.
“}}