-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of new TRL vectors #12
Comments
So - user gets a choice to select from the more vague options above, or tell the exact date, in which case a vector of the 2nd table is selected? |
Anyway - how to estimate? Just count the keys on the most recent day and divide by a padding factor? But then the (non-published) padding factor becomes the critical value. |
Yes, this is exactly what I read out of the CWA source code.
May be, someone from the CWA developers ( @thomasaugsten @nickguendling ) can briefly confirm/explain this behavior. |
According to private communications, the generation of fake diagnosis keys will be terminated in the near future. |
Here are the pull request for a better understanding |
@micb25 @thomasaugsten At the moment,
For the current fixed TRL profile, we could just use the sum of the TRL 6 column, because "6" currently is unique in the TRL vector and indicates the "freshest" uploaded key. About the last line:
10/5 = 2 users apparently had GAEN last active on 2020-09-27, and on 2020-10-09 decided to upload their keys. So far so good. This kind of nicely explains what we are up against in the future:
Approach 2) could find users missed in approach 1) who had GAEN deactivated for at least one day, but only if all other users had their phone active on these days - otherwise the algorithm would attribute those "otherwise missed" keys to the "normal flow" of keys. What do you think? P.S. |
Two quick questions about the package from 2020-10-21, as I'm trying to understand how the trl vectors are applied (I'm posting here, as I'm unsure whether it is tool-related or server-related):
Where does the single Why are there only 6 |
It doesn't match the profiles on the top of this issue, correct.
|
Thanks! That explains it. disclaimer: everything following is highly speculative and is just a test to see what is possible without manual optimizationBasically what we are trying to do is pattern-matching. Feel free to skip to Implementation, if you don't care for my clumsy explanations paired with awkward formatting ;) Theoretical ExplanationWhat we are trying to find is the minimal (? more on that later) set of trl-profiles uploaded that matches the daily published distribution of uploaded keys: We have Let's introduce a variable for every vector What we are trying to minimize is the sum over our variables What's missing are constraints: So we add a constraint
so for each day now we just need some additional constraints, that all skip here if you wantImplementationSo let's implement in python using our favourite ILP solver library ;) (read: the first and only one I got running out of the box without booting linux). I'll demonstrate it here using the latest package from
now we parse that matrix into a numpy array keys =
[[ 0 106 101 86 85 203 648 0 144]
[ 0 68 69 131 0 77 182 247 485]
[ 0 44 118 40 0 46 72 326 497]
[ 0 104 32 28 4 40 83 358 470]
[ 0 97 17 25 36 244 311 180 191]
[ 0 96 43 219 277 24 203 75 149]
[ 0 333 83 196 97 97 86 60 110]
[ 0 412 280 9 147 6 76 45 75]
[ 0 683 77 78 58 2 60 36 46]
[ 0 747 135 2 44 5 39 33 22]
[ 0 864 50 2 38 0 26 20 14]
[ 0 909 42 2 22 3 20 6 10]
[ 0 941 28 4 11 2 17 3 7]] now we load all currently possible transmission risk vectors into another matrix trvs =
[[6 6 4 6 5 1 1 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[8 7 3 8 6 1 1 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1]
[8 7 2 8 7 1 1 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1]
[8 7 1 8 7 2 1 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1 1]
[5 6 1 7 7 3 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1 1]
[3 4 1 5 6 4 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1 1]
[1 3 1 3 5 5 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1 1]
[1 2 1 2 4 6 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1 1]
[1 1 1 1 3 6 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1 1]
[1 1 1 1 2 7 1 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1 1]
[1 1 1 1 1 7 2 1 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1 1]
[1 1 1 1 1 6 3 1 1 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1 1]
[1 1 1 1 1 6 4 1 1 1 1 1 1 1 1 2 4 6 7 8 8 8 6 5 3 2 2 1 1]] (note: we cut the latest key for every vector, as todays key doesn't get published right away. if we think about this further, this means my solution more or less depends on us being able to filter out the single key, that gets published the next day at 3am; otherwise the pattern matching will of course go haywire, which could possibly even be the case in this example case) let's create a solver: from ortools.linear_solver import pywraplp
solver: pywraplp.Solver = pywraplp.Solver.CreateSolver("SCIP") create our variables for every possible transmission vector vars: List[pywraplp.Variable] = []
for i in range(trvs.shape[1]): # $trv_i$ number of people submitted vector $i$
trvi: pywraplp.Variable = solver.IntVar(0, infinity, f"trv{i}")
vars.append(trvi) now we need to assign each variable to every field in key_vars = np.zeros((keys.shape[0], keys.shape[1], 1), dtype=int).tolist()
for i, var in enumerate(vars):
for day in range(keys.shape[0]):
key_vars[day][trvs[day, i]].append(var) all that is left now is to add the constraint that we need at least for day in range(len(key_vars)):
for level in range(len(key_vars[day])):
# adding the constraint:
solver.Add(sum(key_vars[day][level]) >= (int(keys[day, level]))) short debug print: print("Number of variables =", solver.NumVariables()) # 29 (because there are 29 possible different vectors including the pre1.5 one, which to my understanding could still get published)
print("Number of constraints =", solver.NumConstraints()) # 117 let's define our minimization goal: solver.Minimize(sum(vars)) and then solve the problem: status = solver.Solve() # 1373 vectors uploaded (note: I removed one 7 by hand)
print(f"took: {solver.wall_time()}") # 30ms let's print the suggested distribution (trv[i] corresponds to column i in the
if we now calculate everything backwards and "simulate" a submission of the calculated number of keys, we arrive at the following expected distribution: predicted_keys =
[[ 0 106 101 86 85 203 648 0 144]
[ 0 78 83 131 0 86 203 247 545]
[ 0 66 125 55 0 46 90 408 583]
[ 0 131 32 28 4 55 83 474 566]
[ 0 124 17 26 37 244 401 317 207]
[ 0 119 53 222 360 24 333 75 187]
[ 0 367 117 262 177 148 87 86 129]
[ 0 454 453 9 203 7 106 46 95]
[ 0 883 94 133 86 9 67 55 46]
[ 0 956 219 16 46 5 64 42 25]
[ 0 1166 52 3 55 16 33 26 22]
[ 0 1212 59 2 28 3 42 6 21]
[ 0 1267 28 4 12 2 23 7 30]] calculating missing keys predicted_keys - keys = # missing keys
[[ 0 0 0 0 0 0 0 0 0]
[ 0 10 14 0 0 9 21 0 60]
[ 0 22 7 15 0 0 18 82 86]
[ 0 27 0 0 0 15 0 116 96]
[ 0 27 0 1 1 0 90 137 16]
[ 0 23 10 3 83 0 130 0 38]
[ 0 34 34 66 80 51 1 26 19]
[ 0 42 173 0 56 1 30 1 20]
[ 0 200 17 55 28 7 7 19 0]
[ 0 209 84 14 2 0 25 9 3]
[ 0 302 2 1 17 16 7 6 8]
[ 0 303 17 0 6 0 22 0 11]
[ 0 326 0 0 1 0 6 4 23]]
missing number of keys: `3548`
missing number of keys per uploaded vector: `2.58` # which fits very nicely into the average of the graph of "number of keys published per user per day" I'd argue it works quite ok, except for the btw: min_fun = 0
for day in range(len(key_vars)):
for level in range(len(key_vars[day])):
min_fun += sum(key_vars[day][level]) - int(keys[day, level])
solver.Minimize(min_fun) this results in the exact same distribution (in this case). alternative idea:we can invert the contraints and convert the problem into a maximization problem: for day in range(len(key_vars)):
for level in range(len(key_vars[day])):
# solver.Add(sum(key_vars[day][level]) >= (int(keys[day, level])))
solver.Add(sum(key_vars[day][level]) <= (int(keys[day, level])))
solver.Maximize(sum(vars)) this results in the maximum number of keys we can pattern match with only "complete" vectors, where no key was skipped: status = solver.Solve() # 1009 vectors are present without any skipped keys
distribution:
trv0 213.0
trv1 180.0
trv2 68.0
trv3 13.0
trv4 73.0
trv5 9.0
trv7 4.0
trv8 29.0
trv9 76.0
trv10 65.0
trv11 74.0
trv12 58.0
trv13 44.0
trv14 36.0
trv15 21.0
trv16 11.0
trv17 6.0
trv18 3.0
trv19 2.0
trv20 5.0
trv22 2.0
trv23 2.0
trv24 4.0
trv25 2.0
trv26 1.0
trv27 8.0
predicted_keys =
[[ 0 76 80 58 68 147 471 0 109]
[ 0 55 57 112 0 58 147 184 396]
[ 0 44 100 36 0 44 62 282 441]
[ 0 97 26 21 4 36 73 329 423]
[ 0 94 13 20 29 234 292 151 176]
[ 0 96 34 219 265 24 159 74 138]
[ 0 333 83 196 65 88 85 58 101]
[ 0 409 263 2 147 3 73 44 68]
[ 0 665 76 78 58 2 56 36 38]
[ 0 737 135 0 44 5 38 30 20]
[ 0 864 50 2 36 0 26 20 11]
[ 0 906 42 2 21 2 20 6 10]
[ 0 941 24 4 11 2 17 3 7]]
predicted_keys - keys = # keys, which were not yet assigned a vector, as no more vector can be found which would explain any combination of columns of keys
[[ 0 -30 -21 -28 -17 -56 -177 0 -35]
[ 0 -13 -12 -19 0 -19 -35 -63 -89]
[ 0 0 -18 -4 0 -2 -10 -44 -56]
[ 0 -7 -6 -7 0 -4 -10 -29 -47]
[ 0 -3 -4 -5 -7 -10 -19 -29 -15]
[ 0 0 -9 0 -12 0 -44 -1 -11]
[ 0 0 0 0 -32 -9 -1 -2 -9]
[ 0 -3 -17 -7 0 -3 -3 -1 -7]
[ 0 -18 -1 0 0 0 -4 0 -8]
[ 0 -10 0 -2 0 0 -1 -3 -2]
[ 0 0 0 0 -2 0 0 0 -3]
[ 0 -3 0 0 -1 -1 0 0 0]
[ 0 0 -4 0 0 0 0 0 0]] maybe those keys can then be matched with another algorithm. TLDR: please publish some raw data and/or implement an official dashboard with the official data ;)ideas for improvements:
I'm open to suggestions/improvements, as I never previously worked with ILPs and it's been a few years since I was theoretically dealing with them |
Publishing of data is planned and be also aware of the the new package with federation keys |
That is quite an interesting approach. Having regression problems in mind, I think a least squares formulation might also be natural for the estimation. I gave it a quick shot (without a lot of validation so far) based on the data that you posted, so that the results can be directly compared. I attached a LaTeX description: estimation.pdf. I have used cvxpy as the SCIP license is not very permissive (free for academic use). Code to do the least squares estimation would be: import cvxpy
import numpy as np
TOTALDAYS = 13
TOTALLEVELS = 8
# rows: days 1-13, columns: risk levels 1-8
# example from https://github.com/mh-/diagnosis-keys/issues/12#issuecomment-714272588
counts = np.array([[ 106, 101, 86, 85, 203, 648, 0, 144],
[ 68, 69, 131, 0, 77, 182, 247, 485],
[ 44, 118, 40, 0, 46, 72, 326, 497],
[ 104, 32, 28, 4, 40, 83, 358, 470],
[ 97, 17, 25, 36, 244, 311, 180, 191],
[ 96, 43, 219, 277, 24, 203, 75, 149],
[ 333, 83, 196, 97, 97, 86, 60, 110],
[ 412, 280, 9, 147, 6, 76, 45, 75],
[ 683, 77, 78, 58, 2, 60, 36, 46],
[ 747, 135, 2, 44, 5, 39, 33, 22],
[ 864, 50, 2, 38, 0, 26, 20, 14],
[ 909, 42, 2, 22, 3, 20, 6, 10],
[ 941, 28, 4, 11, 2, 17, 3, 7]])
# taken from from https://github.com/mh-/diagnosis-keys/issues/12#issuecomment-714272588
# but transposed in comparison to the table there
trvectors = [
[6, 8, 8, 8, 5, 3, 1, 1, 1, 1, 1, 1, 1],
[5, 7, 7, 7, 6, 4, 3, 2, 1, 1, 1, 1, 1],
[4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[6, 8, 8, 8, 7, 5, 3, 2, 1, 1, 1, 1, 1],
[5, 6, 7, 7, 7, 6, 5, 4, 3, 2, 1, 1, 1],
[1, 1, 1, 2, 3, 4, 5, 6, 6, 7, 7, 6, 6],
[8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1],
[8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1, 1],
[8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1, 1],
[6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1, 1],
[5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1, 1],
[3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1, 1],
[2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1, 1],
[2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2, 1],
[1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4, 2],
[1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6, 4],
[1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7, 6],
[1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8, 7],
[1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8, 8],
[1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8],
[1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8, 8],
[1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6, 8],
[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5, 6],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 5],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
]
def trvector_to_matrix(trvector):
T = np.zeros([TOTALDAYS, TOTALLEVELS])
for i, level in enumerate(trvector):
T[i, level-1] = 1.0 # level-1 to switch between 1 and 0 based indexing
return T
# set up matrix A
A = np.array([
trvector_to_matrix(trvector).flatten() for trvector in trvectors
]).T
# assert that A is non-degenerate
#if np.linalg.matrix_rank(A) < min(A.shape):
# raise ValueError(F"Transmission Risk Matrix is rank deficient! Rank {np.linalg.matrix_rank(A)}, shape {A.shape}")
# set up vector n
n = counts.flatten()
# formulate as MIQP and solve with cvxpy
m = cvxpy.Variable(len(trvectors), integer=True)
objective = cvxpy.Minimize(cvxpy.sum_squares(A @ m - n))
prob = cvxpy.Problem(objective, [m >= 0])
prob.solve(solver=cvxpy.ECOS_BB)
m_opt = np.rint(m.value) # round to nearest integer, solution violates integrality by about tol = 1e-8
print(F"Solved Mixed-Integer Least Squares with status {prob.status} and residual norm {np.sqrt(prob.value)}\n")
print("Number of users per transmission vectors")
for i, trvector in enumerate(trvectors):
print(F"{trvector}: {m_opt[i]}")
print(F"Sum of users {np.linalg.norm(m_opt, 1)}")
print("\nResidual counts ")
print((A.dot(m_opt) - n).reshape(counts.shape))
# formulate as MIQP with additional constraint A m - n >= 0
m = cvxpy.Variable(len(trvectors), integer=True)
objective = cvxpy.Minimize(cvxpy.sum_squares(A @ m - n))
prob = cvxpy.Problem(objective, [m >= 0, A @ m - n >= 0])
prob.solve(solver=cvxpy.ECOS_BB)
m_opt = np.rint(m.value) # round to nearest integer, solution violates integrality by about tol = 1e-8
print(F"\n\nVariant 2: With monotonicity constraint.\n Solved Mixed-Integer Least Squares with status {prob.status} and residual norm {np.sqrt(prob.value)}\n")
print("Number of users per transmission vectors")
for i, trvector in enumerate(trvectors):
print(F"{trvector}: {m_opt[i]}")
print(F"Sum of users {np.linalg.norm(m_opt, 1)}")
print("\nResidual counts ")
print((A.dot(m_opt) - n).reshape(counts.shape)) |
Are we now using new TRL vectors or not? Did CWA change their mind after all? The new documentation pdfs seem to suggest so. I'm very confused by the calculation update to v2. Here are the three PDFs that should be relevant to this discussion:
What I find difficult to square is: right now, the only report type is "confirmed test", so that should mean TRLs of 7 or 8 according to overview/client interpretation pdfs. Yet, in reality, we find TRLs of 0-8. Is the explanation, that right now, TRLs are still calculated as always, but the encoding is prepared for a switch over to a different way of calculating TRLs? I'd be very interested in your view of this @micb25 @janpf @mh- @thomasaugsten |
For ENF v1 the Server distribute trl 1-8 and for the ENF v2 the TRL is encoded in dsos and Report Type the Mapping to TRL happens on the Client |
@thomasaugsten Thanks! You say TRL is encoded in dsos and Report Type. Does that mean that all that changed so far is the encoding - not the calculation/determination of TRL? What's discussed in this issue is still WIP and not live? |
TRL profiles based on symptoms is included since v1.5 since 19.10.20 |
The next version of the CWA will contain new TRL profiles depending on the symptom status of the patient and the onset. This will significantly increase the difficulty to accurately estimate the number of people that shared their diagnosis keys. However, I think it should be still possible to estimate the number of people with a certain accuracy.
The text was updated successfully, but these errors were encountered: