nip45: restrict hyperloglog to two hardcoded use cases with deterministic offset for now.

This commit is contained in:
fiatjaf 2024-12-07 07:38:40 -03:00
parent 545281646c
commit bf0d740d12

11
45.md
View File

@ -43,10 +43,10 @@ This is so it enables merging results from multiple relays and yielding a reason
This section describes the steps a relay should take in order to return HLL values to clients.
1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a nibble (a half-byte, a number between 0 and 16) and add `8` to it to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number `16`;
1. Upon receiving a filter, if it is eligible (see below) for HyperLogLog, compute the deterministic `offset` for that filter (see below);
2. Initialize 256 registers to `0` for the HLL value;
3. For all the events that are to be counted according to the filter, do this:
1. Read byte at position `offset` of the event `pubkey`, its value will be the register index `ri`;
1. Read the byte at position `offset` of the event `pubkey`, its value will be the register index `ri`;
2. Count the number of leading zero bits starting at position `offset+1` of the event `pubkey` and add `1`;
3. Compare that with the value stored at register `ri`, if the new number is bigger, store it.
@ -56,6 +56,13 @@ On the client side, these HLL values received from different relays can be merge
And finally the absolute count can be estimated by running some methods I don't dare to describe here in English, it's better to check some implementation source code (also, there can be different ways of performing the estimation, with different quirks applied on top of the raw registers).
### Filter eligibility and `offset` computation
This NIP defines (for now) two filters eligible for HyperLogLog:
- `{"#p": ["<pubkey>"], "kinds": [3]}`, i.e. a filter for `kind:3` events with a single `"p"` tag, which means the client is interested in knowing how many people "follow" the target `<pubkey>`. In this case the `offset` will be given by reading the character at the position `32` of the hex `<pubkey>` value as a base-16 number then adding `8` to it.
- `{"#e": ["<id>"], "kinds": [7]}`, i.e. a filter for `kind:7` events with a single `"e"` tag, which means the client is interested in knowing how many people have reacted to the target event `<id>`. In this case the `offset` will be given by reading the character at the position `32` of the hex `<id>` value as a base-16 number then adding `8` to it.
### Attack vectors
One could mine a pubkey with a certain number of zero bits in the exact place where the HLL algorithm described above would look for them in order to artificially make its reaction or follow "count more" than others. For this to work a different pubkey would have to be created for each different target (event id, followed profile etc). This approach is not very different than creating tons of new pubkeys and using them all to send likes or follow someone in order to inflate their number of followers. The solution is the same in both cases: clients should not fetch these reaction counts from open relays that accept everything, they should base their counts on relays that perform some form of filtering that makes it more likely that only real humans are able to publish there and not bots or artificially-generated pubkeys.