additional notes about escaping to ensure correct event IDs

This commit is contained in:
mleku 2024-08-17 11:42:07 +01:00
parent 2c7e2af15f
commit e34653ad04
No known key found for this signature in database

38
01.md
View File

@ -42,6 +42,8 @@ To obtain the `event.id`, we `sha256` the serialized event. The serialization is
] ]
``` ```
### String Escapes
To prevent implementation differences from creating a different event ID for the same event, the following rules MUST be followed while serializing: To prevent implementation differences from creating a different event ID for the same event, the following rules MUST be followed while serializing:
- UTF-8 should be used for encoding. - UTF-8 should be used for encoding.
- Whitespace, line breaks or other unnecessary formatting should not be included in the output JSON. - Whitespace, line breaks or other unnecessary formatting should not be included in the output JSON.
@ -54,6 +56,42 @@ To prevent implementation differences from creating a different event ID for the
- A backspace, (`0x08`), use `\b` - A backspace, (`0x08`), use `\b`
- A form feed, (`0x0C`), use `\f` - A form feed, (`0x0C`), use `\f`
In addition, implementations should retain all other escape sequences
without modification due a normalization to one scheme affecting event IDs
in the absence of a normative marker to specify the one being used,
because there is three forms of escaping other than the single letter C
style as above:
- `\uXX` - 8 bit hex
- `\uXXXX` - 16 bit hex
- `\XXX` - 24 bit octal
Implementations *could* make this a part of their internal data structure
but the primary directive is that the submitted event string encoding MUST
be the same after marshalling it back to JSON, thus it is simpler to just
leave them alone.
There can also be HTML entities, but these do not need special handling due
to their not being based on the reverse solidus " \ ". Longer `\u` codes are
possible, according to UTF-8 rules but few implementations use them and a
parser that accepts the `\u` prefix without modification will accept 2, 4, 6
or 8 hex digits or even incorrect values that don't include reverse solidus.
A parser can thus make a special case for `\u` and `\[0-9]` and cover all cases.
Because of the absence of sentinels to signify which scheme should be used, and
to conserve space on the most frequently occurring control characters, `\n`,
`\t` and `\\`, this specification uses the C-style escapes, and so any escapes
like these three common types above, should not be modified to ensure that the
canonical form of the event that determines the event ID hash is consistent
across implementations.
As a rule, data that is intended to represent binary should be either
encoded in hexadecimal or standard JSON Base64. Wherever possible, as with
the `e` and `p` tags, specifications that put binary data in a specific
format in fields of tags should make it simple for implementations to store
the data in binary format in the runtime to conserve memory and improve
matching performance, at a very low processing cost.
### Tags ### Tags
Each tag is an array of one or more strings, with some conventions around them. Take a look at the example below: Each tag is an array of one or more strings, with some conventions around them. Take a look at the example below: