additional notes about escaping to ensure correct event IDs

2024-12-23 08:55:52 -05:00 · 2024-08-17 11:42:07 +01:00 · 2024-08-17 11:42:07 +01:00 · e34653ad04
commit e34653ad04
parent 2c7e2af15f
1 changed files with 38 additions and 0 deletions
--- a/01.md
+++ b/01.md
@ -42,6 +42,8 @@ To obtain the `event.id`, we `sha256` the serialized event. The serialization is
 ]
 ```
 ### String Escapes
 To prevent implementation differences from creating a different event ID for the same event, the following rules MUST be followed while serializing:
 - UTF-8 should be used for encoding.
 - Whitespace, line breaks or other unnecessary formatting should not be included in the output JSON.
@ -54,6 +56,42 @@ To prevent implementation differences from creating a different event ID for the
  - A backspace, (`0x08`), use `\b`
  - A form feed, (`0x0C`), use `\f`
 In addition, implementations should retain all other escape sequences
 without modification due a normalization to one scheme affecting event IDs
 in the absence of a normative marker to specify the one being used,
 because there is three forms of escaping other than the single letter C
 style as above:
 - `\uXX` - 8 bit hex
 - `\uXXXX` - 16 bit hex
 - `\XXX` - 24 bit octal
 Implementations *could* make this a part of their internal data structure
 but the primary directive is that the submitted event string encoding MUST
 be the same after marshalling it back to JSON, thus it is simpler to just
 leave them alone.
 There can also be HTML entities, but these do not need special handling due
 to their not being based on the reverse solidus " \ ". Longer `\u` codes are
 possible, according to UTF-8 rules but few implementations use them and a
 parser that accepts the `\u` prefix without modification will accept 2, 4, 6
 or 8 hex digits or even incorrect values that don't include reverse solidus.
 A parser can thus make a special case for `\u` and `\[0-9]` and cover all cases.
 Because of the absence of sentinels to signify which scheme should be used, and
 to conserve space on the most frequently occurring control characters, `\n`,
 `\t` and `\\`, this specification uses the C-style escapes, and so any escapes
 like these three common types above, should not be modified to ensure that the
 canonical form of the event that determines the event ID hash is consistent
 across implementations.
 As a rule, data that is intended to represent binary should be either
 encoded in hexadecimal or standard JSON Base64. Wherever possible, as with
 the `e` and `p` tags, specifications that put binary data in a specific
 format in fields of tags should make it simple for implementations to store
 the data in binary format in the runtime to conserve memory and improve
 matching performance, at a very low processing cost.
 ### Tags
 Each tag is an array of one or more strings, with some conventions around them. Take a look at the example below: